Performance Optimizing a Squareword solver
Overview
Previously, I started to create a program to solve squareword puzzles. Squareword is similar to Wordle, except that instead of guessing 1 word, the final solution consists of a 5x5 grid of letters where each row and column are words.
I’d seen videos like 3Blue1Brown’s on writing a program to solve Wordle. I thought it’d be interesting to do the same thing, but for SquareWord. One big difference, is that find solutions in Wordle is trivial, its just a single word. However, with SquareWord, each solution is non-trivial and actually takes a while to find all the possible solutions.
While playing SquareWord, I’ve noticed a few patterns that I wanted to maintain:
- All ten words are unique. Columns and rows never repeat words.
- If you know one solution, then you actually know two. You can flip the columns and rows to generate a new solution.
By knowing these restrictions, we should be able to implement additional performance improvements
There are two parts for this program, generating solutions, and evaluating guesses.
Goal
My goal with this project is to document my engineering approach to doing performance optimizations on finding SquareWord solutions.
My hopeful end goal for the program, is that I would like to be able to generate all the possible solutions in a day.
At the start, I though it would be fun to write a program that would also tell you what to guess to get the lowest average score, but I found that generating the solutions and finding the optimal first guess was more helpful than anticipated.
Current State
This is a project that I’ve already done some work on, so I won’t be starting from scratch. The code is here. I’ve gotten a working implementation for finding solutions, but I’d like to make it faster.
My current implementation goes something like this. Build a solution from the top down. We guess a word to go at the top, making sure that the columns can be turned into valid words.
As we place words, we ‘order’ solutions. Since we can take the transpose of a solution to get a new solution, we break early if the word in the first column is alphabetically before the word in the first row. This ensures we will only get one of the pairs of solutions. When generating solutions, we build both of these at the same time.
If we manage to place 5 words, then we have to check no words are repeated between the rows and columns.
I’ve made some optimizations like:
- Creating a struct that has constant time lookup to see if, given some starting letters, if a word exists with those letters.
- Using ACSII only strings
- Using the builder pattern to create solutions while modifying the builder
- Multithreading - Each thread works on a solution with a particular staring word, then works on the next word in the order.
First steps
The first thing I need to do is make my solution single threaded, so that I can understand it better. Multithreading makes things faster, but less reliable with respect to time. My thought is that I’ll iterate solutions based on the starting word. I can change this iterator to rayon::par_iter to parallelize it.
commit: c0e3c7cb84cc339d5c0eb0d34ce5d43a31249192
When running the code, I need to find a good list of words to process. With 1700 words I didn’t find any solutions, and after 2000 words, I got 4. In order to get accurate results I want at least a few solutions to be found, so I’d like to use the first 2000 words. I need to find out how long perf should run, as it takes about 5 minutes to generate the 4 solutions.
Here’s a blog post explaining how to generate flamegraphs on mac. https://carol-nichols.com/2017/04/20/rust-profiling-with-dtrace-on-osx/
This commit is the starting point for the single threaded option. I also verified that I can use rayon to easily multi-thread the whole thing.
commit 56744dbbb8802e257b127f7f6a2808e4e9825904
The command I’ll use to run dtrace is:
cargo install flamegraph
sudo flamegraph -- ./target/release/solve
Time: 5:05.20s
Removing duplicate self.column() calls
commit: 33de928d95c1737aeb596ec3de0b2e1d318ee4cc
time: 2:57.09s
New algorithm
I knew there are a few more optimizations I could make with the previous method, but I wanted to see what I could do with a different starting point. I created a new algorithm that alternates fills in rows and columns. This seems like it might be more optimizable. The commit for that is:
I tried to write it with minimal optimizations. I knew that I would be able to optimize it later, and I just wanted to see what a different approach would look like. Staring off, it seems much better already. It takes only 2:31s to generate solutions from 1800 words. During this, I also discovered that my implementations are not correct, as they don’t return the same solutions for the given words.
To fix this, I decided to use property based testing to find a minimal set of words that causes the two implementations to diverge. I used the proptet crate instead of quickcheck as they claim better shrinking, which is exactly what I want.
The docs seemed a little bare of examples, but I was able to get something working that would pass a subset of the words I knew was a problem. The first pass found a list of about 750 words that caused a differences, but when I ran it, starting with that set of words, I got a shorter list of 11 words.
This seems like the smallest possible set, as I expected adding one more word to a known good solution would mess things up. I should be able to more easily understand the differences.
My new implementation found more words than my previous solutions, and checking by hand my new implementation is correct, and the old one was wrong. Going forward I’ll use the new implementation. Maybe I can use my old one someday to practice debugging non-trivial Rust programs.
Optimizing
I’ve switched from 2000 words to 1800, as this runs faster, and doesn’t give me much more information.
8 solutions: Current time for 1800 words: 2:26.77 Commit: 2dfef70b5a1ff8d21b2aa662fb03c4f37c151638
After using the transposition trick, we get a significant speed up of about 2x.
New time: 1:11.18s commit 475e1e4bff52b8b7ef9695b33260f8889181531d
Switched to storing the indexes for the words, instead of words themselves. This isn’t much faster, but I think I will be able to use this information to go faster with other methods. Such as, skipping over the words we’ve already used, reducing our search space
New Time: 1:10.60 commit 6f510a9cc1973c020db3b762cd6eb362367fa670
Since we are now storing the indexes of the words we’re using, we can skip them ahead of time, and not need to check for duplicates. After implementing this, I found it took about twice as long. I feel that with some further optimizations, I can make if faster. I think most of the slowdown is from allocating Vec quite often. I’ll try to reduce the number of allocations, and see how that performs.
New time: 2:44.29s commit 33cb2d904dc976789a04da335ad6339e288ff418
After removing some of the calls to .clone(), the time did reduce a little bit, but still not more than when I started. I’m going to revert back to before I made this change, and see if it helps later on.
New time: 1:41.98s commit 8982914cd1599ec401c104d6406f1991f4145b8e
I switched to using AsciiStr instead of simple str. This happened to have a performance improvement, but the main reason is now I can directly index into strings. This should allow for more improvements in the future.
New time: 1:00.66s commit efb0d91939fdc7e208cbbf618b893a3884150d68
After switching to storing words as an array, [AsciiChar;5], we got a pretty big improvement on performance. This should also allow us to use the stack more in the future, reducing our need for heap allocations, since we can now index into the words directly.
New time: 44.436s commit 11e2b2ceef26c1d704eacc105253305bfcab0742
The next change took a little while to figure out, and I’m still not 100% happy with it. I wanted a way to use arrays, instead of Vecs, since that would reduce calls to alloc. I tried a few methods with const generics, but I couldn’t get anything to work. I ended up splitting up my recursive function, and making each cal to itself a separate function. This allowed me to know exactly what the state of the internal Vecs was, so I knew their size, and so was able to build arrays of the exact right length.
I’d like to refactor it some more, since I add a few more very repetitive methods. Although, this change was worth it, since it now runs about twice as fast.
New time: 22.889s commit 9343ebffee24df201a3a8b15ecb9bcef8013a87e
After making all these functions, I noticed that I was passing the reference to the list of words around, though every function. I decided to just store inside the inner struct. No significant change to performance, just a dev change.
New time: 21.898s commit 34671599d7da2ccc5c2b383f739f90b6e2edf603
I added something that I had done in my previous implementation: checking to see if the rows or columns that we’ve already made are valid. By valid, I mean that there are words that can fill the unfinished row or column. This kills off dead branches that will die in the future, earlier. I only made this change when filling in row and column 3, as when I did this check farther in, it actually slowed things down. I could imagine that this is sensitive to the density of words that we have in our space, meaning the more words we have in our list, the more helpful this check will be.
New time: 8.415s commit 518c8fd59722d3bf90d290a6296d55b9c1f65df6
This is a graph of the time complexity for the program so far. This a loglog plot, meaning I’ve graphed the log of the input size vs the log of the time it took to process. The slope of the line corresonds to the power of our big O notation. The trivial example is n^5, and this test shows that our current implementation also is n^5.

Now that I have access to creating more things on the stack, I added the except_for approach back in. This time I did get a speed up, since I could deal with arrays instead of Vecs. The times are also a bit wonky here, as I made these changes in a differnent session, and my previous runs took a different amount of time, but I sill get a speed up.
New time: 7.890s commit 0dfc1e533317977182bf6c1895bae34ddd5968f8
Takeaways
During this process I have learned quite a lot. This has probably been the biggest Rust program that I have written. This has also been the first time I’ve taken optimization seriously.
Multiple Optimization Strategies
One realization I made is that I mixed two styles of optimization, algorithmic and implementation. I found that adding an algorithmic improvement could, in some cases, be slower because the implementation was not at the same polish level as the rest of the solution.
Going forward, I expect to find better results to purely work on algorithmic improvements, like reducing the Big O time complexity, before optimizing the implementation.
Comparative Property based testing
This is something that I had figured out, then was pleased when I heard people talking about it on this Oxide and Friends episode.
Property based testing works by being able to verify output of a function. All the example I’ve seen online work by setting up limits. For example a hash function should always output the chars 0-9 and a-f, and be of length 40.
I wan’t able to think of simple rules like that, so I tried pitting two implementations against each other. If they ever returned something different, then I knew one of them would be wrong. This ended up helping me in this project, though I didn’t end up solving the underlying issue.
Performance Optimizations can hurt code readability
One thing that annoyed me a little about Rust is that const generics act as types, instead of instances of types. An implication of this is that if you have a const generic of usize called N, you can’t do any arithmetic with that N. Here is an example:
fn foo<const N: usize>() {
let arr = [(); 2 * N];
}
I’m not sure what this would be useful for, but I would have wanted to use a feature like this. I now understand that this isn’t allowed because the usize you pass into as a generic parameter is truly a type, not a value.
Optimizations I didn’t make
There are a few things I didn’t have a good idea how to optimize. The two that come to mind are:
- SIMD
- Cache size optimization
SIMD
I haven’t used SIMD before, but I never found a place where I thought this would be helpful. SIMD seems to be helpful when you just need to run through a bunch of stuff at the same time. The nature of this problem didn’t seem to lend itself to this. There seemed to be too much jumping around to get any use out of SIMD.
Cache size optimization
This is something to SIMD, where I couldn’t find a way to effectively use a smaller cache size. The nature of this problem is that solutions are jumping around in the word list. I couldn’t think of a way to get my data closer together.