Need for Speed. W7
Today I want to try something new: concurrency. Includes: Thrilling Video.
As constant readers know, creating a SolutionDictionary for the full Wordle vocabulary takes about 30 seconds on my M1 MacBook Air. I have an idea for a way to make it faster.
The first part of what follows is based on How to Check if Numbers are Prime Concurrently in Python, provided by Jason Brownlee, Ph.D. H/T to Jason.
There is a similar example in the Python Docs, but I found Jason’s page first, and he has an amazing amount of material, much of it free. Another H/T to Jason.
Module ‘concurrent’
Python has a module called concurrent
that can run processes independently, on separate cores if your machine has them. I write these tests:
class TestSpeed:
NUMBERS = [17977, 10619863, 106198, 6620830889, 80630964769, 228204732751,
...]
def test_primes(self):
t0 = time.time()
results = self.check_numbers_prime()
count = sum(1 for result in results if result[1])
delta_time = time.time() - t0
assert count == 112
assert delta_time == 0
def test_primes_parallel(self):
t0 = time.time()
results = self.check_numbers_prime_concurrently()
count = sum(1 for result in results if result[1])
delta_time = time.time() - t0
assert count == 112
assert delta_time == 0
def check_numbers_prime(self):
return zip(self.NUMBERS, map(is_prime, self.NUMBERS))
def check_numbers_prime_concurrently(self):
with ProcessPoolExecutor() as executor:
results = executor.map(is_prime, self.NUMBERS)
return zip(self.NUMBERS, results)
In check_numbers_prime
, we map the function is_prime
over the numbers, giving a list of True/False values. We zip that with the original numbers to return pairs (number, is_prime). No reason why I did it that way, it just kind of evolved.
In check_numbers_prime_concurrently
, we create the list of True/False values using ProcessPoolExecutor.map, applying the same function. But the executor allocates each function application to the next available processor core.
The serial version runs in almost 20 seconds, the parallel version in less than 5. Here’s a little video. See if you notice the same thing that I did.
Of course the first thing one notices is all the cores firing up at once. Very gratifying, although I imagine that if we did that for very long I could warm my coffee on the laptop, if I drank coffee, which I do not.
But the next thing one might notice is that the test window switches to running the parallel test and quite a bit of time elapses before the cores light up. Now part of that could be latency in Activity Monitor, but I wonder also whether the executor needs a bit of time to figure out what it’s going to do. I wouldn’t expect seconds, however, so I’ll chalk the delay up to latency, but it is interesting.
- Later
- A little research tells me that the normal update cycle for Activity Monitor is every 5 seconds. When I set the update cycle to 1 second, the apparent delay in redlining the cores is much shorter. Good news.
We should expect, however, that firing up a separate process takes time, and these prime calculations are mostly pretty short. Naturally, what I’m interested in is whether I can speed up the calculation of the big SolutionDictionary using concurrency.
We want to have a function / method to apply via map
calls, although there are other ways we could use the ProcessPoolExecutor thingie. Our current code creating the SolutionDictionary looks like this:
@staticmethod
def create_dict(guesses, solutions):
solutions_dict = {} # guess -> dict (score -> [solutions])
for guess in guesses:
guess_desc = GuessDescription()
solutions_dict[guess] = guess_desc
for solution in solutions:
score = guess.score(solution)
guess_desc.add_word(score, solution)
return solutions_dict
Let’s reorder that:
@staticmethod
def create_dict(guesses, solutions):
solutions_dict = {} # guess -> dict (score -> [solutions])
for guess in guesses:
guess_desc = GuessDescription()
for solution in solutions:
score = guess.score(solution)
guess_desc.add_word(score, solution)
solutions_dict[guess] = guess_desc
return solutions_dict
We want to compute all the GuessDescriptions in a batch, so that we can parallelize. I notice that a guess description does not have the guess in it. I think we may want that. Let’s add it:
class GuessDescription:
def __init__(self, guess_word):
self.guess_word = guess_word
self.score_descriptions = {}
class SolutionDictionary:
@staticmethod
def create_dict(guesses, solutions):
solutions_dict = {} # guess -> dict (score -> [solutions])
for guess in guesses:
guess_desc = GuessDescription(guess)
for solution in solutions:
score = guess.score(solution)
guess_desc.add_word(score, solution)
solutions_dict[guess] = guess_desc
return solutions_dict
Let’s compute the GuessDescriptions all at once and then plug them into the SolutionDictionary. We extract a method, giving:
@staticmethod
def create_dict(guesses, solutions):
solutions_dict = {} # guess -> dict (score -> [solutions])
for guess in guesses:
guess_desc = SolutionDictionary.guess_description(guess, solutions)
solutions_dict[guess] = guess_desc
return solutions_dict
@staticmethod
def guess_description(guess, solutions):
guess_desc = GuessDescription(guess)
for solution in solutions:
score = guess.score(solution)
guess_desc.add_word(score, solution)
return guess_desc
Now we can change create_dict
to use map
. We’ll go in steps. Oh, that reminds me: commit: refactoring toward map and concurrency.
First step:
@staticmethod
def create_dict(guesses, solutions):
guess_descriptions = []
for guess in guesses:
guess_desc = SolutionDictionary.guess_description(guess, solutions)
guess_descriptions.append(guess_desc)
return {desc.guess_word: desc for desc in guess_descriptions}
Now we’re getting there. We have a collection of GuessDescriptions. We construct the required dictionary from them.
Now to use map
to create them. After a bit of fumbling and removing @static_method:
def create_dict(self, guesses, solutions):
guess_descriptions = map(self.guess_description, guesses, repeat(solutions))
return {desc.guess_word: desc for desc in guess_descriptions}
def guess_description(self, guess, solutions):
guess_desc = GuessDescription(guess)
for solution in solutions:
score = guess.score(solution)
guess_desc.add_word(score, solution)
return guess_desc
Now I have a map call doing the work. The trick with repeat(solutions
) passes the entire solutions
collection to each call, as needed. repeat
is in itertools
, and that is its sole purpose as far as I know.
My tests are green, but I want to turn on the slow one just to be sure. Runs, takes 32 seconds.
Let’s just put in the parallel stuff. Better commit first: Using map
.
def create_dict(self, guesses, solutions, concurrent=False):
if concurrent:
with ProcessPoolExecutor(8) as executor:
guess_descriptions = executor.map(self.guess_description, guesses, repeat(solutions), chunksize=500)
else:
guess_descriptions = map(self.guess_description, guesses, repeat(solutions))
return {desc.guess_word: desc for desc in guess_descriptions}
With concurrent True, the time is about 16 seconds, False about 32. So those two lines are worth a factor of two improvement in speed. Still too slow to let the long process run as part of the standard test cycle, but noticeable.
The time is fairly sensitive to the chunk size. With the default size of 1, the time actually tripled over the standard 30-ish seconds. And remember, we’re doing 12000 guess descriptions in 30 seconds, so each one is only 0.0025 seconds, and starting a process takes time.
It might be fun to think of a better way to divide up the processing, but as an experiment, this was interesting and fun. More interesting than Yet Another Wordle Solver in my view.
That said, a factor of two just isn’t enough. To make it feasible to create the big SolutionDictionary during testing, I’d need more like 0.5 seconds, not 16. We’re not going to get there this way.
Next time, probably back to Wordle analysis. See you then!