In a Pickle. W9
Convinced that I can’t make generating the SolutionDictionary fast enough, I’ll explore pickling it.
Yesterday’s efforts convinced me that I can’t make building the SolutionDictionary fast enough to just build it every time I want to analyze something. Who has 15 or 30 seconds to spare? I mean, really. Get serious.
In reading about concurrent
, I learned that Python has to “pickle” information passed to and from processes. That’s their term for serializing and de-serializing object structures, and given that we’re dealing with over 12,000 X 2000 objects, containing objects, …, that’s just going to take a lot of time setting up and tearing down processes, which explains why I was never able to peg the CPU cores while creating the dictionary.
So my plan today is to try pickling it and unpickling it, to see how long that takes. It looks to be pretty straightforward. We’ll do it in tests, of course. Let’s go:
Pickling Your Dict
I’ll start with a small one. In fact, I’ll start with figuring out how to write a file to the desktop.
def test_write_file(self):
with open("/users/ron/Desktop/scratch.txt", "w") as scratch:
scratch.writelines(["hello\n", "world\n"])
TIL: writelines
would be better named writestrings
, because it doesn’t add newlines to the strings.
TIL: To get the path to a file in Finder, or on the desktop, option right click the file and it offers “copy … as Pathname”.
Now I’ll try a small pickling of the SolutionsDictionary.
@pytest.mark.skip("working on it")
def test_pickling_and_unpickling(self):
all_guesses = WordCollection.from_file("valid_combined.txt")
all_solutions = WordCollection.from_file("valid_solutions.txt")
guesses = all_guesses[0:10000:500]
solutions = all_solutions[0:2000:100]
solution_dict = SolutionDictionary(guesses, solutions)
t0 = time.time()
with open("/users/ron/Desktop/test.pcl", "wb") as pick:
pickle.dump(solution_dict, pick)
t1 = time.time()
t_write = t1 - t0
with open("/users/ron/Desktop/test.pcl", "rb") as pick:
unpickled = pickle.load(pick)
t_read = time.time() - t1
print(f"Pickle: {t_write:.5f}, Unpickle: {t_read:.5f}")
guess = Word("berth")
solution = Word("frail")
score = guess.score(solution)
assert score == 100
solutions = unpickled.solutions_for(guess, score)
expected = ScoreDescription.from_strings(score, "frail", "grasp", "rival")
assert solutions == expected
assert False
The action here is simple enough. I’ve lifted this test from another one. We create a SolutionDictionary from a subset of the data, pickle it and unpickle it, recording the time, and then verify that it gives us the answers we expect, which reassures me that it was correctly unpickled.
I used the pytest.mark.skip
to keep the thing from running every time I paused my typing.
The time to do this file was trivial:
Pickle: 0.00173, Unpickle: 0.00030
Along the way I learned that the right file open status is “wb” for writing, and “rb” for reading. PyCharm objects at compile time if that’s not the case, but its diagnostic was not helpful. Apparently one could also use “ab”, append binary, to write multiple objects to the file. I don’t know what would come back from such a thing, and don’t plan to find out.
Let’s commit these little tests: testing pickling.
Real Deal
While I go make my morning iced chai latte, I’ll let the computer build the real pickle file for the full SolutionDictionary.
I’ll spare you the code: it’s much like above only with all the data. Time is 32 seconds to build, and 3.25 seconds to save. The file is 186 megabytes. Should be no surprise, there are 30,030,180 combinations of guess and solution considered in the SolutionDictionary, so it would be 30 megabytes if it was only one byte per combination.
Now I’ll test how long it takes to read it in: load time: 3.553 seconds.
That’s still too long for my auto-running unit tests, but for getting some statistics it’ll be OK.
Let’s pause here, reflect, and sum up.
What He Said
The solution dictionary has the information we need if we wanted to move toward the information theory approach to selecting guesses. If and when we need it, we need only wait a few seconds and there it is. I’m not sure if we’ll do that analysis or not. In favor is the opportunity for me to refresh my ancient understanding of information theory, and to try to explain it here. Opposed is that it has already been done and explained better than I ever could, and I foresee absolutely no use for information theory in the near future.
There are other analyses that might be interesting, letter frequencies and such, which would be easily done and would involve only the word lists. I might do that just to learn what some good ways to do it might be. More about learning Python than anything about Wordle.
As for Wordle itself, I suppose one interesting thing would be to replicate the game itself, just to learn how to do that sort of screen thing. I wonder if I could use my little game framework, with independent letters flying around like aliens or space ships. I bet I could. But I hear Dr Ian Malcolm saying that thing he says.
So far, what have I learned? Quite a lot. I’ve checked (√) some that seem especially notable.
- √ How to run multiple processes, filling all my CPU cores.
- √ Limitations of multiple processes related to how much data needs to move.
- Suspicion that I could do something really exotic with shared memory to get around that.
- Near certainty that that would be seriously deep in the bag of tricks.
- But it might be fun.
- √ Python
pickle
is powerful and easy to use. - It’s not that hard to write a file out.
- √ Iterators: generators and maps, can only be realized once.
- √ Quite a bit of practice making objects respond like collections, print themselves nicely, check for equality, and the like.
- It really bugs me to have my unit tests require even three seconds to run.
- A dictionary of dictionaries of objects containing collections is hard to think about.
- Creating named objects to cover the dictionaries helps a bit, but the structure is still nested and therefore hard to think about.
- √ Suspicion that there is some kind of way of thinking about this that would be easier.
- Maybe a DSL? (Domain Specific Language)
- Every single word in the list produces a set of scores such that there exists one score such that exactly one word in the solutions produces that score.
- There must be some fundamental mathematical reason for the preceding fact.
- Finding out facts with unit tests is easy.
- Finding out facts with unit tests makes me need to end them with
assert False
, which is not great. - Finding out facts with unit tests makes me skip a lot of tests after I get the info I want.
- Finding out facts with unit tests is still better than with REPL, because I save them.
- √ It might be good to write a little tool / GUI to do some of this.
- I have no idea just now what that should look like.
- If there is a really nifty clear way of scoring a Wordle guess, no one I know has found it.
- Doing several billion calculations is sometimes too much.
- √ It might be interesting and somewhat educational to program the actual game.
- √ I still have a tendency to try to go too fast sometimes, and I almost inevitably stumble when I do.
I’ll reflect on this list. In particular, as much as I like writing little tests to find things out, the technique is limited in some regards. Maybe I can think of some tooling for that. I know that Hill would: he calls it the “Making App”, I think.
I only got nerd-sniped into this because Ken Pugh has been playing with it, and the list above certainly suggests a lot of nerd fun. Fact is, though, I also spent a lot of time in nerd frustration, because I found the data hard to think about. It might almost be worthwhile to find a way of thinking that is easier for me. I think it was Alan Kay who said something like “Point of view is worth 80 IQ points”. I could have used those a few times in the past week or so.
What’s next? I don’t know. I promise that it will be interesting—at least to me. Join me to find out if it interests you, too.