A Bit of Information. W10.
I can’t explain why, but I’ve decided it will be interesting to try the information theory ideas I’ve found on line.^{1}
I’ll be basing this article on Grant Sanderson’s video about Wordle and information theory. Grant^{2} has a delightful series of videos on things mathematical, if you’re into that sort of thing, which, for my sins, I often am.
I don’t plan to go as far as Grant did, but just to calculate the basic expected value of information for each guess, scored against the actual answers.
That value is the sum over all the scores of the probability of that score times the log base two of one over the probability.
The probability of a given score is the number of words from the solutions set that receive that score, over the total number of words in the set.
Our SolutionDictionary has, for each word we might guess, a dictionary from guessed word to a GuessDescription, which itself has a dictionary from score (the 243 possible scores that a word might get) to the solutions that produce that score.
DogGONE but I do find this hard to think about.^{3} Anyway, we want our GuessDescriptions to be able to calculate its expected information.
I think I’ll do a new test suite for today’s work. Because I have trouble remembering just what is in there, I write my first test to provide me some reeducation:
def test_first_info(self):
all_solutions = WordCollection.from_file("valid_solutions.txt")
word = Word("abate")
gd = GuessDescription(word, all_solutions)
keys = gd.score_descriptions.keys()
print(22000, gd.score_descriptions[22000])
assert len(keys) == 84 # just because it is.
assert False
Sure enough, what prints is a ScoreDescription:
22000 ScoreDescription(22000:
..[Word(abhor), Word(abyss)]
Now the values we want to sum up involve the probability of each of these ScoreDescriptions, which is the number of words in the ScoreDescription, over the total number of words in the solutions list. Let’s verify that number in our test.
def test_first_info(self):
all_solutions = WordCollection.from_file("valid_solutions.txt")
word = Word("abate")
gd = GuessDescription(word, all_solutions)
keys = gd.score_descriptions.keys()
print(22000, gd.score_descriptions[22000])
assert len(keys) == 84 # just because it is.
total_words = sum(len(sd.words) for sd in gd.score_descriptions.values())
assert total_words == 2315
That passes, because there are 2315 words in the solutions, and therefore there will be 2315 words in each GuessDescription, partitioned among some number of scores.
I think I’ll have to invent some fake data for a test I can trust, but let’s drive out the method first.
def test_first_info(self):
all_solutions = WordCollection.from_file("valid_solutions.txt")
word = Word("abate")
gd = GuessDescription(word, all_solutions)
keys = gd.score_descriptions.keys()
print(22000, gd.score_descriptions[22000])
assert len(keys) == 84 # just because it is.
total_words = sum(len(sd.words) for sd in gd.score_descriptions.values())
assert total_words == 2315
expected_info = gd.expected_value()
assert expected_info == pytest.approx(4.63, abs=0.05)
The answer there at the bottom is the answer that the code produces. I still need a test with known results, but that drove out this code:
class GuessDescription:
def __init__(self, guess_word, solutions):
self.guess_word = guess_word
self.score_descriptions = {}
self.solutions_count = 0
for solution in solutions:
self.solutions_count += 1
score = guess_word.score(solution)
self.classify_word(score, solution)
def expected_value(self):
total = 0
for score_description in self.score_descriptions.values():
probability = score_description.probability(self.solutions_count)
log = log2(1/probability)
total += probability*log
return total
class ScoreDescription:
def probability(self, total):
return len(self.words)/total
We record the total number of solutions. Could do that with len
, I guess.^{4} Sorry, I was thinking loops at the time.
For the expected value, which we should rename expected_information
^{4}, we ask the ScoreDescription for its probability, which is of course just the number of words it has vs the total. We return the value to be summed, which is the probability times log base 2 of 1/probability, because it is.
I think this is good, but let’s create a GuessDescription we know the answer for.
def test_known_values(self):
solutions = WordCollection.from_strings("azzzz", "zbzzz", "zzczz", "zzzdz")
guess = Word("dcbay")
gd = GuessDescription(guess, solutions)
keys = set(gd.score_descriptions.keys())
expected = {10, 100, 1000, 10000}
assert keys == expected
info = gd.expected_information()
assert info == 2
These handcrafted solutions each find 1 character, in the wrong position, given our sample word “dcbay”. So each solution key will have one word in it, so each one has probability 1/4 and the log2 of 4 is 2, so each one gets a score of 2/4 and there are four of them so the answer is two bits of information. Another way of thinking of it is that there are four equally probably answers, and that’s two bits of info, because each bit of info divides our solution space in half.
So I am confident in this result and the expected_information
function.
What do we want? Maximum Information! When do we want it? Now!
So let’s add that info into our statistics. Stats are calculated by SolutionDictionary and recorded in the dumb object Statistic:
def create_statistics(self):
stats = []
for word in self.dict:
guess_description = self.dict[word] # {score > scoredWords}
expected_info = guess_description.expected_information()
number_of_buckets = guess_description.number_of_buckets
max_words = max(len(bucket) for bucket in guess_description.buckets)
min_words = min(len(bucket) for bucket in guess_description.buckets)
avg_words = sum(len(bucket) for bucket in guess_description.buckets) / number_of_buckets
stat = Statistic(word, number_of_buckets, max_words, min_words, avg_words)
stats.append(stat)
def my_key(stat: Statistic):
return stat.number_of_buckets
stats.sort(key=my_key)
return stats
I propose to add the expected information to the statistic and sort on it.
 Thwarted
 I was going to use my convenientlysaved massive solution dictionary … but I’ve changed the form of the GuessDescription. So I must recreate it. Stand by.
OK, ready to go, let’s add our new statistic and change the sort.
# @pytest.mark.skip("working")
def test_statistics(self):
with open("/users/ron/Desktop/SD.pcl", "rb") as pick:
sd = pickle.load(pick)
stats = sd.create_statistics()
print(Statistic.header)
for stat in stats[0:20]:
print(stat)
print("...")
for stat in stats[10:]:
print(stat)
assert False
class Statistic:
def __init__(self, word, number_of_buckets, max_words, min_words, avg_words, expected_info):
self.word = word
self.number_of_buckets = number_of_buckets
self.max_words = max_words
self.min_words = min_words
self.avg_words = avg_words
self.expected_info = expected_info
def __repr__(self):
return (f"{self.word.word} {self.expected_info:5.2f} {self.number_of_buckets:4d} {self.min_words:5d} "
f"{self.avg_words:7.2f}{self.max_words:5d}")
@classmethod
@property
def header(cls):
return "\nWord Info Buckets Min Avg Max"
class SolutionDictionary:
def create_statistics(self):
stats = []
for word in self.dict:
guess_description = self.dict[word] # {score > scoredWords}
expected_info = guess_description.expected_information()
number_of_buckets = guess_description.number_of_buckets
max_words = max(len(bucket) for bucket in guess_description.buckets)
min_words = min(len(bucket) for bucket in guess_description.buckets)
avg_words = sum(len(bucket) for bucket in guess_description.buckets) / number_of_buckets
stat = Statistic(word, number_of_buckets, max_words, min_words, avg_words, expected_info)
stats.append(stat)
def my_key(stat: Statistic):
return stat.expected_info
stats.sort(key=my_key, reverse=True)
return stats
And here’s my result:
Word Info Buckets Min Avg Max
soare 5.89 127 1 18.23 183
roate 5.88 126 1 18.37 195
raise 5.88 132 1 17.54 168
raile 5.87 128 1 18.09 173
reast 5.87 147 1 15.75 227
slate 5.86 147 1 15.75 221
crate 5.83 148 1 15.64 246
salet 5.83 148 1 15.64 221
irate 5.83 124 1 18.67 194
trace 5.83 150 1 15.43 246
arise 5.82 123 1 18.82 168
orate 5.82 127 1 18.23 195
stare 5.81 133 1 17.41 227
carte 5.79 146 1 15.86 246
raine 5.79 129 1 17.95 195
caret 5.78 145 1 15.97 246
ariel 5.78 125 1 18.52 173
taler 5.77 134 1 17.28 196
carle 5.77 144 1 16.08 249
slane 5.77 133 1 17.41 225
...
gyppy 2.29 41 1 56.46 1380
kudzu 2.26 29 1 79.83 1369
fuffy 2.24 32 1 72.34 1374
jaffa 2.23 35 1 66.14 1247
pzazz 2.21 25 1 92.60 1165
yukky 2.21 27 1 85.74 1368
xylyl 2.19 28 1 82.68 1334
immix 2.05 34 1 68.09 1429
jujus 2.04 23 1 100.65 1353
qajaq 1.89 18 1 128.61 1369
I believe this result, because I see in my list words that other explorers have found to be of high value, like soare, raise, crate, slate, salet, even taler. I also believe it because my tests run and I actually have a handcrafted data set with known expected info.
Commit: Test and calculate information provided by each guess vs the solutions set.
Summary
OK. Did what I set out to do: computed expected info, and sorted the words to get a look at which ones provide the best information as a first guess. Life is good and it was easy enough to do.
I know that the intrepid and not at all irritating GeePaw Hill, who started out on this venture at the same time I did, has already kissed this project up to heaven, and I think that I will probably do the same. At listed last time, I’ve learned some useful things about Python and about what works well for me. I’ll probably do one more article summarizing the important items from the long list of learnings. But you never know what I’ll do. I don’t even know what I’ll do.
I hope you’ll check in and find out with me, next time!

Yes, of course I intended the pun. What kind of person do you think I am? ↩

I freely grant^{5} that I called Grant “Brian” in at least one earlier article. That’s because there is an author Brian Sanderson some of whose work I have read and it stuck in my mind. I think I have fixed all those mistakes. I’m sure that may other mistakes continue to exist^{6}. ↩

That’s not the word I actually used. ↩

Oops, I did it again!^{7} ↩

There’s one right there. I left it in because irony. ↩

Yes, I know about the Britney Spears song. I’m sorry. ↩