FAFO on GitHub

Having given Hill some bizarre advice last week, I propose to “Fool” Around and Find Out some things about the idea.

In the preceding article I reported on a weird and very limiting suggestion that the FGNO1 gave to [GeePaw Hill] regarding a key-value store that he thinks he needs. He is not wrong, in my view: he does need such a thing. The question is what such a thing is, and how to get from here to there. As you can read in that article:

The solution we were proposing was to store the pages in a file folder (somewhere) with the tags all encoded in some trivial way into the file name. slug-refactoring_level-1_date-20240126T073905Z_status-released with the contents being that version of the text.

Today, and for the next few days, I propose to FA2 with that idea, or even simpler versions of that idea, to FO3 what I can learn about the problem and potential solutions. I propose to do that in a clean code base. I set up a new PyCharm project, FAFO. I decided to follow the PyCharm page on using pytest, and now I have this class and test:

class Tag:
    pass

class TestTag:
    def test_tag(self):
        assert True

Why those? Here are some thoughts on the “problem” I’m addressing:

KV Store
We’re working on a curriculum for students. The curriculum basically consists of a large number of documents. Hundreds, possibly a few thousand. I think of these documents as containing a day’s lesson, or part of a day’s lesson. Doesn’t matter, they are just documents from the viewpoint of our storage scheme, currently named KV Store.

Each document has an arbitrary number of “tags”. A tag is an ordered pair of strings, tag name and tag value. (“author”, “ron”) sort of thing.

Document retrieval will consist of specifying one or more tags, and the KV Store will return, with all their tags, all the documents that have the tags provided. If we specify (author, ron), we’ll get all my documents, with all their tags.

I expect that some tag names can occur more than once on a document: (author, ron), (author, hill). Other tags, I imagine, will be allowed to occur only once (released, true) xor (released, false). I will not be surprised to find that a document has exactly one timestamp (time, 20240128T082634Z)

The various operations that will need to be done on these documents will arise, but will clearly include various flavors of creation, reading, updating, and deleting. Why? Because that’s all there is.

Possible “Solutions”

As experienced developers, we know many systems and tools for “solving” this “problem”. Some of use know how to solve it with a relational database like SQL. Others wisely point out that a hierarchic database would be just the thing but unfortunately they really pretty much got wiped out by relational. Some would say that we need a NoSQL solution, and point to existing document databases and general key-value stores.

Some few would say to ignore all that and store the data first, in memory, and then, later, on files, until one day it becomes clear what one really needs. One of those few is writing this article. And he wants to tell you why he proposes to do such an obviously wrong thing.

Why does he want to do such an obviously wrong thing?

The reason is that he wants to focus on the operations needed by this application first, developing “abstractions” that make sense in the context of a curriculum. Then, with that in hand, he asserts, we can “plug in” a suitable practical solution.

And, because he’s writing the article, that’s what we’re going to do.

The Tag

I think a tag is an immutable ordered pair of name and value. That’s not going to need a lot of testing, but here goes:

    def test_creation(self):
        tag = Tag("author", "ron")
        assert tag.name == "author"
        assert tag.value == "ron"

And, more or less obviously:

class Tag:
    def __init__(self, name, value):
        self._name = name
        self._value = value

    @property
    def name(self):
        return self._name

    @property
    def value(self):
        return self._value

More obvious would be to have used name and value as the members and allow direct access. I didn’t want to do that: I’m trying to be a bit strict here, for some reason. Let’s do put in a test:

    def test_modification(self):
        tag = Tag("author", "ron")
        with pytest.raises(AttributeError):
            tag.value = "not gonna happen"

That test passes. Yay!

Now I do suspect that we’ll want to change the values of tags on a document. Our document has (released, false) and we want (released, true). Or something like that. We’re not going to code speculatively on that: we’re going to come up with scenarios, use cases, situations, that describe what we really need to do and let that drive the implementation of such things.

I think the Tag is done, for now. Create GitHub project, commit: Initial Commit.

So that’s nice. What now? I think I want to create an initial tagged storage thing to work with. Let’s call it MemoryStore, although I hate to waste such a good name already. We have rename.

Memory Store

I’ll begin with a test, my standard first test in a new test file:

class TestMemoryStore():
    def test_hookup(self):
        assert 2 + 1 == 4

My tests automatically run, but this one is not picked up. The pinned tab just runs the tag tests. I need a new configuration.

I edit the existing one into submission, like this:

config file

Then make the hookup pass:

class TestMemoryStore():
    def test_hookup(self):
        assert 2 + 2 == 4

Commit: new config and passing hookup for memory store.

I think I’d like to be a bit more explicit about types here. Like specifying that the type of name and value is string, for example. I’ll hold off on that for a bit.

What would be a good first test for our MemoryStore? I suppose putting something in, with some tags, and getting it back out.

What does the memory store contain? Tagged documents. Does this mean that we need a Document class? Perhaps, but not yet. Soon, probably, but not yet. I have my eye on the MemoryStore, right or wrong. I think it doesn’t care what it stores. It just takes tags and a thing. First we need to have one:

    def test_store_exists(self):
        store = MemoryStore()

Curiously, that does not pass. Perhaps it is the total absence of the class MemoryStore. Let’s alleviate that concern.

PyCharm offers to create the class locally. We’ll allow it and move it soon:

class MemoryStore:
    pass

Tests are green, mize well commit: initial test and MemoeryStore class in test file.

Let’s store and retrieve something. I think storing will take an item, but retrieving will return a list of items. Why? Because more than one thing might have the keys we search on. (author, ron) will return 5600 items.

And it becomes clear that we need a new kind of object here, with a tag collection and a document (string).

We’ll get there. First the storing part of the test:

    def test_can_store(self):
        store = MemoryStore()
        tags = [Tag("author", "ron"), Tag("title", "nature")]
        document = "these are the facts"
        store.add_document(tags, document)

I do not love having to type out all those tags, and I’m not really sure that add_document is what we’ll want. We might want to create a document with tags and then add it. We’ll learn those things as we use the classes.

I need add_document:

class MemoryStore:
    def __init__(self):
        self._store = []

    def add_document(self, tags, document):
        self._store.append((tags, document))

Test passes. Commit: first add_document Let’s test fetching.

    def test_can_fetch(self):
        store = MemoryStore()
        tags = [Tag("author", "ron"), Tag("title", "nature")]
        document = "these are the facts"
        store.add_document(tags, document)
        pairs = store.fetch([Tag("author", "ron")])
        tags, document = pairs[0]
        assert document == "these are the facts"

I made some arbitrary decisions here. Notably, the fetch method returns a list of pairs, each one containing a tags collection and a document (string). Here, we expect just one. And I feel sure we’ll want a named class, not just a pair. We’ll get there, probably.

    def fetch(self, tags):
        result = []
        for item in self._store:
            item_tags, item_document = item
            if all((tag in item_tags for tag in tags)):
                result.append(item)
        return result

This checks each item in the store to see whether all the provided tags are in that item’s tags and adds it to the result if so. We need an equality check on Tag:

    def __eq__(self, other):
        if isinstance(other, Tag):
            return self.name == other.name and self.value == other.value
        else:
            return NotImplemented

I’ve seen somewhere that that’s the approved way to do that. I could imagine others that I might prefer. But the convention in equality is to return NotImplemented if the comparison can’t be made, because Python will try things the other way around on the off chance.

Shall we test that? Sure, we should do.

    def test_does_not_equal_non_tag(self):
        tag = Tag("author", "ron")
        assert tag != ("author", "ron")

That passes. I thought I could catch the NotImplementedError, but that’s not the case: Python swallows it and converts it to a False after trying whatever it tries behind the curtain.

It did occur to me that it might be convenient to allow sequences like (“author”, “ron”) to be equal to Tags, but that seems likely to lead to trouble. We’ll not do that, at least not for now.

Reflection

I took a break to make an iced chai, so let’s reflect.

We can see some objects that seem to be nascent. TagGroup, maybe, Document, probably, TaggedDocument, probably.

I was thinking I’d do one more test, where we have two documents, maybe same author different title, and we fetch on author and get both documents. But having taken the break, I think we’ll wrap up here.

One semi-deep thought. This stems, at least in part, from my set-theoretic database mostly debacle past. That leads me to think that what we have here is not really a MemoryStore, at least not one that the user-programmer looks at. It is a particular implementation of a more general thing, a TaggedDocumentSet.

Now this probably flies in the face of my starting proposition, which was to do something really simple, but … the idea of the simple “solution” was to eliminate all the assumptions about the database and focus on what ind of objects and operations we really want.

In that light … a document set that happens to be on the drive, or on the Internet, or in memory … is all the same kind of thing from the viewpoint of someone writing the curriculum management stuff. I’ll try not to go wild with actual set theory, but the idea that these collections are all the same except for where they live … might be of value. We’ll try to keep that idea warm.

Summary

We really just have a few tests and a few sketchy objects, but a bit of a shape is starting to come out of the fog. You may still think this is a terrible idea: you could even be right. I might even learn that later … or even worse, never come to see the truth. But what you see here is what I would really try to do when faced with a “big” question, “what kind of database”. I’d write a lot of small objects to see what the problem really is.

I would try not to face the user-programmer with built-in objects to manipulate, though I would try using some native collections naturally, where they are easier to use. But in general, I’d be moving toward objects that make sense in the problem space, in this case, documents with lots of varied tags.

What will happen? I am excited to find out! See you next time!



  1. Friday Geeks Night Out, a Zoom meeting we hold each Tuesday evening. 

  2. Fool Around. What did you think it stood for? 

  3. and Find Out.