Last night I had an interesting dream. It had to do with a thing like Finder or a file browser. I was manipulating little chip-sized tokens, dropping them into or onto the browser. They were basically functions, and they were manipulating the file names. Fundamentally they were map and reduce kinds of things, and they were helping me do whatever it was that I was doing.
I then sort of dreamed and sort of thought about the idea, and now that I’m awake (I believe), I’ve decided to write about it.
Expect that I’ll be using terms here metaphorically. I’ll not be trying to be precise in terminology, at least not yet, probably never. By the time I’m done, I’ll probably draw some rough pictures. I have no idea where this leads: probably nowhere. Probably, cubed. Very long odds on this one.
A file name is basically a very long string. In that string, the slash character (/) and the dot character (.) have somewhat special meanings. We think of the things between slashes as folder names, and the final bit after the last slash as the file name. The bit after the last dot is the extension.
In the Dream Folders context, these are just conventions. There is a thing called a file, and it has a name, which is just that very long string. We think of those file name strings as the fundamental name. Internally, are things organized into folders like /Users/ron/Documents? We don’t know. We do know that the name of this file I’m writing is
Now suppose we were looking at a list of all the file names on the whole system, just a flat list. It would be quite long, starting with
/Applications/Alfred.app /Applications/App Cleaner.app /Applications/App Store.app /Applications/Automator.app /Applications/Calculator.app /Applications/Calendar.app /Applications/Chess.app
with this file’s name somewhere in there:
and ending, I don’t know, with
… whatever that is. And there’d be upwards of half a million lines in the list.
We could filter those strings any way we wanted, maybe using regular expressions, because they’re really cool, or perhaps some simplified form of notation. We might find all the “md” files by filtering on
*.md, or, more generally, on
.*\.md$. Probably, assuming I’ve got that right. You know how it goes with regular expressions. If we did, right away we’d have that half a million items cut down to only 1782. If we looked for
.*/index.md$, we’d cut that right down to 1279. Nearly convenient.
If we looked for
.*/018-01ff/.*/index.md$ we’d only find 20, including this article. This would be almost useful, nearly practical.
What if we looked for
/Users/ron/Dropbox/rj-com-db/articles/.*$? That would list all the files in all the folders in my articles section. There’d be a lot. If we sorted the list, we see my sub-folders:
0000-experiments 015-10 015-11 016-0607
And so on. Each one would have one or more files in it but those folder names would repeat in the list.
What about some kind of viewing filters? If we just searched for
018-01ff, there’s a chance that string would show up in more than one place. And in fact it does, because there’s an old copy of my site somewhere else on my hard drive, from the era before I put it all in Dropbox, and when I left it in the old place for backup, until I was sure the Dropbox plan would work. Someday I should delete that, I suppose.
One obvious view, of course, would be to treat the bits of the names between slashes as folders, and to treat them as rooted at the left end, and to compress out the duplicates, and show it all as a hierarchy:
Users ron Dropbox rj-com-db articles 018-01ff abandon-1 agile-riff ... dream-folders
My, that would be clever. And there’s no reason why we couldn’t, say, double-click on the line that said
018-01ff and get a nice picture like this:
It’s all filters, map reduction, and such, working on the big long strings that are the “real” file names.
A digression: Ages ago, I mean like in the 1960s, I was working with a team on an operating system for the SDS Sigma 7, or maybe it was even before that, on the SDS 940. Anyway, I personally didn’t know about Unix, if it even existed yet. Somewhere though, someone on the team got the idea that files should have “tracks”. The file named
proggycould have a
.symtrack for its source code and a
.exetrack for its executable code, and so on. We picked 8 names, and packed the track identifier into three bits. Some of the tracks, like
.exe, had dedicated secure meanings. Others, like
.txt, were more open. Now in those days we didn’t have lots of memory. The 940 had 64,000 24 bit words. Characters were six bits, if I recall, so we had 256,000 characters of memory. The Sigmas were much larger: they had a full megabyte of memory! We had a good reason to pack things into a few bits. Nonetheless, we really kind of missed a bet by not letting our tracks be a string, even had we only allowed three characters.
But what about today? All the operating systems I know of make a more or less real distinction between a folder, or directory, and a file. Some of the applications I use, such as Scrivener, make less of a distinction internally. Scrivener folders can contain paragraphs, and documents can contain other documents. The distinction is mostly in the icon used to show the item, plus a little bit of leaning one way or the other. I think a folder’s default view might be index cards and a document’s might be its text.
But mostly, file systems make a pretty solid distinction between folder and file, and certainly don’t represent everything in a big flat structure of files with long names containing lots of slashes. But they could!
If I have half a million files on my computer and if the average name including folder names was 100 characters, that would still only be 50 megabytes. My laptop has 80 times that much main memory and 5000 times that much file storage. Two hundredths of one percent of my file storage would be taken up by file names if we wrote them all out longhand!
I’m tempted to generate the file just to see how easy it is to process it. Imagine you wanted to find all the files named index.md. Today, you’d have to do some kind of recursive descent through all the folders on the system. With the file names written out longhand, you’d just look at all the strings and check the end for
'/index.md'. That would be easier, and likely a lot faster as well.
Plus, of course, we know that the storage structure of an idea, and its information structure need not be the same. Suppose we’re building a file browser and it needs to be able to support ideas like “files whose names are index.md”. We’d surely start with some simple idea like “scan the list of all files and create a new list of those whose names end in ‘index.md’”. That would work fine, of course, and it might even be faster than a raw search today. (Mind you, my Mac builds an index and the index nearly works.)
So suppose we wanted that kind of query to go faster. No problem, we might create, first, a list
name was the file name at the end of the long string, and
index was the record number in the big long list of names. We might want to make that even faster, and produce a structure like
With that, we could look up the first character of the file name sought, and that would point us to a list of all the files whose name started with that initial.
Or we could just create
index was the index into the big list of
We could make these structures flat, or make them into chunked hierarchic or network structures on the hard drive. We’ve known storage structures like these for decades.
But we haven’t had 4 and 8 and 16 and 32 gig of main memory for decades. All the file names on my computer will fit, not in one gig, but 50 or maybe 100 megabytes. The machine code to search a file that is known to be a series of strings is incredibly fast. It’s quite likely faster than even one file access, even from the solid state drive on my laptop.
It’s tempting to do the experiment, isn’t it? But wait, there’s more.
As an object-oriented kind of guy, I’d probably represent the various physical structures as objects representing functions, typically maps or views or reductions. Then internally, those objects would have structures consisting of indexes and the like, whether structured ones in memory, or flat ones. I’d be tempted to work with flat ones, because I happen to have done that in the past and I know how amazingly fast that can be. But the details would be hidden in my objects.
But I was thinking today, what if you were doing it in LISP? LISP’s natural data structure is just lists, including numbers and string kinds of things like
(quote /Users/ron/Dropbox ...)
It’d be easy enough to create functions to return all the elements of a list that match some regex, and so on. And there’s nothing to stop you from writing something like this:
files_named_index = (quote (select_by_index (1 3 5 99) all_the_files))
The above, in case it isn’t clear, means to me to return the elements 1, 3, 5, and 99 from the second list, the big list of all the files. That might be the result of an earlier query about returning the numbers of the records in the big file list whose names are ‘index.md’. And notice, this is a list, which if evaluated will produce the list we want. It’s dynamic.
So whether we did all this in some OO language, or some functional language, we could readily represent and store alternative data structures, for speed or convenience. In a real LISP, we could even rewrite lists in a new form, or update them. Imagine that our user has renamed one of the ‘index.md’ files to something else. In renaming it, we of course notice that he renamed number 5. So, given our list
things_named_index, as above, all we have to do to update it is remove the 5 from the first portion of the
select_by_index. Poof, we have
files_named_index = (quote (select_by_index (1 3 99) all_the_files)).
We’re good to go. Time to update this index? A lot like zero.
There is a rub. (Only one? Well, no, but here’s the one I have in mind):
Removing that 5 from the list of files named index is pretty obviously correct, but as we build more and more different indexing structure, managing them gets more and more complex, and our clever approaches to updating become more and more ad hoc. And we really don’t want much ad-hoc in our file system, now, do we? We’d like it to work and to be really bullet-proof.
That would lead me back to “Extended Set Theory”, about which I’ve written in the distant past. XST is a mathematical representation of sets that has an important mathematical fix in it, having to do with a somewhat esoteric glitch in how set are axiomatized. More important to us here is that, because it is founded in very pure math–more pure than even relational theory–we can reason with XST and prove that our special structures really work.
This isn’t really exactly easier than ad hoc approaches, and in fact by the time you’ve coded it there’s probably still room for bugs, but once you get the basics of XST in place, things can get easier. Equally interesting is that we can readily build up and tear down structures that are custom-made for the storage hardware we have available. We could, in principle, build something that cached a lot of information in main memory, backed up by special indexing structures on the solid-state drive, backed up by main storage on the rotating drive. We could set up redundancy and so on, with a more formal, more reliable approach.
Then, maybe, we program it all up in something like Elixer or Erlang, where we can set up concurrent, even distributed processes to do the work.
All this is a long way from a settled issue, but it does suggest to me that, if not today, then someday soon, we might do better to have a file naming approach that’s more like a flat collection of strings, with clear functional operations on those strings, and clear convenient views over those operations.
Will we ever have that? I certainly don’t know. Unix has been with us nearly 50 years and is still going strong. But, hey, who knows what’s really inside there? Most of us don’t know, and don’t care.
So how’s that for a dream and a riff on the dream?