Files, Formats and Byte Arrays

The “file” and the “file system” are ubiquitous concepts employed by all the major operating systems and hard to avoid in mainstream programming. In this episode, I take a critical look at the notion of a file to point out the many shortcomings lurking in this simple abstraction.

Conceptually, a file is just a named array of bytes. A file oriented system promotes the idea that:

All concepts must be represented as one-dimensional arrays of bytes.

This is incredibly flexible but also incredibly tedious. There are deep, often overlooked implications of this idea.

File Boundaries

Physical boundaries are the immutable reality of our material world. Two pieces of paper are distinct, thus copying printed text from one to the other must necessarily produce a physical copy such that the copied matter is completely disconnected from the original. However this doesn’t have to be the case in machines that manipulate virtual constructs! We should be able to manipulate abstract concepts that may be interconnected in any which way, without having to sever those connections. The laws of physical materials don’t apply to made-up constructs in computers.

Yet, ‘files’ mimic this severe shortcoming of the material world—hard boundaries and completely detached copying.

We really want to represent all kinds of interconnected concepts in the system, but we're forced to work with disjoint clumps of bytes.

Of course, we can represent links between concepts by encoding them as special formats. Once we start looking for these we find them all over the place—Git’s tree files and blobs, HTML’s links and embedded images, import statements in programming languages, and so on. Pretty much everywhere you have related information split across files you end up with cross links and inclusions.

But why, in a machine capable of representing all such links, the default mode of operation is connection blindness and not connection awareness?

What I mean by connection blindness is this: if I explore a typical file system I'll find a large number of named “clumps of bytes” but with no knowledge of the structure or interconnections of the concepts they represent.

One exception to this is the nesting of directories which provides one system visible dimension that can be used to encode certain interconnections. For instance, in Python a directory is a “package” and each file within is a contained “module”. The 'module X is in package Y' relationship is visible to the system (partially, as “X is in Y”) - so we can use a generic system tools (such as zip to archive a Python package, or ls to explore the contents). Contrast this with the import statement in Python, that relationship between modules is completely hidden from the system.

Why must every application define their own formats and reconstruct these interconnections (temporarily, in-memory) from disjoint clumps of bytes? Why are the links themselves only visible to the select few with special knowledge of how to interpret the bytes? Why a simple concept such as 'content A here is linked to content B there, via transformation T' cannot be directly represented by in the (file) system—it must be indirectly represented as per some file format. Which brings me to my second point.

File Formats

We use computers to organize, manipulate and transmit arbitrary constructs and high level media (such as pictures). The “file” idea requires that we first map these higher level concepts into byte arrays—before transmission or persistence. The mapping between the concepts and the byte arrays—called the file format—is essential to interpretation of the bytes. However, the file format itself is transmitted entirely out of band. The “meaning” of the file can only be extracted when the file contents are viewed through the lens of the correct file format.

While knowledge of file formats is critical, there is no standardized representation for a “file format” definition.

This makes evolution and invention painful and slow because it requires knowledge of any file format to be pre-shared. This can only work at scale if we have a small set of well known formats that is globally agreed upon—welcome to the sad state of the world today.

If the notion of “file format” was reified (and say, itself represented as a file), it could be transmitted with the file and new formats would be trivial to transmit. I'll only touch upon the problem here—it would mean we first concretize concepts that are not byte-arrays into our system, so we can define the mapping to these. Further, why only have a flat, one level mapping from higher level concepts to bytes? Maybe we can map concepts to other concepts - there is plenty of room to explore here. One final point on this note—parsing grammars and the Erlang bit syntax are two examples of reification of formats—demonstrating that we keep reinventing the general idea in lesser forms with limited applicability.

It’s (probably) impossible to have a scheme where you have zero shared knowledge and can still transmit concepts between two systems. But shouldn’t minimizing pre-shared knowledge be an essential goal of our system design, rather than maximizing it, as we do with files? One tangent here is the encapsulated object idea from Alan Kay and Smalltalk—i.e. organize your system as encapsulated objects (not exposed byte-arrays) but that is a topic for another time.

Composability and Encapsulation

There isn’t much to write here because the file concept doesn’t have much of either. Not much is hidden—the guts are completely exposed. You cant “compose” a new file from some existing files—not in the sense of how we can compose functions or objects to make higher level ones. All higher level concepts exist outside the file concept—within running programs, source code and people’s heads.

In Our operating systems are incorrectly factored, Tony Garnock-Jones writes:

It’s not only a Unix problem. Windows and OS X are just as bad. They, too, offer no higher-level model than byte sequences to their applications. Even Android is a missed opportunity.

Joel Jakubovic has a long essay with an overlapping theme called There is only one OS, and it’s been obsolete for decades. An excerpt:

Anyway, once you supply Unix with a name, it hands back to you a stream of bits. Now, despite that this is pretty indisputably the lowest-level picture you could possibly get of anything, at least we can build on top of it.

I wrote previously about misfeatures of plain text. Looking deeper, we might see those misfeatures are really about concrete examples of the general problems I describe above.

Finally, this discussion would be grossly incomplete without mentioning the ideas of transclusion and xanalinks from Ted Nelson. Nelson defines transclusion as “the same content knowably in more than one place”—which would support a provenance-aware copy of a construct—very different from the detached byte-array copying we do with files. In Xanalogical Structure..., Nelson writes:

Our intention has been not merely to create an electronic literary structure, but to import literary concepts into a redesign of the rest of the software world. We sought to reduce the influence of hierarchical directories and conventional files (which we see as large lumps with stuck names in fixed places, with compulsory gratuitous naming—unsuited to overlap, interpenetration, rich connectivity, reasonable backtracking, and most human thinking and creative work.)

Byte-array Orientation

While I've been harping on about files, everything I’m saying applies really to all opaque byte-array based constructs such as sockets and file streams. I’d say the mainstream operating systems are all byte-array oriented operating systems. They provides a large hierarchical key value store of byte-arrays (the file system), inter process communication via byte-arrays, and intentionally treat these byte-arrays as opaque.

When overlapping, interconnected high level constructs are fused and solidified into byte-arrays, the structure is lost and it is much harder to extract meaning (even partial meaning such as identities and relationships of the inner elements). Further, the extraction of meaning requires out-of-band communication and fragile re-implementation of the “meaning extractors”. What we need is a system that preserves the rich structures and elevates the level of inter process communication.

Counterpoint

Pragmatic counterpoints to all of the above are “you'll have bytes/byte-arrays somewhere in your stack”, “any construct can be represented in byte arrays” and “files are sufficient”, etc.

While I agree with the first two, I don’t find them a compelling defense of the file idea. The level of abstractions offered by a system determine a lot about how the system is used, what power it provides and the context that higher level structures evolve within. We'll probably always have NAND gates at some level but if that were the level of abstraction available, we’d be building very different systems.

Here’s a quick thought experiment—say Unix had originally provided a standard byte syntax to represent a link to a file that can be embedded within another file. How would this simple addition affect the evolution of programming languages, text editors, file formats and other tooling? For languages, a possible emergent pattern could be to store each code function in a distinct file while keeping links to the functions in a module file. Another possible pattern might be to use links instead of import statements. Since links would be a system wide standard, all editors would allow traversing the embedded links. New tools would be designed with links in mind.

Links could also be used to represent directories, symlinks, build files, tar files, git trees and so on. How many ways do we currently have to represent a reference to another file? All of those diverse representations collapse into the link. The underlying point I’m making is that the abstractions provided by the substrate deeply affect the structures and processes we build on top of it.

There is nothing particularly natural about the prevalent disjoint file idea—it’s just an abstraction we are all stuck with (perhaps because it seems useful—it is just good enough to get something going) and hard to evolve away from.

I believe that not only should the system as a whole support the notion of arbitrary constructs and their relationships, it should also support construction of higher level constructs via composition.

Summary

The main ideas I've expressed above are:

Files have no notion of interconnections or structure within or externally. This then requires all file formats and file based systems to reinvent the mapping of these higher level concepts into a “array of bytes” in a myriad different ways, resulting in an explosion of representation for a few ideas.
There is no encapsulation of meaning in a file-centric world—extracting meaning from a file requires knowledge transmitted out of band—via a filename or mime type or human interpreted text. This requires extensive pre-shared knowledge of file formats.
Our operating systems are byte-array oriented and files are an artifact of this design choice.

In other posts I'll attempt to explore what other overarching ideas might form the basis of systems.

Comments

Lobsters

Instead of a comment you can also annotate this page.