Index ¦ Archives ¦ Atom

Files, Formats and Byte Arrays

The 'file' and the 'file system' are ubiquitous concepts employed by all the major operating systems and hard to avoid in mainstream programming. In this episode, I take a critical look at the notion of a file to point out the many shortcomings lurking in this simple abstraction.

Coceptually, a file is just a named array of bytes. A file oriented system promotes the idea that:

All concepts must be represented as a one-dimensional arrays of bytes.

This is incredibly flexible but also incredibly tedious. There are deep, often overlooked implications of this idea.

File Boundaries

Physical boundaries are the immutable reality of our material world. Two pieces of paper are distinct, thus copying printed text from one to the other must necessarily produce a physical copy such that the copied matter is completely disconnected from the original. However this doesn’t have to be the case in machines that manipulate virtual constructs! We should be able to manipulate abstract concepts that may be interconnected in any which way, without having to sever those connections. The laws of physical materials don’t apply to made-up constructs in computers.

Yet, ‘files’ mimic this severe shortcoming of the material world - hard boundaries and completely detached copying.

We really want to represent all kinds of interconnected concepts in the system, but we're forced to work with disjoint clumps of bytes.

Of course, we can represent links between concepts by encoding them as special formats. Once we start looking for these we find them all over the place - Git's tree files and blobs, HTML's links and embedded images, import statements in programming languages, and so on. Pretty much everywhere you have related information split across files you end up with cross links and inclusions.

But why, in a machine capable of representing all such links, the default mode of operation is connection blindness and not connection awareness?

What I mean by connection blindness is this: if I explore a typical file system I'll find a large number of named 'clumps of bytes' but with no knowledge of the structure or interconnections of the concepts they represent.

Why must every application define their own formats and reconstruct these links (temporarily) from disjoint clumps of bytes? Why are the links themselves only visible to the select few with special knowledge of how to interpret the bytes? Why a simple concept such as 'content A here is linked to content B there, via transformation T' cannot be directly represented by in the (file) system - it must be indirectly represented as per some file format. Which brings me to my second point.

File Formats

We use computers to organize, manipulate and transmit arbitrary constructs and high level media (such as pictures). The 'file' idea requires that we first map these higher level concepts into byte arrays - before transmission or persistence. The mapping between the concepts and the byte arrays - called the file format - is essential to intepretation of the bytes. However, the file format itself is transmitted entirely out of band. The 'meaning' of the file can only be extracted when the file contents are viewed through the lens of the correct file format.

While knowledge of file formats is critical, there is no standardized representation for a 'file format' definition.

This makes evolution and invention painful and slow because it requires knowledge of any file format to be pre-shared. This can only work at scale if we have a small set of well known formats that is globally agreed upon - welcome to the sad state of the world today.

If the notion of 'file format' was reified (and say, itself represented as a file), it could be transmitted with the file and new formats would be trivial to transmit. I'll only touch upon the problem here - it would mean we first concretize concepts that are not bytearrays into our system, so we can define the mapping to these. Further, why only have a flat, one level mapping from higher level concepts to bytes? Maybe we can map concepts to other concepts - there is plenty of room to explore here. One final point on this note - parsing grammars and the Erlang bit syntax are two examples of reification of formats - demonstrating that we keep reinventing the general idea in lesser forms with limited applicability.

It's (probably) impossible to have a scheme where you have zero shared knowledge and can still transmit concepts between two systems. But shouldn't minimizing pre-shared knowledge be an essential goal of our system design, rather than maximizing it, as we do with files? One tangent here is the encapsulated object idea from Alan Kay and Smalltalk - i.e. organize your system as encapsulated objects (not exposed bytearrays) but that is a topic for another time.

Composability and Encapsulation

There isn't much to write here because the file concept doesn't have much of either. Not much is hidden - the guts are completely exposed. You cant 'compose' a new file from some existing files - not in the sense of how we can compose functions or objects to make higher level ones. All higher level concepts exist outside the file concept - within running programs, source code and people's heads.


In Our operating systems are incorrectly factored, Tony Garnock-Jones writes:

It’s not only a Unix problem. Windows and OS X are just as bad. They, too, offer no higher-level model than byte sequences to their applications. Even Android is a missed opportunity.

Joel Jakubovic has a long essay with an overlapping theme called There is only one OS, and it’s been obsolete for decades. An excerpt:

Anyway, once you supply Unix with a name, it hands back to you a stream of bits. Now, despite that this is pretty indisputably the lowest-level picture you could possibly get of anything, at least we can build on top of it.

I wrote previously about misfeatures of plain text. I realize now it's really about concrete versions of the general problems I describe above.

Finally, this discussion would be grossly incomplete without mentioning the ideas of transclusion and xanalinks from Ted Nelson. Nelson defines transclusion as "the same content knowably in more than one place" - which would support a provenance-aware copy of a construct - very different from the detached bytearray copying we do with files. In Xanalogical Structure..., Nelson writes:

Our intention has been not merely to create an electronic literary structure, but to import literary concepts into a redesign of the rest of the software world. We sought to reduce the influence of hierarchical directories and conventional files (which we see as large lumps with stuck names in fixed places, with compulsory gratuitous naming -- unsuited to overlap, interpenetration, rich connectivity, reasonable backtracking, and most human thinking and creative work.

Bytearray Orientation

While I've been harping on about files, everything I'm saying applies really to all bytearray based constructs such as sockets and file streams. I'd say the mainstream operating systems are all bytearray oriented operating systems. They provides a large hierarchical key value store of bytearrays (the file system), inter process communication via bytearrays, and deliberately treat these bytearrays as opaque.


Pragmatic counterpoints to all of the above are "you'll have bytes/bytearrays somewhere in your stack", "any construct can be represented in byte arrays" and "files are sufficient", etc.

While I agree with the first two, I don't find them a compelling defense of the file idea. The level of abstractions offered by a system determine a lot about how the system is used, what power it provides and the context that higher level structures evolve within. We'll probably always have NAND gates at some level but if that were the level of abstraction available, we'd be building very different systems.

Here's a quick thought experiment - say Unix had originally provided a standard byte syntax to represent a link to a file that can be embedded within another file. How would this simple addition affect the evolution of programming languages, text editors, file formats and other tooling? Consider the notion of a directory, a symlink, an import statement, a build file, a tar file - how much of that gets subsumed? Can many diverse representations collapse into the same concept? What about internal structure - could each code function be stored in a distinct anonymous file, if all editors transparently traversed the embedded links? The substrate constructs provided deeply affect the structures we build on top of them.

There is nothing particularly natural about the prevalent disjoint file idea - it's just an abstraction we are all stuck with (perhaps because it seems useful - it is just good enough to get something going) and hard to evolve away from.

I believe that not only should the system as a whole support the notion of arbitrary constructs and their relationships, it should also support construction of higher level constructs via composition.


The main ideas I've expressed above are:

  • Files have no notion of interconnections or structure within or externally. This then requires all file formats and file based systems to reinvent the mapping of these higher level concepts into a 'array of bytes' in a myriad different ways, resulting in an explosion of representation for a few ideas.

  • There is no encapsulation of meaning in a file centric world - extracting meaning from a file requires knowledge transmitted out of band - via a filename or mime type or human interpreted text. This requires extensive pre-shared knowledge of file formats.

  • Our operating systems are bytearray oriented and files are an artifact of this design choice.

In other posts I'll attempt to explore what other overarching ideas might form the basis of systems.

Next, read Where's my Simulator?.

© shalabh. Built using Pelican. Theme by Giulio Fidente on github. Comments powered by Talkyard.