Index ¦ Archives ¦ Atom

Misfeatures of Plain Text Files

Continuing the theme of problems with existing systems, I want to enumerate various problems we typically tolerate when we use plain text files as a representation medium.

Linear Order

Plain text provides the notion of ordered lines and characters but no other structure. I could swap the order of two sections in a plain text file, effectively modifying the inherent structure of the text, but have zero effect on the semantic structure (parsed representation) of the information. In effect plain text forces us to specify a linear order where none might exist - it has no notion of an unordered collection.

Data or Visualization?

A lot of the whitespace in source code is meaningless to the computer and promptly discarded by parsers. Eliminating superfluous and redundant data is generally considered good practice - we don't want to store visual padding spaces in the name field for a User data type, for instance - but plain text gets a free pass here.

In fact, plain text requires us to consider both - the visual presentation aspects (line lengths, text spacing and such dictated by the 'style guides'), and semantic aspects (the program you are trying to represent) - in the same medium. Isn't this a gross conflation of concerns? Whether I change the presentation or semantics of my a plain text encoded program, I change the same file.

Plain text is a strange mixture of data and visualization.

Zoom-ability

When you open a plain text file, you might see a long list of text lines, but no indication of where to start reading, what are the high level ideas and low level details encoded in that file, and so on. While presentation is one supposed purpose of plain text files, there are no rich exploratory features available in the encoding.

Denormalized Links

Most information we want to encode includes various kinds of links and interconnections. These could be references to other files (such as import path.to.other) or references to entities in the same file (such as a file global name). Plain text has no way to encode rich structure, so each of these links has to be stored denormalized - repeating a text representation of the linked entity.

Convenience and Familiarity

One thing going for text is that it is very easy to generate large amounts of it from a keyboard. It is also a familiar medium - we spend years in early childhood learning to read and write text. Another convenient aspect is that a very large number of existing tools work with it, and these tools can be composed to the degree that text provides some rudimentary line and character based structure. Still, I feel the best job these tools can do is a poor one, as I wrote previously in Stuck with Plain Text.

Summary

All the misfeatures listed above have real implications on complexity in systems.

Since semantic structure is subject to parsing and interpretation, multiple readers must re-implement the same logic for extracting useful meaning. Editors and compilers for instance, parse the same files into similar tree structures. Since visualization is intertwined with semantics, version control systems cannot differentiate between purely visual changes and semantic changes. Links are stored denormalized and interpreted outside the file itself. This means any relocation of the link target requires a system wide text search and replace operation. This problem is compounded by the fact that the text representation of a link itself has no standard syntax, with each language format inventing its own. With no embedded semantic structure (besides lines), editors can provide no useful affordances for viewing, creation and manipulation of text - unless of course they re-implement some parser and layer some semantics on top.

Admittedly, some of these issues apply not just to text files but files in general, however it's still useful to identify these specific misfeatures of text files.

In general, I don't think the convenience aspect outweighs these misfeatures. I've come to the conclusion that 'plain text' should only be a transient representation at most, used to communicate to the system, but quickly parsed into something richer and not used as a long term canonical representation.

Why strip the system of the knowledge of the semantic interconnections and richer structures, only to reconstruct them in a transient form, within some special programs? Why not have the system capable of storing the rich structures directly, so we might have more powerful generic tools?

© Shalabh Chaturvedi. Built using Pelican. Theme by Giulio Fidente on github.