I used to like plain text.

Those binary files seem so uptight and impenetrable. By contrast, plain text files are open and accessible—pop them in any editor, cat, head and grep away to slice and dice. So many tools—so convenient!

But, I want to talk about a shortcoming of plain text.

The two meanings in text

In text files, the newline characters are special because they have a universally agreed upon meaning—there is no question about how the file is structured in terms of lines. You could say the newline is part of the core syntax of text files and implies a specific visual and structural meaning.

However, text almost always encodes another structure—one that comes from the program that interprets it. For instance, configuration files have sections and sub-sections, programming languages have classes, methods, and so on. The syntax for this structure is unique for each context.

The problem, then, is that generic tools can only work with the core syntax of text, not the interpreted structure:

  • Source control “diffs” show which lines were added or deleted
  • Counters show how many lines are in a file
  • Up and down arrow in editors navigate lines

It would be a lot more meaningful if these tools could work with the richer meaning instead. Tools can be made to do this (editors do this to different degrees already), but only by implementing special support to parse each custom syntax.

What if?

What if plain text had included special invisible characters that could indicate nesting? After all, aren’t all control characters arbitrary anyway?

You would be able to encode a tree of nodes instead of a sequence of lines in this ubiquitous file format. Languages designed in such a world would not invent syntax just for the purpose of nesting (they would still need keywords to attach meaning to the nodes). Generic tools would offer tree semantics. Source control diffs could report something like:

+   Node 'class Shiny()' added at position 3, with 10 children, 105 in subtree
-   Node 'def rusty()' deleted at position 4.2, with 33 children

Editors would be tree editors, not line editors. Overall, things would have been incrementally better, at least in my imagination. So now, every time I see a line based diff, it’s just a little annoying when I have to mentally map the diff to something more meaningful.

Slightly more bothersome is the feeling that we are stuck with plain text—it’s a file format that wasn’t designed to evolve, and it’s not a great thing to be stuck with. I'll leave that as the closing thought for now, and revisit this general topic in a future post.


Instead of a comment you can also annotate this page.