Stuck with Plain Text

I used to like plain text.

Those binary files seem so uptight and impenetrable. By contrast, plain text files are open and accessible — pop them in any editor, cat, head and grep away to slice and dice. So many tools — so convenient!

But these days I cannot stop noticing a shortcoming of plain text.

In text files, the newline characters are special because they have a universally agreed upon meaning — there is no question about how the file is structured in terms of lines. You could say the newline is part of the core syntax of text files and indicates a specific visual meaning.

However, text almost always encodes another structure — one that comes from the program that interprets it. For instance, configuration files have sections and sub-sections, programming languages have classes, methods, and so on. The syntax for this structure is unique for each context.

The problem, and the source of my irritation, is that generic tools can only work with the core syntax of text, not the contextual structure:

  • Source control 'diffs' show which lines were added or deleted
  • Counters show how many lines are in a file
  • Up and down arrow in editors navigate lines

It would be a lot more meaningful if these tools could work with the contextual structure instead. Tools can be made to do this (editors do this to different degrees already), but only by implementing special support for each custom syntax.

"What if" I imagine, "plain text had included special invisible characters that could indicate syntactic nesting?"

You would be able to encode a tree of nodes instead of a sequence of lines in this ubiquitous file format. Languages designed in such a world would not invent syntax just for the purpose of nesting (they would still need keywords to attach meaning to the nodes). Generic tools would offer tree semantics. Source control diffs could report something like:

+   Node 'class Shiny()' added at position 3, with 10 children, 105 in subtree
-   Node 'def rusty()' deleted at position 4.2, with 33 children

Editors would be designed to handle a tree structure from the get go. Overall, things would have been much better, at least in my imagination. So now, every time I see a line based diff, it's just a little annoying when I have to mentally map the diff to something more meaningful.

Slightly more bothersome is the feeling that we are stuck with plain text — it's a file format that wasn't designed to evolve. I'll leave that as the closing thought for now, and revisit this general topic in a future post.