Imagine two identical subroutines in the same program, but with different names. The program works but it doesn’t look right—there’s this code smell. Maybe someone didn’t know that a subroutine already exists and just rewrote it. The obvious cleanup is to remove one copy and fix up all references to point to the other.
The above example involves a single program, but can we stretch the same idea to whole systems as well? Where does duplication arise in systems? Is it even a problem?
In this essay I look at the various methods of composition, reuse and duplication in our systems.
Examples of Duplication
The example above is about an oversight, but the kinds of duplication I’m really interested in are ones that require something be re-implemented, even though it is already available. This happens when an existing implementation doesn’t fit the context where it is needed. For example a library available in one programming language must be re-implemented for use in another programming language.
Here are a few more examples:
Compilers parse languages. Editors also parse languages. Do they share the parser? Rarely. An integrated environment that provides both—a runtime and an editor—would not reimplement parsing because the parsed structures can be easily shared.
In distributed applications, protocol implementations are duplicated across many different languages and stacks.
Schema definitions—such as defining a
idfields—are duplicated across multiple frontend, backend and database systems.
The cases above deal with information interchange across different programs. Parsing and serialization code is often rewritten in various languages. Are duplicate implementations necessary here? It certainly seems so, considering how the systems are currently built.
It’s not just parsing logic that is duplicated. I’m going to dig deeper into a popular composable environment.
Consider the powerful, time-honored environment that gives us many “small
programs, each doing one thing well”, the Unix shell. There is a
sort command, an
ls command, and many more. A versatile collection of
blocks that I can snap together in different ways (yay pipes!). There isn’t
much duplication of commands and the environment seems to have nice composition
I think of composition as having elements and verbs. In a shell, the elements are the commands and binaries; and the verbs are the operations we do with them, such as run or pipe one into another.
But it only goes so far.
If I write a program in Unix using Java or Python, can I reuse the Unix
to sort an array of items inside my program? Of course not, what an improper
question! The decent choice is to
reimplement sorting in my program (or use the standard library where someone
else has already re-implemented it).
Even so, let’s pause and review the question as if phrased by a hand wavy beginner:
The computer already knows how to sort things, why do I need to tell it again?
This question is somewhat reasonable and deserves an answer.
The reason we cannot just reuse the
sortcommand within a programming language is that shell commands and in-program concepts don’t connect easily. It’s like trying to fit Lego blocks into a jigsaw puzzle. You need parts that fit the other parts.
Now, if I reimplement the
sort command for the Unix shell I'll be rightly told
that my work is pointless: “we already have a sort program, how is yours any
different?” Yet, when I cross the boundary from the shell world into the
in-program world and implement sorting with the exact same algorithm, nobody
raises any duplication concerns. This is the “way things are!”, “how could they
be any different?”, “deep reuse like that isn’t feasible/possible/practical”,
What was a program smell—two commands or subroutines that do the same thing—is in fact a system fragrance—two implementations that do the same thing!
Or is it?
We do end up re-implementing a lot of functionality that is already available in other forms in the same system. Half the fun of a new language/paradigm/ecosystem is re-implementing all the stuff that already exists. (But by golly won’t we do it right this time!)
In sharp contrast to a person, who may learn a single algorithm and be able to apply it in various contexts, in-computer representations of algorithms seem to be tightly coupled to the context and require repeated expressions for different contexts. There is no way today for me to express any sort algorithm in a pure way such that I can reuse it by only decorating the definition with some context specific glue.
Generic programming is an attempt at improving the reuse of algorithm implementations, but it’s applicability is severely limited to in-language code of a certain kind.
Is it really true that sort cannot be expressed in a pure form and decorated to fit into any context? I’m interested in any counter examples.
Step back to our scenario with the shell
sort command and the in-program
There a line between the shell world and the in-program world. Let’s call this boundary a composition barrier. On one side of the barrier we have the the shell elements—commands, pipes and such. On the other side we have a very different world—subroutines, types, values—programming language elements. Elements across composition barriers cannot be directly connected.
We have multiple distinct and isolated worlds of composition. Each with its own elements and composition rules.
It’s easy to see that programming languages are each fenced by a composition barrier. Each is an island with a rich selection of various artifacts and composition patterns. The artifacts include not just the language concepts (procedures, pure functions, algebraic types, classes, inheritance..) but also various libraries. You can play around with them all you want but you cannot take any artifacts off the island.
Well, it’s not entirely true that we cannot connect elements across composition barriers. But to do so requires both sides to agree on a shared mechanism. These tend to be fairly low level—some examples of such mechanisms are the C ABI, pipes, sockets and other protocols built upon these. Even file formats are an interoperability bridge. I realize these are all somewhat different and range from in-process bindings to remote ones, but from my squinty eyes these are all just ways to cross composition barriers.
If there is one thing we seem to like doing more than building islands, it is building bridges.
By the way, whenever we're grappling with all this plumbing, isn’t it the most grunty of all the grunt work we call programming today?
Composing across a bridge isn’t quite the same as working within an island. Notice how when working within “my favorite language”, the elements seem to connect quite elegantly, but whenever I need to cross the bridge, things get ugly—I need all kinds of compromises.
Some things are really not possible across many composition barriers. For instance, a C ABI function call can never be inlined because we're dealing with pre-generated machine code here. By contrast, an OS designed with late binding in mind might be able to do such inlining optimizations.
The Composition Hodgepodge
So who decides where these composition barriers arise and where islands flourish? It’s the operating system—Unix, really. It provides a baseline composition model with files, binaries, compilation, running and so on:
- At the bottom you have files
- Files are bound together, compiled to form binaries or dynamic libraries
- Binaries are run to form processes, which are bound to libraries
- Binaries compose with other running binaries using bytearray oriented IPC, file storage or databases
- Binaries call library and kernel syscalls via binary ABIs
This framework is really quite rigid. However there is plenty of freedom to design within the file formats—which is where we exercise it. It’s like the system provides strong affordances—places where we can extend it.
Affordance is a term from UX where it means cues which give a hint how users may interact with something. I think it can be applied to system design as well.
What we have is a bunch of ad-hoc composition models that don’t scale well. The OS provides the DIY kit for in-program composition—here you employ objects, functions, or whatever language features you use—but these are mostly statically bound in one-shot compilation. The next level composition is “libraries” where you must use the ABI. This means very low level semantics and no code inlining. The next level is inter-process communication - either via bytearray streams or weakly consistent files. Both require really low level definition, serialization and copying. There is no strong OS support for any ensemble of processes that form a larger computation process. Neither is there any support for system wide schema or type definitions. All these notions have to be manually built on top.
What I think is particularly fascinating is how in-process composition is drastically different from cross process composition. When you zoom out and look at a running system—it has all these interconnections—the couplings exist both within and across processes but the way we build and think about these bindings is so different.
In-process, function to function or function to data type couplings are bound by the compiler in one shot and are tighter in nature. Cross process couplings self-bound by running processes and dynamic.
I see this bifurcation as a system smell.
I realize this whole essay might seem like a non-problem, but I really believe there is something here. If it doesn’t look like an elephant in the room, perhaps the elephant is the room. Our composition models don’t scale and we're pushing them to the limits. Tractability is a growing problem as we add layer upon layer of technology to cobble together the previous layer of elements.
What we're doing seems like stacking brick structures to their limit. Stronger bricks wont give us skyscrapers, we need better ideas.
What I’m really after here is better composition models. A good model should scale up—we shouldn’t need to keep reinventing new ways to compose elements together. Perhaps this involves a uniform notion of the binding between elements and some kind of recursive design? For bonus points, can we have a way to describe a general algorithm that is never reimpelemented, but only refined and tweaked by decorators so it can fit different contexts?
Are these things possible? I hope so.
An excellent write-up on a similar theme is Composition is the root of all evil, where Konrad Hinsen writes:
Digital information is, by definition, information expressed in a formal language. Composition of digital information produces a new, more complex, digital information item, which is of course expressed in a formal language as well. And since the ingredients remain accessible as parts of the whole, everything must be expressed in one and the same formal language. And that’s where all our trouble comes from.
In Architectural Mismatch: Why Reuse is Still So Hard, David Garlan, Robert Allen, and John Ockerbloom reflect on the state of architectural mismatch, a term they coined earlier:
Applications must include extensive code to ensure compatibility with different browsers, languages, and library implementations.