Experimental Engineering at Scale

One of my biggest pet peeves with the state of software engineering is the general lack of numerical literacy on the part of programmers. In the dozen companies that I've consulted for or worked for over the past 13 years, I can count on one hand the number of projects that I've seen mechanistic models used to work out the design of a system. I can count fewer times where the modeling was then followed up with subsequent predictive modeling to understand how the system would respond to changes in the future. Most of the systems I build for profit are dealing with big data, petabytes of data a month, exabytes on a system lifespan level. When you have thousands of servers, tens of thousands of cores, and hundreds of thousands of processes, you can only approach the problem space in a statistical fashion. Each small change often has spectacularly weird side effects when interacting at scale, your assumptions must be tested because your small scale assumptions are just wrong.

Look at a concept like batching. If you tend to work with database applications that involve a single machine and are typically I/O bound, you'll likely look to batch or stream your data in larger chunks. Because the repeated connection cost from your single server connecting to that single DB can be a significant source of overhead at small scale, you can shave a few percentage points of load with a simple recoding of your algorithm. But what happens when you grow to a multi-master federated eventually consistent geographically distributed database with several modes of operation? Well, you're no longer running on a core, you're no longer locking the DB and waiting for each user's critical update to process, and you're no longer duplicate or globally consistent. Odds are to if you want to eek out any performance, the additional layers of proxies and load balancers are voiding any validity on your connection cost logic. Finally, if you're running at scale, you're no longer hitting it with a single threaded application, but thousands of multiprocess applications. Odds are your intuition of send bigger batches is not only wrong but harmful.

If your app sends large batches to process, the backend now sees the incoming requests as a very spiky signal. You have troughs with very little processing, and then great bite spikes of activity. To accomodate the spikes, the system's scheduler will need to break down the large request and queue the smaller request so as to distribute the load across the clusters. This means your spike will ikely be trans formed into a sawtooth signal which will be drawn off at a rate proportional to the number of workers average and average job duration. If the workers can not keep up with the spike, the queue will eventually overflow, and most systems will discard the TTL exceeding messages in favor of establishing availability for new requests. If the app doesn't break the large requests into smaller ones, you'll see massive pipeline stalls as some nodes go unresponsive overloaded with work and other nodes have difficulty getting any work at all due to insufficient small requests to fill their time. In this case your spiky signal gets translated into a collection of dwarf square waves, with only a small fraction of the theoretical throughput being realized.

If your backend is efficient, the queue size will produce a sawtooth wave that traded latency for efficiency. You can shorten the delay in processing for the full dataset by adding more resources to shorten the base of each sawtooth, but capacity will then go unused. The nodes can service other requests during that period, but at the risk of adding additional delay to the next batch. This produces seemingly no deterministic behaviors when enough clients send batches at effectively random times. As you never know what the impact of environmental effects may have on your system, like users or historic events for which there are no good priors, the combination of batching and job queue scheduling can produce SLA violating results you can not simply "add servers" to overcome.

What ends up being a better bet nearly every time is sending lots of little messages. The overhead of messaging, which may be as high as 20-40% can be baked into your designed costs, and provide a number of spectacular advantages over a "faster" but more dangerous design. Lots of little messages allow for smarter distribution and usually simpler processing requirements, As processing time upperbounds become stable, you can ensure fair scheduling for tasks, improving throughput and utilization, As the number of messengers grow, so too does the opportunity to use QoS to control ingress and egress from the system layer, Priority queuing allows a queuing system to communicate with the client and the worker scheduler to ensure the most important SLA impacting events can be responded to, which is especially true of any command and control signaling. Further more, you can exploit the law of large numbers which allows random events to drown each other out to a stable baseline. You start loving randomness and sampling once your system has a high enough volume of messages to make it not just practical but efficient. All of a sudden you find yourself looking at your monitoring tools as your mission critical system.

But when you start thinking in these big data, large scale way, you start to feel the need for greater level so of numerical literacy. If your engineers can't look at a pile of numbers and derive useful statistics they can't engineer anything. If you don't know how to create a probabilistic model of the behavior of your system, you have no way to develop and test plans for future development. Finally, if you can't construct a hypothesis and tie it back to a set of variables, you really shouldn't be allowed to touch a keyboard -- let alone program. Microoptimizations have no place in this world, and choice of trendy programming language du'jour is just bike shedding. Joe Armstrong is right about solving the wrong problem, but I'm afraid that the general state of software engineering is such that most lack the numerical literacy to understand what the problem is in the first place. But Erlang and R a a good step in the right direction.