Living in a Distributed World

Odds are if you are reading this, you are living in a distributed world. The software which rendered this post is connected to a global network of machines which ferried you these bits like a relay race.  The machine that did the serving, itself is a remote database periodically updated via a period poll to the device I type most of these posts. Since I do most of my typing on public transit, I use a whole host of servers and networks to deliver these bits to my remote persistent cache.  At each step of the way, machines are caching and forwarding this information, distributing it to the ends of the earth, if my geo-ip database is to be believed. 

We live in a distributed world, where computing for our most common tasks is scattered over dozens of machines on a regular basis. When you think of the total cost of this infrastructure, and how much of it is used for mere amusement, it is truly awe inspiring.  

And yet the basics of building large scale distributed systems are very much a black art bordering on magic to a surprising number of the practitioners of the craft. Not understanding even the basics of how their hardware works, or how a network is routed, or even how their operating system handles files, is considered ordinary and in some circles even applauded.  When was the last time you saw a college graduate java programmer measure the impact of their data representation on disk and network I/O?  Odds are they are too busy trying to tame Hibernate to give such real world limitations a second thought.  How can one expect a software engineer a solution when their tools are too abstract for reality?

And yet this is the world we find ourselves. 

At the heart of this distributed world is a client / server model. A server consisting of a generic daemon and some custom code handles requests for resources and doles out responses.  Probably the best example of this model is inetd, a daemon which listens on a wide variety of ports and then hands off the processing of the socket's data to a separate process.  This daemon allows any script or program that reads STDIN and writes STDOUT to process a socket connection.  Today few servers run inetd for their primary services, but back 15 years ago it was common. 

Early web servers supporting CGI worked in a similar fashion. The daemon would listen on port 80, and when a file in the cgi-bin directory was requested the server would:

 execve(char const path, char const const argv, char const const envp);

The script with the header values and query string parameters passed as environment variables via envp. The script or program could then call getenv(), use %ENV in Perl, or just use the variable straight in Bash.  Secondary connections to databases or proxies were rare, as a single machine could run all of an organization's services.  Updating a BerkeleyDB was a common action for those requiring storage, as they were relatively cheap hashed string stores. 

Enter LAMP: Linux Apache MySQL Perl/PHP/(and eventually Python)

If you were doing anything in the dot com days, you were either doing it on Apache or likely rolling your own. And if you were rolling your own you were probably doing it wrong.  Apache was a big step up from shell scripts running through inetd.   Apache still exec'd CGI scripts, but also managed a forked pool of workers which ensured if one worker segfaulted it didn't bring down the entire server.

 In an age when C modules linked into your Apache runtime were the best way to get performance, some popular web frameworks went a step further and ran a secondary daemon that cached database results and did heavy lifting internally. These designs reached their peak around 2000, when E10000 Solaris systems running Oracle on 16-32 cores were common among sites with millions of active users. Smaller organizations, in comparison were running MySQL on dual core AMD or Intel, but we're running into the same issues: Slow clients, fast servers. 

The reason you ran Sun hardware was your data set was too large for commodity servers. Either your working set size was approaching the 2GB system limit or 2 CPUs could not keep up with query volumes. The reason you ran MySQL was Oracle was frigging expensive.  But even with MySQL, your application was now a distributed application even on the backend. Some processing was cheap to do on 1u pizza boxes, but that which required a lot of shared data still ran on the database.   If you had an advanced middleware tier that did the business logic processing, and ran Apache with mod_proxy, you might have a tier of cheap pizza boxes serving a bunch of slow connections, a collection of beefier boxes for caching data and processing business logic, and a backend pool of databases which ran large queries and updated large sets of shared data. 

Move forward few more years and your world probably became more distributed than ever before. As JavaScript engines got faster, so too did more processing move to the edge. CDNs now were your largest database, and the idea of "a database" had lost ground to clusters. Every part of a typical application from presentation, to control, to storage, to communication involved many processes running on many machines. Most meaningful apps are distributed apps in a networked world. 

When the first distributed systems were put into production, wire runs were measured in miles.  Soon then after, they were measured in yards as computers on the same campus were connected. Then as the computers in the same room connected, feet. Then in cut to length data centers, inches. Now on multicore multi-CPU machines millimeters.  With chips like the GA144, Tilera64, and some of the new ARM offerings, we are measuring the traces between computers in nanometers.   As you shrink the network down to the atomic level, the same principles apply just using smaller machines. 

This is why most modern approaches to programming suck in a distributed world.  It doesn't matter how fast your code runs on a general CPU today, in the near future it isn't going to scale to the computing fabric. When I have thousands of cores per cubic millimeter of computing fabric, processes and threads aren't the bottleneck.  I can dedicate a core per object and have cores to spare. The problem becomes one of modeling the flow of data over those objects, and altering the routing characteristics of the object network in flight.   The traditional instruction pointer is meaningless when you have thousands of parallel cores. 

And in most interesting applications we are already there. On a daily basis, I write programs to run on 10,000 cores.  No application runs exclusively on one machine, and pools of workers wax and wane with load.  With 24-32 core machines (My dev setup has 48 cores), what matters most to both performance and maintenance is having a clear understanding of data flow  between the elements of the system.  And that means becoming comfortable with emergent properties of an asynchronous system.