Serious Play: The Fun Side of Engineering

What passes for software engineering in a lot of shops is a little less than lip service. Rather than develop software based upon a firm understanding of empirical evidence, most software programmers just program with their gut. We see a lot of gutsy programming projects crash and burn, and pretend that software engineering must be hard. Similarly, we look at the cost of labor in our department budgets, and look at the capital cost of hardware, and pretend that humans are expensive and computers are cheap. As such, we should use tools that make our jobs easier even if that means the product works less well.

The Plural of Anecdote

While the plural of anecdotes may not be data, they can be illustrative of how our preconceptions can be wrong. Let's consider the cost of programmer time vs. the cost of computer time. During my last two weeks on a job, a number of years back, I asked to tackle an issue that had bugged me for the proceeding 2 years. Repeated profiles of an application showed that it was spending an inordinate amount of time in its HTML parser. In fact, around 34% of the total processing time was spent processing the web pages over and over again, as it went through a series of permutations which produced the final game output.

Looking at it for two weeks, I discovered that whoever wrote the original parser basically implemented a very elegant generic parser for HTML that was capable of identifying and handling mismatched tags. It was so tolerant of faults, that it was tolerant to a fault. A few quick tests of our existing site demonstrated:

1. We had only a handful of easy to fix mismatched tags. and
2. None of the functionality of the site depended on mismatching

Further more, a full code review showed that the parser was a generic parser that matched all of the tags on the page each iteration, and then only after all of the tags were matched and balanced, did it attempt to identify which tags were of actual interest. In other words, 99.98% of the work it was doing was wasted effort. Changing the parser to a mind-numbingly simple, match the 4 characters that match our custom begin and end tags, resulted in the parser taking only 0.2% of the total process time.

Real World Costs

So what were the real world costs associated with the process of applying metrics to the code and evaluating their actual performance against real world data?

2 weeks of programmer time @ $125/hr = $10,000

Now that may look like a huge figure, but that included testing and verifying an app that served at peak hours over 1.2 million simultaneous users. Now consider what the hardware that this system was running:

4x DB servers = $160,000
12x Web servers = $48,000
4x Switches, Load balancers, Firewalls = $250,000
6x Mail servers = $12,000

Now we're only talking about a production system, and not the development system that mirrored half of this exact environment. The capital outlay is a measly $470,000 in replacement cost. This system was at capacity during peak load, and was in need of an expansion. What was the result of reducing the front end load by 33%?

$10,000 bought enough optimization work to increase the max concurrent users from 1.2 million to around 1.6 million. As there was around a 3:1 total app users per peak user, the application could support another 1.2 million users, before another round of hardware upgrades would be required. For a business where a paying customer represented around $30/year, that represents a potential increase of an addition $36 million in revenue with out any additional capital outlay.

There aren't just capital costs associated with rolling out new hardware as well. There's power costs, hosting costs, and administration costs. If like many businesses, you finance your hardware through a lease deal, there are also additional financing costs. Hosting for the infrastructure was already annually around a $1.2 million expense, or about $0.30 / user. Any improvement in performance, reduction in power or bandwidth, could directly impact this bottom line. After all you pay per bit as well as per watt.

If you look at the actual numbers, spending the money on optimization, on that supposedly expensive programmer, produced a substantial long term cost savings. That $10k could have effectively only increased the physical capacity of the hardware by around 2%, so any improvement in performance greater than 2% would have broken even. Obviously not all apps have such low hanging fruit that can produce a 33% speed improvement in 2 weeks, but many applications have a collection of smaller improvements that can easily add up to a 10-20% improvement in only a couple weeks.

Total Cost of Development

If you look at the cost of development as a function of dollars spent per unit processing accomplished:

programmer dollars spent / CPU cycles processed

You'll note that both the top and bottom of this equation are parameterized by time. The longer the project takes to develop, the greater the money spent on development. The shorter the projected lifespan of the project, the more inefficient the money spent on development. A project that is quick to develop, but runs for years, gives you the greatest bang for your buck, as it trends towards zero. Even a massive project can trend to zero, if it is run by enough people for a long enough period of time. Operating system are a good example of that class of application.

When you are making design decisions, however, there is another programmer cost that continues for the lifespan of the code. The cost to maintain, update, and support the application are also programmer costs, that continue independent of the initial development. Design decisions made early on, may hamper the maintenance effort years down the road, resulting in a low up front development cost, but a substantially higher maintenance cost. So we can modify our equation as follows:

development(t) + maintenance(t) / lifetime(t)

This gives us a fairly good estimate of the cost of the program from the point of view of the development house. But there is also another cost to the system as a whole, and that is the time of the user. A user's time is seldom free. If you're hiring a lawyer by the hour, every hour wasted waiting for a document to save or a memo to load is money out of your pocket. If you're in advertising, every ad you can't serve in a timely fashion can have a dramatic impact on what ad rate you can charge (the more ads you can serve, the more you can charge per ad!). So these issues come into play as well. For a lot of software, a bad review because your app was sluggish or unresponsive can also spell a kiss of financial death in the market place. So let's revise our costs once more:

development(t) + maintenance(t) + user(t) / lifetime(t)

And that gives you a better estimate of what the actual total cost of software can be. When view by this metric, programmer time, especially the time for initial development may actually be one of the smallest costs. For most businesses it is the ongoing maintenance costs that account for much of their personnel issues. The more complicated, poorly designed, and ill documented your project is, the greater the training, maintenance, and support costs you'll face. Furthermore, the less geared towards the user's needs, the slower the runtime, the fewer requests per second you serve, the greater the user costs, and the knock-on costs to your customers. And remember folks, customers who can't afford to upgrade, don't buy upgrades!

Why Metrics Matter

While on several occasions I've blasted less than clueful project managers for their use of metrics, it is not because I believe metrics don't matter. Rather it is the proper user of them that does. For example, on one project a 3rd party project manager hired by a CEO to keep the dev team honest, began to freak out when my code base during the last month leading up to launch was shrinking by 1000 lines of C code a day. An emergency meeting was called, and all of the developers flew in to conference on what was going wrong, and why it was we were going to miss our deadlines. Where did 15k lines of code disappear to?

The answer was simple. For every component that I developed, I did 5 different versions. Each of these versions tests a hypothesis or theory I had regarding how the system may interact with the hardware or with other components. Additionally, each of these 5 versions, may themselves go through 2-10 successive revisions, where each piece of the puzzle was refactored based on real world testing. As a result, in order to produce a final 1000 lines of code, I produced 5000 lines of prototype code, and another 10,000 lines of revisions. All of these were being stored in a repository, and tested, and as I picked the variations that worked best together, I cleaned out the ones I no longer needed. They were still in the revision history if we ever needed them.

Now from the project manager's point of view, all of this code was wasted. There was no need to write 16,000 lines of code in a couple months, if 1000 lines would do. The problem with this mentality, however, is that no one knows ahead of time what those 1000 lines need be. The gap between theory and practice can only be filled with some serious play. The reason that metrics matter is that they help drive development into productive directions. By playing around with an idea, and studying your solution from all possible angles, you can find a more optimal solution than the first thing that comes to your head. The fun part of software engineering is playing around with the ideas.

The Importance of Play

This is the serious play that separates great hackers from middling programmers. It is also the thing that distinguishes a good project manager from a pointy haired boss. Only by appreciating the importance of play, and using metrics to guide your explorations, can you hit the sweet spot that makes for beautiful code. But it is more than just fun and games. This form of serious play as a rigorous exercise also has tangible business results. By basing your design decisions on a combination of empirical evidence and visceral experience, you can assure quality and drive down costs. Play well enough, and you can build systems that require no active maintenance for several years at a time. While there may be a slightly higher up front cost, and sometime seeming waste, the experience gained from experimentation is only wasted if it is not applied and valued.