Lessons Learned in 2017

Even though 2017 wasn't a good year, I did manage to learn a number of things that I think might be worth remembering:

  • You may recognize and understand problems your customers won't realize they have for years
  • Things you were doing 10 years ago, are someone else's cutting edge technology today
  • An 80% correct solution today is indefinably better than no solution at all
  • System Engineering is a dying art as platforms become increasingly abstract
  • These lessons will no doubt reflect in what I will choose to do in 2018:

    Problem Recognition

    One problem I became painfully aware of this past year is the problem of "problem recognition". You may have encountered this when communicating with a client and realizing that they don't understand what you're selling because they don't understand their actual problem. Now marketing exists largely to help a market segment identify a problem, and then advertising help associate your solution with said problem, but there can be a gap between marketing and customer experience so wide that no amount of marketing can be effective. This gap is what I call the "problem recognition" problem.

    For example, your customer has decided to "move to the cloud", and they then adopt a set of tools sold to them by various vendors without understanding the problems that they solve. All they know is that other successful organizations have used these tools and the consultants they hired to drive the process recommended them too. After natrual attrition, the people involved on the original implementation of the move to the cloud have left the team, and now you are left with a staff who doesn't understand any of the problems that the tools were intended to solve. No ammount of marketing for new or better tools with bridge this gap, because the organization you are selling into lost it's institutional knowledge necessary to recognize even the existince of the problems your tools solves.

    Another example, your customer has decided to adopt a device management platform which follows industry standards, but does not really understand what the problems those standards were attempting to address. In fact, while the platform is standards compliant, it fails miserably to meet the basic functional requirements of the product because it doesn't solve a soft-realtime command and control requirement. The reason for this is that the standard was never inteded to address soft-realtime command and control, but merely RESTful state transfer with eventual consistency. Educating your customer about how many different IoT problems there are can be more work than implementing an entire new platform from scratch!

    Everything old is new again

    I began automating server builds on an industrial scale professionally 18 years ago. Building dependency checkers, validating hardware performance, burn in and failure testing, and adding new device support were all just things you did to get a professional grade linux server live. Having worked in a factory setting for servers, I had also developed an appreciation for the work being done in the HPC world, and large scale datacenter style deployements. When I branched out into doing game and app development, I kept those automating tendencies with me, and have scripted every build, deployment, and release ever since. Turns out much of the world still doesn't work that way.

    There is a terrible tendency in techonology circles to discount "old code" and "old tech", as if a solution is invalid because it wasn't just recently discovered. You can see this in a lot of fad tech: cmake, Go, Rust, Javascript famework du jour, systemd, etc, where the design decisions are a product of frustration with existing tools only to invent new equally frustrating tools with their own new special inadequacies. Each of these tools tend to push for what I would call "more magic and less discoverability". While these tools often make the happy path seemingly easy, the actively hamper undestanding by hiding too much complexity. For those experts who have use cases that fall off the happy path (nearly everything I seem to do these days), these tools fail miserably by lacking the flexibility of the tools that they seek to replace.

    Today, probably the best new tool for building reliable server artifacts that work across clouds are in fact Dockerfiles. Why? Because a Dockerfile is a hipster formatted shell script (just add meta-comments inline). If you want to replicate the build process you can pretty much just run all the RUN lines on the command line and get the same result. Now it doesn't solve the management of your upstream repository mirrors, properly vaidating API and ABI versions of software, or properly managing security patches, but it is a sufficiently flexibile build system you can support most crazy edge cases.

    This year, I discovered to my astonishment that just getting IT personel to write shell scripts can be a huge multimillion dollar win for some of my customers. Doing just doing basic shell scripting to apply security updates can save some businesses millions of dollars a day in fines. Forget trying to get those customers to adopt modern CI/CD methodologies, get them up to par with what you were doing 20 years ago. Don't worry about autoscaling their apps for cost savings, just try to get them to have a way to reproduce a build and a configuration.

    Working today is good enough

    Over the past decade, I have spent most of my time in R&D. I've been building cutting edge tech that often made it into production right after the proof of concept was committed. Many of my co-workers have complained that my PoCs were not ready for production, but then they got stuck "productizing" them while I went off to work on the next new thing. Why were we selling something that was obviously only 60-80% baked?

    Well, for a lot of business problems, a solution that partially works today can be vastly more useful than no solution at all. This past year, I ran into an issue where a solution I built for another context was immediately applicable to a new business problem. While my solution is only about 80% effective in it's machine learning aspect, that is 100% more data than the company was collecting with their current solution (ie none). This then begs the question of what one can charge for such a product? Since they current solution cost at least $120/yr, anything less than that would be a no brainer, and even at a slighly higher cost it may still be worth it due to the enhanced capabilities. Working today is often good enough.

    The War on Systems Engineering

    In 2017 I had multiple projects that I built on various "serverless" platforms, including that from https://serverless.com/, and I see this following in the "more magic less discoverability" camp. As each of the cloud platforms evolve, they are trying harder and harder to move up the abstraction stack. It allows them to market their serivices directly to programmers, and allow managers to pretend that they don't need to hire the specialized systems engeering skils that make a production system work at scale. Like with most closeted outsourcing business decisions, the cost of these decisions today will not be realized to the businesses that adopt them for many years.

    While it is "easy" to deploy a simple "function" to Google or Amazon, it is very difficult to properly architect a solution for any given use case. If you design your application to take advantage of auto-scale capabilities, you can throw money at the problem rather than solving it. If your application's use case doesn't take advantage of auto-scale (due to a functional requirement impedance), you won't have the systems knowledge or data to properly scale your design. The serverless platform becomes yet another black box that you can trail and error your way into a functional solution some of the time.

    Having used 4 different serverless style platforms in 2017, I can say I ran into edge cases with each one fairly quickly. In each case there was no solution available that didn't involve at least spinning up multiple containers. By the end of the year, I was fully convinced we were doing "serverless" in 2014 better at wot.io than any of the current providers are doing today. Once again it is a happy path problem, where the use cases these platforms were designed to solve for are more specific than the generalized tools they seek to replace.

    As more and more companies adopt these technologies, they are locking themselves in for another round of painful lessons a few years from now when these platforms suffer major design changes. Building a mission critical system on one of these platforms feels foolhardy at worst, negligent at best. While you may have a good time to market for your proof of concept implementation, your customers will suffer the lack of proper systems engineering that go into the design of the largely black box systems. Until their inner workings, profilers, tunable settings, and system parameters are properly documented and made visibile to systems enginners, they should be avoided for serious work.