Tech, 10 yrs ago
Back when I started in tech, the most common deployments were what is known as a LAMP stack. This was a kind of architecture that was simple to reason about. Also, we commonly deployed such a stack by either leasing servers from vendors like Softlayer and Rackspace, or by directly leasing space in datacenters and dealing with all the details of networking, cage space, cooling, power, etc. That environment created a generation of operations engineers who knew far too much about how the internet really works. Baling wire and glue, we liked to call it. You can just read one BGP incident review writeup to see how fragile it all really is. And because that stack was simple and had ‘A’ database, it meant we could get away with deploying that by hand. Configuration would be artisanal to specifically match our work load, snowflakes and highlanders all the way down.
Fast forward more than ten years later, and a diminishing number of companies run their tech that way. For shops that are running at moderate scale or new startups that are still iterating and haven’t hit their stride yet, the common path is in the cloud. Not just through leasing EC2 instances from AWS but by using managed services that shorten the time to features for engineers. Need a database? Here is one that is not just up and running in minutes, but comes with write failover, replicas in multiple regions and basic monitoring and metrics all baked in. Need container orchestration? Here is a hosted kube cluster where all you have to do is provide the helm charts of what your deployments look like, its care and feeding and management is all solved for you. Need to deploy services? You don’t need to size some hosted virtual machine or manage the reservations of these instances anymore. Use lambda to directly to deploy your code, set your concurrences and conditions for running and you are off to the races.
We don’t need to learn how these complex stacks work under the hood, it’s all managed right?
Cloud reality check
In reality and put bluntly, there is no free lunch. Not forever at least. Yes, managed databases and Kafkas and kube will be convenient and much simpler to use when things are new and you don’t have a lot of customers yet and you just want to focus on shipping new features but if you are truly banking on long term success and a growing market share, you cannot rely on managed services forever.
All systems have limits and when you are using a managed service, it means you are at best going to get the uptime and reliability guarantees that cloud provider can give. Sometimes those guarantees are not enough. And even when they are enough, sometimes vendors don’t meet these guarantees and the best they can do is refund you some money or give you some ‘cloud credits’. And if you are already past the ‘growing pains’ stage as a company and have a large roster of very big customers who have uptime expectations of your service, you can’t just tell them “well our cloud provider apologized”. As Jeff Hodges says in one of the best websites out there: Who owns your availability?
And this goes beyond just uptime nines and availability. As your product grows, you will learn the limits, sharp edges and bear traps in all sorts of managed cloud services. Aurora doesn’t like getting past a certain write throughput rate? Dynamo can have hot shards? Certain parts of EKS are not yet as configurable as you need them to be? These are all real things that can cause scaling issues, unexpected behavior in your software or at the extreme, outages. And the managed services technically did not have an ‘incident’ but like all complex distributed systems, they will have limits and the risk of your success is being the first, or one of the first, to find these limits.
Working with younger engineers/Where senior engineers come from
Engineers early in their careers find the advice to ‘learn the sharp edges of your tools’ constricting them from providing value, from building things, from getting stuff done. And this tension can be frustrating to seasoned engineers who may feel the new team members are about to make the same mistakes they did in the past. That is not a surprising state of affairs. But it is also something organizations need to get ahead of and treat as an opportunity for learning. Things like design reviews, pairing between senior team members and newcomers, brown bags that tell stories of past mistakes. These are all things that help cross pollinate knowledge and level up the new team members. Or more importantly, fill gaps in the experience senior engineers you hire in. Because like it or not, even engineers with years of experience may have not encountered the things you had to scale against in your specific organization/product.
Learning from incidents and near incidents. Nurturing inquisitiveness
This is a topic that can take up not just a whole other blog post but entire books and academic papers and it does. But I would be remiss if I did not also mention in this post the huge importance of learning, as an organization, how to learn from both incidents and near-misses. This is more than just “how do we steer/facilitate retrospectives” or “what artifacts should we produce from such meetings” which are both important questions. But in the intersection of “learning from incidents” and “how we grow senior engineers” I am interested in how these learning exercises, when done right, can become a critical tool in transferring that invaluable ‘smell test’ or ‘hunch’ that valuable senior engineers have and showing less experienced team members how to troubleshoot these complex systems we have built. Whether or not you run in a data center and have to manage your own kube clusters or are glueing together managed services in a public cloud, it is the incidents and near misses and how you reason about them that evolve the team’s ability to know what their code really does. Once that value proposition sinks in, it becomes clear how important it is that your engineering organization fosters an atmosphere of inquisitiveness.
Invest in your engineers’ learning. There is no way around the need for this to be an explicit investment by fast growing companies. The cloud is for sure convenient. It is completely understandable why companies prefer to start new products there and orient rewrites of old things to move there. Engineering time is the most expensive asset for a tech company and all companies are now tech companies so anything that speeds getting new features in customer hands is a win. This makes it all the more important for companies that want to build a scaling engineering team to intentionally invest in learning. In the past this learning was happening ‘by accident’. Your old school system admins learned the sharp edges of tech by bleeding on them but now the cloud is making younger engineers less aware of the things that will cause issues at 10x (or 100x) scale. And for those of us who know the sharp edges exist, keep reminding yourself what it is like to have been a novice. It will keep you humble and make you a better mentor for the newcomers.