Back when I started my career budgets for servers were pretty straightforward. The IT team had a supplier who would install something in a datacenter and so you had a known cost of acquiring the hardware. And once it was in the DC there was a known operating cost based on whatever rate you had with the DC operator. After that it was up to you to add load onto it, or VMs or whatever else you fancied. Zero complexity.
The rise of cloud computing has drastically changed that cost structure. Rather than paying for potential capacity (i.e. 'a server') you now pay for the resources you actually use. It's cost reduction across the board in most cases, not to speak of the convenience of not having to manage or operate a DC (or DC operators) and the inherent scalability. In short, for the majority of early stage software companies going cloud-first is by far the better option. Unfortunately, for many people going from fixed resource cost to something variable has created a "we don't know what this will cost" world. As a CTO, not knowing the cost of running your operation is bad for you, impossible to work with for finance and likely something that puts a very short lifetime on your presence in the C suite. So let's work through some practical steps on establishing and predicting your cloud costs, and being friends with the CFO. (note: I'm going to refer a lot to AWS in this post, but the principles apply for other cloud providers just the same)
Accuracy, Precision and Materiality
Before diving into the meat of this topic I want to make a small sidestep and talk about accuracy and precision, and the concept of materiality in financial terms. Materiality is technically defined as follows "Information is material if omitting, misstating or obscuring it could reasonably be expected to influence the decisions that the primary users of general purpose financial statements make on the basis of those financial statements, which provide financial information about a specific reporting entity." Or in a shorter interpretation: does the number have a meaningful impact on the overall result.
Precision and accuracy are of course a little more general understood from back in high school science, even though in common everyday use they often get conflated. So to refresh the memory: accuracy is a level of correctness compared to the "real" value you are trying to observe. Precision on the other hand refers to the amount of deviation you would have when measuring the same thing repeatedly, usually as a consequence of your measuring setup and instruments.
Let's take the example of you trying to measure the temperature in the room (which is 20 degrees C). Your thermometer is consistently showing you values between 24.9 and 25.1. From that you could conclude that you are quite precise (there is little variation in your measures) but you are somewhat inaccurate (seeing that you are about 5 degrees off). For the statistically minded folks, this is refered to as bias (how far off the true value are you) and variability (how much does your measurement fluctuate from a central value). If you take this to an extreme level you could say something like we have 1 million users rounded to the nearest million, or we have 1,000,000 users. If in reality you had 600,000 users the former is accurate but not precise, the latter is precise but not accurate.
Combining this with the materiality principle of finance, the goal of this exercise is not to predict costs down to the penny but rather to provide adequate guidance. In other words you want to be fairly accurate but not too precise. That's exactly where I feel the AWS Cost calculator comes up short. It requires far too much detail that is often not quite material, and creates more confusion than answers generally.
Cloud costs are based on the principle that you pay what you use. And ultimately your use is driven by the usage (i.e. the number of visitors to your website, the amount of API requests your app makes, ... ). However, they don't all scale in the same way. There are 3 main models of cost scaling
- Linear: The traditional pay as you go type. CDN traffic cost is a great example of this. You pay for every Gb of traffic that is served off the CDN.
- Step Function: Typically something like an EC2 instance on AWS, and the closest you'll get to the old school cost model. Essentially you pay for having capacity available which can server N number of users. Once you go over N you need to add another bit of capacity which then creates availability for up to 2N users.
- Fixed costs: Some parts of your cost don't really vary at all. Reserved instances on AWS (where you pre-pay for capacity essentially) is a great example of this.
Every architecture is ultimately a set of decisions and tradeoffs. Did you optimize for computation intensive tasks, are you running a microservice setup with an event driven system, is it a monolith application with a single big database, is it replicated across the globe or fronted by a CDN etc etc. And with those decisions now come different cost implications. For instance a system that runs with a minimal need for compute but serves everything off the edge will scale linear with usage in cost, whereas a process that uses heavy calculations and needs lots of CPU firepower will have a fairly hefty step once you hit a threshold. Those architecture decisions are important not just to how your system will perform, but also how your cost will scale.
Usually this kind of cloud cost prediction problem isn't really a concern early on in the business. Typically it starts when there is some system in place and you're dealing with growth while the business is on a path towards being profitable (or increasing margins). I'm going to assume that you've got some basics in place with the set up of your cloud provider: resources are appropriately tagged or under different billing accounts. If that's not the case, go fix that first. You need some amount of sane input data before you can start slicing it.
The first thing to do is to simply go back for a few months of data and answer some basic questions
- Which services are taking the bulk of the cost?
- How do the various services cost behave over time? And how does that tie in with usage data or big changes/releases?
- Does all that make sense?
When you have a feel for both the state of the bill and the behaviour you have a baseline. All things being equal, this behaviour should hold. Accounting for a bit of fluctuation in exchange rates etc, you could say that within a small tolerance these costs should remain stable.
Once you have an established baseline it's time to start understanding the limits of your system. The exact technical details are outside the scope of this article, but in summary you can build an easy tool that will hit your servers the way a user typically would. There are enough web-automation frameworks out there that can simulate user journeys, or you could simply write a headless version that hits your API in a sequence. The concept is to create an "agent", a piece of code that behaves as a user would. Once you have one of those it becomes a matter of deploying that at scale and pointing all of them at your server. It's a variation on classic performance testing.
While there is a lot to learn here about how your system behaves under load, the goal here is to establish what those step functions look like. I.e. at what point does your current capacity max out and would you need to add another step.
Putting it all together
If you've gone through those exercises you've got some interesting data. Let's put it all together.
- You established a baseline of costs which give you a budget for current performance.
- You understand how extra user load will affect the step functions in your cost model, i.e. at what point do you need to make a step in resources
- You know your fixed cost and overheads from your current bill
- You understand the structure of your cost in terms of what is and isn't costly
Ultimately this becomes a powerful tool to help the engineering team focus, but give them the autonomy to make decisions. You know what a day of engineering time costs, and you know what a small saving represents. Which means engineers can now focus their efforts on high value optimizations rather than make everything better, and as a CTO you can clearly demonstrate why that work is valuable or what the tradeoff is between new features vs cost programs.