Sep 29, 2020
Back when I started my career budgets for servers were pretty straightforward. The IT team had a supplier who would install something in a datacenter and so you had a known cost of acquiring the hardware. And once it was in the DC there was a known operating cost based on whatever rate you had with the DC operator. After that it was up to you to add load onto it, or VMs or whatever else you fancied. Zero complexity.
The rise of cloud computing has drastically changed that cost structure. Rather than paying for potential capacity (i.e. 'a server') you now pay for the resources you actually use. It's cost reduction across the board in most cases, not to speak of the convenience of not having to manage or operate a DC (or DC operators) and the inherent scalability. In short, for the majority of early stage software companies going cloud-first is by far the better option. Unfortunately, for many people going from fixed resource cost to something variable has created a "we don't know what this will cost" world.
As a CTO, not knowing the cost of running your operation is bad for you, impossible to work with for finance and likely something that puts a very short lifetime on your presence in the C suite.
So let's work through some practical steps on establishing and predicting your cloud costs, and being friends with the CFO. (note: I'm going to refer a lot to AWS in this post, but the principles apply for other cloud providers just the same)
Accuracy, Precision and Materiality
Before diving into the meat of this topic I want to make a small sidestep and talk about accuracy and precision, and the concept of materiality in financial terms. Materiality is technically defined as follows "Information is material if omitting, misstating or obscuring it could reasonably be expected to influence the decisions that the primary users of general purpose financial statements make on the basis of those financial statements, which provide financial information about a specific reporting entity." Or in a shorter interpretation: does the number have a meaningful impact on the overall result.
Precision and accuracy are of course a little more general understood from back in high school science, even though in common everyday use they often get conflated. So to refresh the memory: accuracy is a level of correctness compared to the "real" value you are trying to observe. Precision on the other hand refers to the amount of deviation you would have when measuring the same thing repeatedly, usually as a consequence of your measuring setup and instruments.
Let's take the example of you trying to measure the temperature in the room (which is 20 degrees C). Your thermometer is consistently showing you values between 24.9 and 25.1. From that you could conclude that you are quite precise (there is little variation in your measures) but you are somewhat inaccurate (seeing that you are about 5 degrees off). For the statistically minded folks, this is refered to as bias (how far off the true value are you) and variability (how much does your measurement fluctuate from a central value).
If you take this to an extreme level you could say something like we have 1 million users rounded to the nearest million, or we have 1,000,000 users. If in reality you had 600,000 users the former is accurate but not precise, the latter is precise but not accurate.
Combining this with the materiality principle of finance, the goal of this exercise is not to predict costs down to the penny but rather to provide adequate guidance. In other words you want to be fairly accurate but not too precise. That's exactly where I feel the AWS Cost calculator comes up short. It requires far too much detail that is often not quite material, and creates more confusion than answers generally.
Cloud costs are based on the principle that you pay what you use. And ultimately your use is driven by the usage (i.e. the number of visitors to your website, the amount of API requests your app makes, ... ). However, they don't all scale in the same way. There are 3 main models of cost scaling
- Linear: The traditional pay as you go type. CDN traffic cost is a great example of this. You pay for every Gb of traffic that is served off the CDN.
- Step Function: Typically something like an EC2 instance on AWS, and the closest you'll get to the old school cost model. Essentially you pay for having capacity available which can server N number of users. Once you go over N you need to add another bit of capacity which then creates availability for up to 2N users.
- Fixed costs: Some parts of your cost don't really vary at all. Reserved instances on AWS (where you pre-pay for capacity essentially) is a great example of this.
Every architecture is ultimately a set of decisions and tradeoffs. Did you optimize for computation intensive tasks, are you running a microservice setup with an event driven system, is it a monolith application with a single big database, is it replicated across the globe or fronted by a CDN etc etc. And with those decisions now come different cost implications.
For instance a system that runs with a minimal need for compute but serves everything off the edge will scale linear with usage in cost, whereas a process that uses heavy calculations and needs lots of CPU firepower will have a fairly hefty step once you hit a threshold.
Those architecture decisions are important not just to how your system will perform, but also how your cost will scale.
Usually this kind of cloud cost prediction problem isn't really a concern early on in the business. Typically it starts when there is some system in place and you're dealing with growth while the business is on a path towards being profitable (or increasing margins). I'm going to assume that you've got some basics in place with the set up of your cloud provider: resources are appropriately tagged or under different billing accounts. If that's not the case, go fix that first. You need some amount of sane input data before you can start slicing it.
The first thing to do is to simply go back for a few months of data and answer some basic questions
- Which services are taking the bulk of the cost?
- How do the various services cost behave over time? And how does that tie in with usage data or big changes/releases?
- Does all that make sense?
When you have a feel for both the state of the bill and the behaviour you have a baseline. All things being equal, this behaviour should hold. Accounting for a bit of fluctuation in exchange rates etc, you could say that within a small tolerance these costs should remain stable.
Once you have an established baseline it's time to start understanding the limits of your system. The exact technical details are outside the scope of this article, but in summary you can build an easy tool that will hit your servers the way a user typically would. There are enough web-automation frameworks out there that can simulate user journeys, or you could simply write a headless version that hits your API in a sequence. The concept is to create an "agent", a piece of code that behaves as a user would. Once you have one of those it becomes a matter of deploying that at scale and pointing all of them at your server. It's a variation on classic performance testing.
While there is a lot to learn here about how your system behaves under load, the goal here is to establish what those step functions look like. I.e. at what point does your current capacity max out and would you need to add another step.
Putting it all together
If you've gone through those exercises you've got some interesting data. Let's put it all together.
- You established a baseline of costs which give you a budget for current performance.
- You understand how extra user load will affect the step functions in your cost model, i.e. at what point do you need to make a step in resources
- You know your fixed cost and overheads from your current bill
- You understand the structure of your cost in terms of what is and isn't costly
Ultimately this becomes a powerful tool to help the engineering team focus, but give them the autonomy to make decisions. You know what a day of engineering time costs, and you know what a small saving represents. Which means engineers can now focus their efforts on high value optimizations rather than make everything better, and as a CTO you can clearly demonstrate why that work is valuable or what the tradeoff is between new features vs cost programs.
Sep 15, 2020
When I started getting involved in project management it didn't take long for someone to bring up the iron triangle. In essence it talks about the relationship between cost, time, scope and quality and how as a manager making a change in one necessarily means a change in others. These days my linkedin feed is full of wisdom along the lines of "Good, Cheap, Fast: choose two". The times they aren't exactly a-changin'...
So rather than regurgitate that triangle, let me substitute the 3 corners with a different set of values: Form, Function and Execution.
It could look a bit like this
Function -- The 'what'
Taking ownership of "what" is typically the realm of product teams. I'm not trying to suggest that they are the only ones involved in defining the function of a product or feature, but they are leading the charge on discovering and defining this.
Without trying to provide an exhaustive list, there are quite a few good tools out there on how to uncover function. The Design Sprint can be quite effective, particularly if you're in need of rapid prototyping and lots of fresh ideas. Equally the lean startup ethos and Lean Canvas can be a useful framework. And of course you can also base yourself more on user data, no matter if that's in the form of focus groups, surveys or insights you gather from existing usage. All of those ultimately serve to answer the same macro question "what does this thing do" or, more apt, "how does it add value for the customer". (and yes, the customer could be internal to the org)
Form -- The 'experience'
This is predominantly the corner of the creative team, and I take creative in a fairly broad sense. Obviously it'll feature the usual suspects producing stunning visual designs, but I also think of User Experience and related research, copywriters, etc. Most roles that contribute to first order "the look and feel" of your product are in this corner for me. I do say first order, because obviously there are 2nd order things such as performance, reaction time, etc.
While Form has a value of its own, for most commercial products I would say that its main purpose is to support and enhance function. There is a lot of literature on form vs function (or how form follows function if you're into industrial design), but the reality is that they can't really live without each other. Even "no design" is still a design choice.
Execution -- Bringing it all together
The 3rd component in this trinity is what I call execution. It's all the activities that take the input from product & design and transform that into a tangible product for people to use. In software, that means the bulk of the work here is in the engineering team. In reality product&design should work closely with the engineering team because while they work they make thousands of small decisions. All of those have an impact on the final output.
To give you an idea, something as simple as rounded corners on a button used to be rather difficult technically and so even though the design team might have them everywhere, when time is finite engineers might agree with design to have rectangular buttons. A particular report might be a great feature, but it turns out it takes 5 minutes to generate...does that still meet the "adding value for teh customer" test?
These kind of questions do come up and no engineer tries to do a bad job, but the decisions they make are what the final product will look like...by definition.
Ok...so play nice?
I'm hoping it's obvious that these 3 groups don't really play in isolation, even though I've seen many organisations pretend they are. How often have you seen a company go through this cycle: define a feature, get the creatives to design it, give the "final design" to the dev team, dev team produces something within feasible technical constraints, ... original feature doesn't really address the problem and isn't quite what the design team thought it would be.
It's one of the main reasons that I believe in cross-functional teams. If these 3 teams work closely together and are aligned (and incentivized) to produce a result together that tends to create better results.
Aug 02, 2020
"Let's apply some duct tape and we'll sort it out properly later on"
I'm going to take a wild guess and claim you've said or thought that at some point in your life. It doesn't really matter if it was a DIY situation, a broken down vehicle, some plumbing or any other situation where you needed a 'quick fix'.
Quick fixes in software
The truth is that when writing software we would love to fix everything properly, but for a long list of reasons you might not get to there. Could be anything from deadlines, bugs you can't quite smash, a tired developer, ... . The problem with applying duct tape in software though is that it's not quite as visible as in real life and a lot easier to forget. If it ain't broke,don't fix it; right? So why would you go back and undo the duct tape for a proper solution.
Given the reality we will have duct tape, I'd rather shift the attention to the decision of applying it. In my ideal world, duct tape is a choice. Not a tool to be forbidden, but one to applied knowingly. Anyone who's been in my team for a while has heard me say this
"There will be duct tape, I just want to know where it is"
Most of the time when creating delivery processes my goal is not to eliminate the duct tape. In fact, just like in real life there are many scenarios where a short term "quick fix" is a perfectly acceptable solution. However what I am trying to create is a situation in which the choice to apply duct tape is made for the right reasons, or at least the tradeoffs are understood. So a good process in that sense creates visibility, it creates room for discussion without being overly religious on doing "the right thing" but rather takes this on a case by case basis.
In theory this is very easy, creating visibility into what's going on can be done any number of ways and there are entire books written about methodologies that raise awareness of the work going on. No matter if you are looking at an agile or a more traditional process, they all have aspects to create visibility and inspect the work. However, the hard part is creating a culture in which engineers, or generally anyone at the front line, feels comfortable highlighting that they are about to apply duct tape and can have that conversation. And particularly that kind of culture is what I focus on. A process is only as good as the way it is executed, and a lot of that comes back to culture and team work.
Aug 01, 2020
I love a good model. Imperfect as they can be, they're invaluable in trying to organize the world and create clarity between detail and big picture.
As a consultant it was only ever going to be a matter of time before I ended up creating my own model for the world around me: the world of (software) delivery organisations.
At heart my job is usually very simple to define: create a system in which the right work can be delivered in a controlled manner. Let's unpack that a little bit. It starts by knowing what "the right work" really is.
Delivering the wrong thing isn't the goal here, or in fact anywhere I'd hope. So any system with that goal has to deal first with understanding the input and how to make sure that input fits those goals.
The second part, "a controlled manner" doesn't mean a command-control "manage the detail" style. Instead I mean controlled in a more statistical interpretation, as in "within the bounds of acceptable variation".
The last part to think about, and arguably the most important, is that any process or framework ultimately has a number of people to think about. While I do subscribe in part to the Ohno "it's the system, stupid" line of thinking, in the end the work still is done by people and so that's a dimension that has to be factored in. Each of those is worthy of a long discussion, but for now let me stick to introducing the Onion.
The Onion really is a meta-framework, if you will. It doesn't prescribe any particular actions in the way some proejct management methods would. Instead it takes a step back and tries to place the various methodologies and practices in context. In it's simplest form, the onion is a set of concentric layers from the inside out: delivery, quality, automation, framework, governance.
The heart of the onion is delivery. Going back to what I said earlier, designing systmes in which the right work can be delivered is essentially my job. So at core, teh capacity to put product or services in the hand of customers is what drives a business forward and generates revenue. It doesn't really matter if your product is a small toy, a complex supercar, some software or a product that is really a service like a lawyer or consultant. At the end of the day, value comes from delivering this product to people who have a use for it. That's as true on a macro level (i.e. company wide) as it is on a team or even individual level. The only difference really is the scope or type of customer, which could vary from team mates to internal teams to external customers.
I tend to look at quality through 2 questions:
1. Are you delivering the right thing?
2. Are you delivering the thing right?
Or slightly more philosophical: is delivering the wrong thing right better than delivering the right thing wrong?
The first perspective is all about understanding demand and requirements, the second is inspecting the processes. Not looking at both these perspectives is a classic pitfall when organisations "go Agile". They'll put all their focus on the second perspective, trying to do a better version of Scrum/KanBan/SAFe but end up simply building the wrong thing better.
In a perfect world where you are delivering, both the right thing and the right way, the focus shifts to repeatability. I would also expand that definition to "repeatable without extra variability". In other words, consistency is also important.
That's where automation comes in.
I'm going to take a pretty broad definition here of automation as any rules based system that gets consistently executed. In it's purest form that is probably code, but it could equally be a set of rules for people to follow and help them make decisions.
If you take that in terms of dollar value of time spent by a person it's not easy to understand that situations that require judgement and creative solutions are worth more than repeating a set of steps over and over. And if anything, the latter is more prone to human error.
So with that in mind, gear your automation towards maximum leverage while allowing enough variability in the creative/hard/unclear sitautions. 100% automation isn't a goal, it's about freeing up people to engage with the hard problems rather than the mundane.
This layer and the governance layer around it are less about directly affecting the product but focus more on the processes, both creations and management thereof. In my experience people discuss this layer a lot, but often don't bother reflecting on how this affects the other 3 layers inside it.
We're doing Scrum has become short hand for a few ceremonies with sprints and daily standups. Similarly KanBan seems to have been reduced to swimlanes and no sprints.
Consider this though: which company has ever won because they had the best prince2, scrum, kanban, SAFe, ... implementation? I'd argue those that won consistently delivered value to their customers, and their internal process was likely some amalgamation of things that worked for them.
In many ways this layer deals with the most important decision of all: the decision of how you make decisions.
Governance isn't about picking a particular framework, let alone directly impacting quality levels in the delivery. The core problem of governance is around defining focus, boundaries, tolerances, and even strategy. It sets the broad guiedes for what is valued, what the organisational design should optimize for and the top level execution of how that will be achieved.
As you move through the layers 2 dimensions change. The first is very obvious: you are moving away from directly impacting the core. The second is the time it takes to get feedback on a change. Changing a line of code has immediate result, tweaking the parameters of how you make decisions will take quite a while before the effects are seen throughout the organisation.
This is at heart the difficulty with moving from individual contributor roles into leadership, and either of those trying to overreach too far is likely to have negative effects. In fact individual contributors typically look from the inside out, whereas senior managers tend to look from the outside layers inwards. As a result they might optimize for different things and at times even seem at odds. This model however has helped me for well over a decade in staying grounded, and help teams focus on the place where they can add most value for the entire organisation.