<aside> đź’ˇ This is WIP, originally as part of Unreal Server Scaling Exercise but can/should be spun out to its own thing.
</aside>
With your SLI, you can discuss your SLO, and now you can get your Service Level Agreement.
SLA’s are a contract you sign with yourself and your customers.
<aside> đź’ˇ The gut response is zero tolerance. This is prohibitively expensive, even if it were possible.
AWS goes down sometimes. So does GCP and every other service provider on you critical infrastructure. So right off the bat you probably aren’t hitting 99.999% uptime.
</aside>
But the much bigger point is opportunity cost.
Building new and exciting features for your players introduces instability to your service. Hardening these features before launch means shipping slower.
You gotta be shipping content to keep those KPIs up.
This defines the central tension between stability and business nimbleness.
SLA’s gives you a hard number to define this tension.
This is the way to think of SLA’s:
How many times would you accept being in the “Feels Bad” zone if it meant launching a feature a month early?
One hour of bad vibes for getting a major feature out a month early? Probably a good deal.
Engineers, this is how you talk to Business people. You will see their eyes glaze over in real time if you start explaining why something is the way it is.
Don’t explain why, convert the why into a choice that can be made. Give them the two sides of a deal