Cisco AppDynamics Community

Prateek.Sachan · ‎05-26-2020

Estimated Reading Time: 5 mins

Why use Error Budget

Developers want to push features as quickly as possible to production. SREs and the Ops engineers want to make sure the availability SLOs are always compliant and production is always stable. As a service owner how do you weigh the risks of deploying changes to production against its current stability. Use error budgets!

Your Availability SLO quantitatively defines your error budget. Too much downtime in a month could (should!) mean that you may not be able to deploy any features that month where you are not meeting availability SLOs. Your product management should clearly state the availability targets and be very objective and quantitative about the risks that the service is willing to tolerate. This is where error budgets help!

Understanding Error Budget

Typically, an availability SLO may look like a figure say 99.99%. Error budget is that remaining 0.01%.

Error budgets are also often misunderstood! This on your average day will mean that a service was available 99.99% of the TIME since you calculated that SLO, e.g. in last quarter, in last 30 days or in last week.

Let me ask you a question here. There exists a sample service called myService. This service receives 100 requests a day. It was up for the entire day but when it received those 100 requests, say between 11:00 AM to 2:00 PM, it returned error for 90 requests. What would you count as its availability? 100% or 10%?

If you know what's best model for availability, you would know the right error budget as well. The unit for error budgets need not necessarily be time. Google SRE practices defined two models for availability calculations:

Time based availability:

availability = uptime / (uptime+downtime)

Aggregate availability:

availability = successful requests / total requests

Both the approaches are quite self-explanatory. However, we are missing one key SLI here, which is latency!

Let's take a look again at our example. myService is up for the entire day. It receives 100 requests between 11:00 AM and 2:00 PM. It serves all requests successfully. Most requests take under <500ms of response time. However, 10 requests take more than 2000ms which is 4x typical response time. What would be your availability now?

An improved aggregate availability model should then look something like this (an example):

availability = (Successful Requests whose latency < 2 times standard deviation of Median Response Time) / (Total Requests)

This availability model not only considers myService success in terms of Volumes but also considers its SLIs of Errors and Latency. If you get your availability SLO right, you will get your error budgets right!

Service owners should then be using this budget to balance the risk of burning out the availability targets versus continuously deploying new features or making changes in production.

How AppDynamics helps in calculating error budgets

AppDynamics helps you create business transactions which are core functions delivered by one or more of your services. You will find that AppDynamics categorizes these transactions based on latency & errors as Error, Normal, Slow, Very Slow and Stall. The default thresholds that are used to mark any requests as Slow, Very Slow and Stall use standard deviations from an exponential moving average.

Hence, a very good way to identify the latency inclusive aggregate availability is to use the feature called as "Transaction Score". The score is not only available for an entire application but for each core function, i.e. business transaction as well. You can sort the BTs by transaction score and identify the functions that are pulling down your availability and eating your error budgets away.

The score that you receive for all your normal requests will help you decide accurate error budgets and how your services are tracking against the same. So honestly, you don't have to look far. Use it and find better availability SLOs and error budgets. Use this credit to reliably push new features and changes on your service in production while balancing the risks.

References:

Cisco AppDynamics Community

Community access restored to most members

I've been able to make some changes to restore community engagement access to most members.

Follow the blog post for up to date information

We thank you for your patience while we get this fixed

AppDynamics and Error Budget in Site Reliability

Cisco AppDynamics Community

Community access restored to most members

I've been able to make some changes to restore community engagement access to most members. Follow the blog post for up to date information We thank you for your patience while we get this fixed

AppDynamics and Error Budget in Site Reliability

I've been able to make some changes to restore community engagement access to most members.

Follow the blog post for up to date information

We thank you for your patience while we get this fixed