cancel
Showing results for 
Show  only  | Search instead for 
Did you mean: 

My problem with baselines and moving averages ingesting bad data

Kathryn.Green
Creator

Hi,

I've been struggling for some time with the way baselines seem to work - to the extent that I'm feeling like I can't trust them to be used to alert us to degraded performance in our systems.  I thought I would describe the issue and get the thoughts of the community.  Looking for some thoughts from folks who are happy with baselines and how they are mitigating the issue I’m experiencing.  Or some input confirming that my thinking on this is correct.

I have proposed what I think could be a fix towards the end.  Apologies if this ends up being a bit of a long read but it feels to me like this is an important issue – baselines are fundamental to AppD alerting and currently I don’t see how they can reliably be used.

To summarise the issue before I go into more detail it looks to me like AppD baselines, and the moving average used for transaction thresholds, ingest bad data when there is performance degradation which renders baselines unfit for their purpose of representing ‘normal’ performance.  This obviously then impacts on any health rules or alerting that make use of these baselines.

Let me provide an example which will hopefully make the issue clear.

A short time ago we had a network outage which resulted in a Major Incident (MI) and significantly increased average response time (ART) for many of our BTs.

KathrynGreen_0-1726156308057.png

Because the ART metric baseline uses these abnormal ART values to generate the ongoing baseline it meant that the baseline itself rapidly increased.

KathrynGreen_1-1726156346912.png

The outage should have significantly exceeded multiple SDs above the expected ‘normal’ baseline.  But because the bad data from the outage increased the baseline it meant that other than the very brief spike right at the start the increase in ART barely reached 1SD above baseline.

Furthermore, the nature of the Weekly Trend – Last 3 Months baseline means that this ‘bad’ baseline will propagate forward.  Looking at the first screenshot above we can clearly see that the baseline is expecting ‘normal’ ART to be significantly elevated every Tuesday morning now.  Presumably this will continue until the original outage spike moves out of the baseline rolling window in 3 months.

This is more clearly shown if we look more closely at the current week so that the chart re-scales without the original ART spike present.

KathrynGreen_2-1726156346916.png

As far as the baseline is concerned, a large spike in ART every Tuesday morning is now normal.   This mean that less extreme (but still valid) ART degradation will not trigger any health rules that use this baseline.  In fact, this could also generate spurious alerts on healthy performance if we were using an alert based on < baseline SD as the healthy ART now looks to be massively below ‘normal’ baseline.

To my mind this simply can’t be correct behaviour by the baseline.  It clearly no longer represents normal performance which by my understanding is the very purpose of the baselines.

The same problem is demonstrated if we use other baselines but I’ll not include my findings here for the sake of this already long post not becoming a saga.

This issue of ingesting bad data also impacts the Slow/VerySlow/Stalled thresholds and the Transaction Score chart:

KathrynGreen_3-1726156346923.jpeg

As can be seen we had a major network outage which caused an increase in ART for an extended period.  This increase was correctly reflected in the Transaction Score chart for a short period but as the bad data was ingested and increased the value of the moving average used for thresholds we can see that even though the outage continued and ART stayed at abnormal level, the health of the transactions stopped being orange Very Slow and moved through yellow Slow back to green Normal.  And yet the outage was ongoing, the Major Incident was ongoing, the ART had not improved from its abnormally high service impacting value. 

These later transactions are most certainly not Normal by a very long way and yet AppD believes them to be normal because the moving average has been polluted by ingesting the outage ART data.  So after a short period of time the moving average used to define a Slow/Very Slow transaction no longer represents normal ART but instead has decided that the elevated ART caused by the outage is the new normal.  I’d like to think that I’m not the only one who thinks this is undesirable.
Any alerting based on using slow transaction metrics would stop alerting and would report normal performance even though the outage was ongoing with service still being impacted.

Now it’s not my way to raise a problem without at least trying to provide a potential solution and in this case I have two initial thoughts:

  1. AppD adds the ability to lock the baseline in much the same way as we lock BTs.  So a BT is allowed to build up a baseline until it looks like it matches ‘normal’ behaviour as closely as we’re likely to get.  At this point the baseline is locked and no further data is added to the baseline.  If a service changes and we believe we have a new normal performance then the baseline can be unlocked to ingest the new metrics and update the baseline to the new normal, at which point it can be locked again.

  2. Instead of locking baselines AppD could perhaps implement a system whereby bad data is not ingested into the baseline.  Perhaps something like: any data point which comes in which triggers a health rule (or transaction threshold) is taken as evidence of abnormal performance and is not used to generate the baseline, maybe instead the last known non-triggering data point is used for the baseline.  This would mean that the baseline probably would still increase during an outage (working on the assumption that a service degrades before failing so the points immediately prior to the triggering of an alert might still be elevated above normal).  But this should mean that the baseline change would not be as fast or as catastrophic as the current method of calculating the rolling baseline/moving average.

Well, that pretty much wraps it up I think.  If you've made it this far then thanks for your time and I'd really appreciate knowing if other folks are having a similar issue with baselines or have found ways to work around it.

1 REPLY 1

Ryan.Paredez
Community Manager

Hello @Kathryn.Green,

I was told you should be having a conversation soon with ApPDynamics about your questions here, as they have reached our privately. 


Thanks,

Ryan, Cisco AppDynamics Community Manager




Found something helpful? Click the Accept as Solution button to help others find answers faster.

Liked something? Click the Thumbs Up button.



Check out Observabiity in Action

new deep dive videos weekly in the Knowledge Base.

Join Us On December 10
Learn how Splunk and AppDynamics are redefining observability


Register Now!

Observe and Explore
Dive into our Community Blog for the Latest Insights and Updates!


Read the blog here