Not a customer? Click the 'Start a free trial' link to begin a 30-day SaaS trial of our product and to join our community.
Existing Cisco AppDynamics customers should click the 'Sign In' button to authenticate to access the community
09-12-2024 09:05 AM
Hi,
I've been struggling for some time with the way baselines seem to work - to the extent that I'm feeling like I can't trust them to be used to alert us to degraded performance in our systems. I thought I would describe the issue and get the thoughts of the community. Looking for some thoughts from folks who are happy with baselines and how they are mitigating the issue I’m experiencing. Or some input confirming that my thinking on this is correct.
I have proposed what I think could be a fix towards the end. Apologies if this ends up being a bit of a long read but it feels to me like this is an important issue – baselines are fundamental to AppD alerting and currently I don’t see how they can reliably be used.
To summarise the issue before I go into more detail it looks to me like AppD baselines, and the moving average used for transaction thresholds, ingest bad data when there is performance degradation which renders baselines unfit for their purpose of representing ‘normal’ performance. This obviously then impacts on any health rules or alerting that make use of these baselines.
Let me provide an example which will hopefully make the issue clear.
A short time ago we had a network outage which resulted in a Major Incident (MI) and significantly increased average response time (ART) for many of our BTs.
Because the ART metric baseline uses these abnormal ART values to generate the ongoing baseline it meant that the baseline itself rapidly increased.
The outage should have significantly exceeded multiple SDs above the expected ‘normal’ baseline. But because the bad data from the outage increased the baseline it meant that other than the very brief spike right at the start the increase in ART barely reached 1SD above baseline.
Furthermore, the nature of the Weekly Trend – Last 3 Months baseline means that this ‘bad’ baseline will propagate forward. Looking at the first screenshot above we can clearly see that the baseline is expecting ‘normal’ ART to be significantly elevated every Tuesday morning now. Presumably this will continue until the original outage spike moves out of the baseline rolling window in 3 months.
This is more clearly shown if we look more closely at the current week so that the chart re-scales without the original ART spike present.
As far as the baseline is concerned, a large spike in ART every Tuesday morning is now normal. This mean that less extreme (but still valid) ART degradation will not trigger any health rules that use this baseline. In fact, this could also generate spurious alerts on healthy performance if we were using an alert based on < baseline SD as the healthy ART now looks to be massively below ‘normal’ baseline.
To my mind this simply can’t be correct behaviour by the baseline. It clearly no longer represents normal performance which by my understanding is the very purpose of the baselines.
The same problem is demonstrated if we use other baselines but I’ll not include my findings here for the sake of this already long post not becoming a saga.
This issue of ingesting bad data also impacts the Slow/VerySlow/Stalled thresholds and the Transaction Score chart:
As can be seen we had a major network outage which caused an increase in ART for an extended period. This increase was correctly reflected in the Transaction Score chart for a short period but as the bad data was ingested and increased the value of the moving average used for thresholds we can see that even though the outage continued and ART stayed at abnormal level, the health of the transactions stopped being orange Very Slow and moved through yellow Slow back to green Normal. And yet the outage was ongoing, the Major Incident was ongoing, the ART had not improved from its abnormally high service impacting value.
These later transactions are most certainly not Normal by a very long way and yet AppD believes them to be normal because the moving average has been polluted by ingesting the outage ART data. So after a short period of time the moving average used to define a Slow/Very Slow transaction no longer represents normal ART but instead has decided that the elevated ART caused by the outage is the new normal. I’d like to think that I’m not the only one who thinks this is undesirable.
Any alerting based on using slow transaction metrics would stop alerting and would report normal performance even though the outage was ongoing with service still being impacted.
Now it’s not my way to raise a problem without at least trying to provide a potential solution and in this case I have two initial thoughts:
Well, that pretty much wraps it up I think. If you've made it this far then thanks for your time and I'd really appreciate knowing if other folks are having a similar issue with baselines or have found ways to work around it.
09-18-2024 09:20 AM
Hello @Kathryn.Green,
I was told you should be having a conversation soon with ApPDynamics about your questions here, as they have reached our privately.
Thanks,
Ryan, Cisco AppDynamics Community Manager
Found something helpful? Click the Accept as Solution button to help others find answers faster.
Liked something? Click the Thumbs Up button.
Check out Observabiity in Action
new deep dive videos weekly in the Knowledge Base.
User | Count |
---|---|
1 | |
1 | |
1 | |
1 | |
1 |
Thank you! Your submission has been received!
Thank you! Your submission has been received!
Oops! Something went wrong while submitting the form