Knowledge Base

cancel
Showing results for 
Search instead for 
Did you mean: 

The troubleshooter's mind: an introduction

Why do we troubleshoot?
Troubleshooting an application means troubleshooting the business
 

 

Applications are central to our businesses — or they are the actual business. We rely heavily on them to perform user-requested actions and downstream business processing. Any application-oriented performance problems inevitably become business problems.  

 

AppDynamics gives a holistic view of an application’s performance and gives you the tools to narrow problems down to a specific scope that’s experiencing service degradation.

 

To narrow your problem to the root successfully, follow AppDynamics best practices: 

Drill down in TIME

to be aware of the exact time of the issue

Drill down in SCOPE

to have an understanding of the actual scope of the issue

 

Due to the unique fit of each application to your business needs, only you can determine a problem’s context. Proactively and continuously adjust your AppDynamics application configuration to the needs of your business. 

 

These rules will result in more streamlined and successful troubleshooting sessions, so you can resolve issues before they impact your business severely. 

 

No matter how technical you are, you are troubleshooting your business.

 

Table of Contents


 

Using the See, Act, Know troubleshooting procedure

Because any troubleshooting process may be complicated, it’s essential to use a streamlined approach. The troubleshooting procedure outlined here follows AppDynamics best practices and consists of three parts: 

  1. See
  2. Act
  3. Know

This successful troubleshooting strategy builds on your “configured-to-measure” Application Performance Monitoring foundation, as well as a broad understanding of the application’s environment. 

SEE
the root cause

What is it? Triaging and troubleshooting application problems, so you can find and describe the right problems within the right scope and timeframe. 

 

Since Dashboards present the status of the application environment visually, and in real time, they are the starting point of the troubleshooting process. An arising issue’s yellow or red color is the trigger for the analysis. 

 

When a problem is revealed, many tools in the AppDynamics platform enable you to determine its root cause. 

 

NOTE: See the detailed process for successful troubleshooting below, under the Seeing the problem: Troubleshooting Principles section.

 

ACT
resolve early

What is it? The target for application performance is to eliminate problems before they can impact customers. 

 

So, the Alerting Model will inform teams early about any abnormal behavior, at any layer of the application. As a result, relevant teams will be able to undertake remedial actions immediately. In the best cases, AppDynamics will trigger fully automated actions that prevent the application from running into critical issues. 

 

You can apply these principles in a testing environment, to prevent problems before they’re promoted to production. 

 

KNOW
the impact

What is it? The last step of the troubleshooting session should result in a determination of the event’s impact to the organization. 

 

The time between the performance bottleneck’s emergence and its solution is simultaneously the time during which business was disturbed. An affected application cannot offer services to the end user and, so it adds to the business opportunities losses. In the long view, poor user experience leads to loss of customers and loss of revenue. 

 

The AppDynamics platform’s capability to determine impact is called Business iQ, and is provided through the set of AppDynamics products that correlate your application performance with business metrics (revenue, active users, product type, payment ID, etc.) in any given timeframe.

 


Seeing the problem: troubleshooting principles

There are two main principles of troubleshooting: 

 

Drill down in time and scope

Well-defined Health Rules reveal where an issue appears, giving you a picture of the current application status. When the Health Rule is violated, AppDynamics will inform you by either sending a notification or by illustrating the problem in a Dashboard. This presents you with one of many starting points, where others may include: an end-user complaint, or a ticket from another team. 

 

Once your application experiences problems, what should you do? A notification or Dashboard visualization as discussed above gives you the issue’s rough scope and starting time. 

 

The next step is to triage the exact time that the problem arose in the application and then to identify the specific component that had been affected at the very beginning of the issue, presenting you with the root cause. You accomplish this by iteratively following all impacted components and metrics in a given timeframe, which will ultimately lead to the scope and time of the problem.

 

Troubleshooting Drill Down@2x.png

 

Time

In troubleshooting timeframe, the best practice is to locate the beginning of an issue. This issue type’s beginning is usually visible through a change of behavior in the different metrics. Since a problem becomes more unclear the longer it exists, it’s important to start at the beginning.

Starting there also gives you the ability to compare what’s happening now with what was normal  in the past by:

  • using baselines,
  • shifting the time to a day or week before the problem happened.

 

Scope

Scope is a representative component, function, metric, group of components, configuration, etc.

Because problems often don't show up in a single scope, it’s important to understand all involved and/or impacted scopes that might be related to the problem, as well as triaging the scope that’s affecting all other scopes.

 

In looking at a health rule, we can actually see the time scope and issue that is presented in a health rule event. If this event is an issue’s root cause, you’ll only be able to find out by triaging the problem as described above.

 

Follow the red 

Iteratively triage the Average Response Time of transactions, Events, Errors and resource issues, taking notes at every stage of the troubleshooting session. By analyzing components with poor performance, and the performance indicators (in a given timeframe), you’ll be able to understand the affected scope, as well as gathering relevant information for further remedial actions. 

 

 

image1.png
 

 

When you see that an EUM Page experiences poor performance, keep in mind that the root of the problem can be caused by any component of your application environment. 

  • Start troubleshooting in the End User Monitoring section, analyze all available metrics and try to link Browser Snapshots with the correlated Transaction Snapshot.

  • Drill down to the back-end and analyze metrics, not only for the application itself, but also for network, infrastructure, and other entities in a given timeframe. 

This should give you an understanding of the issue: scope of affected components and time of the problem. 

 

Understanding the context

At some point, you may need more data (for example expanding the visibility by adding more agents) or to collaborate with other teams (such as Infrastructure, Development, Business, etc.) who know the subject matter in detail, and who will understand why some of the metrics you discovered in AppDynamics had poor performance.

 

A troubleshooter's skill is not only in troubleshooting and understanding the details of  a given scenario, but in also knowing their limitations and being able to escalate problems outside of their expertise. 

 

Consequently, you must understand the application environment. AppDynamics will make detailed measurements, and show baselined performance status, but will not explain the full context of the problem. The context may be technical, such as understanding of your database characteristics. But, it can also be non-technical. For instance, only you can determine which Business Transaction is business-critical for your organization. Therefore a deeper understanding of the application environment and the application “context” is essential.


 

AppDynamics troubleshooting tools, by issue

AppDynamics provides various tools for troubleshooting different application issues. Official documentation includes several pages dedicated to troubleshooting different application components. 

 

Imagine what you might do in a situation where:

Resources

Your service desk calls, telling you that there are complaints about slow logging into your application. What do you do?

Slow Response Times 

.NET Slow Response Times

And by the way, have you checked Analytics to see how the issue impacted your business? 

An alert reported a higher than usual error rate for certain Business Transactions, and we simultaneously spotted a sudden increase of exceptions in the application dashboard. How do we analyze it?

Errors and Exceptions  

While analyzing a Business Transaction, you discovered a lot of issues between Tiers and towards the shared backend.

None of the application information or system points to a specific root cause but you can see the response time is inexplicably high. Instead of escalating to the network team, can you look to the Network Visibility yourself to further troubleshoot?

 

Network Issues

You notice some bottlenecks on the side of Java Virtual Machines, and your team of developers asks for more detail about the problems you discovered in AppDynamics

 

Java Resource Issues

Java Memory Leaks

Java Memory Thrash

Code Deadlocks for Java


The application that you monitor takes advantage of multithreading solutions. Do you know how to discover potential problems in the AppD Controller?

 

Thread Concatenation 

Event Loop Blocking in Node.js


 

Automated troubleshooting with AI Ops 

One of the tools provided by AppDynamics' SaaS platform is Anomaly Detection. AIOps supports AppDynamics SaaS customers with automatic troubleshooting to replace the manual process described above, in the troubleshooting section.

 

Machine Learning-based algorithms leverage the data in the AppDynamics platform (such as transactions metrics, components, and their relations and events) and try to correlate abnormal behaviors, and inform AppDynamics users about possible issues in the application with a given Business Transaction. 

Version history
Revision #:
4 of 4
Last update:
‎04-12-2021 12:05 PM
Updated by:
 
Labels (1)


Found this article helpful? Click the Thumbs Up button.
Have an additional comment? Post it below.