Health rules are based on one or more of an application's performance metrics and let you specify the parameters for what is considered “normal” for your application. When the health rule is violated, an event is raised that triggers a health policy. Health policies can also be triggered by other types of events (e.g. errors). In sum, health rules work on metrics, policies work on events, and the evaluation of health rules triggers events on which policies take action. For a diagram of how this functions, see Alert and Respond. Below are the most common reasons and steps to take if your health policies aren’t triggering.
Contents
1. Confirm that the health rule violation is being triggered
If your policies are triggered by health rule violation events and your health rules aren’t configured properly, violations may not be fired.
- Review your health rule configuration - Ensure you followed the workflows outlined in Configure Health Rules when creating your health rule.
- Check your Health Rule Evaluation configuration - This is the “Use the last <x> minutes of data when evaluating the health rule” field and the default is 30 minutes. This is the window of time in which the violation is detected, rather than a range over which the violation must persist. See the "Health Rule Evaluation Window" section in this document for additional information.
- Check your Wait Time after Violation configuration - This value is used to avoid alert storms after a violation has occurred. When a violation is detected, the health rule is not re-evaluated until after this wait time has expired. If the erroneous conditions correct themselves before that window of time expires, you won’t receive any additional alerts. See the "Health Rule Wait Time After Violation" section in this document for details.
2. Confirm that the event that triggers the policy is occurring
Policies can be fired by health rule violation events or events that are automatically raised under certain conditions, such as errors and slow transactions. Events can also be manually registered or created programmatically using the REST API. For a list of events, see Events Reference. If you are on an older version of AppDynamics, refer to the events list for your specific version. If you do not see the expected event in the Controller UI, evaluate the following:
- Slow transactions - Check the defined threshold and the baseline. Not allowing sufficient time for the baseline to be calculated or a lack of load on the application can affect the ability to define the baseline. If there is no baseline, we can’t monitor of the deviations from the baseline.
- Errors - Check if the agent logs contain the expected error. If the error is found but the event didn’t trigger, contact AppDynamics Support.
- Code Problems - Resource Pool Limit Reached is a predefined event that can trigger a policy. Policies for this event are not configured by default, so you would need to create one. The event is triggered by 80% usage of EJB-related resource pools (e.g. connection pools and thread pools), but that threshold can be configured using the
jms-metric-threshold-percentage
node property. Changing the value requires a restart of the agent JVM to apply this change.
- Type: Integer
- Default: 80%
- Platform: Java
- Create discrete policies for health rule events - Rather than create one all-encompassing policy, create more specific ones so you can better understand the scope of the policy configuration. Before creating a policy, read: Configuring Policies
- For more information, see: Policy Triggers
3. If the event exists, check the following:
- Evaluated object - When the health rule is defined, you select the objects that you want monitored. If you create a policy based on the health rule, you can also select the objects that the policy applies to. If there is a mismatch between the health rule definition and the policy definition, the policy will not work as expected. If the event was not on the expected object, edit the health rule and/or policy configuration.
- Check the expected policy action - You can assign the actions that are taken when a policy is triggered:
- The user running the script has executable permissions
- The dependent files or directory exist
- The Machine Agent is up and running or reachable by the Controller
- If the expected action is to send an email, check if the email server is working.
- If the expected action is to run a remediation Machine Agent script, check the following:
- If the expected action is to take a thread dump, check if the agent node is running.
- If the expected action is to run a Custom Action, validate your configuration.
4. If the event doesn’t exist, check the following:
- Evaluation time - Metrics that are being evaluated against a baseline need time and load to build up the baseline. If the timeframe or the load is insufficient, evaluation against the baseline can't occur.
- Confirm metrics exist - When metrics are discovered, they are registered with the Controller. If something happens to the entities that identify the source of specific metrics (e.g., the application, tier, or node), the health rule evaluation will fail and there won’t be a rule violation to trigger the associated policy.
- Symptoms (besides no event or "alert") include log entries in Controller logs similar to the following:
[#|SEVERE|glassfish3.1.2|com.appdynamics.RULES.PROCESSING|_ThreadID=75;_ThreadName=Thread-6;|An Error occured while evaluating Policy com.singularity.ee.controller.api.exceptions.ObjectNotFoundException: Metric not found: <metric_id> at com.singularity.ee.controller.beans.manage.policies.PolicyProcessorBean$SingleThreadedRuleProcessorBean.evaluateLeafCondition(PolicyProcessorBean.java:1895)
- Run the following query on the Controller database:
mysql> use controller; mysql> select id, name, application_id from metric where id in (<metric_IDs from controller log errors>);
. The query should return empty set (no results) if the metrics actually don't exist. If a tier, node or specific metric used in a health rule has been deleted, the behavior may be uncertain. The metric must exist or the health rule can’t operate properly.
- From the command line, change to the Controller's bin directory and use the appropriate script to log in to the Controller database:
- For Windows:
controller.bat login-db
- For Linux:
sh controller.sh login-db
- If an incorrect metric was used, select the correct one and re-apply load on the application.