Cisco AppDynamics Community

Robb.Kane · ‎06-08-2018

This article discusses the issue with the third-party library know as ZMQ (ZeroMQ) and outlines the steps needed to determine if the known issue with socket handling is occurring.

The AppDynamics (Apache) Agent's current design implements a communication channel between the Agent (which runs within the Apache worker threads) and a Proxy task (self-contained JVM application). The Proxy task is responsible for batching the data and sending it to the Controller. The Agent communicates with the Proxy via socket files (by default located in the .../logs/appd-sdk directory). For every worker thread, there are at least two sockets consumed. AppDynamics uses the ZeroMQ socket library as the underpinning mechanism for these connections.

A known problem exists within the rev of ZMQ that we use, whereby a crash can (infrequently) occur when closing the socket. This results in the Apache worker thread exiting prematurely. The in-flight HTML request is lost and the main Apache control thread spawns a new worker thread to replace the thread that died. This behavior is usually seen on very heavily loaded systems (CPU usage > 50%, RAM > 50%) and appears to be mitigated by one or more of the following:

increasing the number of cores
increasing the available memory
reducing the webserver load
restricting the number of concurrent Apache worker threads

To determine if the isssue is occuring, make sure that the ulimit for the Apache worker thread id (usually "httpd" or "apache2") has its core file size set high enough to capture a core file, that the open-file-descriptors value is greater than 10K, that you have installed the gdb debugger, and are running a symbolicated version of the Agent, as provided by AppDynamics technical support. When a crash occurs, open the file using gdb and produce a stack trace, looking for the term "epoll.c". If you find any thread that has abnormally ended while inside that routine, chances are very high that the problem occurred because of the ZMQ bug.

If you believe this is the case, please submit an APM-Apache support request and include the stack trace output in the ticket, along with the ulimit settings (/etc/security/limits) for the Apache worker thread userid.

Under typical operating conditions, the HTTP browser request will be resent. The net effect of this issue is a small increase in Apache overhead, some reduction in webserver through-put, plus any additional disk or CPU consumption if writing core files.

Cisco AppDynamics Community

How do I determine if the ZeroMQ issue affects my Apache Agent?