Cisco AppDynamics Community

Georgiy.Chigrichenko · ‎11-23-2020

What are the recommended considerations and steps for patching nodes in Events Service?

This article provides recommendations on how to safely patch an Events Service node. It includes an example in which you must effectively stop a node for an extended period of time, and then return it to the cluster using either old index information or a fresh, clean node.

This article references the official Elasticsearch guidelines for Rolling Upgrades. As mentioned in Step 2 of these guidelines, due to time constraints, the administrator is asked to stop non-essential indexing as the node is being stopped. However, this step would not be recommended in a heavily active production cluster.

Additionally from the Events Service index management API perspective, using cluster.routing.allocation.enable “none” may lead to unintended consequences. For example, if time-consuming index creation and management tasks happen to occur between “none” and “all” settings, this may prevent indices from being created.

What should I consider before patching nodes in the Events Service?

When patching nodes in Events Service, consider the following practical limitations:

Are there sufficient policy and practical application upgrade time windows?
If the cluster is actively ingesting production monitoring data, you may not be able to stop indexing
If a node was brought back with stale metadata and stale data, there is a potential for Elasticsearch synchronization error conditions (split-brain)
There may be a performance impact from shard rebalancing that is required on active nodes which are performing ingestion and search functionality at the same time
The possibility of a specific node not returning due to potential patch process failures
Test results on recovery speed and safety of operation (which use combinations of disable and enable) may cause a split-brain scenario when changed:

"cluster.routing.allocation.enable": "primaries" | “none” | “all”
"cluster.routing.rebalance.enable": "primaries" | “none” | “all”
"indices.recovery.max_bytes_per_sec": "1000mb"
"cluster.routing.allocation.node_initial_primaries_recoveries": 1-10,
"cluster.routing.allocation.cluster_concurrent_rebalance": 2-8,
"cluster.routing.allocation.node_concurrent_recoveries": 2-8,
"indices.recovery.concurrent_streams": 1-6
"cluster.routing.allocation.exclude._ip"

What is the recommended process for rotating nodes?

The AppDynamics Analytics team has adopted and recommends the following practice when rotating nodes in or out.

Considerations

After identifying the nodes to replace or upgrade in-place, consider the following, one node at a time:

If you remove more than one node out of the cluster, you must temporarily remove shard allocation restrictions on all indices. The Events Service default is 3.

Note: Do not leave the Events Service without shard allocation restrictions for an extended period of time. Please re-apply any shard restrictions immediately upon completing the patch or upgrade.
```
curl -XPUT 'localhost:9200/*/_settings’ -d’

{

    "index": {

        "index.routing.allocation.total_shards_per_node" : -1

    }

}’
```

(Optional) You can increase the rebalancing and recovery speed:

curl -XPUT localhost:9200/_cluster/settings -d'

{  "transient": {

         "indices.recovery.max_bytes_per_sec": "1000mb",  "cluster.routing.allocation.node_initial_primaries_recoveries": 1,

     "cluster.routing.allocation.cluster_concurrent_rebalance": 2,

"cluster.routing.allocation.node_concurrent_recoveries": 2,

         "indices.recovery.concurrent_streams": 6

}   }’

Steps for patching, upgrading, or removing nodes

For each node in the cluster to patch, upgrade, or remove:

Retrieve the IP address node:

curl -s 'http://localhost:9200/_cat/nodes?v'

Exclude the single node from the cluster. Given the volume data stored on a single node, this may take a significant amount of time to complete.

curl -XPUT localhost:9200/_cluster/settings -H 'Content-Type: application/json' -d '{

  "transient" :{

      "cluster.routing.allocation.exclude._ip" : "W.X.Y.Z"

   }

}'

Wait for the process to complete and then verify that the excluded node has 0 shards.
```
curl -s 'http://localhost:9200/_cat/allocation?v'
```

From the Platform Admin or on the node itself, stop the node:

<$PLATFORM_PATH>/product/events-service/processor/bin/events-service.sh stop

Patch or replace the node.

Remove the temporary node exclusion:

curl -XPUT 'localhost:9200/_cluster/settings' -H 'Content-Type: application/json' -d '{ 

"transient" :{ "cluster.routing.allocation.exclude._ip" : "" } }'

From the Enterprise Console UI or on the node itself, restart the node:

nohup $PLATFORM_DIRECTORY/product/events-service/processor/bin/events-service.sh start -p $PLATFORM_DIRECTORY/product/events-service/processor/conf/events-service-api-store.properties &

Wait for the shard migration to complete and for the cluster indicator to turn green.
Repeat this process (steps 1-8) for the next node to patch, as needed.
After you have completed patching, downsizing, or migrating all of the nodes, if total_shards_per_node was set to -1, then re-apply the total_shards_per_node default limit:
```
curl -XPUT 'localhost:9200/*/_settings’ -d’

{

    "index": {

        "index.routing.allocation.total_shards_per_node" : 3

    }

}’
```

Additional Recommendations

When performing a rolling restart of the Elasticsearch data nodes during a minor update, our test results determined that a 30-60 minute window is required (excluding the upgrade or patch time).

Note: AppDynamics Analytics does not recommend stopping Events Service (Elasticsearch) nodes for an extended period of time. As a result, we perform this type of operation in 3-6 month intervals with 60 minutes allocated for each node of the Elasticsearch cluster.

Troubleshooting

After analyzing why your rebalancing or allocating is not correct, you can enter the following troubleshooting commands:

curl -s 'http://localhost:9200/_cat/shards?h=index,shard,prirep,state,unassigned.reason' | grep UNASSIGNED

You can retrieve unassigned.reason descriptions from https://www.elastic.co/guide/en/elasticsearch/reference/2.4/cat-shards.html

If a per cluster allocation restriction exists, you can check the existing cluster settings under cluster.routing:

curl -XGET 'localhost:9200/_cluster/settings?pretty=true'

Verify that you do not have cluster based allocation restrictions:

curl -XPUT 'localhost:9200/_cluster/settings' -d' { "transient": { "cluster.routing.allocation.enable" : "all" } } '

Verify that you do not have cluster based relocation restrictions:

curl -XPUT 'localhost:9200/_cluster/settings' -d' { "transient": { "cluster.routing.rebalance.enable" : "all" } } '

Verify that you do not have cluster based rebalance restrictions:

curl -XPUT 'localhost:9200/_cluster/settings' -d' { "transient": { "cluster.routing.allocation.allow_rebalance" : "always" } } '

To enable faster rebalancing before and after removing the node, or adding the node:

curl -XPUT 'localhost:9200/_cluster/settings' -d' { "transient": { "cluster.routing.allocation.cluster_concurrent_rebalance" : 10 } } '

Verify by retrieving the problematic index settings under index.routing.allocation:

curl -XGET 'localhost:9200/<index_name_goes_here>/_settings?pretty=true'