Not a customer? Click the 'Start a free trial' link to begin a 30-day SaaS trial of our product and to join our community.
Existing Cisco AppDynamics customers should click the 'Sign In' button to authenticate to access the community
on
11-23-2020
09:20 PM
- edited on
11-23-2020
09:23 PM
by
Claudia.Landiva
This article provides recommendations on how to safely patch an Events Service node. It includes an example in which you must effectively stop a node for an extended period of time, and then return it to the cluster using either old index information or a fresh, clean node.
This article references the official Elasticsearch guidelines for Rolling Upgrades. As mentioned in Step 2 of these guidelines, due to time constraints, the administrator is asked to stop non-essential indexing as the node is being stopped. However, this step would not be recommended in a heavily active production cluster.
Additionally from the Events Service index management API perspective, using cluster.routing.allocation.enable “none” may lead to unintended consequences. For example, if time-consuming index creation and management tasks happen to occur between “none” and “all” settings, this may prevent indices from being created.
When patching nodes in Events Service, consider the following practical limitations:
"cluster.routing.allocation.enable": "primaries" | “none” | “all”
"cluster.routing.rebalance.enable": "primaries" | “none” | “all”
"indices.recovery.max_bytes_per_sec": "1000mb"
"cluster.routing.allocation.node_initial_primaries_recoveries": 1-10,
"cluster.routing.allocation.cluster_concurrent_rebalance": 2-8,
"cluster.routing.allocation.node_concurrent_recoveries": 2-8,
"indices.recovery.concurrent_streams": 1-6
"cluster.routing.allocation.exclude._ip"
The AppDynamics Analytics team has adopted and recommends the following practice when rotating nodes in or out.
After identifying the nodes to replace or upgrade in-place, consider the following, one node at a time:
curl -XPUT 'localhost:9200/*/_settings’ -d’
{
"index": {
"index.routing.allocation.total_shards_per_node" : -1
}
}’
curl -XPUT localhost:9200/_cluster/settings -d'
{ "transient": {
"indices.recovery.max_bytes_per_sec": "1000mb", "cluster.routing.allocation.node_initial_primaries_recoveries": 1,
"cluster.routing.allocation.cluster_concurrent_rebalance": 2,
"cluster.routing.allocation.node_concurrent_recoveries": 2,
"indices.recovery.concurrent_streams": 6
} }’
For each node in the cluster to patch, upgrade, or remove:
curl -s 'http://localhost:9200/_cat/nodes?v'
curl -XPUT localhost:9200/_cluster/settings -H 'Content-Type: application/json' -d '{
"transient" :{
"cluster.routing.allocation.exclude._ip" : "W.X.Y.Z"
}
}'
curl -s 'http://localhost:9200/_cat/allocation?v'
<$PLATFORM_PATH>/product/events-service/processor/bin/events-service.sh stop
curl -XPUT 'localhost:9200/_cluster/settings' -H 'Content-Type: application/json' -d '{
"transient" :{ "cluster.routing.allocation.exclude._ip" : "" } }'
nohup $PLATFORM_DIRECTORY/product/events-service/processor/bin/events-service.sh start -p $PLATFORM_DIRECTORY/product/events-service/processor/conf/events-service-api-store.properties &
curl -XPUT 'localhost:9200/*/_settings’ -d’
{
"index": {
"index.routing.allocation.total_shards_per_node" : 3
}
}’
When performing a rolling restart of the Elasticsearch data nodes during a minor update, our test results determined that a 30-60 minute window is required (excluding the upgrade or patch time).
Note: AppDynamics Analytics does not recommend stopping Events Service (Elasticsearch) nodes for an extended period of time. As a result, we perform this type of operation in 3-6 month intervals with 60 minutes allocated for each node of the Elasticsearch cluster.
After analyzing why your rebalancing or allocating is not correct, you can enter the following troubleshooting commands:
curl -s 'http://localhost:9200/_cat/shards?h=index,shard,prirep,state,unassigned.reason' | grep UNASSIGNED |
You can retrieve unassigned.reason
descriptions from https://www.elastic.co/guide/en/elasticsearch/reference/2.4/cat-shards.html
If a per cluster allocation restriction exists, you can check the existing cluster settings under cluster.routing:
curl -XGET 'localhost:9200/_cluster/settings?pretty=true' |
Verify that you do not have cluster based allocation restrictions:
curl -XPUT 'localhost:9200/_cluster/settings' -d' { "transient": { "cluster.routing.allocation.enable" : "all" } } ' |
Verify that you do not have cluster based relocation restrictions:
curl -XPUT 'localhost:9200/_cluster/settings' -d' { "transient": { "cluster.routing.rebalance.enable" : "all" } } ' |
Verify that you do not have cluster based rebalance restrictions:
curl -XPUT 'localhost:9200/_cluster/settings' -d' { "transient": { "cluster.routing.allocation.allow_rebalance" : "always" } } ' |
To enable faster rebalancing before and after removing the node, or adding the node:
curl -XPUT 'localhost:9200/_cluster/settings' -d' { "transient": { "cluster.routing.allocation.cluster_concurrent_rebalance" : 10 } } ' |
Verify by retrieving the problematic index settings under index.routing.allocation:
curl -XGET 'localhost:9200/<index_name_goes_here>/_settings?pretty=true' |
Thank you! Your submission has been received!
Thank you! Your submission has been received!
Oops! Something went wrong while submitting the form