Controller (SaaS, On Premise)

cancel
Showing results for 
Search instead for 
Did you mean: 

Configurations deleted after Machine agent went down.

Amith.B
Creator

Configurations deleted after Machine agent went down.

The configurations which was done in service availability module were deleted after the Machine agent went down(url monitoring was done from this machine agent). How it can be recovered now?. There were almost 200 URL being configured in the server through controller(not from yml file).

Configurations deleted after Machine agent went down.
2 REPLIES 2

Re: Configurations deleted after Machine agent went down.

Hi,

 

I am also noticing the same behavior on both Linux and windows based Machine Agent-based SAM configurations.  Were you able to find a cause for this  ?  bug or expected behavior?

 

Also,  I see that it took over 24 hrs for the config to vanish, what was your experience in duration?

 

thanks

Ven

Brian.Homrich
Creator

Re: Configurations deleted after Machine agent went down.

Amith.B,

 

I'm working with another environment that saw the same behavior.  In our case we were permanently moving the MachineAgent previously supporting the SAM checks to another Controller,  and after the node was removed (we saw it deleted in the on-prem audit log),   the SAM check was removed as well.

 

It may be tied to the fact that you can't create a SAM check configuration without having a node to run it.

Unfortunately,  the SAM configuration appears to be removed from the controller DB as well,  as the table in the on-prem controller database was empty.     In our case the number of checks was small (approximately a dozen).

We're moving forward planning two approaches to workaround this:   

  1. use the ConfigExporter to backup the SAM checks once they're recreated.
  2. put extra documentation into the environment wiki to make sure we don't delete that agent

It should be possible to put a HealthRule on the availability of that specific Agent and go critical quickly if it stops reporting.   I would cause that alert to warn and go critical aggressively to your NOC or Operations team managing the SAM configuration.

 

I say that because,  in our case:  looking at the controller audit log, it appeared the node was deleted after 2 days,  and that doesn't seem to match up with any of the node retention settings that were in effect in the controller (node.retention.period was 500,  node.permanent.deletion.period was 720,  so basically 20 days to mark the node,  and 30 to remove it)

 

Good Luck