Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

...

The easiest way to trigger an alert for testing purposes is to shutdown 1 gateway.

Note

If you are aware of an alert and know that the resolution will take several days or weeks to resolve, you can silence alerts via the alert manager GUI on port 9093

...

General Advice around defining new alerts

  • Pages should be urgent, important, actionable, and real. 

  • They should represent either ongoing or imminent problems with your service. 

  • Err on the side of removing noisy alerts – over-monitoring is a harder problem to solve than under-monitoring. 

  • You should almost always be able to classify the problem into one of: availability & basic functionality; latency; correctness (completeness, freshness and durability of data); and feature-specific problems. 

  • Symptoms are a better way to capture more problems more comprehensively and robustly with less effort. 

  • Include cause-based information in symptom-based pages or on dashboards, but avoid alerting directly on causes. 

  • The further up your serving stack you go, the more distinct problems you catch in a single rule.  But don't go so far you can't sufficiently distinguish what's going on. 

  • If you want a quiet on call rotation, it's imperative to have a system for dealing with things that need timely response, but are not imminently critical.