Jump to: navigation, search

Monasca/Incident Manager

Use Cases

  1. Create a new incident
  2. Display all incidents in Ops Console
  3. Display all open, acknowledged or resolved incidents in Ops Console
  4. Display all open, acknowledged or resolved incidents assigned to a user in Ops Console
  5. Acknowledge an incident in Ops Console
  6. Resolve an incident in Ops Console

Concepts

  • Incidents
    • Incidents are created when an alarm transitions to the ALARM or UNDETERMINED state and are associated with an alarm.
    • Incidents enable alarms to
      • Track status
      • Be assigned to users
      • Commented on by users
    • There are three statuses of an incident
      • OPEN: When an incident is created it is in the OPEN state.
      • ACKNOWLEDGED: When an incident is being worked on it is ACKNOWLEDGED.
      • RESOLVED: When an incident is closed, it is resolved.
    • Some of the concepts around incidents are "borrowed" from PagerDuty. See https://developer.pagerduty.com/documentation/rest/incidents.
  • Alarm
    • There are three states of an alarm
      • OK
      • ALARM
      • UNDETERMINED
  • Alarm state transition event
    • An event that is created by the Threshold Engine when the alarm transitions state.
  • Assignment/Owner
    • The user that the incident is assigned to.
  • Comment
    • A comment on an incident.
  • Actions
    • Similar to alarm definition actions in Monasca, incidents can also have actions which occur when an incident is modified.

Incident Lifecycle

This section describes the lifecycle of an incident.

Alarm state transition events are processed as follows:

  1. To ALARM
    1. Open a new incident for the supplied alarm, or adds an alarm state transition event to an existing incident.
      1. If an incident doesn't exist for the alarm, or the status of the incident has been RESOLVED, a new incident is created with the incident status as OPEN.
      2. If there exists an incident with a status of OPEN or ACKNOWLEDGED for the alarm, the alarm state transition event is added to the existing incident, and the status is not modified.
  2. To OK
    1. Adds an alarm state transition event to an existing incident.
      1. If an incident doesn't exist for the alarm, or the status of the incident has been RESOLVED, nothing is done.
      2. If there exists an incident with a status of OPEN or ACKNOWLEDGED for the alarm, the alarm state transition event is added to the existing incident, and the status is not modified.
  3. To UNDETERMINED
    1. Open a new incident for the supplied alarm, or adds an alarm state transition event to an existing incident.
      1. If an incident doesn't exist for the alarm, or the status of the incident has been RESOLVED, a new incident is created with the incident status as OPEN.
      2. If there exists an incident with a status of OPEN or ACKNOWLEDGED for the alarm, the alarm state transition event is added to the existing incident, and the status is not modified.

Acknowledge incident

  1. Modify the incident to ACKNOWLEDGED.
  2. If an incident is acknowledged, it won't generate any additional notifications, even if it receives new alarm state transition events.

Resolve incident

  1. Modify the incident to RESOLVED.
  2. If an incident is resolved, it won't generate any additional notifications.

Assign or reassign incidents are processed as follows:

  1. When an incident is created it is initially unassigned. It can then be assigned or reassigned later.