Jump to: navigation, search

Difference between revisions of "Monasca/Incident Manager"

(Incident Lifecycle)
(Incident Lifecycle)
Line 37: Line 37:
 
This section describes the lifecycle of an incident.  
 
This section describes the lifecycle of an incident.  
  
Alarm state transition events are processed as follows:
+
=== Alarms states ===
 +
Alarm states transitions are processed as follows:
 +
 
 
# To ALARM
 
# To ALARM
 
## Open a new incident for the supplied alarm, or add an alarm state transition event to an existing incident.
 
## Open a new incident for the supplied alarm, or add an alarm state transition event to an existing incident.
Line 51: Line 53:
 
### If there exists an incident with a status of OPEN or ACKNOWLEDGED for the alarm, the alarm state transition event is added to the existing incident, and the status is not modified.
 
### If there exists an incident with a status of OPEN or ACKNOWLEDGED for the alarm, the alarm state transition event is added to the existing incident, and the status is not modified.
  
Acknowledge incident
+
=== Acknowledge incident ===
 
# Modify the incident to ACKNOWLEDGED.
 
# Modify the incident to ACKNOWLEDGED.
 
# If an incident is acknowledged, it won't generate any additional notifications, even if it receives new alarm state transition events.
 
# If an incident is acknowledged, it won't generate any additional notifications, even if it receives new alarm state transition events.
  
Resolve incident
+
=== Resolve incident ===
 
# Modify the incident to RESOLVED.
 
# Modify the incident to RESOLVED.
 
# If an incident is resolved, it won't generate any additional notifications.
 
# If an incident is resolved, it won't generate any additional notifications.
  
 +
=== Assign or reassign incident ===
 
Assign or reassign incidents are processed as follows:
 
Assign or reassign incidents are processed as follows:
 
# When an incident is created it is initially unassigned. It can then be assigned or reassigned later.
 
# When an incident is created it is initially unassigned. It can then be assigned or reassigned later.

Revision as of 23:48, 3 April 2015

Use Cases

  1. Create a new incident
  2. Display all incidents in UI
  3. Display all open, acknowledged or resolved incidents in UI
  4. Display all open, acknowledged or resolved incidents assigned to a user in UI
  5. Acknowledge an incident in UI
  6. Resolve an incident in UI

Concepts

  • Incidents
    • Incidents are created when an alarm transitions to the ALARM or UNDETERMINED state.
    • Incidents are associated with an alarm.
    • Incidents enable alarms to
      • Track status
      • Be assigned to users
      • Commented on by users
    • There are three statuses of an incident
      • OPEN: When an incident is created it is in the OPEN state.
      • ACKNOWLEDGED: When an incident is being worked on it is ACKNOWLEDGED.
      • RESOLVED: When an incident is closed, it is resolved.
    • Some of the concepts around incidents are "borrowed" from PagerDuty. See https://developer.pagerduty.com/documentation/rest/incidents.
  • Alarm
    • There are three states of an alarm
      • OK
      • ALARM
      • UNDETERMINED
  • Alarm state transition event
    • An event that is created by the Threshold Engine when the alarm transitions state.
  • Assignment/Owner
    • The user that the incident is assigned to.
  • Comment
    • A comment on an incident.
  • Actions
    • Similar to alarm definition actions in Monasca, incidents can also have actions which occur when an incident is modified.

Incident Lifecycle

This section describes the lifecycle of an incident.

Alarms states

Alarm states transitions are processed as follows:

  1. To ALARM
    1. Open a new incident for the supplied alarm, or add an alarm state transition event to an existing incident.
      1. If an incident doesn't exist for the alarm, or the status of the incident has been RESOLVED, a new incident is created with the incident status as OPEN.
      2. If there exists an incident with a status of OPEN or ACKNOWLEDGED for the alarm, the alarm state transition event is added to the existing incident, and the status is not modified.
  2. To OK
    1. Adds an alarm state transition event to an existing incident.
      1. If an incident doesn't exist for the alarm, or the status of the incident has been RESOLVED, nothing is done.
      2. If there exists an incident with a status of OPEN or ACKNOWLEDGED for the alarm, the alarm state transition event is added to the existing incident, and the status is not modified.
  3. To UNDETERMINED
    1. Open a new incident for the supplied alarm, or adds an alarm state transition event to an existing incident.
      1. If an incident doesn't exist for the alarm, or the status of the incident has been RESOLVED, a new incident is created with the incident status as OPEN.
      2. If there exists an incident with a status of OPEN or ACKNOWLEDGED for the alarm, the alarm state transition event is added to the existing incident, and the status is not modified.

Acknowledge incident

  1. Modify the incident to ACKNOWLEDGED.
  2. If an incident is acknowledged, it won't generate any additional notifications, even if it receives new alarm state transition events.

Resolve incident

  1. Modify the incident to RESOLVED.
  2. If an incident is resolved, it won't generate any additional notifications.

Assign or reassign incident

Assign or reassign incidents are processed as follows:

  1. When an incident is created it is initially unassigned. It can then be assigned or reassigned later.

Incidents

  • GET /v2.0/incidents/
    • Query parameters
      • status
      • state
      • assigned_to
      • acknowledged_by
      • create_start_time
      • status_update_start_time
  • GET /v2.0/incidents/{incident-id}
  • PATCH /v2.0/incidents/{incident-id}: Update an incident, such as modifying the status to ACKNOWLEDGED or RESOLVED.
  • GET /v2.0/incidents/history: Get the history of all incidents filtering on the supplied query parameters.
    • Query parameters
      • status (string, optional)
      • state (string, optional)
      • created_timestamp (string, optional)
  • GET /v2.0/incidents/{incident-id}/history/: Get the history of a specific incident

Incident Response Object

  • id: The ID of the incident.
  • name: The name of the incident.
  • description: The description of the incident.
  • alarm: {alarm}
  • alarm_state_transitions: [{alarm_state_transition}]
  • status: OPEN, ACKNOWLEDGED, RESOLVED
  • created_timestamp: The timestamp when the incident was created.
  • status_updated_timestamp: The timestamp when the incident was last updated.
  • comments: [comment-id]: An array of comments for the incident.
  • assignments: [{Assignment}]: The user ID and timestamp that the incident was assigned.
  • acknowledgments: [{Acknowledgment}]: The user ID and timestamp that acknowledged the incident.
  • actions: [{notification-method}]: Array of notification method IDs that are invoked when the incident is modified in any way.

Comments

  • GET /v2.0/comments
    • Query parameters
    • incident_id (string, optional) -
  • GET /v2.0/comments/{comment-id}
  • POST /v2.0/comments

Comment Object

  • id
  • incident_id
  • created_timestamp
  • comment
  • user-id (string, required)

Architecture

  • Monasca Incident Management API
    • Query and update incidents
    • Create and query comments
  • Monasca Incident Management Engine
    • Consumes alarm state transition events from the Kafka alarm state transition events topic.
    • Creates incidents in the MySQL database based on the rules listed above
    • Published incident transition events to the incident transition events topic in Kafka which are consumed by the Notification Engine
  • MySQL
    • Schemas are
      • Incidents
      • Comments

Issues

  1. How to assign actions when a new incident is created?
  2. Should alarm IDs match to incidents directly, or should there be a level of indirection between an incident ID and an alarm ID?