Jump to: navigation, search

Difference between revisions of "Monasca/Incident Manager"

(Incident Lifecycle)
Line 60: Line 60:
 
Assign or reassign incidents are processed as follows:
 
Assign or reassign incidents are processed as follows:
 
# When an incident is created it is initially unassigned. It can then be assigned or reassigned later.
 
# When an incident is created it is initially unassigned. It can then be assigned or reassigned later.
 +
 +
== Incidents ==
 +
* GET /v2.0/incidents/
 +
** Query parameters
 +
*** status
 +
*** state
 +
*** assigned_to
 +
*** acknowledged_by
 +
*** create_start_time
 +
*** status_update_start_time
 +
* GET /v2.0/incidents/{incident-id}
 +
* PATCH /v2.0/incidents/{incident-id}: Update an incident, such as modifying the status to ACKNOWLEDGED or RESOLVED.
 +
* GET /v2.0/incidents/history: Get the history of all incidents filtering on the supplied query parameters.
 +
** Query parameters
 +
*** status (string, optional)
 +
*** state (string, optional)
 +
*** created_timestamp (string, optional)
 +
* GET /v2.0/incidents/{incident-id}/history/: Get the history of a specific incident
 +
 +
== Incident Response Object ==
 +
* id: The ID of the incident.
 +
* name: The name of the incident.
 +
* description: The description of the incident.
 +
* alarm: {alarm} 
 +
* alarm_state_transitions: [{alarm_state_transition}]
 +
* status: OPEN, ACKNOWLEDGED, RESOLVED
 +
* created_timestamp: The timestamp when the incident was created.
 +
* status_updated_timestamp: The timestamp when the incident was last updated.
 +
* comments: [comment-id]: An array of comments for the incident.
 +
* assignments: [{Assignment}]: The user ID and timestamp that the incident was assigned.
 +
* acknowledgments: [{Acknowledgment}]: The user ID and timestamp that acknowledged the incident. 
 +
* actions: [{notification-method}]: Array of notification method IDs that are invoked when the incident is modified in any way.
 +
 +
== Comments ==
 +
* GET /v2.0/comments
 +
* Query parameters
 +
* incident_id (string, optional) -
 +
* GET /v2.0/comments/{comment-id}
 +
* POST /v2.0/comments
 +
 +
=== Comment Object ===
 +
* id
 +
* incident_id
 +
* created_timestamp
 +
* comment
 +
* user-id (string, required)
 +
 +
== Architecture ==
 +
* Monasca Incident Management API
 +
** Query and update incidents
 +
** Create and query comments
 +
* Monasca Incident Management Engine
 +
** Consumes alarm state transition events from the Kafka alarm state transition events topic.
 +
** Creates incidents in the MySQL database based on the rules listed above
 +
** Published incident transition events to the incident transition events topic in Kafka which are consumed by the Notification Engine
 +
* MySQL
 +
** Schemas are
 +
*** Incidents
 +
*** Comments
 +
 +
== Issues ==
 +
# How to assign actions when a new incident is created?
 +
# Should alarm IDs match to incidents directly, or should there be a level of indirection between an incident ID and an alarm ID?

Revision as of 23:12, 3 April 2015

Use Cases

  1. Create a new incident
  2. Display all incidents in Ops Console
  3. Display all open, acknowledged or resolved incidents in Ops Console
  4. Display all open, acknowledged or resolved incidents assigned to a user in Ops Console
  5. Acknowledge an incident in Ops Console
  6. Resolve an incident in Ops Console

Concepts

  • Incidents
    • Incidents are created when an alarm transitions to the ALARM or UNDETERMINED state and are associated with an alarm.
    • Incidents enable alarms to
      • Track status
      • Be assigned to users
      • Commented on by users
    • There are three statuses of an incident
      • OPEN: When an incident is created it is in the OPEN state.
      • ACKNOWLEDGED: When an incident is being worked on it is ACKNOWLEDGED.
      • RESOLVED: When an incident is closed, it is resolved.
    • Some of the concepts around incidents are "borrowed" from PagerDuty. See https://developer.pagerduty.com/documentation/rest/incidents.
  • Alarm
    • There are three states of an alarm
      • OK
      • ALARM
      • UNDETERMINED
  • Alarm state transition event
    • An event that is created by the Threshold Engine when the alarm transitions state.
  • Assignment/Owner
    • The user that the incident is assigned to.
  • Comment
    • A comment on an incident.
  • Actions
    • Similar to alarm definition actions in Monasca, incidents can also have actions which occur when an incident is modified.

Incident Lifecycle

This section describes the lifecycle of an incident.

Alarm state transition events are processed as follows:

  1. To ALARM
    1. Open a new incident for the supplied alarm, or adds an alarm state transition event to an existing incident.
      1. If an incident doesn't exist for the alarm, or the status of the incident has been RESOLVED, a new incident is created with the incident status as OPEN.
      2. If there exists an incident with a status of OPEN or ACKNOWLEDGED for the alarm, the alarm state transition event is added to the existing incident, and the status is not modified.
  2. To OK
    1. Adds an alarm state transition event to an existing incident.
      1. If an incident doesn't exist for the alarm, or the status of the incident has been RESOLVED, nothing is done.
      2. If there exists an incident with a status of OPEN or ACKNOWLEDGED for the alarm, the alarm state transition event is added to the existing incident, and the status is not modified.
  3. To UNDETERMINED
    1. Open a new incident for the supplied alarm, or adds an alarm state transition event to an existing incident.
      1. If an incident doesn't exist for the alarm, or the status of the incident has been RESOLVED, a new incident is created with the incident status as OPEN.
      2. If there exists an incident with a status of OPEN or ACKNOWLEDGED for the alarm, the alarm state transition event is added to the existing incident, and the status is not modified.

Acknowledge incident

  1. Modify the incident to ACKNOWLEDGED.
  2. If an incident is acknowledged, it won't generate any additional notifications, even if it receives new alarm state transition events.

Resolve incident

  1. Modify the incident to RESOLVED.
  2. If an incident is resolved, it won't generate any additional notifications.

Assign or reassign incidents are processed as follows:

  1. When an incident is created it is initially unassigned. It can then be assigned or reassigned later.

Incidents

  • GET /v2.0/incidents/
    • Query parameters
      • status
      • state
      • assigned_to
      • acknowledged_by
      • create_start_time
      • status_update_start_time
  • GET /v2.0/incidents/{incident-id}
  • PATCH /v2.0/incidents/{incident-id}: Update an incident, such as modifying the status to ACKNOWLEDGED or RESOLVED.
  • GET /v2.0/incidents/history: Get the history of all incidents filtering on the supplied query parameters.
    • Query parameters
      • status (string, optional)
      • state (string, optional)
      • created_timestamp (string, optional)
  • GET /v2.0/incidents/{incident-id}/history/: Get the history of a specific incident

Incident Response Object

  • id: The ID of the incident.
  • name: The name of the incident.
  • description: The description of the incident.
  • alarm: {alarm}
  • alarm_state_transitions: [{alarm_state_transition}]
  • status: OPEN, ACKNOWLEDGED, RESOLVED
  • created_timestamp: The timestamp when the incident was created.
  • status_updated_timestamp: The timestamp when the incident was last updated.
  • comments: [comment-id]: An array of comments for the incident.
  • assignments: [{Assignment}]: The user ID and timestamp that the incident was assigned.
  • acknowledgments: [{Acknowledgment}]: The user ID and timestamp that acknowledged the incident.
  • actions: [{notification-method}]: Array of notification method IDs that are invoked when the incident is modified in any way.

Comments

  • GET /v2.0/comments
  • Query parameters
  • incident_id (string, optional) -
  • GET /v2.0/comments/{comment-id}
  • POST /v2.0/comments

Comment Object

  • id
  • incident_id
  • created_timestamp
  • comment
  • user-id (string, required)

Architecture

  • Monasca Incident Management API
    • Query and update incidents
    • Create and query comments
  • Monasca Incident Management Engine
    • Consumes alarm state transition events from the Kafka alarm state transition events topic.
    • Creates incidents in the MySQL database based on the rules listed above
    • Published incident transition events to the incident transition events topic in Kafka which are consumed by the Notification Engine
  • MySQL
    • Schemas are
      • Incidents
      • Comments

Issues

  1. How to assign actions when a new incident is created?
  2. Should alarm IDs match to incidents directly, or should there be a level of indirection between an incident ID and an alarm ID?