Jump to: navigation, search

Difference between revisions of "Monasca/Incident Manager"

(Incident Lifecycle)
(Blanked the page)
 
(3 intermediate revisions by the same user not shown)
Line 1: Line 1:
== Use Cases ==
 
# Open a new incident when an alarm occurs
 
# Display all incidents
 
# Display all open, acknowledged or resolved incidents
 
# Display my incidents
 
# Display all open, acknowledged or resolved incidents assigned to a user
 
# Display all incidents for an alarm.
 
# Display all open incidents
 
# Display all open and unassigned incidents.
 
# Assign/Reassign an incident to a user
 
# Acknowledge an incident by a user
 
# Resolve an incident by a user
 
# Comment on an incident by a user
 
# Create an incident report for a user.
 
  
== Domain Model ==
 
This section describes the domain model by establishing the core concepts and ubiquitous language for incident management.
 
 
* Incidents
 
** Incidents are resources that are created when an alarm transitions to the ALARM or UNDETERMINED state.
 
** Incidents are associated with an alarm.
 
** Incidents allow users to manage alarms as follows:
 
*** Assign and query the state of incidents
 
*** Track the history of alarm events for an incident.
 
*** Assign incidents to users. Store and query the history of assignments for an incident.
 
*** Allow users to comment on incidents. Store and query the history of comments for an incident.
 
** There are four states of an incident
 
*** OPEN: When a a new alarm occurs and incident is created it is in the OPEN state, but not assigned to a user.
 
*** ASSIGNED: From the OPEN state the incident can be assigned to a user.
 
*** ACKNOWLEDGED: After an incident is assigned to a user the user must acknowledge that they are working on it.
 
*** RESOLVED: When you no longer interested in tracking an incident, it can be resolved.
 
* Alarms
 
** Alarms are resources in Monasca that are created by the Threshold Engine when new metrics are received that match one or more alarm definitions.
 
** There are three states of an alarm:
 
*** OK
 
*** ALARM
 
*** UNDETERMINED
 
** The state of an alarm is controlled by the Threshold Engine unless it is explicitly set using the Monasca API.
 
* Alarm state transition events
 
** An event that is generated when the state of an alarm transitions from one state to another.
 
** An event is published by the Threshold Engine to the Message Queue when an alarm transitions state.
 
* Comment
 
** Comments are resources that are created when a user comments on an incident.
 
* Actions
 
** Similar to actions for alarm definitions in Monasca, incidents can also have actions which occur when an incident is modified.
 
** Actions can be associated with notification methods and result in webhooks, emails or other notifications.
 
* Users: Users a users in Keystone.
 
* Incident events
 
 
 
Note, several of the concepts related to incidents were "borrowed" from PagerDuty. See https://developer.pagerduty.com/documentation/rest/incidents. The PagerDuty incident lifecycle is described at, https://support.pagerduty.com/hc/en-us/articles/202829260-Incident-Lifecycle.
 
 
== Incident Lifecycle ==
 
An incident starts out with a state of OPEN and over the course of it's lifecycle it undergoes several state transitions until it is resolved and it's lifecycle is complete. This section describes the lifecycle of an incident which includes creating incidents, handling alarm state transitions, updating the status of incidents, assignment of incidents and commenting on incidents.
 
 
=== Alarms states ===
 
Alarm states transition events are created by the Threshold Engine and are processed by the Incident Manager as follows:
 
 
# To ALARM
 
## Open a new incident for the supplied alarm, or add an alarm state transition event to an existing incident.
 
### If an incident doesn't exist for the alarm, or the state of the incident matching an alarm has been RESOLVED, a new incident is created with an incident state of OPEN.
 
### If there exists an incident with a state of OPEN, ASSIGNED or ACKNOWLEDGED matching the alarm, the alarm state transition event is added to the existing incident, and the state is not modified.
 
# To OK
 
## Adds an alarm state transition event to an existing incident.
 
### If an incident doesn't exist for the alarm, or the state of the incident matching an alarm has been RESOLVED, nothing is done.
 
### If there exists an incident with a state of OPEN, ASSIGNED or ACKNOWLEDGED matching the alarm, the alarm state transition event is added to the existing incident, and the state is not modified.
 
# To UNDETERMINED
 
## Open a new incident for the supplied alarm, or adds an alarm state transition event to an existing incident.
 
### If an incident doesn't exist for the alarm, or the state of the incident matching an alarm has been RESOLVED, a new incident is created with an incident state of OPEN.
 
### If there exists an incident with a state of OPEN, ASSIGNED or ACKNOWLEDGED for the alarm, the alarm state transition event is added to the existing incident, and the state is not modified.
 
 
=== Incident stat ===
 
The state of an Incident is modified via the Incidents API and processed as follows:
 
 
# To OPEN
 
## When an incident is initially created the state is set to OPEN, but not assigned.
 
 
# To ASSIGNED
 
## An incident can be assigned to a user from the OPEN state.
 
 
# To ACKNOWLEDGED
 
## If an incident is in the ASSIGNED state the assignee can acknowledge it by setting the state can be set to ACKNOWLEDGED using the Incidents API.
 
## An incident state event is published to the Message Queue which is processed by the Notification Engine.
 
## If an incident is acknowledged, it won't generate any additional notifications, even if it receives new alarm state transition events.
 
 
# To RESOLVED
 
## If an incident is in the ACKNOWLEDGED state the state can be set to RESOLVED using the Incidents API.
 
## An incident state event is published to the Message Queue which is processed by the Notification Engine.
 
## If an incident is resolved, it won't generate any additional notifications.
 
 
 
Whenever the state of an incident is modified the user that modified the incident and timestamp are recorded.
 
 
=== Assign or reassign incident ===
 
Incidents can be assigned to users. The assignment/reassignment of an incident is done via the Incidents API and are processed as follows:
 
 
# When an incident is initially created it is unassigned. It can then be assigned or reassigned later using the Incidents API.
 
# An incident assignment/reassignment event is published to the Message Queue which is then processed by the Notification Engine.
 
 
 
When an incident is assigned or reassigned the assigner, assignee and timestamp are recorded.
 
 
=== Comment on incident ===
 
Incidents can be commented on by users. Comments can be created via the Incidents API and are processed as follows:
 
 
# When a comment is added to an incident the incident is stored.
 
# An incident comment event is published to the Message Queue and then processed by the Notification Engine.
 
 
 
When an incident is commented on, the user, timestamp and comment are recorded.
 
 
== Incidents API ==
 
* GET /v2.0/incidents/
 
** Query parameters
 
*** state
 
*** alarm_state
 
*** assigned_to
 
*** acknowledged_by
 
*** create_start_time
 
*** state_update_start_time
 
*** alarm ID
 
*** user ID
 
* GET /v2.0/incidents/{incident-id}
 
* PATCH /v2.0/incidents/{incident-id}: Update an incident, such as modifying the state to ACKNOWLEDGED or RESOLVED.
 
 
=== Incident Object ===
 
* id - The ID of the incident.
 
* name - The name of the incident.
 
* description - The description of the incident.
 
* alarm: ({alarm}, required) - The alarm associated with the incident
 
* alarm_state_transitions [{alarm_state_transition}] - The history of alarm state transitions for the incident
 
* state - OPEN, ASSIGNED, ACKNOWLEDGED, RESOLVED
 
* created_timestamp - The timestamp when the incident was created.
 
* state_updated_timestamp - The timestamp when the incident was last updated.
 
* comments: [comment-id] - An array of comments for the incident.
 
* assignments: [{Assignment}] - The assigner, assignee and timestamp when the incident was assigned/reassigned.
 
* acknowledgments: [{Acknowledgment}] - The user and timestamp that acknowledged or resolved the incident. 
 
* actions: [{notification-method}] - Array of notification method IDs that are invoked when the incident is modified.
 
 
== Comments API ==
 
* GET /v2.0/comments
 
** Query parameters
 
*** incident_id (string, optional) -
 
*** user_id (string, optional) - Filter comments by user ID.
 
*** alarm_id (string, optional) - Filter alarms by alarm ID.
 
* GET /v2.0/comments/{comment-id}
 
* POST /v2.0/comments
 
 
=== Comment Object ===
 
* id
 
* incident_id
 
* created_timestamp
 
* comment
 
* user_id (string, required)
 
 
== Architecture ==
 
* Monasca Incident Manager
 
** Provides an API that enables the following:
 
*** Incidents: Query and update incidents, such as updating the state of incidents.
 
*** Comments: Create and query comments
 
** Consumes alarm state transition events from the Message Queue alarm state transition events topic.
 
** Creates incidents in the MySQL database based on the rules listed above
 
** Publishes incident events to the incident events topic in the Message Queue which are consumed by the Notification Engine and an potentially result in notifications being sent.
 
** MIM will be implemented in Python based on the Falcon WSGI framework similar to how to Monasca Python API is implemented.
 
** The MIM is a stand-alone system that doesn't have any dependencies on Monasca. It gets registered in the Keystone catalog as a separate service.
 
* Python Monasca Incident Manager Client
 
** A Python client and library for interacting with the Monasca Incident Manager API.
 
* MySQL
 
** Schemas
 
*** Incidents
 
**** id: The ID of the incident.
 
**** tenant_id
 
**** name: The name of the incident.
 
**** description: The description of the incident.
 
**** alarm_id
 
**** alarm_state_transitions: [{alarm_state_transition}]
 
**** state: OPEN, ASSIGNED, ACKNOWLEDGED, RESOLVED
 
**** created_timestamp: The timestamp when the incident was created.
 
**** state_updated_timestamp: The timestamp when the incident was last updated.
 
*** IncidentAcknowledgments
 
**** id
 
**** incident_id
 
**** state
 
**** user_id
 
**** timestamp
 
*** IncidentAssignments
 
**** id
 
**** incident_id
 
**** assigner_id
 
**** assignee_id
 
**** timestamp
 
*** Comments
 
**** id
 
**** incident_id
 
**** user_id
 
**** timestamp
 
**** comment_text
 
*** IncidentActions
 
**** id
 
**** incident_id
 
**** action
 
*** IncidentAlarmHistory
 
**** ?
 
 
== Comparison to PagerDuty ==
 
This section describes some of the similarities and differences with PagerDuty.
 
 
Monasca Incident Manager does not support the following:
 
 
* Automatic assignment of incidents to a user based on an service team, on-call schedule and escalation policy. In the initial release the MIM does not support these concepts. As a result, incidents must be manually assigned to users.
 
 
== Issues ==
 
# How to assign actions to incidents when a new incident is created? It seems like some notifications should be sent based on what team that the incident is for and who is on-call within that team. So, while it seems necessary to have notification methods for incidents so that users can subscribe to updates on an incident, it also seems like it is necessary to have some global notifications be automatically sent to users or teams based on some policies around the mapping of alarms to users/teams and the schedule for the members of the team.
 
# Should alarm IDs match to incidents directly, or should there be a level of indirection between an incident ID and an alarm ID? In PagerDuty you create an incident and get a response that has the incident ID, which the client should store. On subsequent events, the same incident ID can be provided for the same alarm. If the incident has been resolved a new incident is created and a new incident ID is returned. If the incident has not been resolved, the event is added to the incident. In PagerDuty the responsibility is on the client to manage the incident IDs associated with an alarm such that on subsequent alarm events the previous incident ID can be provided. What is described here is that the Incident Manager creates a new incident when an alarm occurs, if the incident tracking the alarm has already been resolved. From the perspective of the Threshold Engine this is a nice approach as the Threshold Engine just publishes alarm state transition events to the Message Queue and there isn't a way for the Threshold Engine to get responses back from the Incident Manager. Another possibility here is to not model alarms specifically in the Incidents API as an explicit field, but to have a field that can be used to store a dictionary. The dictionary would be used to store an alarm id, and this is what the consumer from the Message Queue would. However, if a POST Api was added for incidents, client could supply any fields they want.
 
# Escalation policies and schedules: PagerDuty has the concept of schedules and escalation policies. This is out-of-scope with the initial release of MIM.
 
# Teams and Groups. PagerDuty has the ability to assign incidents to teams or groups or individuals with schedules and escalation policies. This is out-of-scope with the initial release of MIM. However, we want to ensure that we can enhance the system to do this.
 
# Maintenance Schedules: PagerDuty has the concept of maintenance schedules. This is out-of-scope with the initial release of MIM.
 
# There are a number of interesting capabilities in PagerDuty. One is the idea of creating incidents for service teams that have an escalation policy which assigns the incident to a member of the service team that is on-call based on a schedule and escalation policy. Currently, what is described above, doesn't have these capabilities. Alarm state transitions are created that result in incidents, but how can incidents be associated with service teams, let alone basing that assignment on a schedule or escalation policy? One problem with the above design is that alarms are consumed by the Incident Manager, but we don't know which service team they should be associated with. There are several ways of addressing this. One is to use webhooks similar to PagerDuty. Application IDs would be created in the Incident Manager for service teams and then alarm notification methods could be associated with services. However, we were trying to avoid this overhead and management. Another idea is to used metric names and dimensions. Normally, metric names and dimensions are used to describe metrics. For example, a dimension of service is used to associate metrics with a specific service. If we add an calendar and escalation policy to the Incident Manager one way to describe the association with service teams is to allow service teams to be created in the Incident Manager and then describe the metrics that they are interested in. This seems like a more powerful construct than the mechanism used in PagerDuty.
 
# Can you reopen an incident after it has be acknowledged or resolved? This could potentially be a useful capability.
 
# If we do support re-opening incidents should we add a new state for REOPENED?
 
# Should the comments resource be moved below the incidents resources as in /v2.0/incidents/comments?
 
# Consider changing IncidentAlarmHistory to AlarmStateHistory.
 

Latest revision as of 15:14, 24 April 2015