Jump to: navigation, search

Difference between revisions of "Monasca/Incident Manager"

(Issues)
(Blanked the page)
 
(112 intermediate revisions by the same user not shown)
Line 1: Line 1:
== Use Cases ==
 
# Create a new incident
 
# Display all incidents in UI
 
# Display all open, acknowledged or resolved incidents in UI
 
# Display all open, acknowledged or resolved incidents assigned to a user in UI
 
# Acknowledge an incident in UI
 
# Resolve an incident in UI
 
# Comment on an incident
 
# Assign an incident
 
  
== Concepts ==
 
* Incidents
 
** Incidents are resources that are created when an alarm transitions to the ALARM or UNDETERMINED state.
 
** Incidents are associated with an alarm.
 
** Incidents enable alarms to
 
*** Track status
 
*** Be assigned to users
 
*** Commented on by users
 
** There are three statuses of an incident
 
*** OPEN: When an incident is created it is in the OPEN state.
 
*** ACKNOWLEDGED: When an incident is being worked on it is ACKNOWLEDGED.
 
*** RESOLVED: When an incident is closed, it is resolved.
 
** Some of the concepts around incidents are "borrowed" from PagerDuty. See https://developer.pagerduty.com/documentation/rest/incidents.
 
* Alarm
 
** There are three states of an alarm
 
*** OK
 
*** ALARM
 
*** UNDETERMINED
 
** The state of an alarm is controlled by the Threshold Engine unless it is explicitly set via the Monasca API.
 
* Alarm state transition event
 
** An event that is created by the Threshold Engine when the alarm transitions state.
 
* Assignment/Owner
 
** The user that the incident is assigned to.
 
* Comment
 
** Comments are resources that are created when a user comments on an incident.
 
* Actions
 
** Similar to actions for alarm definitions in Monasca, incidents can also have actions which occur when an incident is modified.
 
 
== Incident Lifecycle ==
 
This section describes the lifecycle of an incident which includes creating incidents, handling alarm state transitions, updating the status of incidents, assignment of incidents and commenting on incidents.
 
 
=== Alarms states ===
 
Alarm states transition events are created by the Threshold Engine and are processed as follows:
 
 
# To ALARM
 
## Open a new incident for the supplied alarm, or add an alarm state transition event to an existing incident.
 
### If an incident doesn't exist for the alarm, or the status of the incident has been RESOLVED, a new incident is created with the incident status as OPEN.
 
### If there exists an incident with a status of OPEN or ACKNOWLEDGED for the alarm, the alarm state transition event is added to the existing incident, and the status is not modified.
 
# To OK
 
## Adds an alarm state transition event to an existing incident.
 
### If an incident doesn't exist for the alarm, or the status of the incident has been RESOLVED, nothing is done.
 
### If there exists an incident with a status of OPEN or ACKNOWLEDGED for the alarm, the alarm state transition event is added to the existing incident, and the status is not modified.
 
# To UNDETERMINED
 
## Open a new incident for the supplied alarm, or adds an alarm state transition event to an existing incident.
 
### If an incident doesn't exist for the alarm, or the status of the incident has been RESOLVED, a new incident is created with the incident status as OPEN.
 
### If there exists an incident with a status of OPEN or ACKNOWLEDGED for the alarm, the alarm state transition event is added to the existing incident, and the status is not modified.
 
 
=== Incident status ===
 
The Incident status is modified via the Incident Manager API and processed as follows:
 
 
# To ACKNOWLEDGED
 
## Modify the incident to ACKNOWLEDGED.
 
## Publish incident status event to Kafka which is processed by the Notification Engine.
 
## If an incident is acknowledged, it won't generate any additional notifications, even if it receives new alarm state transition events.
 
 
# To RESOLVED
 
## Modify the incident to RESOLVED.
 
## Publish incident status event to Kafka which is processed by the Notification Engine.
 
## If an incident is resolved, it won't generate any additional notifications.
 
 
=== Assign or reassign incident ===
 
The assignment/reassignment of an incident is done via the Incidents API and are processed as follows:
 
 
# When an incident is created it is initially unassigned. It can then be assigned or reassigned later.
 
# Publish an incident assignment/reassignment event to the Message Queue which is processed by the Notification Engine.
 
 
=== Comment on incident ===
 
Comments can be created via the Incidents API and are processed as follows:
 
 
# When a comment is added to an incident the incident is stored.
 
# An incident comment event is published to the Message Queue and then processed by the Notification Engine.
 
 
== Incidents ==
 
* GET /v2.0/incidents/
 
** Query parameters
 
*** status
 
*** state
 
*** assigned_to
 
*** acknowledged_by
 
*** create_start_time
 
*** status_update_start_time
 
* GET /v2.0/incidents/{incident-id}
 
* PATCH /v2.0/incidents/{incident-id}: Update an incident, such as modifying the status to ACKNOWLEDGED or RESOLVED.
 
* GET /v2.0/incidents/history: Get the history of all incidents filtering on the supplied query parameters.
 
** Query parameters
 
*** status (string, optional)
 
*** state (string, optional)
 
*** created_timestamp (string, optional)
 
* GET /v2.0/incidents/{incident-id}/history/: Get the history of a specific incident
 
 
== Incident Response Object ==
 
* id: The ID of the incident.
 
* name: The name of the incident.
 
* description: The description of the incident.
 
* alarm: {alarm} 
 
* alarm_state_transitions: [{alarm_state_transition}]
 
* status: OPEN, ACKNOWLEDGED, RESOLVED
 
* created_timestamp: The timestamp when the incident was created.
 
* status_updated_timestamp: The timestamp when the incident was last updated.
 
* comments: [comment-id]: An array of comments for the incident.
 
* assignments: [{Assignment}]: The user ID and timestamp that the incident was assigned.
 
* acknowledgments: [{Acknowledgment}]: The user ID and timestamp that acknowledged the incident. 
 
* actions: [{notification-method}]: Array of notification method IDs that are invoked when the incident is modified in any way.
 
 
== Comments ==
 
* GET /v2.0/comments
 
** Query parameters
 
*** incident_id (string, optional) -
 
* GET /v2.0/comments/{comment-id}
 
* POST /v2.0/comments
 
 
=== Comment Object ===
 
* id
 
* incident_id
 
* created_timestamp
 
* comment
 
* user-id (string, required)
 
 
== Architecture ==
 
* Monasca Incident Manager
 
** Provides an API that enables the following:
 
*** Query and update incidents, such as updating the status of incidents.
 
*** Create and query comments
 
** Consumes alarm state transition events from the Kafka alarm state transition events topic.
 
** Creates incidents in the MySQL database based on the rules listed above
 
** Publishes incident events to the incident events topic in Kafka which are consumed by the Notification Engine and an potentially result in notifications being sent.
 
* MySQL
 
** Schemas
 
*** Incidents
 
*** Comments
 
 
== Issues ==
 
# How to assign actions when a new incident is created?
 
# Should alarm IDs match to incidents directly, or should there be a level of indirection between an incident ID and an alarm ID? In PagerDuty you create an incident and get a response that has the incident ID, which the client should store. On subsequent events, the same incident ID can be provided for the same alarm. If the incident has been resolved an new incident is created and a new incident ID is returned. If the incident has not been resolved, the event is added to the incident. In PagerDuty the responsibility is on the client to manage the incident IDs associated with an alarm such that on subsequent alarm events the incident ID can be provided. What is described here is that the Incident Manager creates new incident when a alarm event occurs, but the incident tracking the alarm has already been resolved.
 
# Teams and Groups. PagerDuty has the ablity to assign incidents to teams or groups or individuals with escalation policies.
 
# Maintenance Schedules
 
# Should incidents be unassigned when created or assigned to a user based on a "escalation" policy?
 

Latest revision as of 15:14, 24 April 2015