Jump to: navigation, search

Difference between revisions of "Monasca/Incident Manager"

(Architecture)
(Blanked the page)
 
(134 intermediate revisions by the same user not shown)
Line 1: Line 1:
== Use Cases ==
 
# Create a new incident
 
# Display all incidents in UI
 
# Display all open, acknowledged or resolved incidents in UI
 
# Display all open, acknowledged or resolved incidents assigned to a user in UI
 
# Acknowledge an incident in UI
 
# Resolve an incident in UI
 
  
== Concepts ==
 
* Incidents
 
** Incidents are created when an alarm transitions to the ALARM or UNDETERMINED state.
 
** Incidents are associated with an alarm.
 
** Incidents enable alarms to
 
*** Track status
 
*** Be assigned to users
 
*** Commented on by users
 
** There are three statuses of an incident
 
*** OPEN: When an incident is created it is in the OPEN state.
 
*** ACKNOWLEDGED: When an incident is being worked on it is ACKNOWLEDGED.
 
*** RESOLVED: When an incident is closed, it is resolved.
 
** Some of the concepts around incidents are "borrowed" from PagerDuty. See https://developer.pagerduty.com/documentation/rest/incidents.
 
* Alarm
 
** There are three states of an alarm
 
*** OK
 
*** ALARM
 
*** UNDETERMINED
 
** The state of an alarm is controlled by the Threshold Engine unless it is explicitly set via the Monasca API.
 
* Alarm state transition event
 
** An event that is created by the Threshold Engine when the alarm transitions state.
 
* Assignment/Owner
 
** The user that the incident is assigned to.
 
* Comment
 
** A comment on an incident.
 
* Actions
 
** Similar to alarm definition actions in Monasca, incidents can also have actions which occur when an incident is modified.
 
 
== Incident Lifecycle ==
 
This section describes the lifecycle of an incident.
 
 
=== Alarms states ===
 
Alarm states transitions are processed as follows:
 
 
# To ALARM
 
## Open a new incident for the supplied alarm, or add an alarm state transition event to an existing incident.
 
### If an incident doesn't exist for the alarm, or the status of the incident has been RESOLVED, a new incident is created with the incident status as OPEN.
 
### If there exists an incident with a status of OPEN or ACKNOWLEDGED for the alarm, the alarm state transition event is added to the existing incident, and the status is not modified.
 
# To OK
 
## Adds an alarm state transition event to an existing incident.
 
### If an incident doesn't exist for the alarm, or the status of the incident has been RESOLVED, nothing is done.
 
### If there exists an incident with a status of OPEN or ACKNOWLEDGED for the alarm, the alarm state transition event is added to the existing incident, and the status is not modified.
 
# To UNDETERMINED
 
## Open a new incident for the supplied alarm, or adds an alarm state transition event to an existing incident.
 
### If an incident doesn't exist for the alarm, or the status of the incident has been RESOLVED, a new incident is created with the incident status as OPEN.
 
### If there exists an incident with a status of OPEN or ACKNOWLEDGED for the alarm, the alarm state transition event is added to the existing incident, and the status is not modified.
 
 
=== Acknowledge incident ===
 
# Modify the incident to ACKNOWLEDGED.
 
# If an incident is acknowledged, it won't generate any additional notifications, even if it receives new alarm state transition events.
 
 
=== Resolve incident ===
 
# Modify the incident to RESOLVED.
 
# If an incident is resolved, it won't generate any additional notifications.
 
 
=== Assign or reassign incident ===
 
Assign or reassign incidents are processed as follows:
 
# When an incident is created it is initially unassigned. It can then be assigned or reassigned later.
 
 
== Incidents ==
 
* GET /v2.0/incidents/
 
** Query parameters
 
*** status
 
*** state
 
*** assigned_to
 
*** acknowledged_by
 
*** create_start_time
 
*** status_update_start_time
 
* GET /v2.0/incidents/{incident-id}
 
* PATCH /v2.0/incidents/{incident-id}: Update an incident, such as modifying the status to ACKNOWLEDGED or RESOLVED.
 
* GET /v2.0/incidents/history: Get the history of all incidents filtering on the supplied query parameters.
 
** Query parameters
 
*** status (string, optional)
 
*** state (string, optional)
 
*** created_timestamp (string, optional)
 
* GET /v2.0/incidents/{incident-id}/history/: Get the history of a specific incident
 
 
== Incident Response Object ==
 
* id: The ID of the incident.
 
* name: The name of the incident.
 
* description: The description of the incident.
 
* alarm: {alarm} 
 
* alarm_state_transitions: [{alarm_state_transition}]
 
* status: OPEN, ACKNOWLEDGED, RESOLVED
 
* created_timestamp: The timestamp when the incident was created.
 
* status_updated_timestamp: The timestamp when the incident was last updated.
 
* comments: [comment-id]: An array of comments for the incident.
 
* assignments: [{Assignment}]: The user ID and timestamp that the incident was assigned.
 
* acknowledgments: [{Acknowledgment}]: The user ID and timestamp that acknowledged the incident. 
 
* actions: [{notification-method}]: Array of notification method IDs that are invoked when the incident is modified in any way.
 
 
== Comments ==
 
* GET /v2.0/comments
 
** Query parameters
 
** incident_id (string, optional) -
 
* GET /v2.0/comments/{comment-id}
 
* POST /v2.0/comments
 
 
=== Comment Object ===
 
* id
 
* incident_id
 
* created_timestamp
 
* comment
 
* user-id (string, required)
 
 
== Architecture ==
 
* Monasca Incident Management API
 
** Query and update incidents
 
** Create and query comments
 
* Monasca Incident Management Engine
 
** Consumes alarm state transition events from the Kafka alarm state transition events topic.
 
** Creates incidents in the MySQL database based on the rules listed above
 
** Published incident transition events to the incident transition events topic in Kafka which are consumed by the Notification Engine
 
* MySQL
 
** Schemas
 
*** Incidents
 
*** Comments
 
 
== Issues ==
 
# How to assign actions when a new incident is created?
 
# Should alarm IDs match to incidents directly, or should there be a level of indirection between an incident ID and an alarm ID?
 

Latest revision as of 15:14, 24 April 2015