|
|
(140 intermediate revisions by the same user not shown) |
Line 1: |
Line 1: |
− | == Use Cases ==
| |
− | # Create a new incident
| |
− | # Display all incidents in Ops Console
| |
− | # Display all open, acknowledged or resolved incidents in Ops Console
| |
− | # Display all open, acknowledged or resolved incidents assigned to a user in Ops Console
| |
− | # Acknowledge an incident in Ops Console
| |
− | # Resolve an incident in Ops Console
| |
| | | |
− | == Concepts ==
| |
− | * Incidents
| |
− | ** Incidents are created when an alarm transitions to the ALARM or UNDETERMINED state and are associated with an alarm.
| |
− | ** Incidents enable alarms to
| |
− | *** Track status
| |
− | *** Be assigned to users
| |
− | *** Commented on by users
| |
− | ** There are three statuses of an incident
| |
− | *** OPEN: When an incident is created it is in the OPEN state.
| |
− | *** ACKNOWLEDGED: When an incident is being worked on it is ACKNOWLEDGED.
| |
− | *** RESOLVED: When an incident is closed, it is resolved.
| |
− | ** Some of the concepts around incidents are "borrowed" from PagerDuty. See https://developer.pagerduty.com/documentation/rest/incidents.
| |
− | * Alarm
| |
− | ** There are three states of an alarm
| |
− | *** OK
| |
− | *** ALARM
| |
− | *** UNDETERMINED
| |
− | * Alarm state transition event
| |
− | ** An event that is created by the Threshold Engine when the alarm transitions state.
| |
− | * Assignment/Owner
| |
− | ** The user that the incident is assigned to.
| |
− | * Comment
| |
− | ** A comment on an incident.
| |
− | * Actions
| |
− | ** Similar to alarm definition actions in Monasca, incidents can also have actions which occur when an incident is modified.
| |
− |
| |
− | == Incident Lifecycle ==
| |
− | This section describes the lifecycle of an incident.
| |
− |
| |
− | Alarm state transition events are processed as follows:
| |
− | # To ALARM
| |
− | ## Open a new incident for the supplied alarm, or adds an alarm state transition event to an existing incident.
| |
− | ### If an incident doesn't exist for the alarm, or the status of the incident has been RESOLVED, a new incident is created with the incident status as OPEN.
| |
− | ### If there exists an incident with a status of OPEN or ACKNOWLEDGED for the alarm, the alarm state transition event is added to the existing incident, and the status is not modified.
| |
− | # To OK
| |
− | ## Adds an alarm state transition event to an existing incident.
| |
− | ### If an incident doesn't exist for the alarm, or the status of the incident has been RESOLVED, nothing is done.
| |
− | ### If there exists an incident with a status of OPEN or ACKNOWLEDGED for the alarm, the alarm state transition event is added to the existing incident, and the status is not modified.
| |
− | # To UNDETERMINED
| |
− | ## Open a new incident for the supplied alarm, or adds an alarm state transition event to an existing incident.
| |
− | ### If an incident doesn't exist for the alarm, or the status of the incident has been RESOLVED, a new incident is created with the incident status as OPEN.
| |
− | ### If there exists an incident with a status of OPEN or ACKNOWLEDGED for the alarm, the alarm state transition event is added to the existing incident, and the status is not modified.
| |
− |
| |
− | Acknowledge incident
| |
− | # Modify the incident to ACKNOWLEDGED.
| |
− | # If an incident is acknowledged, it won't generate any additional notifications, even if it receives new alarm state transition events.
| |
− |
| |
− | Resolve incident
| |
− | # Modify the incident to RESOLVED.
| |
− | # If an incident is resolved, it won't generate any additional notifications.
| |
− |
| |
− | Assign or reassign incidents are processed as follows:
| |
− | # When an incident is created it is initially unassigned. It can then be assigned or reassigned later.
| |
− |
| |
− | == Incidents ==
| |
− | * GET /v2.0/incidents/
| |
− | ** Query parameters
| |
− | *** status
| |
− | *** state
| |
− | *** assigned_to
| |
− | *** acknowledged_by
| |
− | *** create_start_time
| |
− | *** status_update_start_time
| |
− | * GET /v2.0/incidents/{incident-id}
| |
− | * PATCH /v2.0/incidents/{incident-id}: Update an incident, such as modifying the status to ACKNOWLEDGED or RESOLVED.
| |
− | * GET /v2.0/incidents/history: Get the history of all incidents filtering on the supplied query parameters.
| |
− | ** Query parameters
| |
− | *** status (string, optional)
| |
− | *** state (string, optional)
| |
− | *** created_timestamp (string, optional)
| |
− | * GET /v2.0/incidents/{incident-id}/history/: Get the history of a specific incident
| |
− |
| |
− | == Incident Response Object ==
| |
− | * id: The ID of the incident.
| |
− | * name: The name of the incident.
| |
− | * description: The description of the incident.
| |
− | * alarm: {alarm}
| |
− | * alarm_state_transitions: [{alarm_state_transition}]
| |
− | * status: OPEN, ACKNOWLEDGED, RESOLVED
| |
− | * created_timestamp: The timestamp when the incident was created.
| |
− | * status_updated_timestamp: The timestamp when the incident was last updated.
| |
− | * comments: [comment-id]: An array of comments for the incident.
| |
− | * assignments: [{Assignment}]: The user ID and timestamp that the incident was assigned.
| |
− | * acknowledgments: [{Acknowledgment}]: The user ID and timestamp that acknowledged the incident.
| |
− | * actions: [{notification-method}]: Array of notification method IDs that are invoked when the incident is modified in any way.
| |
− |
| |
− | == Comments ==
| |
− | * GET /v2.0/comments
| |
− | ** Query parameters
| |
− | ** incident_id (string, optional) -
| |
− | * GET /v2.0/comments/{comment-id}
| |
− | * POST /v2.0/comments
| |
− |
| |
− | === Comment Object ===
| |
− | * id
| |
− | * incident_id
| |
− | * created_timestamp
| |
− | * comment
| |
− | * user-id (string, required)
| |
− |
| |
− | == Architecture ==
| |
− | * Monasca Incident Management API
| |
− | ** Query and update incidents
| |
− | ** Create and query comments
| |
− | * Monasca Incident Management Engine
| |
− | ** Consumes alarm state transition events from the Kafka alarm state transition events topic.
| |
− | ** Creates incidents in the MySQL database based on the rules listed above
| |
− | ** Published incident transition events to the incident transition events topic in Kafka which are consumed by the Notification Engine
| |
− | * MySQL
| |
− | ** Schemas are
| |
− | *** Incidents
| |
− | *** Comments
| |
− |
| |
− | == Issues ==
| |
− | # How to assign actions when a new incident is created?
| |
− | # Should alarm IDs match to incidents directly, or should there be a level of indirection between an incident ID and an alarm ID?
| |