Jump to: navigation, search

Monasca/Incident Manager

< Monasca
Revision as of 15:14, 6 April 2015 by Roland Hochmuth (talk | contribs) (Comment on incident)

Use Cases

  1. Create a new incident when alarm occurs
  2. Display all incidents
  3. Display all open, acknowledged or resolved incidents
  4. Display my incidents
  5. Display all open, acknowledged or resolved incidents assigned to a user
  6. Acknowledge an incident
  7. Resolve an incident
  8. Comment on an incident
  9. Assign/Reassign an incident

Domain Model

This section describes the domain model by establishing the core concepts and ubiquitous language for incident management.

  • Incidents
    • Incidents are resources that are created when an alarm transitions to the ALARM or UNDETERMINED state.
    • Incidents are associated with an alarm.
    • Incidents allow users to manage alarms as follows:
      • Assign and query the status of incidents
      • Track the history of alarm events for an incident.
      • Assign incidents to users. Store and query the history of assignments for an incident.
      • Allow users to comment on incidents. Store and query the history of comments for an incident.
    • There are three status of an incident
      • OPEN: When a a new alarm occurs and incident is created it is in the OPEN state.
      • ACKNOWLEDGED: When an incident is being worked on it is ACKNOWLEDGED.
      • RESOLVED: When an incident is closed, it is resolved.
  • Alarms
    • Alarms are resources in Monasca that are created by the Threshold Engine when new metrics are received that match one or more alarm definitions.
    • There are three states of an alarm:
      • OK
      • ALARM
      • UNDETERMINED
    • The state of an alarm is controlled by the Threshold Engine unless it is explicitly set using the Monasca API.
  • Alarm state transition events
    • An event that is generated when the state of an alarm transitions from one state to another.
    • An event is published by the Threshold Engine to the Message Queue when an alarm transitions state.
  • Assignment
    • The user that an incident is assigned to.
  • Comment
    • Comments are resources that are created when a user comments on an incident.
  • Actions
    • Similar to actions for alarm definitions in Monasca, incidents can also have actions which occur when an incident is modified.
    • Actions can be associated with notification methods and result in webhooks, emails or other notifications.


Note, several of the concepts related to incidents were "borrowed" from PagerDuty. See https://developer.pagerduty.com/documentation/rest/incidents.

Incident Lifecycle

An incident starts out with a status of OPEN and over the course of it's lifecycle it undergoes several state transitions until it is resolved and it's lifecycle is complete. This section describes the lifecycle of an incident which includes creating incidents, handling alarm state transitions, updating the status of incidents, assignment of incidents and commenting on incidents.

Alarms states

Alarm states transition events are created by the Threshold Engine and are processed by the Incident Manager as follows:

  1. To ALARM
    1. Open a new incident for the supplied alarm, or add an alarm state transition event to an existing incident.
      1. If an incident doesn't exist for the alarm, or the status of the incident matching an alarm has been RESOLVED, a new incident is created with an incident status of OPEN.
      2. If there exists an incident with a status of OPEN or ACKNOWLEDGED matching the alarm, the alarm state transition event is added to the existing incident, and the status is not modified.
  2. To OK
    1. Adds an alarm state transition event to an existing incident.
      1. If an incident doesn't exist for the alarm, or the status of the incident matching an alarm has been RESOLVED, nothing is done.
      2. If there exists an incident with a status of OPEN or ACKNOWLEDGED matching the alarm, the alarm state transition event is added to the existing incident, and the status is not modified.
  3. To UNDETERMINED
    1. Open a new incident for the supplied alarm, or adds an alarm state transition event to an existing incident.
      1. If an incident doesn't exist for the alarm, or the status of the incident matching an alarm has been RESOLVED, a new incident is created with an incident status of OPEN.
      2. If there exists an incident with a status of OPEN or ACKNOWLEDGED for the alarm, the alarm state transition event is added to the existing incident, and the status is not modified.

Incident status

The status of an Incident is modified via the Incidents API and processed as follows:

  1. To OPEN
    1. When an incident is initially created the status is set to OPEN.
  1. To ACKNOWLEDGED
    1. If an incident is in the OPEN state the status can be set to ACKNOWLEDGED using the Incidents API.
    2. An incident status event is published to the Message Queue which is processed by the Notification Engine.
    3. If an incident is acknowledged, it won't generate any additional notifications, even if it receives new alarm state transition events.
  1. To RESOLVED
    1. If an incident is in the ACKNOWLEDGED state the status can be set to RESOLVED using the Incidents API.
    2. An incident status event is published to the Message Queue which is processed by the Notification Engine.
    3. If an incident is resolved, it won't generate any additional notifications.


Whenever the status of an incident is modified the user that modified the incident and timestamp is recorded.

Assign or reassign incident

The assignment/reassignment of an incident is done via the Incidents API and are processed as follows:

  1. When an incident is initially created it is unassigned. It can then be assigned or reassigned later using the Incidents API.
  2. An incident assignment/reassignment event is published to the Message Queue which is then processed by the Notification Engine.


When an incident is assigned or reassigned the assigner, assignee and timestamp are recorded.

Comment on incident

Comments can be created via the Incidents API and are processed as follows:

  1. When a comment is added to an incident the incident is stored.
  2. An incident comment event is published to the Message Queue and then processed by the Notification Engine.


When an incident is commented on, the user, timestamp and comment are recorded.

Incidents API

  • GET /v2.0/incidents/
    • Query parameters
      • status
      • alarm_state
      • assigned_to
      • acknowledged_by
      • create_start_time
      • status_update_start_time
  • GET /v2.0/incidents/{incident-id}
  • PATCH /v2.0/incidents/{incident-id}: Update an incident, such as modifying the status to ACKNOWLEDGED or RESOLVED.

Incident Object

  • id: The ID of the incident.
  • name: The name of the incident.
  • description: The description of the incident.
  • alarm: {alarm}
  • alarm_state_transitions: [{alarm_state_transition}]
  • status: OPEN, ACKNOWLEDGED, RESOLVED
  • created_timestamp: The timestamp when the incident was created.
  • status_updated_timestamp: The timestamp when the incident was last updated.
  • comments: [comment-id]: An array of comments for the incident.
  • assignments: [{Assignment}]: The assigner, assignee and timestamp that the incident was assigned.
  • acknowledgments: [{Acknowledgment}]: The user and timestamp that acknowledged or resolved the incident.
  • actions: [{notification-method}]: Array of notification method IDs that are invoked when the incident is modified in any way.

Comments API

  • GET /v2.0/comments
    • Query parameters
      • incident_id (string, optional) -
  • GET /v2.0/comments/{comment-id}
  • POST /v2.0/comments

Comment Object

  • id
  • incident_id
  • created_timestamp
  • comment
  • user_id (string, required)

Architecture

  • Monasca Incident Manager
    • Provides an API that enables the following:
      • Incidents: Query and update incidents, such as updating the status of incidents.
      • Comments: Create and query comments
    • Consumes alarm state transition events from the Kafka alarm state transition events topic.
    • Creates incidents in the MySQL database based on the rules listed above
    • Publishes incident events to the incident events topic in Kafka which are consumed by the Notification Engine and an potentially result in notifications being sent.
  • MySQL
    • Schemas
      • Incidents
        • id: The ID of the incident.
        • tenant_id
        • name: The name of the incident.
        • description: The description of the incident.
        • alarm_id
        • alarm_state_transitions: [{alarm_state_transition}]
        • status: OPEN, ACKNOWLEDGED, RESOLVED
        • created_timestamp: The timestamp when the incident was created.
        • status_updated_timestamp: The timestamp when the incident was last updated.
      • IncidentAcknowledgments
        • id
        • incident_id
        • status
        • user_id
        • timestamp
      • IncidentAssignments
        • id
        • incident_id
        • assigner_id
        • assignee_id
        • timestamp
      • Comments
        • id
        • incident_id
        • user_id
        • timestamp
        • comment_text
      • IncidentActions
        • id
        • incident_id
        • action
      • IncidentAlarmHistory
        •  ?

Issues

  1. How to assign actions to incidents when a new incident is created?
  2. Should alarm IDs match to incidents directly, or should there be a level of indirection between an incident ID and an alarm ID? In PagerDuty you create an incident and get a response that has the incident ID, which the client should store. On subsequent events, the same incident ID can be provided for the same alarm. If the incident has been resolved an new incident is created and a new incident ID is returned. If the incident has not been resolved, the event is added to the incident. In PagerDuty the responsibility is on the client to manage the incident IDs associated with an alarm such that on subsequent alarm events the previous incident ID can be provided. What is described here is that the Incident Manager creates a new incident when an alarm occurs, if the incident tracking the alarm has already been resolved. From the perspective of the Threshold Engine this is a nice approach as the Threshold Engine just publishes alarm state transition events to the Message Queue and there isn't a way for the Threshold Engine to get responses back from the Incident Manager.
  3. Teams and Groups. PagerDuty has the ability to assign incidents to teams or groups or individuals with escalation policies.
  4. Maintenance Schedules
  5. Should incidents be unassigned when created or assigned to a user based on a "escalation" policy?
  6. Incident status or state? Which word is better. Alarms have a state. Incidents have a state too. But, status seems like a more appropriate term for incidents, than state.
  7. There are a number of interesting capabilities in PagerDuty. One is the idea of creating incidents for service teams that have an escalation policy which assigns the incident to a member of the service team that is on-call based on a schedule and escalation policy. Currently, what is described above, doesn't have these capabilities. Alarm state transitions are created that result in incidents, but how can incidents be associated with service teams, let alone basing that assignment on a schedule or escalation policy? One problem with the above design is that alarms are consumed by the Incident Manager, but we don't know which service team they should be associated with. There are several ways of addressing this. One is to use webhooks similar to PagerDuty. Application IDs would be created in the Incident Manager for service teams and then alarm notification methods could be associated with services. However, we were trying to avoid this overhead and management. Another idea is to used metric names and dimensions. Normally, metric names and dimensions are used to describe metrics. For example, a dimension of service is used to associate metrics with a specific service. If we add an calendar and escalation policy to the Incident Manager one way to describe the association with service teams is to allow service teams to be created in the Incident Manager and then describe the metrics that they are interested in. This seems like a more powerful construct than the mechanism used in PagerDuty.