Jump to: navigation, search

OpsGuide/Handling a Complete Failure

< OpsGuide
Revision as of 02:44, 14 November 2017 by David.desrosiers (talk | contribs) (David.desrosiers moved page Handling a Complete Failure to OpsGuide/Handling a Complete Failure without leaving a redirect)
(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)

A common way of dealing with the recovery from a full system failure, such as a power outage of a data center, is to assign each service a priority, and restore in order. Table. Example service restoration priority list shows an example.

Table. Example service restoration priority list
Priority Services
1 Internal network connectivity
2 Backing storage services
3 Public network connectivity for user virtual machines
4 nova-compute, cinder hosts
5 User virtual machines
10 Message queue and database services
15 Keystone services
20 cinder-scheduler
21 Image Catalog and Delivery services
22 nova-scheduler services
98 cinder-api
99 nova-api services
100 Dashboard node

Use this example priority list to ensure that user-affected services are restored as soon as possible, but not before a stable environment is in place. Of course, despite being listed as a single-line item, each step requires significant work. For example, just after starting the database, you should check its integrity, or, after starting the nova services, you should verify that the hypervisor matches the database and fix any mismatches.