OpsGuide/Maintenance, Failures, and Debugging


 * Cloud Controller and Storage Proxy Failures and Maintenance
 * Planned Maintenance
 * Rebooting a Cloud Controller or Storage Proxy
 * Total Cloud Controller Failure
 * Compute Node Failures and Maintenance
 * Planned Maintenance
 * After a Compute Node Reboots
 * Instances
 * Inspecting and Recovering Data from Failed Instances
 * Managing floating IP addresses between instances
 * Volumes
 * Total Compute Node Failure
 * /var/lib/nova/instances
 * Storage Node Failures and Maintenance
 * Rebooting a Storage Node
 * Shutting Down a Storage Node
 * Replacing a Swift Disk
 * Handling a Complete Failure
 * Configuration Management
 * Working with Hardware
 * Adding a Compute Node
 * Adding an Object Storage Node
 * Replacing Components
 * Databases
 * Database Connectivity
 * Performance and Optimizing
 * RabbitMQ troubleshooting
 * RabbitMQ service hangs
 * RabbitMQ alerts
 * Excessive database management memory consumption
 * File descriptor limits when scaling a cloud environment
 * HDWMY
 * Hourly
 * Daily
 * Weekly
 * Monthly
 * Quarterly
 * Semiannually
 * Determining Which Component Is Broken
 * Tailing Logs
 * Running Daemons on the CLI
 * What to do when things are running slowly
 * OpenStack Identity service
 * OpenStack Image service
 * OpenStack Block Storage service
 * OpenStack Compute service
 * OpenStack Networking service
 * AMQP broker
 * SQL back end
 * Uninstalling

Downtime, whether planned or unscheduled, is a certainty when running a cloud. This chapter aims to provide useful information for dealing proactively, or reactively, with these occurrences.