Scheduled-images-service
- Launchpad Entry: GlanceSpec:scheduled-iamges-service
- Created: 29 Oct 2012
- Contributors: Alex Meade, Eddie Sheffield
Summary
This blueprint introduces a new service and Nova API extension for scheduling daily snapshots of an instance.
Service responsibilities include:
- Allow creation, deletion, and listing of schedules
- Determine specific schedules with regards to load balancing
- Handle rescheduling failed jobs
- Maintain persistant schedules
- Removal of images outside of rotation count
Release Note
Rationale
User stories
Assumptions
Design
API Extension
Create Backup Schedule POST /servers/<id>/backup_schedule
{ "rotation": INT }
Rotation: Specifies the number of recent backups to keep for an instance. If a schedule already exists for an instance, it is overridden.
Delete Backup Schedule DELETE /servers/<id>/backup_schedule
Show Backup Schedule GET /servers/<id>/backup_schedule
Response Body: {
"instance": <UUID>, "rotation": INT,
}
List All Backup Schedules of a Tenant GET /servers/backup_schedules
Response Body: [ {
"instance": <UUID>, "rotation": INT,
}, ]
Service
The service shall consist of a set of apis, worker nodes, and a DB.
API - Provides a RESTful interface for adding schedules to the DB
Worker - References schedules in the DB to schedule and perform jobs
DB - Tracks schedules and currently executing jobs
Implementation
Typical flow of the system is as follows.
- User makes request to Nova extension
- Nova extension passes request to API
- API picks time of day to schedule
- Adds schedule entry to DB
- Worker polls DB for schedules needing action
- Worker creates job entry in DB
- Worker initiates image snapshot
- Worker waits for completion while updating 'last_touched' field in the job table (to indicate the Worker has not died)
- Worker updates DB to show the job has been completed
- Worker polls until a schedule needs action
Edge cases:
Worker dies in middle of job:
- A different worker will see the job has not been updated in awhile and take over, performing any cleanup it can.
- Jobs contain information of where they left off and what image they were working on (this allows a job whose worker died in the middle of an upload to be resumed)
Image upload fails
- Retry a certain number of times, afterwards leave image in error state
Instance no longer exists
- Remove schedule for instance