Qonos-scheduling-service
- Launchpad Entry: QonoS scheduling service
- Created: 3 May 2013
- Contributors: Alex Meade, Eddie Sheffield, Andrew Melton, Iccha Sethi, Nikhil Komawar, Brian Rosmaita
Contents
Summary
This document describes the design and API of QonoS, a distributed high-availability scheduling service that has been implemented for the cloud[1]. QonoS is currently used as the scheduling component of a scheduled images service that is invoked by a Nova extension, so many of the examples in this document discuss that use case.
Service responsibilities include:
- Create scheduled tasks
- Perform scheduled tasks
- Handle rescheduling failed jobs
- Maintain persistent schedules
QonoS was designed to work with OpenStack and uses OpenStack common components.
Conceptual Overview
The system consists of:
- a REST API
- a database
- one or more schedulers, and
- one or more workers.
The API handles communication, both external requests and internal communication. It creates the schedule for a request and stores it in the database.
The scheduler examines schedules and creates jobs.
A job describes a task that must be performed.
A worker performs a task. It obtains a task by polling the API and picking up the first task it is capable of handling.
Job Lifecycle
Jobs have the following statuses:
-
queued
: the job is ready to be processed by a worker -
processing
: the job has been picked up by a worker -
done
: the worker processing this job has decided that the job has been successfully completed -
timeout
: the worker processing this job has decided the job is taking too long and has stopped processing it. A job in this state can be picked up by another worker. -
error
: the worker notes that something went wrong, but the job could be retried -
canceled
: the worker decides that the job can't be done and should not be retried
Job Timeouts
There are two kinds of timeouts:
- hard timeout: once reached, the job is no longer available for retries
- soft timeout: is renewed by the worker, indicates that the worker is still doing the task (similar to a heartbeat)
Job Failures
Job failures are reported as job faults and stored in the database.
Scalability
Address or remove?
Reliability
Address or remove?
Overall System Diagram
Design
Entities
- Schedule
- the general description of what the service will do
- looks something like
{ "tenant_id" : <tenantId>, "schedule_id" : <scheduleId>, "job_type" : <keyword>, "metadata" : { // all the information for this job_type "key" : "value" } "schedule" : <the schedule info, exact format TBD> }
API
Schedules
Create Schedule
POST <version>/schedules {"schedule": { "tenant": "tenant_username", "action": "snapshot", "minute": 30, "hour": 2, "day": 3, "day_of_week": 5, "day_of_month": 23, "metadata": { "instance_id": "some_uuid", "retention": "3" } } }
List schedules
GET <version>/schedules { "schedules": [ { # schedule as above }, { # schedule as above }, ... ] }
Query filters
- next_run_after - only list schedules with next_run value >= this value
- next_run_before - only list schedules with next_run value <= this value
Example
List schedules which start in the next five minutes
GET <version>/schedules?next_run_after={Current_DateTime}&next_run_before={Current_DateTime+5_Minutes} GET <version>/schedules?next_run_after=2012-05-16T15:27:36Z&next_run_before=2012-05-16T15:32:36Z
Get a specific schedule
GET /v1/schedules/{id}
Update a schedule
PUT <version>/schedules/{id} {"schedule": { "minute": 45, "hour": 3 } }
Delete a schedule
DELETE <version>/schedules/{id}
Jobs
Create job from schedule
POST <version>/jobs {"job": {"schedule_id": "some_uuid"}}
The action, tenant_id, and metadata gets copied from the schedule to the job.
Get a specific job
GET <version>/jobs/{id} { "job":{ { "id": "{some_uuid}", "created_at": "{DateTime}", "updated_at": "{DateTime}", "schedule_id": "{some_uuid}", "worker_id": "{some_uuid}", "tenant": "tenant_username", "action": "snapshot", "status": "queued", "retry_count": 0, "hard_timeout": "{DateTime}", "timeout": "{DateTime}", "metadata": { "key1": "value1", "key2", "value2" } } }
List current jobs
GET <version>/jobs { "jobs": [ { # job as above }, { # job as above }, ... ] }
Update status of a job
PUT <version>/jobs/{id}/status { "status": { "status": "some_status", "timeout": "{datetime of next timeout}" (optional) "error_message":"some message" (optional) } }
NOTE: The error_message field is only looked for if the status is ERROR. In the event of an ERROR status, an entry is created in the job_faults table capturing as much info as possible from the job. If an error_message is provided, it is included in the job fault entry.
Delete(finish) a specific job
DELETE <version>/jobs/{id}
Metadata
Set schedule/job metadata
PUT <version>/schedules/{id}/metadata or PUT <version>/jobs/{id}/metadata
Note: The resulting metadata for a schedule/job will exactly match what is provided.
{ "metadata": { "each": "someval", "meta": "someval", "key": "someval", } }
List all metadata for a schedule/job
GET <version>/schedules/{id}/metadata or GET <version>/jobs/{id}/metadata { "metadata": { "instance_id": "some_uuid", "retention": "3" } }
Workers
Service
The service shall consist of a set of apis, worker nodes, and a DB.
API - Provides a RESTful interface for adding schedules to the DB
Worker - References schedules in the DB to schedule and perform jobs
DB - Tracks schedules and currently executing jobs
Database
- schedules
- jobs
- job faults
- must be useful!
Implementation
Typical flow of the system is as follows.
- User makes request to Nova extension
- Nova extension passes request to API
- API picks time of day to schedule
- Adds schedule entry to DB
- Worker polls DB for schedules needing action
- Worker creates job entry in DB
- Worker initiates image snapshot
- Worker waits for completion while updating 'last_touched' field in the job table (to indicate the Worker has not died)
- Worker updates DB to show the job has been completed
- Worker polls until a schedule needs action
Edge cases:
Worker dies in middle of job:
- A different worker will see the job has not been updated in awhile and take over, performing any cleanup it can.
- Jobs contain information of where they left off and what image they were working on (this allows a job whose worker died in the middle of an upload to be resumed)
Image upload fails
- Retry a certain number of times, afterwards leave image in error state
Instance no longer exists
- Remove schedule for instance
Code Repository
References
- ↑ QonoS was first described in https://wiki.openstack.org/wiki/Scheduled-images-service (29 October 2012).