Difference between revisions of "Qonos-scheduling-service"
m (→Conceptual Overview) |
m |
||
Line 14: | Line 14: | ||
* Handle rescheduling failed jobs | * Handle rescheduling failed jobs | ||
* Maintain persistent schedules | * Maintain persistent schedules | ||
+ | |||
+ | QonoS was designed to work with OpenStack and uses OpenStack common components. | ||
=== Conceptual Overview === | === Conceptual Overview === | ||
Line 23: | Line 25: | ||
* one or more workers. | * one or more workers. | ||
+ | The API handles communication, both external requests and internal communication. It creates the schedule for a request and stores it in the database. | ||
The '''scheduler''' examines schedules and creates jobs. | The '''scheduler''' examines schedules and creates jobs. | ||
Line 47: | Line 50: | ||
Job failures are reported as job faults and stored in the database. | Job failures are reported as job faults and stored in the database. | ||
− | === | + | ==== Scalability ==== |
− | |||
− | |||
− | |||
− | == | ||
− | + | Address or remove? | |
=== Reliability === | === Reliability === | ||
− | + | Address or remove? | |
− | + | === Overall System Diagram === | |
− | + | [[File:Qonos Diagram.png]] | |
== Design == | == Design == | ||
Line 80: | Line 79: | ||
} | } | ||
"schedule" : <the schedule info, exact format TBD> | "schedule" : <the schedule info, exact format TBD> | ||
+ | } | ||
+ | </pre> | ||
+ | === API === | ||
+ | |||
+ | ==== CRUD for schedules ==== | ||
+ | ===== Create Schedule ===== | ||
+ | <pre> | ||
+ | POST <version>/schedules | ||
+ | {"schedule": | ||
+ | { | ||
+ | "tenant": "tenant_username", | ||
+ | "action": "snapshot", | ||
+ | "minute": 30, | ||
+ | "hour": 2, | ||
+ | "day": 3, | ||
+ | "day_of_week": 5, | ||
+ | "day_of_month": 23, | ||
+ | "metadata": | ||
+ | { | ||
+ | "instance_id": "some_uuid", | ||
+ | "retention": "3" | ||
+ | } | ||
+ | } | ||
+ | } | ||
+ | </pre> | ||
+ | ===== List schedules ===== | ||
+ | <pre> | ||
+ | GET <version>/schedules | ||
+ | { | ||
+ | "schedules": | ||
+ | [ | ||
+ | { | ||
+ | # schedule as above | ||
+ | }, | ||
+ | { | ||
+ | # schedule as above | ||
+ | }, | ||
+ | ... | ||
+ | ] | ||
} | } | ||
</pre> | </pre> | ||
− | + | ====== Query filters ====== | |
− | * | + | * <tt>next_run_after</tt> - only list schedules with next_run value >= this value |
− | + | * <tt>next_run_before</tt> - only list schedules with next_run value <= this value | |
− | * | ||
− | |||
− | |||
− | + | ====== Example ====== | |
− | + | List schedules which start in the next five minutes | |
− | + | <pre> | |
− | + | GET <version>/schedules?next_run_after={Current_DateTime}&next_run_before={Current_DateTime+5_Minutes} | |
− | + | GET <version>/schedules?next_run_after=2012-05-16T15:27:36Z&next_run_before=2012-05-16T15:32:36Z | |
− | + | </pre> | |
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | === | + | ===== Get a specific schedule ===== |
− | ==== | + | <pre> |
+ | GET /v1/schedules/{id} | ||
+ | </pre> | ||
+ | ===== Update a schedule ===== | ||
<pre> | <pre> | ||
− | + | PUT <version>/schedules/{id} | |
− | + | {"schedule": | |
− | + | { | |
− | + | "minute": 45, | |
− | + | "hour": 3 | |
+ | } | ||
+ | } | ||
</pre> | </pre> | ||
− | + | ===== Delete a schedule ===== | |
− | + | <pre> | |
+ | DELETE <version>/schedules/{id} | ||
+ | </pre> | ||
==== CRUD for jobs ==== | ==== CRUD for jobs ==== |
Revision as of 20:06, 3 May 2013
- Launchpad Entry: QonoS scheduling service
- Created: 3 May 2013
- Contributors: Alex Meade, Eddie Sheffield, Andrew Melton, Iccha Sethi, Nikhil Komawar, Brian Rosmaita
Summary
This document describes the design and API of QonoS, a distributed high-availability scheduling service that has been implemented for the cloud[1]. QonoS is currently used as the scheduling component of a scheduled images service that is invoked by a Nova extension, so many of the examples in this document discuss that use case.
Service responsibilities include:
- Create scheduled tasks
- Perform scheduled tasks
- Handle rescheduling failed jobs
- Maintain persistent schedules
QonoS was designed to work with OpenStack and uses OpenStack common components.
Conceptual Overview
The system consists of:
- an API
- a database
- one or more schedulers, and
- one or more workers.
The API handles communication, both external requests and internal communication. It creates the schedule for a request and stores it in the database.
The scheduler examines schedules and creates jobs.
A job describes a task that must be performed.
A worker performs a task. It obtains a task by polling the API and picking up the first task it is capable of handling.
Job Lifecycle
Jobs have the following statuses:
-
queued
: the job is ready to be processed by a worker -
processing
: the job has been picked up by a worker -
done
: the worker processing this job has decided that the job has been successfully completed -
timeout
: the worker processing this job has decided the job is taking too long and has stopped processing it. A job in this state can be picked up by another worker. -
error
: the worker notes that something went wrong, but the job could be retried -
canceled
: the worker decides that the job can't be done and should not be retried
Job Timeouts
There are two kinds of timeouts:
- hard timeout: once reached, the job is no longer available for retries
- soft timeout: is renewed by the worker, indicates that the worker is still doing the task (similar to a heartbeat)
Job Failures
Job failures are reported as job faults and stored in the database.
Scalability
Address or remove?
Reliability
Address or remove?
Overall System Diagram
Design
Entities
- Schedule
- the general description of what the service will do
- looks something like
{ "tenant_id" : <tenantId>, "schedule_id" : <scheduleId>, "job_type" : <keyword>, "metadata" : { // all the information for this job_type "key" : "value" } "schedule" : <the schedule info, exact format TBD> }
API
CRUD for schedules
Create Schedule
POST <version>/schedules {"schedule": { "tenant": "tenant_username", "action": "snapshot", "minute": 30, "hour": 2, "day": 3, "day_of_week": 5, "day_of_month": 23, "metadata": { "instance_id": "some_uuid", "retention": "3" } } }
List schedules
GET <version>/schedules { "schedules": [ { # schedule as above }, { # schedule as above }, ... ] }
Query filters
- next_run_after - only list schedules with next_run value >= this value
- next_run_before - only list schedules with next_run value <= this value
Example
List schedules which start in the next five minutes
GET <version>/schedules?next_run_after={Current_DateTime}&next_run_before={Current_DateTime+5_Minutes} GET <version>/schedules?next_run_after=2012-05-16T15:27:36Z&next_run_before=2012-05-16T15:32:36Z
Get a specific schedule
GET /v1/schedules/{id}
Update a schedule
PUT <version>/schedules/{id} {"schedule": { "minute": 45, "hour": 3 } }
Delete a schedule
DELETE <version>/schedules/{id}
CRUD for jobs
GET /v1/jobs GET /v1/jobs/{jobId} DELETE /v1/jobs/{jobId} GET /v1/jobs/{jobId}/status GET /v1/jobs/{jobId}/heartbeat PUT /v1/jobs/{jobId}/status * status in request body PUT /v1/jobs/{jobId}/heartbeat * heartbeat for this job (exact format TBD) in request body
NOTES:
- No POST, the job maker handles job creation.
- The worker will mark the job status as 'done' (or whatever) when it finishes.
- The /status and /heartbeat may be combined into a single call, not sure yet
GET /v1/workers GET /v1/workers/{workerId} GET /v1/workers/{workerId}/jobs/next * return job info, format TBD POST /v1/workers * returns a workerId, is done when a worker is instantiated, allows the system to keep track of the worker DELETE /v1/workers/{workerId} * should be called by the worker if/when it's safely taken down
Service
The service shall consist of a set of apis, worker nodes, and a DB.
API - Provides a RESTful interface for adding schedules to the DB
Worker - References schedules in the DB to schedule and perform jobs
DB - Tracks schedules and currently executing jobs
Database
- schedules
- jobs
- job faults
- must be useful!
Implementation
Typical flow of the system is as follows.
- User makes request to Nova extension
- Nova extension passes request to API
- API picks time of day to schedule
- Adds schedule entry to DB
- Worker polls DB for schedules needing action
- Worker creates job entry in DB
- Worker initiates image snapshot
- Worker waits for completion while updating 'last_touched' field in the job table (to indicate the Worker has not died)
- Worker updates DB to show the job has been completed
- Worker polls until a schedule needs action
Edge cases:
Worker dies in middle of job:
- A different worker will see the job has not been updated in awhile and take over, performing any cleanup it can.
- Jobs contain information of where they left off and what image they were working on (this allows a job whose worker died in the middle of an upload to be resumed)
Image upload fails
- Retry a certain number of times, afterwards leave image in error state
Instance no longer exists
- Remove schedule for instance
Code Repository
References
- ↑ QonoS was first described in https://wiki.openstack.org/wiki/Scheduled-images-service (29 October 2012).