Jump to: navigation, search

Difference between revisions of "Qonos-scheduling-service"

m (Conceptual Overview)
m
Line 14: Line 14:
 
* Handle rescheduling failed jobs
 
* Handle rescheduling failed jobs
 
* Maintain persistent schedules
 
* Maintain persistent schedules
 +
 +
QonoS was designed to work with OpenStack and uses OpenStack common components.
  
 
=== Conceptual Overview ===
 
=== Conceptual Overview ===
Line 23: Line 25:
 
* one or more workers.
 
* one or more workers.
  
 +
The API handles communication, both external requests and internal communication.  It creates the schedule for a request and stores it in the database.
  
 
The '''scheduler''' examines schedules and creates jobs.
 
The '''scheduler''' examines schedules and creates jobs.
Line 47: Line 50:
 
Job failures are reported as job faults and stored in the database.
 
Job failures are reported as job faults and stored in the database.
  
=== Overall System Diagram ===
+
==== Scalability ====
 
 
[[File:Qonos Diagram.png]]
 
 
 
=== Scalability ===
 
  
Creating a new, self-standing service allows for scaling the feature independently of the rest of the system.
+
Address or remove?
  
 
=== Reliability ===
 
=== Reliability ===
  
Users of the API may come to rely on this feature working every time or notifying them of failures.
+
Address or remove?
  
It is important to have a scheduling service that understands information such as instances, tenants, etc if there is any desire to recover from errors or make performance decisions based on such information. This is opposed to having a more generic 'cron' service that knows nothing of the concept of an instance or image.
+
=== Overall System Diagram ===
  
For example, listing schedules of a particular tenant would be much more efficient if the tenant was in a DB column instead of a blob in the DB.
+
[[File:Qonos Diagram.png]]
  
 
== Design ==
 
== Design ==
Line 80: Line 79:
 
   }
 
   }
 
   "schedule" : <the schedule info, exact format TBD>
 
   "schedule" : <the schedule info, exact format TBD>
 +
}
 +
</pre>
 +
=== API ===
 +
 +
==== CRUD for schedules ====
 +
===== Create Schedule =====
 +
<pre>
 +
POST <version>/schedules
 +
    {"schedule":
 +
        {
 +
            "tenant": "tenant_username",
 +
            "action": "snapshot",
 +
            "minute": 30,
 +
            "hour": 2,
 +
            "day": 3,
 +
            "day_of_week": 5,
 +
            "day_of_month": 23,
 +
            "metadata":
 +
            {
 +
                "instance_id": "some_uuid",
 +
                "retention": "3"
 +
            }
 +
        }
 +
    }
 +
</pre>
 +
===== List schedules =====
 +
<pre>
 +
GET <version>/schedules
 +
{
 +
    "schedules":
 +
    [
 +
        {
 +
            # schedule as above
 +
        },
 +
        {
 +
            # schedule as above
 +
        },
 +
        ...
 +
    ]
 
}
 
}
 
</pre>
 
</pre>
  
* Job
+
====== Query filters ======
** a particular instance of a scheduled job_type
+
* <tt>next_run_after</tt> - only list schedules with next_run value >= this value
*** e.g., 'snapshot'
+
* <tt>next_run_before</tt> - only list schedules with next_run value <= this value
* i.e., this is the thing that will be executed by a worker
 
* Worker
 
** a process that performs a Job
 
  
The QonoS scheduling service has the following functional components:
+
====== Example ======
* API
+
List schedules which start in the next five minutes
** handles communication, both external requests and internal communication
+
<pre>
** creates the schedule for a request and stores it in DB
+
GET <version>/schedules?next_run_after={Current_DateTime}&next_run_before={Current_DateTime+5_Minutes}
** the only job_type we will implement is 'scheduled_image'
+
GET <version>/schedules?next_run_after=2012-05-16T15:27:36Z&next_run_before=2012-05-16T15:32:36Z
* Job Maker
+
</pre>
** creates Jobs from schedules; the idea is that the Jobs table will consist of Jobs that are ready to be executed for the current time period
 
* Job Monitor
 
** keeps the Job table updated
 
* Worker monitor
 
** looks for dead workers
 
* Worker
 
** executes a job, keeps the job's 'status' updated
 
** does "best effort" ... if an error is encountered, it will log and terminate job
 
  
=== API ===
+
===== Get a specific schedule =====
==== CRUD for schedules ====
+
<pre>
 +
GET /v1/schedules/{id}
 +
</pre>
  
 +
===== Update a schedule =====
 
<pre>
 
<pre>
POST /v1/schedules
+
PUT <version>/schedules/{id}
GET /v1/schedules
+
    {"schedule":
GET /v1/schedules/{scheduleId}
+
        {
DELETE /v1/schedules/{scheduleId}
+
            "minute": 45,
PUT /v1/schedules/{scheduleId}
+
            "hour": 3
 +
        }
 +
    }
 
</pre>
 
</pre>
  
Request body for POST, PUT will be roughly the Schedule entity described above.
+
===== Delete a schedule =====
POST would return the scheduleId.
+
<pre>
 +
DELETE <version>/schedules/{id}
 +
</pre>
  
 
==== CRUD for jobs ====
 
==== CRUD for jobs ====

Revision as of 20:06, 3 May 2013

  • Launchpad Entry: QonoS scheduling service
  • Created: 3 May 2013
  • Contributors: Alex Meade, Eddie Sheffield, Andrew Melton, Iccha Sethi, Nikhil Komawar, Brian Rosmaita

Summary

This document describes the design and API of QonoS, a distributed high-availability scheduling service that has been implemented for the cloud[1]. QonoS is currently used as the scheduling component of a scheduled images service that is invoked by a Nova extension, so many of the examples in this document discuss that use case.

Service responsibilities include:

  • Create scheduled tasks
  • Perform scheduled tasks
  • Handle rescheduling failed jobs
  • Maintain persistent schedules

QonoS was designed to work with OpenStack and uses OpenStack common components.

Conceptual Overview

The system consists of:

  • an API
  • a database
  • one or more schedulers, and
  • one or more workers.

The API handles communication, both external requests and internal communication. It creates the schedule for a request and stores it in the database.

The scheduler examines schedules and creates jobs.

A job describes a task that must be performed.

A worker performs a task. It obtains a task by polling the API and picking up the first task it is capable of handling.

Job Lifecycle

Jobs have the following statuses:

  • queued : the job is ready to be processed by a worker
  • processing : the job has been picked up by a worker
  • done : the worker processing this job has decided that the job has been successfully completed
  • timeout : the worker processing this job has decided the job is taking too long and has stopped processing it. A job in this state can be picked up by another worker.
  • error : the worker notes that something went wrong, but the job could be retried
  • canceled : the worker decides that the job can't be done and should not be retried

Job Timeouts

There are two kinds of timeouts:

  • hard timeout: once reached, the job is no longer available for retries
  • soft timeout: is renewed by the worker, indicates that the worker is still doing the task (similar to a heartbeat)

Job Failures

Job failures are reported as job faults and stored in the database.

Scalability

Address or remove?

Reliability

Address or remove?

Overall System Diagram

Qonos Diagram.png

Design

Entities

  • Schedule
    • the general description of what the service will do
    • looks something like
{
  "tenant_id" : <tenantId>,
  "schedule_id" : <scheduleId>,
  "job_type" : <keyword>,
  "metadata" : {
    // all the information for this job_type
    "key" : "value"
  }
   "schedule" : <the schedule info, exact format TBD>
}

API

CRUD for schedules

Create Schedule
POST <version>/schedules
    {"schedule":
        {
            "tenant": "tenant_username",
            "action": "snapshot",
            "minute": 30,
            "hour": 2,
            "day": 3,
            "day_of_week": 5,
            "day_of_month": 23,
            "metadata":
            {
                "instance_id": "some_uuid",
                "retention": "3"
            }
        }
    }
List schedules
GET <version>/schedules
{
    "schedules":
    [
        {
            # schedule as above
        },
        {
            # schedule as above
        },
        ...
    ]
}
Query filters
  • next_run_after - only list schedules with next_run value >= this value
  • next_run_before - only list schedules with next_run value <= this value
Example

List schedules which start in the next five minutes

GET <version>/schedules?next_run_after={Current_DateTime}&next_run_before={Current_DateTime+5_Minutes}
GET <version>/schedules?next_run_after=2012-05-16T15:27:36Z&next_run_before=2012-05-16T15:32:36Z
Get a specific schedule
GET /v1/schedules/{id}
Update a schedule
PUT <version>/schedules/{id}
    {"schedule":
        {
            "minute": 45,
            "hour": 3
        }
    }
Delete a schedule
DELETE <version>/schedules/{id}

CRUD for jobs

GET /v1/jobs
GET /v1/jobs/{jobId}
DELETE /v1/jobs/{jobId}
GET /v1/jobs/{jobId}/status
GET /v1/jobs/{jobId}/heartbeat
PUT /v1/jobs/{jobId}/status
 * status in request body
PUT /v1/jobs/{jobId}/heartbeat
 * heartbeat for this job (exact format TBD) in request body

NOTES:

  • No POST, the job maker handles job creation.
  • The worker will mark the job status as 'done' (or whatever) when it finishes.
  • The /status and /heartbeat may be combined into a single call, not sure yet

Worker related

GET /v1/workers
GET /v1/workers/{workerId}
GET /v1/workers/{workerId}/jobs/next
 * return job info, format TBD
POST /v1/workers
 * returns a workerId, is done when a worker is instantiated, allows the system to keep track of the worker
DELETE /v1/workers/{workerId}
 * should be called by the worker if/when it's safely taken down


Service

The service shall consist of a set of apis, worker nodes, and a DB.

API - Provides a RESTful interface for adding schedules to the DB

Worker - References schedules in the DB to schedule and perform jobs

DB - Tracks schedules and currently executing jobs

Database

  • schedules
  • jobs
  • job faults
    • must be useful!

Implementation

Typical flow of the system is as follows.

  1. User makes request to Nova extension
  2. Nova extension passes request to API
  3. API picks time of day to schedule
  4. Adds schedule entry to DB
  5. Worker polls DB for schedules needing action
  6. Worker creates job entry in DB
  7. Worker initiates image snapshot
  8. Worker waits for completion while updating 'last_touched' field in the job table (to indicate the Worker has not died)
  9. Worker updates DB to show the job has been completed
  10. Worker polls until a schedule needs action

Edge cases:

Worker dies in middle of job:

  • A different worker will see the job has not been updated in awhile and take over, performing any cleanup it can.
  • Jobs contain information of where they left off and what image they were working on (this allows a job whose worker died in the middle of an upload to be resumed)

Image upload fails

  • Retry a certain number of times, afterwards leave image in error state

Instance no longer exists

  • Remove schedule for instance

Code Repository

References

  1. QonoS was first described in https://wiki.openstack.org/wiki/Scheduled-images-service (29 October 2012).