HTTP Caching In Teller
This document describes the proposed plan to implement HTTP caching in the Teller image delivery service in Glance.
Glance consists of two services, Parallax, the image registry, which stores image metadata describing the image and where to fetch it, and Teller which acts as a proxy for the object store containing the actual image data. Both Parallax and Teller are HTTP servers and therefore can benefit from the performance improvements offered by HTTP caching. The following is a proposal for how to add HTTP caching to the Glance project, and in particular the Teller sub-project (for Parallax see ParallaxHttpCaching).
It should be noted that HTTP caching is not the only type of caching that could improve the speed of OpenStack builds. Down the road, we leave open an option for adding memcached to Parallax, a Bit-Torrent distribution system within the cluster and any number of other options. We are starting with HTTP caching first since it will offer a dramatic savings in bandwidth and performance without a lot of work and, at the same time, has a clear implementation path dictated by RFC 2616.
Background on HTTP Caching
HTTP caching is built on two fundamental concepts, freshness (aka cache-expiration) and validation. The expiration policy is governed by the max-age Cache-Control header (we are not using the Expires header since it requires clock synchronization between the client and the origin server). Validation, a process of verifying that cached data is still accurate, occurs by using a validator header, either Last-Modified or the Etag header added by HTTP/1.1. For this spec, we will only use Etag (Last-Modified suffers the same clock synchronization issues as Expires).
These two headers, Cache-Control: max-age and Etag will provide the information that client caches or transparent caching proxies (Squid or Varnish) need to make informed cache descions regarding eviction and validation.
Caching In Teller
Teller's main role in OpenStack is to resolve image_uris (which act as globally unique identifiers for an image) into image data. It does this by looking up the image_uri in an image registry (in this case Parallax) and then fetching the data from a backend. The lookups to Parallax will also benefit from caching, but we will focus on that in a separate blueprint (ParallaxHttpCaching). Rather, in this blueprint, we will discuss how we can cache the image data. Since this data may be on the order of gigabytes, the savings in bandwidth and transfer times is potentially very large.
Caching in Teller is predicated on a single extremely important assumption: image data is immutable. This means, if a user would like to modify an image, the image will need to be re-registered in Parallax. The benefit of making this assumption, aside from simplicity, is that we can now cache that image safely at various layers without having to make validation requests to the origin object-store (usually Swift). Of course, an image may be deleted or have some other relevant part of its metadata change, so Teller will still need to make lookups in Parallax to ensure availablilty of the image. (NOTE: for security reasons, whenever we fetch an image from the object store we still need to perform checksum validations to ensure the image described by Parallax matches what actually resides in the Backend object-store. What the assumption above states, in other words is: once image data is determined to be valid, it is, by definition valid for as long as the image is available).
Description of Changes
Several changes will need to be made in Teller and Parallax to accomodate HTTP caching. Since images may change state often (in the case of one-off-snapshots) or almost never (base installs), there is a need to have a per-image notion of cache expiraton. This will involve the addition of a new column the Image model in Parllax called cache_expiry. This will be discussed in further detail in ParallaxHttpCaching; the important thing to note is that Images possess a freshness window which allows us to avoid round-trips to fetch or validate.
In addition, a pre-requisite, as noted above, for security reasons, is the computation of a checksum so that we can validate that an image described in Parallax matches up with the objects stored in Swift (or another Backend). This isn't strictly necessary, however, the sooner we enforce image immutability the better off we will be.
We will then need to modify Teller's controller to emit the proper Cache-control: max-age and Etag headers. Of these two headers, only max-age will actually be used. Since, by definiton, the entity body (the image data) is always valid, the Etag is returned for compatability with transparent caching proxies and HTTP clients down the line that are not aware of this useful property.
Once we have Teller acting like a well-behaved, cache-aware HTTP server, it is now up to the clients, be it transparent caching proxies or HTTP clients, to actually use this information to speed up response times.
This blueprint does not make any pronouncements for how this should be done, though it does offer one possibility.
As a first step, we may place Squid or Varnish, or another HTTP caching proxy in front of Teller. This will mean Teller is only responsible for fetching an image the first time and for validating that a given image is still available, dramatically reducing its load.
Below are web sequence diagrams that show how Glance would behave with a transparent caching proxy inserted between the Nova Node and the Teller service.
The following states are depicted: