Image handling in edge environment

This page contains a summary of the Vancouver Forum discussions about the topic. Full notes of the discussion are in here. The features and requirements for edge cloud infrastructure are described in OpenStack_Edge_Discussions_Dublin_PTG. Source of the figures is here.

= Synchronisation strategies =
 * Copy every image to every edge cloud instance: the simplest and less optimal solution
 * Copy images only to those edge cloud instances where they are needed
 * Provide a synchronisation policy together with the image
 * Rely on the pulling of the images
 * Per clarkb, nodepool might be a good option here. It will aggressively try to ensure that the images you want are in all of the locations you want them in.

= CURRENT Architecture options for Glance = This section contains an updated view of the Glance Architectures currently being looked at.

Glances with 1 or more Backends and Synchronized DBs (optionally w/caching)
NOTE: Current recommendation is this solution with single shared central backend, DB Synchronization and Image Caching enabled. Additional backends would be an enhancement. Description: Pros: Cons:
 * Central Glance has Multiple Backends, with R/W Access to all Backends
 * A Locally located Backend (DEFAULT BACKEND), and
 * Nx Remotely located Backends; 1x for each of the Nx Edge Clouds,
 * Each Edge Cloud Glance has Nx Backends, with R/O Access to these Backends
 * A Locally located Backend (DEFAULT BACKEND),
 * A Remotely located Backend; the Central Glance's Local Backend, and
 * (N-1)x Remotely located Backends; the other Edge Clouds' Local Backends.
 * NOTE: Assuming the Glance DBs are replicated between Central and Edge, sharing backends between Glance frontends should be little or no work.
 * Central Glance's R/W DB (including metadata) is being DB-Replicated to the Edge Clouds' Glance's R/O DBs,
 * A User at Central Glance
 * Adds Images to Central Glance, into DEFAULT Centrally located backend,
 * Images that are required on different Edge Cloud Sites, are manually copied to Remote Backends representing these sites.
 * NOTE: The ability to copy images between backends is incremental work on top of the Multiple Glance Backends work.
 * A User at Edge Glance
 * Has R/O access to Glance DB and ALL Backends
 * Accesses to images stored in this Edge's Local Backend
 * Are fast, and
 * Are still accessible when connectivity is lost to other sites.
 * Accesses to images stored in Central's Backend (or other Edges' Backends)
 * Are slow,
 * Are NOT accessible when connectivity is lost to other sites,
 * ALTHOUGH, if caching is enabled, images would be cached in order to speed up subsequent uses of image. And cached images would be accessible when connectivity is lost to other sites.  This is based on existing caching capabilities and assumption that Glance DB is being replicated.
 * Leverages existing and/or planned Glance capabilities,
 * Technically the Edge Backends could be optional. One could implement the solution with ONLY the Central Glance's Backend and NO Local Edge Backends in the Edge Glance, such that you are relying on caching for implicit availability of image at Edge Cloud.
 * Multiple Backends provides more reliability in ensuring that the required image is available at the Edge when disconnected. For example, a previously cached image at the Edge could be unknowingly purged from the cache when caching an additional image at the Edge, and then not be available when disconnected from the Central site.
 * The Multiple Backend approach requires manual push/copy of image to the Edge Cloud. One has to proactively know what images are required at each Edge site.
 * The Multiple Backend approach requires explicit configuration at Central Cloud Glance for each new Edge Cloud added,
 * I don't believe all Glance Backend types would be supported, for example a local file backend would not be supported. Only Backends that are by definition remotely accessible would work.
 * The Edge Glance must be aware and configured with the details of the Central Glance Backend. ( Compared to main proposal where the Central Glance Backend is transparent to Edge Glance. ),
 * Multiple Backends must explicitly support scaling the number of backends supported to the number of edge clouds required, which could potentially be in the 100s or 1,000s.

Edge Glances using Central Glance as Backend w/Caching
Description: Pros: Cons:
 * Central Glance is a full typical Glance deployment,
 * Edge Glances are a Glance API using a 'NEW' data-access / backend type which is a remote Glance (the central Glance in this case),
 * Edge Glances, using the Central Glance's API, access both image and image-meta data from the Central Glance,
 * Edge Glances will cache both image and image-meta data locally,
 * Edge Glances, when disconnected from its backend (i.e. the Central Glance), will use the image-meta data cache and the image cache to server local requests.
 * Does not require any explicit DB Synchronization; all done thru caching,
 * Requires Glance changes not currently planned,
 * Maintaining cache synchronization in preparation for potential loss of connectivity requires an audit to keep cache synch'd ... using up valuable resources,
 * Depending on the Keystone Solution, authentication of Edge Glance with Central Glance could require additional authentication step,

= PREVIOUS Architecture options for Glance = Legend for the figures:

One Glance with multiple backends - pull mode
There is one central Glance what is capable to handle multiple backends. Every edge cloud instance is represented by a Glance backend in the central Glance. Every edge cloud instance runs a Glance API server which is configured in a way, that it uses the central Glance to pull the images, but also uses a local cache. All Glance API servers use a shared database and the database clustering is out of Glance-s responsibility. Each Glance API server can access each others images, but use its own by default. When accessing a remote image Glance API streams the image from the remote location. (In this sense this is pull mode). OpenStack services in the edge cloud instances are using the images in the Glance backend with a direct url. Work for multiple backends has been already started with an etherpad and a spec. Cascading (one edge cloud instance is the receiver and the source of the images at the same time) is possible if the central Glance is able to orchestrate the synchronisation of images.
 * Concerns/Questions
 * Network partitioning tolerance?
 * Network connection is required when a given image is started for the first time.
 * Is it safe to store database credentials in the far edge? (It is not possible to provide image access without network connection and not store the database credentials in the edge cloud instance)
 * It is unavoidable in this case.
 * Can the OpenStack services running in the edge cloud instances use the images from the local Glance? There is a worry that OpenStack services (e.g. nova) still need to get the direct URL via the Glance API which is ONLY available at the central site.
 * This can be avoided by allowing Glance to access all the CEPH instances as backends
 * Is CEPH backend CEPH block of CEPH RGW in the figure?
 * Glance talks directly to the CEPH block. As an alternative it is possible to use Swift from Glance and use CEPH RGW as a Swift backend.
 * Did I got it correctly, that the CEPH backends are in a replication configuration?
 * Not always
 * Pros
 * Relatively easy to implement based on the current Glance architecture
 * Smaller storage needs on the edge cloud instances
 * Images are automatically loaded to the edge cloud instances from the central Glance
 * Cons
 * Requires the same Glance backend in every edge cloud instance
 * Requires the same OpenStack version in every edge cloud instance (apart from during upgrade)
 * Sensitivity for network connection loss is not clear
 * There is no explicit control over the time period while an image is cached
 * It is not possible to invalidate an image from the cache
 * It is mandatory to store the Glance database credentials in every edge cloud instance
 * The images are pulled

One Glance with multiple backends - distributed storage
There is one central Glance what is capable to handle multiple backends. Every edge cloud instance is represented by a Glance backend in the central Glance. Every edge cloud instance runs a Glance API server which is configured in a way, that it uses the central Glance to pull the images, but also uses a local cache. All Glance API servers use a shared database and the database clustering is out of Glance-s responsibility. The storage backend supports clustering and available in every edge cloud instance. The clustering of the storage is not in the scope of Glance OpenStack services in the edge cloud instances are using the images in the Glance backend with a direct url. Work for multiple backends has been already started with an etherpad and a spec. In Rocky several backends are already supported, but they have to be different type. Support for same type multiple backends is not implemented yet. Cascading (one edge cloud instance is the receiver and the source of the images at the same time) is possible if the central Glance is able to orchestrate the synchronisation of images.
 * Concerns/Questions
 * Pros
 * Relatively easy to implement based on the current Glance architecture
 * Cons
 * Requires the same Glance backend in every edge cloud instance
 * Requires the same OpenStack version in every edge cloud instance (apart from during upgrade)
 * Sensitivity for network connection loss is not clear
 * Every image is replicated to every edge cloud instance
 * It is mandatory to store the Glance database credentials in every edge cloud instance
 * Adding new edged cloud instances require the reconfiguration of the distributed database

Several Glances with an independent synchronisation service, sych via Glance API
Every edge cloud instance has its Glance instance. There is a synchronisation service what is able to instruct the Glances to do the synchronisation. The synchronisation of the image data is done using Glance API-s. Cascading is visible for the syncronisation service only, not for Glance.
 * Pros
 * Every edge cloud instance can have a different Glance backend
 * Can support multiple OpenStack versions in the different edge cloud instances
 * Can be extended to support multiple VIM types
 * Using the API provides support for schema change in the metadata
 * Cons
 * Needs a new synchronisation service
 * At this moment certain metadata is not visible for the API users and it is not possible to assign metadata to an image

Several Glances with an independent synchronisation service, synch using the backend
Every edge cloud instance have its Glance instance. There is a synchronisation service what is able to instruct the Glances to do the synchronisation. The synchronisation of the image data is the responsibility of the backend (eg.: CEPH). Cascading is visible for the synchronisation service only, not for Glance.
 * Concerns/Questions
 * Is CEPH backend CEPH block or CEPH RBD in the figure?
 * CEPH RBD
 * Pros
 * Cons
 * Needs a new synchronisation service

One Glance and multiple Glance API servers
There is one central Glance and every edge cloud instance runs a separate Glance API server. These Glance API servers are communicating with the central Glance. Backend is accessed centrally, there is only caching in the edge cloud instances. Nova Image Caching caches at the compute node what would work ok for all-in-one edge clouds or small edge clouds. But glance-api caching caches at the edge cloud level, so works better for large edge clouds with lots of compute nodes. A description in slides. Cascading is not possible as only pulling strategy works in this case.
 * Concerns/Questions
 * Do we plan to use Nova image caching or caching in the Glance API server?
 * Are the image metadata also cached in Glance API server or only the images?
 * Pros
 * Implicitly location aware
 * Cons
 * First usage of an image always takes a long time
 * In case of network connection error to the central Glance Nova will have access to the images, but will not be able to figure out if the user have rights to use the image and will not have path to the images data
 * At the moment Glance API cache does not support metadata caching.

= Edge Scenarios for Glance =

The two scenarios below are describing the flow of booting a Nova instance from a Glance image with a distributed and a centralized control architecture.



This scenario describes how to boot a Nova instance when Nova runs on the edge site and the image is getting cached on the Edge site where we need to cache metadata as well.


 * 1) User uploads image to glance in regional datacenter
 * 2) Glance stores the image
 * 3) User calls Nova to boot instance, passes the image name or ID of the new image
 * 4) Nova calls glance in the Edge site to see if image exists
 * 5) Glance checks locally if it has the image
 * 6) If glance does not have the image it will call glance in the Regional DC to confirm the image exists (go to step 6)
 * 7) If glance does have the image, and the image updated_at is greater than the glance cache TTL, then it will call glance in the main datacenter to confirm the metadata is still the same. If not it will update its copy of the metadata.
 * 8) Nova tells nova-compute to boot the instance
 * 9) Nova-compute calls glance to fetch the image
 * 10) If glance does not have the image it will call glance in the Regional DC to download the image (Go to step 8)
 * 11) If glance does have the image, and the image updated_at is greater than the glance cache TTL, then it will call glance in the regional datacenter to update its copy of the metadata (go to step 8)
 * 12) If glance has the image and updated_at is less than TTL, go to step 10
 * 13) Glance in the Edge calls glance in the regional datacenter to download the image and its metadata
 * 14) Glance stores the image locally
 * 15) Glance returns the image to nova-compute, and the instance boots



This scenario shows Nova running in the Regional Datacenter as we don't need control on the edge site in case of a connection loss.


 * 1) User uploads image to glance in main datacenter
 * 2) Glance stores the image
 * 3) User calls Nova to boot instance, passes the image name or ID of the new image
 * 4) Nova calls glance in the main data center the to see if image exists
 * 5) Nova tells nova-compute to boot the instance
 * 6) Nova-compute calls glance in the edge site to fetch the image
 * 7) If glance does not have the image it will call glance in the Main DC to download the image (Go to step 7)
 * 8) If glance does have the image, and the image updated_at is greater than the glance cache TTL, then it will call glance in the main datacenter to update its copy of the metadata (go to step 9)
 * 9) If glance has the image and updated_at is less than TTL, go to step 10
 * 10) Glance in the Edge calls glance in the main datacenter to download the image and its metadata
 * 11) Glance stores the image locally
 * 12) Glance returns the image to nova-compute, and the instance boots

Work items:


 * I as a user of Glance, want to upload an image in the regional datacenter and boot that image in an edge datacenter. Fetch the image to the edge datacenter with its metadata.

Open questions:


 * What happens if the Glance image is updated on the source - (how) does the cache know that it needs to download a new copy?
 * (longer term) What happens to the image if the network fails during the cache update? Does it need to restart in full, or can it be rsync-like incremental update?