OpenStack Edge Discussions Dublin PTG

Intro
This page collects the discussed topics of the Edge Worskhop from the Dublin PTG. If there is any error on this page or some information is missing please just go ahead and correct it.

The discussions were noted in the following etherpads:
 * PTG schedule
 * Gap analyzis
 * Alans problems

Definitions

 * Application Sustainability: VMs/Containers/Baremetals (i.e. workload)s already deployed on an edge site can continue to serve requests, i.e. local user can ssh on it
 * Control site(s): Sites that host only control services (i.e. theses sites do not aim at hosting compute workloads). Please note that there is no particular interest of having such a site yet. We just need the definition of what is a control site for the gap analysis and the different deployment scenarios that can be considered.
 * Edge cloud infrastructure user: The users who are in direct contact with the edge cloud infrastructure via the different API-s of the edge cloud infrastructure.
 * Edge cloud service user: The users who are using the services runnig in the edge cloud infrastructure. These users do not interact with the edge cloud infrastructure and in ideal case they are not aware of the existence of the edge cloud infrastructure.
 * Edge site(s): Sites where servers may deliver control and compute capabilities.
 * Original site: The site where the operation is performed/executed initially.
 * Remote site(s): Site(s) that are affected in an operation launched from the Original one.
 * Site Sustainability: Local administrators/Local users should be able to administrate/use local resources in case of disconnections to remote sites.
 * MVS (Minimum Viable Solution): Required for an edge cloud solution
 * Non-MVS: An Edge Cloud can be viable without those components

Example of Remote and Original sites
In the following figure Edge cloud site 1 is the Original site of the operation, while Edge cloud sites 2 and 3 are the Remote sites of the operation. The operator in the Original site is always triggered by the user of the edge cloud infrastructure, while in the Remote site it is triggered by the Original or a Remote site.

Edge use cases
However it was not noted in the etherpads there were lots of discussions about the use cases for edge clouds. Edge computing group collected the use cases into the OpenStack Edge Computing Whitepaper, which is available from the Edge section of openstack.org and a specific use case section of the Edge Computing Group wiki.

Deployment Scenarios
Deployment scenarios are described in the whitepaper of the OPNFV Edge Cloud Project.

Features and requirements
The discussion happened on two levels 1) on the level of future features of an edge cloud and 2) on the level of concrete missing requirements by the ones who try to deploy edge clouds today. These features and requirements are on a different level, therefore they are recorded in two separate subchapters.

Architectural paradigms

 * There is a single source of truth of cloud metadata in one edge infrastructure
 * Caching the data should be possible
 * It is possible to have network partitioning between any edge cloud instances of the edge infrastructure

Features
Features are organized into different feature groups starting from the Elementary operations on one site to the most advances feature sets.

Base assumptions for the features

 * Hundreds edge sites that need to be operated by first a single operator and multiple operators
 * Each edge site is composed of at least 1 server
 * There can be one or several control sites according to the envisioned scenarios (latency between control sites and edge sites can be between a few ms to hundreds ms).

Elementary operations on one site
Type: MVS
 * Admin operations
 * Create Role/Project
 * Create/Register service endpoint
 * Telemetry (collect information regarding the infrastructures and VMs)
 * Operator operations
 * Create a VM image
 * Create a VM locally
 * Create/use a remote-attached volume
 * Telemetry (collect information regarding the infrastructures and VMs)

Use of a remote site
Type: MVS Credentials are only present on the original site
 * Operator operations
 * All Elementary operations on one site on a remote site
 * Operator should be able to define an explicit list of remote sites where the operations should be executed
 * Sharing of Projects and Users among sites. Note: There is a concern, that this results in a non shared-none configuration. The question is if there is any other way to avoid the manual configuration of this data to every edge sites.

Collaboration between edge cloud instances
Type: Non-MVS?
 * Sharing a VM image between sites
 * Create network between sites
 * Create a mesh application (i.e. several VMs deployed through several sites)
 * Use a remote 'remote attached' volume - please note that the remote attached volume is stored on the remote site (while in L1/L2 the remote volume was stored locally
 * Relocating workloads from one site to another one
 * There should be a way to discover and negotiate site capabilities as a migration target. There might be differences in hypervisors and hypervisor features.
 * Different hypervisors might use different VM image formats
 * use different hypervisor-specific images choosen by their metadata derived from the same original image - also to know which image to start where (which site)
 * use common image tested on certain pool of hypervisors - so one can guarantee that image is X, Y, Z is hypervisor certified (maybe even use hypervisor-version granularity?)
 * Cold migration
 * Live migration
 * Require direct connectivity between the compute nodes - e.g. through some sort of point-to-point overlay connection (e.g. via IPsec) between two compute nodes (source + destination)
 * How do we handle attached cinder volumes?
 * Mild migration
 * Take a VM snapshot and move that one
 * Authenticity of the edge cloud infrastructures should be one capability
 * Leverage a flavor that has been created on another site.
 * Rollout of one VM on a set of sites
 * Define scope/radius for collaborations (ie., it should be possible to explicitly define locations where workloads can be launched/where data can be saved for a particular tenants.)

Network unreliability
Type: Non-MVS
 * Both the control and controlled components should be prepared for unreliable networks, therefore they should
 * Have a policy for operation retries without overloading the network
 * Be able to pause the communication while the network is down and restart it after the network recovered
 * Users of an isolated edge cloud site should be able execute operations regarding the site.
 * In case of a network partitioning every side of the partition should be operable.
 * Open questions:
 * Do we expect operations which should be cached in case of a network partitioning?
 * How to handle config data collisions after the restoration of a network partitioning?

Containers
Type: MVS? Same as Collaboration between edge cloud instances, but for containers.

Automatic scheduling between edge cloud instances
Type: Non-MVS Same as Collaboration between edge cloud instances and Containers, but in an implicit manner.
 * edge compliant scheduler/placement engine
 * Edge cloud instances and the scheduler/placement engine should be aware of the edge cloud instances physical location
 * A workload should be relocalized autonomously for performance objectives (follow an end-user,...)
 * edge compliant application orchestratror (heat like approach)
 * autoscaling of hosts within an edge site or between edge sites.... (that requires a ceilometer-like system)
 * Authorisation of the new hosts when joining to the edge cloud instance cluster

Administration features
Type: MVS?
 * zero touch provisioning (e.g., from bare-metal Rack Scale Controller (RSC) to an OpenStack)
 * Host OS provisioning
 * OpenStack deployment based on kolla/kubernetes/helm
 * Join the edge cloud infrastructure (authentication of the edge cloud instance and the edge cloud infrastructure)
 * Question: what about network equipment?
 * remote hardware management (inventory, eSW management, configuration of BIOS and similar things)
 * remote upgrade (of OpenStack core services).
 * Service continuity of the control plane
 * Service continuity of the workloads
 * control plane versioning issue
 * This is somewhat similar to the capability discovery
 * Monitoring service get important events through notifications/trigger.
 * logs and alarms
 * logs/alarms and events
 * logs, alarms, events, and performance metrics
 * new performance metrics related to latency between edge sites
 * Build monitoring dashboards in real time (and on demand in order to define the scope of the resources the administrator wants to monitor on demand)
 * Workload consolidation/relocation for maintenance operations, energy optimization (green energy sources - solar panels/wind turbines..)
 * Perform a particular operation on each edge site (configure my users and tenants only once so the configurations are consistent among all of my edge clouds)
 * Other autononomous mechanisms
 * Collect information on the edge sites and do operations based on this (autoscaling)
 * Dealing with Churn challenges (edge apparitions/removals)
 * human based vs crashed based.
 * Connecting network of a newly provisioned edge site
 * Operators of different parts of the edge infrastructure should have their view on their operation domain (god mode users can see everyting, while operators of edge clouds of a region can see the data of the actual edge clouds)
 * Be able to look at a cloud of cloud and for a user of this cloud of cloud give him access to a set of features and a set of nodes where he can use those features.

Multiple cloud stacks
Type: MVS? Different versions of OpenStack and Kubernetes instances
 * Collaboration between edge cloud instances, Containers, Automatic scheduling between edge cloud instances and Administration features features, but between different version of OpenStack
 * Collaboration between edge cloud instances, Containers, Automatic scheduling between edge cloud instances and Administration features features, different cloud solutions (eg.: OpenStack or Kubernetes)

Multi operator scenarios
Note: This section is even more draft than the rest of this page. Note: Operator to edge id mapping needs consideration. Type: Non-MVS?
 * Security considerations
 * Guarantee that the communication between the VM-s or containers of an application is secure
 * (physical) security issues (can we limit human byzantine attackers
 * data privacy (jurdisction concerns)
 * HA / Reliability / Recovery challenges

Requirements
This section collects the captured concrete requirements to existing or new open source projects.

An edge cloud site should be aware of its location
Components: OpenStack Keystone? / Kubernetes ? An edge cloud instance should be able to store data bout its location.

Discovering of data sources
Components: synch service An edge cloud instance should be able to discover other edge cloud instances which are trustable as a source of metadata.

Registering for synchronisation
Components: synch service An edge cloud instance which is capable to provide metadata synchronisation services should be able to provide a reistration API for edge cloud instances which would like to receive the data. The data should be syncronised after the first succesfull registration. An edge cloud instance should be able to register itself for metadata synchronisation services.

Advertise metadata data source service
Components: synch service or OpenStack Keystone An edge cloud instance sould be able to advertise if it is able to provide metadata sycnshronisation services

User management data source side
Note: Alternatives of Keystone metadata synchronisation in edge environment are discussed in a wiki page. The final content of this chapter depends on the solutions discussed there. Components: synch service, OpenStack Keystone, Kubernetes An edge cloud instance should be able to provide user data for synchronisation. The users to be synchronised are either marked (via API, CLI or config file) or received via synchronisation. The target edge clouds API-s for user management data are called. In case of an error the erroneous data segment is marked for retry and retried until a 200 OK is received. If the synchronised data changed it should be re-synched to all receiving edge cloud instances.

User management data receiver side
Note: Alternatives of Keystone metadata synchronisation in edge environment are discussed in a wiki page. The final content of this chapter depends on the solutions discussed there. Components: synch service, OpenStack Keystone, Kubernetes An edge cloud instance should be able to utilize users from a remote site, this means that users can log in to the edge cloud instance without the need to manually provision the users to the edge cloud instance. An edge cloud instance could receive users via synchronisation. In this case an API should be provided where the user management data can be set. 200 OK is provided only for data what is correctly stored. The received data should be locked from local editing. As an alternative the edge cloud instance could auto provision the users based on a set of preprovisioned policies and the information available at the first login attempt. The pre provisioned policies should be either synchronised or static.

RBAC data source side
Note: Alternatives of Keystone metadata synchronisation in edge environment are discussed in a wiki page. The final content of this chapter depends on the solutions discussed there. Components: synch service, OpenStack Keystone, Kubernetes An edge cloud instance should be able to provide RBAC data for synchronisation. The RBAC data to be synchronised are either marked (via API, CLI or config file) or received via synchronisation. The target edge clouds API-s for RBAC data are called. In case of an error the erroneous data segment is marked for retry and retried until a 200 OK is received. If the synchronised data changed it should be re-synched to all receiving edge cloud instances.

RBAC data receiver side
Note: Alternatives of Keystone metadata synchronisation in edge environment are discussed in a wiki page. The final content of this chapter depends on the solutions discussed there. Components: synch service, OpenStack Keystone, Kubernetes An edge cloud instance should be able to receive RBAC data via synchronisation. An API should be provided where the RBAC data can be set. The RBAC data should be consistent with the user data of the edge cloud instance. 200 OK is provided only for data what is correctly stored. The received data should be locked from local editing.

VM images source side
Note: Alternatives of image handling in edge environment are discussed in a separate wiki page. The final content of this chapter depends on the solutions discussed there. Components: synch service, OpenStack Glance or Glare An edge cloud instance should be able to provide selected VM images for synchronisation. The VM images to be synchronised are either marked (via API, CLI or config file) or received via synchronisation. The target edge clouds API-s for VM images data are called where the hash of the image is provided, a datapath is built for the disk images and the disk images are transferred (exact technology is FFS). In case of an error the erroneous image is marked for retry and retried until a 200 OK is received. If any of the the synchronised VM images are updated the image should be re-synched to all receiving edge cloud instances. There should be an API where the receiving edge cloud instances can initiate the synchronisation of particular VM images. A version of the images should be maintained.

VM images receiver side
Note: Alternatives of image handling in edge environment are discussed in a separate wiki page. The final content of this chapter depends on the solutions discussed there. Components: synch service, OpenStack OpenStack Glance or Glare An edge cloud instance should be able to receive VM images via synchronisation. An API should be provided where the VM image transfer can be initiated, datapath for the transfer is built, the received images hash is checked. 200 OK is provided only for data what is correctly stored. The received data should be locked from local editing.

Flavors source side
Components: synch service, OpenStack Nova, Kubernetes An edge cloud instance should be able to provide selected Flavors for synchronisation. The Flavors to be synchronised are either marked (via API, CLI or config file) or received via synchronisation. The target edge clouds API-s for Flavors are called. In case of an error the erroneous Flavor is marked for retry and retried until a 200 OK is received. If any of the synchronised Flavors are changed it should be re-synched to all receiving edge cloud instances.

Flavors receiver side
Components: synch service, OpenStack Nova, Kubernetes An edge cloud instance should be able to receive Flavors via synchronisation. An API should be provided where the Flavors can be set. 200 OK is provided only for Flavors which are correctly stored. The received data should be locked from local editing.

Projects source side
Note: Alternatives of Keystone metadata synchronisation in edge environment are discussed in a wiki page. The final content of this chapter depends on the solutions discussed there. Components: synch service, OpenStack Keystone, Kubernetes An edge cloud instance should be able to provide selected Project configuration for synchronisation. The Projects to be synchronised are either marked (via API, CLI or config file) or received via synchronisation. The target edge clouds API-s for Projects are called. In case of an error the erroneous Projects configuration is marked for retry and retried until a 200 OK is received. If any of the synchronised Projects are changed it should be re-synched to all receiving edge cloud instances.

Projects receiver side
Note: Alternatives of Keystone metadata synchronisation in edge environment are discussed in a wiki page. The final content of this chapter depends on the solutions discussed there. Components: synch service, OpenStack Keystone, Kubernetes An edge cloud instance should be able to receive Projects via synchronisation. An API should be provided where the Projects can be set. The stored Projects should be consistent with the user settings of the edge cloud. 200 OK is provided only for Projects which are correctly stored. The received data should be locked from local editing.

Quotas source side
Note: Alternatives of Keystone metadata synchronisation in edge environment are discussed in a wiki page. The final content of this chapter depends on the solutions discussed there. Components: synch service, OpenStack Keystone, Kubernetes An edge cloud instance should be able to provide selected Quota configuration for synchronisation. The Quotas to be synchronised are either marked (via API, CLI or config file) or received via synchronisation. The target edge clouds API-s for Quotas are called. In case of an error the erroneous Quota configuration is marked for retry and retried until a 200 OK is received. If any of the synchronised Quotas are changed it should be re-synched to all receiving edge cloud instances.

Quotas receiver side
Note: Alternatives of Keystone metadata synchronisation in edge environment are discussed in a wiki page. The final content of this chapter depends on the solutions discussed there. Components: synch service, OpenStack Keystone, Kubernetes An edge cloud instance should be able to receive Quotas via synchronisation. An API should be provided where the Quotas can be set. The stored Quotas should be consistent with the Projects settings of the edge cloud. 200 OK is provided only for Quotas which are correctly stored. The received data should be locked from local editing.

Progress monitoring
Components: synch service An edge cloud instance with metadata synchronisation services should be able to:
 * report the progress of its own in terms of data segments and target edge cloud instances
 * collect the report of other edge cloud instances with metadata synchronisation services which are "under" it
 * report the progress of its own and all other synchronisation services "under" it

Operability data aggregation data provider part
Component: synch service or something else? Edge cloud instances should provide an API where they provide operability data about themselves. The provided data should be:
 * List of active alarms
 * What else?

Operability data aggregation data aggregator part
Components: synch service or something else? Some selected (edge) cloud instances should be able to collect operability data of other edge cloud instances and show these on an UI and a CLI.

Remote control controlling part
Components: synch service or something else? Some selected (edge) cloud instances should be able to issue different operation on other selected edge cloud instances. The supported operations should be:
 * Add operations.

Remote control receiving part
Components: synch service or something else? Edge cloud instances should be able receive commands remotely on an API.

Identified open questions

 * How can we make the distinction between the connectivity from the backhaul to the edge site (i.e. inter edge sites), vs the connectivity between the edge site and devops/users that are in the vicinity of the edge site
 * Regarding the storage of small edge deployment: Is the storage a single, locally attached unit?
 * Regarding the storage of small edge deployment: What's about image repository service (i.e. are you expecting a fully indepedent edge node or can we envision to have just a node that behave like a compute node in the OpenStack terminology?)
 * At edge there can be many vm/cn applications evolve as per the need. Any thoughts on how such app management is taken care.
 * What should be the size of the all the OpenStack components in the different edge deployment scenarios in terms of CPU, memory, disk and hardware units

Further discussions

 * Image handling in edge environment
 * Keystone edge architectures