OpenStack Edge Discussions Dublin PTG

Intro
This page collects the discussed topics of the Edge Worskhop from the Dublin PTG. If there is any error on this page or some information is missing please just go ahead and correcct it.

The discussions were noted in the following etherpads:
 * PTG schedule
 * Gap analyzis
 * Alans problems

Definitions

 * Application Sustainability : VMs/Containers/Baremetals (i.e. workload)s already deployed on an edge site can continue to serve requests, i.e. local user can ssh on it
 * Control site(s): Sites that host only control services (i.e. theses sites do not aim at hosting compute workloads). Please note that there is no particular interest of having such a site yet. We just need the definition of what is a control site for the gap analysis and the different deployment scenarios that can be considered.
 * Edge cloud infrastructure user: The users who are in direct contact with the edge cloud infrastructure via the different API-s of the edge cloud infrastructure.
 * Edge cloud service user: The users who are using the services runnig in the edge cloud infrastructure. These users do not interact with the edge cloud infrastructure and in ideal case they are not aware of the existence of the edge cloud infrastructure.
 * Edge site(s): Sites where servers may deliver control and compute capabilities.
 * Original site: The site where the operation is performed/executed initially.
 * Remote site(s): Site(s) that are affected in an operation launched from the Original one.
 * Site Sustainability: Local administrators/Local users should be able to administrate/use local resources in case of disconnections to remote sites.

Example of Remote and Original sites
This section is WIP.

Edge use cases
However it was not noted in the etherpads there were lots of discussions about the use cases for edge clouds. As the OpenStack Edge Computing Whitepaper, which is available from the Edge section of openstack.org also describes there are unlimited use cases possible. The most prominents are:
 * IoT data aggregation: In case of IoT a big amount of devices are sending their data towards the central cloud. In an edge application this data can be pre processesed and aggregated, so the amount of data sent to the central cloud is smaller.
 * NFV: Telecom operators would like to run realtime applications on an infrastructure close to the radio heads to provide low latency.
 * Autonomus devices: Autonomus cars and other devices will generate high amount of data a will need low latency handling of this data.

Deployment Scenarios
To support all of the use cases there is a need for different size of edge clouds. During the discussions we recognised the following deployment scenarios:

Small edge
This is a single node deployment with multiple instances contained within it (lives in a coffee shop for instance); there should probably be some external management of the collection of these single nodes that does roll-up.
 * Minimum hardware specs: 1 unit of 4 cores, 8 GB RAM, 1 * 240 GB SSD
 * Maximum hardware specs: 1 unit of 16 cores, 64 GB RAM, 1 * 1 TB storage
 * Physical access of maintainer: Rare
 * Physical security: none
 * Expected frequency of updates to hardware: 3-4 year refresh cycle
 * Expected frequency of updates to firmware: 6-12 months
 * Expected frequency of updates to control systems (e.g. OpenStack or Kubernetes controllers): ~ 12 - 24 months, has to be possible from remote management
 * Remote access/connectivity reliability (24/24, periodic, ...): No 100% uptime and variable connectivity expected.

Medium edge

 * Minimum hardware specs: 2 RU
 * Maximum hardware specs: 20 RU
 * Physical access of maintainer: Rare
 * Physical security: Medium, probably not in a secure data center, probably in a semi-physically secure; each device has some authentication (such as certificate) to verify it's a legitimate piece of hardware deployed by operator; network access is all through security enhanced methods (vpn, connected back to dmz); VPN itself is not considered secure, so other mechanism such as https should be employed as well)
 * Expected frequency of updates to hardware: 5-7 years
 * Expected frequency of updates to firmware: Never unless required to fix blocker/critical bug(s)
 * Expected frequency of updates to control systems (e.g. OpenStack or Kubernetes controllers): 12 - 24 months
 * Remote access/connectivity reliability (24/24, periodic, ...): 24/24 (high uptime but connectivity is variable), 100% uptime expected

Features and requirements
The discussion happened on two levels 1) on the level of future features of an edge cloud and 2) on the level of concrete missing requirements by the ones who try to deploy edge clouds today. These features and requirements are on a different level, therefore they are recorded in two separate subchapters.

Architectural paradigms

 * There is a single source of truth of cloud metadata in one edge infrastructure
 * Caching the data should be possible
 * It is possible to have network partitioning between any edge cloud instances of the edge infrastructure

Features
Features are organized into different feature groups starting from the Elementary operations on one site to the most advances feature sets.

Base assumptions for the features

 * Hundreds edge sites that need to be operated by first a single operator and multiple operators
 * Each edge site is composed of at least 1 server
 * There can be one or several control sites according to the envisioned scenarios (latency between control sites and edge sites can be between a few ms to hundreds ms).

Elementary operations on one site
Type: MVS
 * Admin operations
 * Create Role/Project
 * Create/Register service endpoint
 * Telemetry (collect information regarding the infrastructures and VMs)
 * Operator operations
 * Create a VM image
 * Create a VM locally
 * Create/use a remote-attached volume
 * Telemetry (collect information regarding the infrastructures and VMs)

Use of a remote site
Type: MVS Credentials are only present on the original site
 * Operator operations
 * All Elementary operations on one site on a remote site
 * Operator should be able to define an explicit list of remote sites where the operations should be executed
 * Sharing of Projects and Users among sites. Note: There is a concern, that this results in a non shared-none configuration. The question is if there is any other way to avoid the manual configuration of this data to every edge sites.

Network unreliability
Type: Non-MVS
 * Both the control and controlled components should be prepared for unreliable networks, therefore they should
 * Have a policy for operation retries without overloading the network
 * Be able to pause the communication while the network is down and restart it after the network recovered
 * An edge cloud site should be able execute all basic operations ( Action : Basic operations should be defined)

Collaboration between edge cloud instances
Type: Non-MVS?
 * Sharing a VM image between sites
 * Create network between sites
 * Create a mesh application (i.e. several VMs deployed through several sites)
 * Use a remote 'remote attached' volume - please note that the remote attached volume is stored on the remote site (while in L1/L2 the remote volume was stored locally
 * Relocating workloads from one site to another one
 * There should be a way to discover and negotiate site capabilities as a migration target. There might be differences in hypervisors and hypervisor features.
 * Different hypervisors might use different VM image formats
 * use different hypervisor-specific images choosen by their metadata derived from the same original image - also to know which image to start where (which site)
 * use common image tested on certain pool of hypervisors - so one can guarantee that image is X, Y, Z is hypervisor certified (maybe even use hypervisor-version granularity?)
 * Cold migration
 * Live migration
 * Require direct connectivity between the compute nodes - e.g. through some sort of point-to-point overlay connection (e.g. via IPsec) between two compute nodes (source + destination)
 * How do we handle attached cinder volumes?
 * Mild migration
 * Take a VM snapshot and move that one
 * Leverage a flavor that has been created on another site.
 * Rollout of one VMs on a set of sites
 * Define scope/radius for collaborations (ie., it should be possible to explicitly define locations where workloads can be launched/where data can be saved for a particular tenants.)

Containers
Type: MVS? Same as Collaboration between edge cloud instances, but for containers.

Automatic scheduling between edge cloud instances
Type: Non-MVS Same as Collaboration between edge cloud instances and Containers, but in an implicit manner.
 * edge compliant scheduler/placement engine
 * A workload should be relocalized autonomously for performance objectives (follow an end-user,...)
 * edge compliant application orchestratror (heat like approach)
 * autoscaling of hosts within an edge site or between edge sites.... (that requires a ceilometer-like system)

Administration features
Type: MVS?
 * zero touch provisioning (e.g., from bare-metal rsc to an OpenStack)
 * kolla/kubernetes/helm deployment
 * Join the swarm (authentication of the site and the swarm)
 * what about network equipment?
 * remote hardware management (inventory, eSW management, configuration of BIOS and similar things)
 * remote upgrade (of OpenStack core services).
 * Service continuity of the control plane
 * Service continuity of the workloads
 * control plane versioning issue
 * This is somewhat similar to the capability discovery
 * Monitoring service get important events through notifications/trigger.
 * logs and alarms
 * logs/alarms and events
 * logs, alarms, events, and performance metrics
 * new performance metrics related to latency between edge sites
 * Build monitoring dashboards in real time (and on demand in order to define the scope of the resources the administrator wants to monitor on demand)
 * Workload consolidation/relocation for maintenance operations, energy optimization (green energy sources - solar panels/wind turbines..)
 * Perform a particular operation on each edge site (configure my users and tenants only once so the configurations are consistent among all of my edge clouds)
 * Other autononomous mechanisms
 * Collect information on the edge sites and do operations based on this (autoscaling)
 * Dealing with Churn challenges (edge apparitions/removals)
 * human based vs crashed based.
 * Connecting network of a newly provisioned edge site
 * Operators of different parts of the edge infrastructure should have their view on their operation domain (god mode users can see everyting, while operators of edge clouds of a region can see the data of the actual edge clouds)
 * Be able to look at a cloud of cloud and for a user of this cloud of cloud give him access to a set of features and a set of nodes where he can use those features.

Multiple cloud stacks
Type: MVS? Different versions of OpenStack and Kubernetes instances
 * Collaboration between edge cloud instances, Containers, Automatic scheduling between edge cloud instancesand Administration features features, but between different version of OpenStack
 * Collaboration between edge cloud instances, Containers, Automatic scheduling between edge cloud instancesand Administration features features, different cloud solutions (eg.: OpenStack or Kubernetes)

Multi operator scenarios
Type: Non-MVS?
 * Security considerations
 * Guarantee that the communication between the VM-s or containers of an application is secure
 * (physical) security issues (can we limit human byzantine attackers
 * data privacy (jurdisction concerns)
 * HA / Reliability / Recovery challenges

Requirements
This section collects the captured concrete requirements to existing or new open source projects.

An edge cloud site should be aware of its location
Component: OpenStack Keystone? / Kubernetes ? An edge cloud instance should be able to store data bout its location.

Discovering of data sources
Component: synch service (new Kingbird) An edge cloud instance should be able to discover other edge cloud instances which are trustable as a source of metadata.

Registering for synchronisation
Component: synch service (new Kingbird) An edge cloud instance which is capable to provide metadata synchronisation services should be able to provide a reistration API for edge cloud instances which would like to receive the data. The data should be syncronised after the first succesfull registration. An edge cloud instance should be able to register itself for metadata synchronisation services.

Advertise metadata data source service
Component: synch service (new Kingbird) or OpenStack Keystone An edge cloud instance sould be able to advertise if it is able to provide metadata sycnshronisation services

User management data source side
Component: synch service (new Kingbird), OpenStack Keystone, Kubernetes An edge cloud instance should be able to provide user data for synchronisation. The users to be synchronised are either marked (via API, CLI or config file) or received via synchronisation. The target edge clouds API-s for user management data are called. In case of an error the erroneous data segment is marked for retry and retried until a 200 OK is received. If the synchronised data changed it should be re-synched to all receiving edge cloud instances.

User management data receiver side
Component: synch service (new Kingbird), OpenStack Keystone, Kubernetes An edge cloud instance should be able to receive users via synchronisation. An API should be provided where the user management data can be set. 200 OK is provided only for data what is correctly stored. The received data should be locked from local editing.

RBAC data source side
Component: synch service (new Kingbird), OpenStack Keystone, Kubernetes An edge cloud instance should be able to provide RBAC data for synchronisation. The RBAC data to be synchronised are either marked (via API, CLI or config file) or received via synchronisation. The target edge clouds API-s for RBAC data are called. In case of an error the erroneous data segment is marked for retry and retried until a 200 OK is received. If the synchronised data changed it should be re-synched to all receiving edge cloud instances.

RBAC data receiver side
Component: synch service (new Kingbird), OpenStack Keystone, Kubernetes An edge cloud instance should be able to receive RBAC data via synchronisation. An API should be provided where the RBAC data can be set. The RBAC data should be consistent with the user data of the edge cloud instance. 200 OK is provided only for data what is correctly stored. The received data should be locked from local editing.

VM images source side
Component: synch service (new Kingbird), OpenStack Glance or Glare An edge cloud instance should be able to provide selected VM images for synchronisation. The VM images to be synchronised are either marked (via API, CLI or config file) or received via synchronisation. The target edge clouds API-s for VM images data are called where the hash of the image is provided, a datapath is built for the disk images and the disk images are transferred (exact technology is FFS). In case of an error the erroneous image is marked for retry and retried until a 200 OK is received. If any of the the synchronised VM images are changed it should be re-synched to all receiving edge cloud instances. There should be an API where the receiving edge cloud instances can initiate the synchronisation of particular VM images. A version of the images should be maintained.

VM images receiver side
Component: synch service (new Kingbird), OpenStack OpenStack Glance or Glare An edge cloud instance should be able to receive VM images via synchronisation. An API should be provided where the VM image transfer can be initiated, datapath for the transfer is built, the received images hash is checked. 200 OK is provided only for data what is correctly stored. The received data should be locked from local editing.

Flavors source side
Component: synch service (new Kingbird), OpenStack Nova, Kubernetes An edge cloud instance should be able to provide selected Flavors for synchronisation. The Flavors to be synchronised are either marked (via API, CLI or config file) or received via synchronisation. The target edge clouds API-s for Flavors are called. In case of an error the erroneous Flavor is marked for retry and retried until a 200 OK is received. If any of the synchronised Flavors are changed it should be re-synched to all receiving edge cloud instances.

Flavors receiver side
Component: synch service (new Kingbird), OpenStack Nova, Kubernetes An edge cloud instance should be able to receive Flavors via synchronisation. An API should be provided where the Flavors can be set. 200 OK is provided only for Flavors which are correctly stored. The received data should be locked from local editing.

Projects source side
Component: synch service (new Kingbird), OpenStack Keystone, Kubernetes An edge cloud instance should be able to provide selected Project configuration for synchronisation. The Quotas to be synchronised are either marked (via API, CLI or config file) or received via synchronisation. The target edge clouds API-s for Projects are called. In case of an error the erroneous Projects configuration is marked for retry and retried until a 200 OK is received. If any of the synchronised Projects are changed it should be re-synched to all receiving edge cloud instances.

Projects receiver side
Component: synch service (new Kingbird), OpenStack Keystone, Kubernetes An edge cloud instance should be able to receive Projects via synchronisation. An API should be provided where the Projects can be set. The stored Projects should be consistent with the user settings of the edge cloud. 200 OK is provided only for Projects which are correctly stored. The received data should be locked from local editing.

Quotas source side
Component: synch service (new Kingbird), OpenStack Keystone, Kubernetes An edge cloud instance should be able to provide selected Quota configuration for synchronisation. The Quotas to be synchronised are either marked (via API, CLI or config file) or received via synchronisation. The target edge clouds API-s for Quotas are called. In case of an error the erroneous Quota configuration is marked for retry and retried until a 200 OK is received. If any of the synchronised Quotas are changed it should be re-synched to all receiving edge cloud instances.

Quotas receiver side
Component: synch service (new Kingbird), OpenStack Keystone, Kubernetes An edge cloud instance should be able to receive Quotas via synchronisation. An API should be provided where the Quotas can be set. The stored Quotas should be consistent with the Projects settings of the edge cloud. 200 OK is provided only for Quotas which are correctly stored. The received data should be locked from local editing.

Progress monitoring
Component: synch service (new Kingbird) An edge cloud instance with metadata synchronisation services should be able to:
 * report the progress of its own in terms of data segments and target edge cloud instances
 * collect the report of other edge cloud instances with metadata synchronisation services which are "under" it
 * report the progress of its own and all other synchronisation services "under" it

Operability data aggregation data provider part
Component: synch service (new Kingbird) or something else? Edge cloud instances should provide an API where they provide operability data about themselves.

The provided data should be:
 * List of active alarms
 * What else?

Operability data aggregation data aggregator part
Component: synch service (new Kingbird) or something else? Some selected (edge) cloud instances should be able to collect operability data of other edge cloud instances and show these on an UI and a CLI.

Remote control controlling part
Component: synch service (new Kingbird) or something else? Some selected (edge) cloud instances should be able to issue different operation on other selected edge cloud instances. The supported operations should be:
 * Add operations.

Remote control receiving part
Component: synch service (new Kingbird) or something else? Edge cloud instances should be able receive commands remotely on an API.

Identified open questions

 * How can we make the distinction between the connectivity from the backhaul to the edge site (i.e. inter edge sites), vs the connectivity between the edge site and devops/users that are in the vicinity of the edge site
 * Regarding the storage of small edge deployment: Is the storage a single, locally attached unit?
 * Regarding the storage of small edge deployment: What's about image repository service (i.e. are you expecting a fully indepedent edge node or can we envision to have just a node that behave like a compute node in the OpenStack terminology?)