OpenStack Edge Discussions Dublin PTG

Intro

This page collects the discussed topics of the Edge Worskhop from the Dublin PTG. If there is any error on this page or some information is missing please just go ahead and correcct it.

The discussions were noted in the following etherpads:

PTG schedule ^[1]
Gap analyzis ^[2]
Alans problems ^[3]

Definitions

Application Sustainability : VMs/Containers/Baremetals (i.e. workload)s already deployed on an edge site can continue to serve requests, i.e. local user can ssh on it
Control site(s): Sites that host only control services (i.e. theses sites do not aim at hosting compute workloads). Please note that there is no particular interest of having such a site yet. We just need the definition of what is a control site for the gap analysis and the different deployment scenarios that can be considered.
Edge site(s): Sites where servers may deliver control and compute capabilities.
Original site: The site to where the operator is connected to.
Remote site: The site what is managed by some management component and the operator is not directly connected to it
Site Sustainability: Local administrators/Local users should be able to administrate/use local resources in case of disconnections to remote sites.

Edge use cases

However it was not noted in the etherpads there were lots of discussions about the use cases for edge clouds. As the OpenStack Edge Computing Whitepaper ^[4], which is available from the Edge section of openstack.org ^[5] also describes there are unlimited use cases possible. The most prominents are:

IoT data aggregation: In case of IoT a big amount of devices are sending their data towards the central cloud. In an edge application this data can be pre processesed and aggregated, so the amount of data sent to the central cloud is smaller.
NFV: Telecom operators would like to run realtime applications on an infrastructure close to the radio heads to provide low latency.
Autonomus devices: Autonomus cars and other devices will generate high amount of data a will need low latency handling of this data.

Deployment Scenarios

To support all of the use cases there is a need for different size of edge clouds. During the discussions we recognised the following deployment scenarios:

Small edge

This is a single node deployment with multiple instances contained within it (lives in a coffee shop for instance); there should probably be some external management of the collection of these single nodes that does roll-up.

Minimum hardware specs: 1 unit of 4 cores, 8 GB RAM, 225 GB SSD
Maximum hardware specs: 1 unit of ? cores, 64 GB RAM, 1 TB storage
Physical access of maintainer: Rare
Physical security: none
Expected frequency of updates to hardware: 3-4 year refresh cycle
Expected frequency of updates to firmware: ~monthly
Expected frequency of updates to control systems (e.g. OpenStack or Kubernetes controllers): ~ 6 months, has to be possible from remote management
Remote access/connectivity reliability (24/24, periodic, ...): No 100% uptime expected.

Medium edge

Minimum hardware specs: 4RU
Maximum hardware specs: 20 RU
Physical access of maintainer: Rare
Physical security: Medium, probably not in a secure data center, probably in a semi-physically secure; each device has some authentication (such as certificate) to verify it's a legitimate piece of hardware deployed by operator; network access is all through security enhanced methods (vpn, connected back to dmz); VPN itself is not considered secure, so other mechanism such as https should be employed as well)
Expected frequency of updates to hardware: ?
Expected frequency of updates to firmware: ?
Expected frequency of updates to control systems (e.g. OpenStack or Kubernetes controllers): ?
Remote access/connectivity reliability (24/24, periodic, ...): 24/24

Features and requirements

The discussion happened on two levels 1) on the level of future features of an edge cloud and 2) on the level of concrete missing requirements by the ones who try to deploy edge clouds today. These features and requirements are on a different level, therefore they are recorded in two separate subchapters.

Architectural paradigms

There is a single source of truth of cloud metadata in one edge infrastructure
Caching the data should be possible
It is possible to have network partitioning between any edge cloud instances of the edge infrastructure

Features

Features are organized into 7 levels starting from the basic level features set to the most advances feature sets.

Base assumptions for the features

Hundreds edge sites that need to be operated by first a single operator and multiple operators
Each edge site is composed of at least 1 server
There can be one or several control sites according to the envisioned scenarios (latency between control sites and edge sites can be between a few ms to hundreds ms).

Feature levels

Level 1

Elementary operations on the original site.

Admin operations
- Create Role/Project
- Create/Register service endpoint
- Telemetry (collect information regarding the infrastructures and VMs)
Operator operations
- Create a VM image
- Create a VM locally
- Create/use a remote-attached volume
- Telemetry (collect information regarding the infrastructures and VMs)

Level 2

Use of a remote site (considering your credentials are only present on the original site)

Operator operations
- All Level 1 operations on a remote site
- Operator should be able to define an explicit list of remore sites where the operations should be executed
- Sharing of Projects and Users among sites. Note: There is a concern, that this results in a non shared-none configuration. The question is if there is any other way to avoid the manual configuration of this data to every edge sites.

Level 3

sustainability/network split brain

Both the control and controlled components should be prepared for unreliable networks, therefore they should
- Have a policy for operation retries without overloading the network
- Be able to pause the communication while the network is down and restart it after the network recovered

Level 4.1

collaboration/interaction between edge sites

Sharing a VM image between sites
Create network between sites
Create a mesh application (i.e. several VMs deployed through several sites)
Use a remote 'remote attached' volume - please note that the remote attached volume is stored on the remote site (while in L1/L2 the remote volume was stored locally
Relocating workloads from one site to another one
- There should be a way to discover and negotiate site capabilities as a migration target. There might be differences in hypervisors and hypervisor features.
- Different hypervisors might use different VM image formats
  - use different hypervisor-specific images choosen by their metadata derived from the same original image - also to know which image to start where (which site)
  - use common image tested on certain pool of hypervisors - so one can guarantee that image is X, Y, Z is hypervisor certified (maybe even use hypervisor-version granularity?)
- Cold migration
- Live migration
  - Require direct connectivity between the compute nodes - e.g. through some sort of point-to-point overlay connection (e.g. via IPsec) between two compute nodes (source + destination)
  - How do we handle attached cinder volumes?
- Mild migration
  - Take a VM snapshot and move that one
Leverage a flavor that has been created on another site.
Rollout of one VMs on a set of sites
Define scope/radius for collaborations (ie., it should be possible to explicitly define locations where workloads can be launched/where data can be saved for a particular tenants.)

Level 4.2

Same as Level 4.1, but for containers.

Level 4.3

Same as Level 4.1 and Level 4.2, but in an implicit manner.

edge compliant scheduler/placement engine
A workload should be relocalized autonomously for performance objectives (follow an end-user,...)
edge compliant application orchestratror (heat like approach)
autoscaling of hosts within an edge site or between edge sites.... (that requires a ceilometer-like system)

Level 5

Administration features

zero touch provisioning (e.g., from bare-metal rsc to an OpenStack)
- kolla/kubernetes/helm deployment
- Join the swarm (authentication of the site and the swarm)
- what about network equipment?
- remote hardware management (inventory, eSW management, configuration of BIOS and similar things)
remote upgrade (of openstack core services).
- Service contrinuity of the control plane
- Service continutity of the workloads
control plane versioning issue
- This is somewhat similar to the capability discovery
Monitoring service get important events through notifications/trigger.
- logs and alarms
- logs/alarms and events
- logs, alarms, events, and performance metrics
- new performance metrics related to latency between edge sites
Build monitoring dashboards in real time (and on demand in order to define the scope of the resources the administrator wants to monitor on demand)
Workload consolidation/relocation for maintenance operations, energy optimization (green energy sources - solar panels/wind turbines..)
Perform a particular operation on each edge site (configure my users and tenants only once so the configurations are consistent among all of my edge clouds)
Other autononomous mechanisms
- Collect informationon the edge sites and do operations based on this (autoscaling)
Dealing with Churn challenges (edge apparitions/removals)
- human based vs crashed based.
- Connecting network of a newly provisioned edge site
Operators of different parts of the edge infrastructure should have their view on their operation domain (god mode users can see everyting, while operators of edge clouds of a region can see the data of the actual edge clouds)
Be able to look at a cloud of cloud and for a user of this cloud of cloud give him access to a set of features and a set of nodes where he can use those features.

Level 6

multiple cloud stack (openstack, kubernetes, ...)

Level 4 and Level 5 features, but between different version of OpenStack
Level 4 and Level 5 features, different cloud solutions (eg.: OpenStack or Kubernetes)

Level 7

Multi operator scenarios

Security considerations
- Guarantee that the communication between the VM-s or containers of an application is secure
- (physical) security issues (can we limit human byzantine attackers
- data privacy (jurdisction concerns)
HA / Reliability / Recovery challenges

Requirements

This section collects the captured concrete requirements to existing or new open source projects.

Location awareness

= An edge cloud site should be aware of its location

Component: OpenStack Keystone? / Kubernetes ? An edge cloud instance should be able to store data bout its location.

Metadata distribution

= Discovering of data sources

An edge cloud instance should be able to discover other edge cloud instances which are trustable as a source of metadata.

Links

[PTG_schedule-1] [1]

[Gap_analyzis-2] [2]

[Alans_problems-3] [3]

[OpenStack_Edge_Computing_Whitepaper-4] [4]

[openstack.org_edge_section-5] [5]

[1]

[2]

[3]

[4]

[5]