This SIG aims to coordinate the use and development of several OpenStack projects which can be combined in various ways to manage OpenStack infrastructure in a policy-driven fashion, reacting to failures and other events by automatically healing services.
One of the biggest promises of the cloud vision was the idea that all the infrastructure could be managed in a policy-driven fashion, reacting to failures and other events by automatically healing and optimising services. Most of the components required to implement such an architecture already exist within OpenStack:
- HA of individual services
- Monasca: monitoring
- Aodh: alarming
- Congress: policy-based governance
- Mistral: workflow
- Senlin: clustering service
- Vitrage: root cause analysis
- Watcher: optimization
- Masakari: compute plane HA
- Freezer-dr: compute plane HA
- Heat: orchestration (normally used for cloud applications, but can also deploy cloud infrastructure via TripleO, and will be able to deploy Vitrage templates)
- Barometer: Monitoring and Service Assurance for NFV
- Doctor: fault management and maintenance for NFV
- Fault Genes Working Group: Fault classification & Recovery Strategy
- Craton: Fleet management
- Kolla：Containerized OpenStack deployment tool
However, there is not yet a clear strategy within the community for how these should all tie together. This SIG aims to address that.
The original proposal defined the SIG's scope as self-healing of cloud infrastructure, so for now it is primarily of interest to developers and operators, not end users. However it is also possible that in the future we will extend the scope to self-healing of cloud applications (e.g. see https://www.openstack.org/videos/barcelona-2016/building-self-healing-applications-with-aodh-zaqar-and-mistral), in which case end users could also find the SIG useful.
The scope could encompass not only self-healing of failures and service degradations, but also automatic optimization such as that performed by Watcher. However this would raise the issue that the name "self-healing" is not perfect because "healing" implies something is sick/broken, and optimization occurs even when the cloud is perfectly healthy. At the Sydney Forum session it was decided that it was better to be pragmatic and start small by focusing on hard failures. Optimization can easily be introduced later if required.
In scenarios where there are multiple solutions to the same self-healing use case, it is not in the scope of the SIG to assume an opinionated position by recommending one solution or project over another . The SIG intends to remain project-agnostic, instead merely presenting the facts regarding what is and isn't currently possible, and what is intended for future development. This should enable operators and users to make better informed decisions based on their own needs.
- Document reference stacks describing what use cases can already be addressed with the existing projects. (Even better if some of these stacks have already been tested in the wild.)
- Document what integrations between the projects already exist at a technical level.
- Collect real-world use cases from operators, including ones which they would like to accomplish but cannot yet.
- From the collected use cases, perform gaps analysis to help shape the future direction of these projects, e.g. through specs targetting those gaps.
- Perform overlap analysis to help ensure that the projects are correctly scoped and integrate well without duplicating any significant effort.
- Ensure that operators and developers are connecting on this topic on a regular basis, so that project development is steered in directions which will meet real-world requirements.
- Developers working on the OpenStack projects listed above
- Architects responsible for designing OpenStack deployments
- Operators responsible for deploying and managing OpenStack
As the scope increases in the future, we may also want to include:
- Architects responsible for designing applications which run on OpenStack clouds
- Developers responsible for developing applications which run on OpenStack clouds
- End users of applications which run on OpenStack clouds
From a feature request to a design spec, we value all participation. Please see the SIG's contributor guide.
Community Infrastructure / Resources
- Wiki: this page
- SIG StoryBoard (for an authoritative list of all ongoing work within the SIG)
- Official SIG documentation
- self-healing-sig git repository and associated reviews
- openstack-discuss mailing list; use the
- a list of existing integration points between self-healing projects
- IRC channel: #openstack-self-healing on Freenode IRC
- IRC meetings (including logs from past meetings)
- patch reviews (gerrit)
- Adam Spiers
- Co-lead: Eric Kao
- BoF and SIG sessions at Berlin Forum, November 2018
- Denver PTG, Sept 2018
- Various events at the Vancouver summit
- BoF session, Thursday, May 24, 1:50pm-2:30pm. Topics were captured in the YVR-self-healing-brainstorming etherpad.
- Cloud Monitoring with Vitrage – Hands-On Lab
- Vitrage - Project Update
- Closing the Loop: VNF end-to-end Failure Detection and Auto Healing
- Proactive Root Cause Analysis with Vitrage, Kubernetes, Zabbix and Prometheus
- Vitrage - Project Onboarding
- Masakari - Project Update
- Masakari - Project Onboarding
- Congress - Project Update
- Mistral - Project Update
- Mistral - Project Onboarding
- Monasca - Project Update
- Monasca - Project Onboarding
- Barometer beyond Service Assurance: monitoring as a service in OPNFV and beyond
- Presentation/discussion at London OpenStack meetup, March 2018
- Session at Tokyo Ops Meetup, March 2018
- Session at Dublin PTG, February 2018
- Session at Sydney Forum, November 2017
- Session at Denver PTG, September 2017
As a small measure of protection against email crawlers, emails are kept at https://ethercalc.openstack.org/docID where docID is e6retozlgrf8. Ongoing work regarding this list is tracked in https://etherpad.openstack.org/p/self-healing-contacts
|Ansible (Openstack)||Jean-Philippe Evrard||evrardjp|
|Fault Genes WG||Nematollah Bidokhti|
|OPNFV Barometer||Sunku Ranganath||sunku-ranganath|
|OPNFV Doctor||Tomi Juvonen||tojuvone|
|Senlin||Qi Ming Teng||Qiming|
The idea for the SIG was born out of long-standing efforts to unify the OpenStack HA community around a single solution for instance HA, coupled with the realisation that this was just one of many self-healing use cases required in order for OpenStack infrastructure to be robust and performant.
A longer description of the history is in this blog post.