Jump to: navigation, search

Difference between revisions of "Self-healing SIG"

(add integrations list and highlight storyboard)
(21 intermediate revisions by 5 users not shown)
Line 23: Line 23:
 
* [[Masakari]]: compute plane HA
 
* [[Masakari]]: compute plane HA
 
* [https://docs.openstack.org/freezer/latest/ Freezer-dr]: compute plane HA
 
* [https://docs.openstack.org/freezer/latest/ Freezer-dr]: compute plane HA
* [https://docs.openstack.org/heat/latest/ Heat]: orchestration (normally used for cloud applications, but can also deploy cloud infrastructure via [[TripleO]])
+
* [https://docs.openstack.org/heat/latest/ Heat]: orchestration (normally used for cloud applications, but can also deploy cloud infrastructure via [[TripleO]], and [https://storyboard.openstack.org/#!/story/2002684 will be able to deploy Vitrage templates])
 
* [https://wiki.opnfv.org/display/fastpath/Barometer+Home Barometer]: Monitoring and Service Assurance for NFV
 
* [https://wiki.opnfv.org/display/fastpath/Barometer+Home Barometer]: Monitoring and Service Assurance for NFV
 
* [https://www.opnfv.org/community/projects/doctor Doctor]: fault management and maintenance for NFV
 
* [https://www.opnfv.org/community/projects/doctor Doctor]: fault management and maintenance for NFV
 
* [[Fault Genes Working Group]]: Fault classification & Recovery Strategy
 
* [[Fault Genes Working Group]]: Fault classification & Recovery Strategy
 
* [http://craton.readthedocs.io/en/latest/readme.html Craton]: Fleet management
 
* [http://craton.readthedocs.io/en/latest/readme.html Craton]: Fleet management
* Kolla: Containerized OpenStack deployment tool
+
* [https://docs.openstack.org/kolla/latest/ Kolla]:Containerized OpenStack deployment tool
* kolla-k8s: same as above but in kubernetes cluster
+
 
  
  
Line 39: Line 39:
  
 
The scope could encompass not only self-healing of failures and service degradations, but also automatic optimization such as that performed by Watcher.  However this would raise the issue that the name "self-healing" is not perfect because "healing" implies something is sick/broken, and optimization occurs even when the cloud is perfectly healthy.  At [https://wiki.openstack.org/wiki/Forum/Sydney2017 the Sydney Forum session] it was decided that it was better to be pragmatic and start small by focusing on hard failures.  Optimization can easily be introduced later if required.
 
The scope could encompass not only self-healing of failures and service degradations, but also automatic optimization such as that performed by Watcher.  However this would raise the issue that the name "self-healing" is not perfect because "healing" implies something is sick/broken, and optimization occurs even when the cloud is perfectly healthy.  At [https://wiki.openstack.org/wiki/Forum/Sydney2017 the Sydney Forum session] it was decided that it was better to be pragmatic and start small by focusing on hard failures.  Optimization can easily be introduced later if required.
 +
 +
In scenarios where there are multiple solutions to the same self-healing use case, it is not in the scope of the SIG to assume an opinionated position by recommending one solution or project over another .  The SIG intends to remain project-agnostic, instead merely presenting the facts regarding what is and isn't currently possible, and what is intended for future development.  This should enable operators and users to make better informed decisions based on their own needs.
  
 
=== Goals ===
 
=== Goals ===
Line 62: Line 64:
 
* End users of applications which run on OpenStack clouds
 
* End users of applications which run on OpenStack clouds
  
=== SIG Leads ===
+
=== Getting Involved ===
 +
 
 +
From a feature request to a design spec, we value all participation. Please see the SIG's [https://docs.openstack.org/self-healing-sig/latest/meta/CONTRIBUTING.html contributor guide].
 +
 
 +
=== Documentation ===
 +
 
 +
The [https://docs.openstack.org/self-healing-sig/latest/ official SIG documentation] contains self-healing use cases, cross-project specs, and in the future potentially also cross-project code.
  
* [[User:Adam Spiers|Adam Spiers]]
+
The documentation is generated from [https://git.openstack.org/cgit/openstack/self-healing-sig/ the self-healing-sig git repository]; you can also see [https://review.openstack.org/#/q/project:openstack/self-healing-sig associated change reviews].
* Co-lead: Eric Kao
 
  
 
=== Community Infrastructure / Resources ===
 
=== Community Infrastructure / Resources ===
 
For an authoritative list of all ongoing work within the SIG, please see [https://storyboard.openstack.org/#!/project/917 the StoryBoard project].
 
  
 
* Wiki: this page
 
* Wiki: this page
* [http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-sigs openstack-sigs mailing list]; use the <code>[self-healing]</code> tag
+
* [https://storyboard.openstack.org/#!/project/openstack/self-healing-sig SIG StoryBoard] (for an authoritative list of all ongoing work within the SIG)
 +
* Documentation (see above)
 +
* [http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-discuss openstack-discuss mailing list]; use the <code>[self-healing]</code> tag
 
* [https://etherpad.openstack.org/p/self-healing-project-integrations a list of existing integration points between self-healing projects]
 
* [https://etherpad.openstack.org/p/self-healing-project-integrations a list of existing integration points between self-healing projects]
* IRC channel: #openstack-self-healing on Freenode IRC
+
* IRC channel: #openstack-self-healing on [http://freenode.net/ Freenode] IRC
* IRC meetings: TBD, #openstack-self-healing; agenda / details to be linked on SIG page + meetings list
+
* [http://eavesdrop.openstack.org/#Self-healing_SIG_Meeting IRC meetings] (including logs from past meetings)
 +
* [https://review.openstack.org/#/q/project:openstack/self-healing-sig patch reviews] (gerrit)
  
=== Upcoming events ===
+
=== SIG Leads ===
  
There is a [https://www.openstack.org/summit/vancouver-2018/summit-schedule/events/21830/self-healing-sig-bof BoF session scheduled for the Vancouver Forum on Thursday, May 24, 1:50pm-2:30pm].  Topics are being captured in [https://etherpad.openstack.org/p/YVR-self-healing-brainstorming the YVR-self-healing-brainstorming etherpad].
+
* [[User:Adam Spiers|Adam Spiers]]
 +
* Co-lead: Eric Kao
  
Several other sessions relating to self-healing have been been accepted for the Vancouver summit:
+
=== Upcoming events ===
  
* [https://www.openstack.org/summit/vancouver-2018/summit-schedule/events/20957/cloud-monitoring-with-vitrage-hands-on-lab Cloud Monitoring with Vitrage – Hands-On Lab]
+
* Events at Denver Summit / Forum, May 2018
* [https://www.openstack.org/summit/vancouver-2018/summit-schedule/events/21593/vitrage-project-update Vitrage - Project Update]
+
** [https://etherpad.openstack.org/p/DEN-self-healing-SIG BoF and SIG sessions]
* [https://www.openstack.org/summit/vancouver-2018/summit-schedule/events/20839/closing-the-loop-vnf-end-to-end-failure-detection-and-auto-healing Closing the Loop: VNF end-to-end Failure Detection and Auto Healing]
 
* [https://www.openstack.org/summit/vancouver-2018/summit-schedule/events/20964/proactive-root-cause-analysis-with-vitrage-kubernetes-zabbix-and-prometheus Proactive Root Cause Analysis with Vitrage, Kubernetes, Zabbix and Prometheus]
 
* [https://www.openstack.org/summit/vancouver-2018/summit-schedule/events/21653/vitrage-project-onboarding Vitrage - Project Onboarding]
 
* [https://www.openstack.org/summit/vancouver-2018/summit-schedule/events/21622/masakari-project-update Masakari - Project Update]
 
* [https://www.openstack.org/summit/vancouver-2018/summit-schedule/events/21656/masakari-project-onboarding Masakari - Project Onboarding]
 
* [https://www.openstack.org/summit/vancouver-2018/summit-schedule/events/21615/congress-project-update Congress - Project Update]
 
* [https://www.openstack.org/summit/vancouver-2018/summit-schedule/events/21620/mistral-project-update Mistral - Project Update]
 
* [https://www.openstack.org/summit/vancouver-2018/summit-schedule/events/21638/mistral-project-onboarding Mistral - Project Onboarding]
 
* [https://www.openstack.org/summit/vancouver-2018/summit-schedule/events/21594/monasca-project-update Monasca - Project Update]
 
* [https://www.openstack.org/summit/vancouver-2018/summit-schedule/events/21639/monasca-project-onboarding Monasca - Project Onboarding]
 
* [https://www.openstack.org/summit/vancouver-2018/summit-schedule/events/21205/barometer-beyond-service-assurance-monitoring-as-a-service-in-opnfv-and-beyond Barometer beyond Service Assurance: monitoring as a service in OPNFV and beyond]
 
  
 
=== Past events ===
 
=== Past events ===
  
 +
* Events at Berlin Forum, November 2018
 +
** [https://etherpad.openstack.org/p/berlin-self-healing-sig-brainstorm BoF and SIG sessions]
 +
** [https://www.openstack.org/videos/berlin-2018/towards-production-grade-database-as-a-service-in-openstack Towards production-grade Database as a Service in OpenStack]
 +
* [https://etherpad.openstack.org/p/self-healing-sig-stein-ptg Denver PTG, Sept 2018]
 +
* Various events at the Vancouver summit
 +
** [https://www.openstack.org/summit/vancouver-2018/summit-schedule/events/21830/self-healing-sig-bof BoF session, Thursday, May 24, 1:50pm-2:30pm].  Topics were captured in [https://etherpad.openstack.org/p/YVR-self-healing-brainstorming the YVR-self-healing-brainstorming etherpad].
 +
** [https://www.openstack.org/summit/vancouver-2018/summit-schedule/events/20957/cloud-monitoring-with-vitrage-hands-on-lab Cloud Monitoring with Vitrage – Hands-On Lab]
 +
** [https://www.openstack.org/summit/vancouver-2018/summit-schedule/events/21593/vitrage-project-update Vitrage - Project Update]
 +
** [https://www.openstack.org/summit/vancouver-2018/summit-schedule/events/20839/closing-the-loop-vnf-end-to-end-failure-detection-and-auto-healing Closing the Loop: VNF end-to-end Failure Detection and Auto Healing]
 +
** [https://www.openstack.org/summit/vancouver-2018/summit-schedule/events/20964/proactive-root-cause-analysis-with-vitrage-kubernetes-zabbix-and-prometheus Proactive Root Cause Analysis with Vitrage, Kubernetes, Zabbix and Prometheus]
 +
** [https://www.openstack.org/summit/vancouver-2018/summit-schedule/events/21653/vitrage-project-onboarding Vitrage - Project Onboarding]
 +
** [https://www.openstack.org/summit/vancouver-2018/summit-schedule/events/21622/masakari-project-update Masakari - Project Update]
 +
** [https://www.openstack.org/summit/vancouver-2018/summit-schedule/events/21656/masakari-project-onboarding Masakari - Project Onboarding]
 +
** [https://www.openstack.org/summit/vancouver-2018/summit-schedule/events/21615/congress-project-update Congress - Project Update]
 +
** [https://www.openstack.org/summit/vancouver-2018/summit-schedule/events/21620/mistral-project-update Mistral - Project Update]
 +
** [https://www.openstack.org/summit/vancouver-2018/summit-schedule/events/21638/mistral-project-onboarding Mistral - Project Onboarding]
 +
** [https://www.openstack.org/summit/vancouver-2018/summit-schedule/events/21594/monasca-project-update Monasca - Project Update]
 +
** [https://www.openstack.org/summit/vancouver-2018/summit-schedule/events/21639/monasca-project-onboarding Monasca - Project Onboarding]
 +
** [https://www.openstack.org/summit/vancouver-2018/summit-schedule/events/21205/barometer-beyond-service-assurance-monitoring-as-a-service-in-opnfv-and-beyond Barometer beyond Service Assurance: monitoring as a service in OPNFV and beyond]
 
* [https://aspiers.github.io/openstack-meetup-london-march-2018-self-healing/ Presentation/discussion at London OpenStack meetup, March 2018]
 
* [https://aspiers.github.io/openstack-meetup-london-march-2018-self-healing/ Presentation/discussion at London OpenStack meetup, March 2018]
 
* [https://etherpad.openstack.org/p/TYO-ops-meetup-2018-self-healing Session at Tokyo Ops Meetup, March 2018]
 
* [https://etherpad.openstack.org/p/TYO-ops-meetup-2018-self-healing Session at Tokyo Ops Meetup, March 2018]
Line 105: Line 122:
 
* [https://etherpad.openstack.org/p/self-healing-queens-ptg Session at Denver PTG, September 2017]
 
* [https://etherpad.openstack.org/p/self-healing-queens-ptg Session at Denver PTG, September 2017]
  
=== Project contacts ===
+
=== Project liasons ===
 +
 
 +
The following people have volunteered to act as liasons between the SIG and the individual project they are focused on.  The intention of documenting these interface points is to encourage bidirectional assistance:
 +
 
 +
# If anyone is working on a self-healing use case and needs help from a specific project, they should have a greater chance of finding someone from that project who has both the knowledge and interest to help them.
 +
# When projects add new functionality which can benefit self-healing use cases, they can proactively inform the SIG.
  
As a small measure of protection against email crawlers, emails are kept at https://ethercalc.openstack.org/docID where docID is e6retozlgrf8
+
As a small measure of protection against email crawlers, emails are kept at https://ethercalc.openstack.org/docID where docID is e6retozlgrf8.  Ongoing work regarding this list is tracked in https://etherpad.openstack.org/p/self-healing-contacts
  
 
{| class="wikitable"
 
{| class="wikitable"
Line 127: Line 149:
 
| Freezer-DR || Saad Zaher ||  || szaher
 
| Freezer-DR || Saad Zaher ||  || szaher
 
|-
 
|-
| Heat || Rico Lin ||  ||  
+
| Heat || Rico Lin ||  || ricolin
 
|-
 
|-
 
| Kolla ||  ||  ||  
 
| Kolla ||  ||  ||  
Line 145: Line 167:
 
| OPNFV Barometer || Sunku Ranganath ||  || sunku-ranganath
 
| OPNFV Barometer || Sunku Ranganath ||  || sunku-ranganath
 
|-
 
|-
| OPNFV Doctor ||  ||  ||  
+
| OPNFV Doctor ||  Tomi Juvonen ||  || tojuvone
 
|-
 
|-
 
| Senlin || Qi Ming Teng ||  || Qiming
 
| Senlin || Qi Ming Teng ||  || Qiming

Revision as of 14:55, 8 March 2019

Self-healing SIG

Status: Formed

Original proposal: http://lists.openstack.org/pipermail/openstack-sigs/2017-September/000054.html

Mission

This SIG aims to coordinate the use and development of several OpenStack projects which can be combined in various ways to manage OpenStack infrastructure in a policy-driven fashion, reacting to failures and other events by automatically healing services.

Background

One of the biggest promises of the cloud vision was the idea that all the infrastructure could be managed in a policy-driven fashion, reacting to failures and other events by automatically healing and optimising services.  Most of the components required to implement such an architecture already exist within OpenStack:


However, there is not yet a clear strategy within the community for how these should all tie together. This SIG aims to address that.

Scope

The original proposal defined the SIG's scope as self-healing of cloud infrastructure, so for now it is primarily of interest to developers and operators, not end users. However it is also possible that in the future we will extend the scope to self-healing of cloud applications (e.g. see https://www.openstack.org/videos/barcelona-2016/building-self-healing-applications-with-aodh-zaqar-and-mistral), in which case end users could also find the SIG useful.

The scope could encompass not only self-healing of failures and service degradations, but also automatic optimization such as that performed by Watcher. However this would raise the issue that the name "self-healing" is not perfect because "healing" implies something is sick/broken, and optimization occurs even when the cloud is perfectly healthy. At the Sydney Forum session it was decided that it was better to be pragmatic and start small by focusing on hard failures. Optimization can easily be introduced later if required.

In scenarios where there are multiple solutions to the same self-healing use case, it is not in the scope of the SIG to assume an opinionated position by recommending one solution or project over another . The SIG intends to remain project-agnostic, instead merely presenting the facts regarding what is and isn't currently possible, and what is intended for future development. This should enable operators and users to make better informed decisions based on their own needs.

Goals

Audience

  • Developers working on the OpenStack projects listed above
  • Architects responsible for designing OpenStack deployments
  • Operators responsible for deploying and managing OpenStack


As the scope increases in the future, we may also want to include:

  • Architects responsible for designing applications which run on OpenStack clouds
  • Developers responsible for developing applications which run on OpenStack clouds
  • End users of applications which run on OpenStack clouds

Getting Involved

From a feature request to a design spec, we value all participation. Please see the SIG's contributor guide.

Documentation

The official SIG documentation contains self-healing use cases, cross-project specs, and in the future potentially also cross-project code.

The documentation is generated from the self-healing-sig git repository; you can also see associated change reviews.

Community Infrastructure / Resources

SIG Leads

Upcoming events

Past events

Project liasons

The following people have volunteered to act as liasons between the SIG and the individual project they are focused on. The intention of documenting these interface points is to encourage bidirectional assistance:

  1. If anyone is working on a self-healing use case and needs help from a specific project, they should have a greater chance of finding someone from that project who has both the knowledge and interest to help them.
  2. When projects add new functionality which can benefit self-healing use cases, they can proactively inform the SIG.

As a small measure of protection against email crawlers, emails are kept at https://ethercalc.openstack.org/docID where docID is e6retozlgrf8. Ongoing work regarding this list is tracked in https://etherpad.openstack.org/p/self-healing-contacts

Project Contact Email IRC Handle
Ansible (Openstack) Jean-Philippe Evrard evrardjp
Aodh
Cinder
Congress Eric Kao ekcs
Craton
Fault Genes WG Nematollah Bidokhti
Freezer-DR Saad Zaher szaher
Heat Rico Lin ricolin
Kolla
Masakari Adam Spiers aspiers
Mistral Dougal Matthews d0ugal
Monasca Witold Bedyk witek
Neutron Yushiro Furukawa
Nova
OPNFV Georg Kunz georgk
OPNFV Barometer Sunku Ranganath sunku-ranganath
OPNFV Doctor Tomi Juvonen tojuvone
Senlin Qi Ming Teng Qiming
Senlin XueFeng Liu XueFeng
Senlin Yuanbin Chen chenyb4
TripleO Michele Baldessari bandini
TripleO Damien Ciabrini
Vitrage Ifat Afek ifat_afek
Watcher Alexander Chadin alexchadin

History

The idea for the SIG was born out of long-standing efforts to unify the OpenStack HA community around a single solution for instance HA, coupled with the realisation that this was just one of many self-healing use cases required in order for OpenStack infrastructure to be robust and performant.

The first meeting happened at the Denver PTG, and was minuted in this etherpad. The SIG was formally proposed as a result of this meeting.

A Sydney Forum session was proposed, accepted, and took place, after which the SIG was officially formed.

A longer description of the history is in this blog post.