Jump to: navigation, search

Difference between revisions of "Self-healing SIG"

(Background: add missing link to TripleO)
(Self-healing SIG: officially formed)
Line 1: Line 1:
 
== Self-healing SIG ==
 
== Self-healing SIG ==
  
'''Status''': Forming (please fill out [https://docs.google.com/forms/d/e/1FAIpQLSekIFAFYc1mpBkQHgZwIVLOj-rQQPjw9Di3-hXL03ilhI80rg/viewform the survey]!)
+
'''Status''': Formed
  
 
'''Original proposal''': http://lists.openstack.org/pipermail/openstack-sigs/2017-September/000054.html
 
'''Original proposal''': http://lists.openstack.org/pipermail/openstack-sigs/2017-September/000054.html
 
'''Interim follow-up on survey''': http://lists.openstack.org/pipermail/openstack-sigs/2017-October/000109.html
 
  
 
=== Mission ===
 
=== Mission ===
  
This SIG aims to coordinate the use and development of several OpenStack projects which can be combined in various ways to manage OpenStack infrastructure in a policy-driven fashion, reacting to failures and other events by automatically healing and optimising services.
+
This SIG aims to coordinate the use and development of several OpenStack projects which can be combined in various ways to manage OpenStack infrastructure in a policy-driven fashion, reacting to failures and other events by automatically healing services.
 
 
=== Scope ===
 
 
 
The [http://lists.openstack.org/pipermail/openstack-sigs/2017-September/000054.html original proposal] defined the SIG's scope as self-healing / optimization of cloud infrastructure, meaning it would primarily be of interest to developers and operators, not end users. However it would also be possible to extend the scope to self-healing / optimization of cloud applications (e.g. see https://www.openstack.org/videos/barcelona-2016/building-self-healing-applications-with-aodh-zaqar-and-mistral), in which case end users could also find the SIG useful.  We have not yet reached consensus on this point, so please submit your opinion via [https://docs.google.com/forms/d/e/1FAIpQLSekIFAFYc1mpBkQHgZwIVLOj-rQQPjw9Di3-hXL03ilhI80rg/viewform the survey]!
 
 
 
=== SIG name ===
 
 
 
We definitely want the scope to include not only self-healing of failures and service degradations, but also automatic optimization such as that performed by Watcher.  However this raises the issue that the name "self-healing" is not perfect because "healing" implies something is sick/broken, and optimization occurs even when the cloud is perfectly healthy.  As a result we have not yet reached consensus on the SIG's name, so please submit your opinion via [https://docs.google.com/forms/d/e/1FAIpQLSekIFAFYc1mpBkQHgZwIVLOj-rQQPjw9Di3-hXL03ilhI80rg/viewform the survey]!
 
  
 
=== Background ===
 
=== Background ===
  
One of the biggest promises of the cloud vision was the idea that all the infrastructure could be managed in a policy-driven fashion, reacting to failures and other events by automatically healing and optimising services.  Most of the components required to implement such an architecture already exist:
+
One of the biggest promises of the cloud vision was the idea that all the infrastructure could be managed in a policy-driven fashion, reacting to failures and other events by automatically healing and optimising services.  Most of the components required to implement such an architecture already exist within OpenStack:
  
 +
* [https://docs.openstack.org/ha-guide/ HA of individual services]
 
* [http://monasca.io/ Monasca]: monitoring
 
* [http://monasca.io/ Monasca]: monitoring
 
* [https://docs.openstack.org/aodh/latest/ Aodh]: alarming
 
* [https://docs.openstack.org/aodh/latest/ Aodh]: alarming
Line 36: Line 27:
 
* [[Fault Genes Working Group]]: Fault classification & Recovery Strategy
 
* [[Fault Genes Working Group]]: Fault classification & Recovery Strategy
 
* [http://craton.readthedocs.io/en/latest/readme.html Craton]: Fleet management
 
* [http://craton.readthedocs.io/en/latest/readme.html Craton]: Fleet management
 +
* Kolla: Containerized OpenStack deployment tool
 +
* kolla-k8s: same as above but in kubernetes cluster
  
 
However, there is not yet a clear strategy within the community for how these should all tie together.  This SIG aims to address that.
 
However, there is not yet a clear strategy within the community for how these should all tie together.  This SIG aims to address that.
 +
 +
=== Scope ===
 +
 +
The [http://lists.openstack.org/pipermail/openstack-sigs/2017-September/000054.html original proposal] defined the SIG's scope as self-healing of cloud infrastructure, so for now it is primarily of interest to developers and operators, not end users. However it is also possible that in the future we will extend the scope to self-healing of cloud ''applications'' (e.g. see https://www.openstack.org/videos/barcelona-2016/building-self-healing-applications-with-aodh-zaqar-and-mistral), in which case end users could also find the SIG useful.
 +
 +
The scope could encompass not only self-healing of failures and service degradations, but also automatic optimization such as that performed by Watcher.  However this would raise the issue that the name "self-healing" is not perfect because "healing" implies something is sick/broken, and optimization occurs even when the cloud is perfectly healthy.  At [https://wiki.openstack.org/wiki/Forum/Sydney2017 the Sydney Forum session] it was decided that it was better to be pragmatic and start small by focusing on hard failures.  Optimization can easily be introduced later if required.
  
 
=== Goals ===
 
=== Goals ===
Line 54: Line 53:
 
* Operators responsible for deploying and managing OpenStack
 
* Operators responsible for deploying and managing OpenStack
  
Depending on the resolution of the scoping dilemma mentioned above, we may also want to include:
+
As the scope increases in the future, we may also want to include:
  
 
* Architects responsible for designing applications which run on OpenStack clouds
 
* Architects responsible for designing applications which run on OpenStack clouds
Line 62: Line 61:
 
=== SIG Leads ===
 
=== SIG Leads ===
  
TBD; [[User:Adam Spiers|Adam Spiers]] will volunteer if noone else does, but preferably we should have more than one lead to increase the [https://en.wikipedia.org/wiki/Bus_factor bus factor].
+
* [[User:Adam Spiers|Adam Spiers]]
 +
* Co-lead: Eric Kao
  
 
=== Community Infrastructure ===
 
=== Community Infrastructure ===
  
* Wiki: this page (but may be renamed)
+
* Wiki: this page
* [http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-sigs openstack-sigs mailing list]; maybe <code>[self-healing]</code> for the tag, but this depends on the resolution of the naming issue mentioned above
+
* [http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-sigs openstack-sigs mailing list]; use the <code>[self-healing]</code> tag
* IRC channel: TBD; depends on the resolution of the naming issue mentioned above
+
* IRC channel: still to be created
 
* IRC meetings: TBD, #openstack-meeting-$somenumber; agenda / details to be linked on SIG page + meetings list
 
* IRC meetings: TBD, #openstack-meeting-$somenumber; agenda / details to be linked on SIG page + meetings list
  
 
=== Upcoming events ===
 
=== Upcoming events ===
  
* [http://forumtopics.openstack.org/cfp/details/26 A session has been proposed for the Sydney Forum]
 
 
* A session will be proposed for the Tokyo ops meetup
 
* A session will be proposed for the Tokyo ops meetup
  
Line 80: Line 79:
 
The idea for the SIG was born out of long-standing efforts to unify the OpenStack HA community around [https://aspiers.github.io/openstack-day-israel-2017-compute-ha/ a single solution for instance HA], coupled with the realisation that this was just one of many self-healing use cases required in order for OpenStack infrastructure to be robust and performant.
 
The idea for the SIG was born out of long-standing efforts to unify the OpenStack HA community around [https://aspiers.github.io/openstack-day-israel-2017-compute-ha/ a single solution for instance HA], coupled with the realisation that this was just one of many self-healing use cases required in order for OpenStack infrastructure to be robust and performant.
  
The first meeting of the SIG happened at the Denver [https://www.openstack.org/ptg/ PTG], and was minuted in [https://etherpad.openstack.org/p/self-healing-queens-ptg this etherpad].
+
The first meeting happened at the Denver [https://www.openstack.org/ptg/ PTG], and was minuted in [https://etherpad.openstack.org/p/self-healing-queens-ptg this etherpad].  [http://lists.openstack.org/pipermail/openstack-sigs/2017-September/000054.html The SIG was formally proposed] as a result of this meeting.
 +
 
 +
A Sydney Forum session was [http://forumtopics.openstack.org/cfp/details/26 proposed], [https://www.openstack.org/summit/sydney-2017/summit-schedule/events/20508/self-healing-and-optimization-sig accepted], and [https://etherpad.openstack.org/p/self-healing-rocky-forum took place], after which the SIG was officially formed.

Revision as of 14:10, 27 November 2017

Self-healing SIG

Status: Formed

Original proposal: http://lists.openstack.org/pipermail/openstack-sigs/2017-September/000054.html

Mission

This SIG aims to coordinate the use and development of several OpenStack projects which can be combined in various ways to manage OpenStack infrastructure in a policy-driven fashion, reacting to failures and other events by automatically healing services.

Background

One of the biggest promises of the cloud vision was the idea that all the infrastructure could be managed in a policy-driven fashion, reacting to failures and other events by automatically healing and optimising services.  Most of the components required to implement such an architecture already exist within OpenStack:

However, there is not yet a clear strategy within the community for how these should all tie together. This SIG aims to address that.

Scope

The original proposal defined the SIG's scope as self-healing of cloud infrastructure, so for now it is primarily of interest to developers and operators, not end users. However it is also possible that in the future we will extend the scope to self-healing of cloud applications (e.g. see https://www.openstack.org/videos/barcelona-2016/building-self-healing-applications-with-aodh-zaqar-and-mistral), in which case end users could also find the SIG useful.

The scope could encompass not only self-healing of failures and service degradations, but also automatic optimization such as that performed by Watcher. However this would raise the issue that the name "self-healing" is not perfect because "healing" implies something is sick/broken, and optimization occurs even when the cloud is perfectly healthy. At the Sydney Forum session it was decided that it was better to be pragmatic and start small by focusing on hard failures. Optimization can easily be introduced later if required.

Goals

  • Document reference stacks describing what use cases can already be addressed with the existing projects. (Even better if some of these stacks have already been tested in the wild.)
  • Document what integrations between the projects already exist at a technical level. (This was already started during the Denver PTG, by placing the projects into phases of a high-level flow, and then collaboratively building a Google Drawing to show that.)
  • Collect real-world use cases from operators, including ones which they would like to accomplish but cannot yet.
  • From the collected use cases, perform gaps analysis to help shape the future direction of these projects, e.g. through specs targetting those gaps.
  • Perform overlap analysis to help ensure that the projects are correctly scoped and integrate well without duplicating any significant effort.
  • Ensure that operators and developers are connecting on this topic on a regular basis, so that project development is steered in directions which will meet real-world requirements.

Audience

  • Developers working on the OpenStack projects listed above
  • Architects responsible for designing OpenStack deployments
  • Operators responsible for deploying and managing OpenStack

As the scope increases in the future, we may also want to include:

  • Architects responsible for designing applications which run on OpenStack clouds
  • Developers responsible for developing applications which run on OpenStack clouds
  • End users of applications which run on OpenStack clouds

SIG Leads

Community Infrastructure

  • Wiki: this page
  • openstack-sigs mailing list; use the [self-healing] tag
  • IRC channel: still to be created
  • IRC meetings: TBD, #openstack-meeting-$somenumber; agenda / details to be linked on SIG page + meetings list

Upcoming events

  • A session will be proposed for the Tokyo ops meetup

History

The idea for the SIG was born out of long-standing efforts to unify the OpenStack HA community around a single solution for instance HA, coupled with the realisation that this was just one of many self-healing use cases required in order for OpenStack infrastructure to be robust and performant.

The first meeting happened at the Denver PTG, and was minuted in this etherpad. The SIG was formally proposed as a result of this meeting.

A Sydney Forum session was proposed, accepted, and took place, after which the SIG was officially formed.