Fault Genes Working Group
Status: Active (@May 2016)
Chairs: Nemat Bidokhti and Rochelle (Rocky) Grober
OpenStack Foundation liaison: David F. Flanders <Flanders@OpenStack.org>
Welcome to the OpenStack Fault Genes Working Group landing page.
Please cite this working group using the #FaultGenes-WG tag on social infrastructure.
The Fault Genes Working Group's goal is to be the DNA of OpenStack fault-tolerance. The Fault Genes working group supports any community member who wants to help create an OpenStack failure mode taxonomy to be used from design to deployment of OpenStack. This working group supports operator & developers. Helps Operators by identifying the type of failures that can be experienced, how they should be reported and what would be the impact of the failure. Supports the developers by guiding them of what to watch out for and how to mitigate them in the design. If you are interested in making OpenStack resilient, then please join!
The intent of this working group is to service operators & application developers.
- Primary: Operators - Provide insights into OpenStack failures, analysis and understanding of how logging happens.
- Secondary: Application Developers - The results from our working group will support logging group to be able to produce good error codes.
Link to our working group charter: https://docs.google.com/document/d/16Rc_Ye_qpHWfq4P2uaTaGldfYIEPlajfBAu4VxvJg1Y/edit#
The current communication method is email and conference call. The plan for future communications with our members will be via the open community mailing list.
- <firstname.lastname@example.org> (Please prefix email subject lines with the tag "[FaultGenes-WG]".
No formal membership is required. Please introduce yourself and/or fellow colleagues to this working group by emailing to email@example.com
This group is open to all members of the Fault Genes OpenStack community and supporting vendors.
Significant dates for the working group (in reverse chronological order):
- [Please add latest activity for the #FaultGenes-WG here at the top of this list]
- A proposal for hosting a face-to-face working group meeting has been discussed but the date has not been set - TBD
- The group was approved by the OpenStack foundation's User Committee board on 5th May 2016
- The wiki page minted on 2nd June 2016.
- The original proposal and call for nominations was drafted via this https://etherpad.openstack.org/p/Fault_Genes-WG
- The #FaultGenes-WG originated from a conversation before Austin OpenStack summit in Feb 2016, ideas originating from Nemat Bidokhti & Rochelle Grober.
- Meeting Times: 9:00 AM PST
- Meeting Dates: Every week (Mondays)
- Conference Call Info: Meeting Conference Link: https://www.connectmeeting.att.com (Meeting Number: 8887160594, Code: 3773562)
USA Toll-Free: 888-716-0594 USA Caller Paid: 215-861-6199 For Other Countries: Global Conference Access Numbers https://www.teleconference.att.com/servlet/glbAccess?process=1&accessCode=3773562&accessNumber=2158616199#C2
Resources and Reference:
- The link to our Google Sheet template: https://docs.google.com/spreadsheets/d/1sekKLp7C8lsTh-niPHNa2QLk5kzEC_2w_UsG6ifC-Pw/edit#gid=2142834673
- Working group charter: https://docs.google.com/document/d/16Rc_Ye_qpHWfq4P2uaTaGldfYIEPlajfBAu4VxvJg1Y/edit?usp=sharing
- Use case: https://review.openstack.org/#/c/317695
The Fault Genes Working Group plans to meet at the OpenStack Design summits, and also at operators mid-cycle meetups (where possible).
- 1st meeting - OpenStack Summit April 2016, Austin, USA: https://etherpad.openstack.org/p/Fault_Genes-WG
- 2nd meeting - OpenStack summit October 2016, Barcelona, Spain: TBD
For the best global coverage, IRC meetings will be held in the future.
- Fault - It is a condition that causes the software to fail to perform its required function (IEEE definition).
- Error - : Refers to difference between Actual Output and Expected output (IEEE definition).
- Failure - It is the inability of a system or component to perform required function according to its specification (IEEE definition).
- Priority - Is something that is defined by business rules. It defines how important the defect is and how early it should be fixed.
- Severity - Is defined by the extent of damage done by the defect. It defines how badly the defect affects the functionality of the software product.
- Numbered list item P1 - Critical: Interruption making a critical functionality inaccessible or a complete network interruption causing a severe impact on services availability. There is no possible alternative. (4 hours)
- Numbered list item P2 - Important: Critical functionality or network access interrupted, degraded or unusable, having an severe impact on services availability. No acceptable alternative is possible. (24 hours)
- Numbered list item P3 - Normal: Non critical function or procedure, unusable or hard to use having an operational impact, but with no direct impact on services availability. A workaround is available. (3 days)
- Numbered list item P4 - Low: Application or personal procedure unusable, where a workaround is available or a repair is possible. (5 days)
- Normal Behavior - A failure where there is a clear alarm and failure notification.
- Abnormal Behavior - A failure with no alar or notification.
Template Recommendations & Best Practices
Following were recommended by our members for filling out the Google sheet template:
- Make sure you identify the API version(s) -- project specific, so could be a set
- Where/how failure manifested itself (project/log file/action/etc)
- Log file snippets with key messages are also good here (and name/location of file)
- Abnormal behaviors noticed leading up to the failure - hindsight is fine here.
- Trigger conditions if known
- Signs or indications of whether any part of the deployment might be overly stressed/at its limits
- Is the related log message well formed? Is the context correct?
- Clean up process?
- Map to/from launchpad
- Effected components/ combinations of components installed? . Nova networking vs. neutron networking
- Which Workflow did the event happen with?
- How to reproduce?
- How was this detected and can we have monitoring to detect in the future?
- Is this a code issue or a configuration issue? Was the log helpful or not? (Kibana Dashboard examples?)
- What the recovery entailed?
- How often the failure has occurred, or been short circuited once precursor symptoms have been identified?
- Other architectural features you think is important to the root cause (cells, regions, single vs multiple copies of components, etc)
- If this all exists in a bug report already, put the bug report link here and pointers to the info in the report (you could add any info missing from bug to the bug itself and point to the info here)
- If this is a problem you've seen, but someone else submitted it, please annotate with +1 or whatever and comment where your experiences differ. * Make sure if you have comments, they all are a single color that is unique so the info can be linked to a single deployment, or tack the difference info on at the bottom of the failure