ElasticRecheck
How To Help With Gate Failures
TODO(mriedem): Integrate this wiki content into the elastic-recheck readme: https://github.com/openstack-infra/elastic-recheck/blob/master/README.rst
Dump information / FAQs on elastic-recheck and how to use it and contribute to it.
When you hit a failure and there is no e-r query comment in your patch, but you do find a bug to recheck against, you should look at writing an e-r query for it so you don't have to dig next time. Lots of people check the http://status.openstack.org/rechecks/ page but not all of those bugs have e-r queries.
So what's the thought process for writing an e-r query (best practices)?
- First either identify or open the bug to recheck against, that's standard operating procedure.
- See here for more info: https://wiki.openstack.org/wiki/GerritJenkinsGit#Test_Failures
- Second, check the logs for the failure looking for something that uniquely identifies the failure for the bug.
- Avoid general error messages from Tempest in console.html since those aren't always unique.
- Look for errors/warnings in the various log files, e.g. logs/screen-n-cpu.txt and pull information from them.
- Test your query out in http://logstash.openstack.org:
- Typically start with a simple message and filename query over the last 7 days.
- Query is structured like this: message:"<your unique fail here>" AND filename:"<the log that the failure message appears in relative to the root of the job logs>"
- For example: message:"because vif doesn't exist" AND filename:"logs/screen-n-net.txt"
- If you have hits, make sure there are no false negatives by checking 'build_status' on the left side of the logstash page - that will show you the success/failure rate for the builds that the query hits. You need a 100% failure rate for a good e-r query.
- Query limitations:
- We currently only index INFO and above level messages, so we can't write queries against DEBUG level messages.
- elastic-recheck doesn't currently have multi-line support, i.e. taking two separate error messages and putting them into the same query, see https://review.openstack.org/#/c/60508/ as an example of where this is needed.
- Writing the e-r query and pushing it up
- This is pretty easy, you just create a new query yaml file under elastic-recheck/queries and push it up for review. Here is an example: https://review.openstack.org/#/c/61826/
- Tip: use the Related-Bug: #xxxxxxx line in the commit message so it's automatically linked back into the bug report for people monitoring gate failure bugs.
- What to do when a bug is resolved
- When a tracked bug is marked as fixed and it's dropped off the http://status.openstack.org/elastic-recheck/ page (for TBD # of days?), push a change to archive the query for that bug.
- "Archiving" the query for a fixed bug is pretty easy, you just add the 'resolved_at' field to the query yaml file. Example: https://review.openstack.org/#/c/61186/
- TODO: doc when to mark a bug as critical so that it shows up in the weekly release status meetings for the PTLs
- Basically if it's not in elastic-recheck then it's not critical
- See the ML thread on this subject: http://lists.openstack.org/pipermail/openstack-dev/2013-November/020048.html