TripleOPuppetCI

TripleO Puppet CI
A description of our TripleO Puppet CI job, what it does, and how to interpret results.

The job is loosely based around our TripleO devtest scripts with some extra Puppet environment variables which are described here in detail:

http://docs.openstack.org/developer/tripleo-incubator/puppet.html

If you are a developer looking to setup your own TripleO Puppet environment the above link is probably what you want.

The Environment (where the CI jobs run)
TripleO CI is built around a normal OpenStack cloud which is attached to nodepool and is used to spin up Jenkins slaves for each job. This Jenkins slave is used to build images, and launch various devtest scripts from the tripleo-incubator project, etc. Because the Jenkins slave is itself a virtual machine (and we aren't running nested virt in our OpenStack cloud) we need to have something else to provide extra fake "baremetal" virtual machines for testing. We use a separate cluster of test environment (testenv) machines for this which are essentially baremetal machines with pre-created virtual machine groups. So while the Jenkins slave is used to build images, and launch various scripts driving the CI process the actual VMs themselves are running on a separate testenv "cloud" that is hosted on a separate baremetal server. For over a year now our CI clouds have been running jobs in this fashion with a split OpenStack and testenv based configuration which supports everything we need to test TripleO in various deployments (including HA).

Eventually the goal is to be able to host our TripleO CI needs entirely on OpenStack itself. This effort is described here: QuintupleO. Simply put this is adding a couple extra features to various OpenStack components (Nova, and Neutron) to support booting "baremetal" instances in the OpenStack cloud itself.

How does a Puppet CI job work
The following steps describe how Puppet CI jobs flow in TripleO.


 * Jenkins slave is launched
 * A test environment is acquired from the Geard broker. This provides a set of fake "baremetal" VMs for testing.
 * A "seed" image is built. The seed is a special kind of undercloud (to launch baremetal instances) that runs entirely in a virtual machine.
 * The seed VM is launched. Ironic on the seed is configured to use the rest of the fake "baremetal" VMs in the testenv in order to spawn its instances.
 * Overcloud images are built, one controller image, and one compute image. At this time we are pre-installing OpenStack packages into our images. This aligns well with the normal TripleO process and allows our CI jobs to run because testenv machines have no external network connectivity they cannot download packages at deployment time currently.
 * Puppet modules get installed into image at image build time via the puppet-modules element. Although the CI job for puppet uses packages to install most things Puppet modules are installed via Git directly by setting DIB_INSTALLTYPE_puppet_modules=source.
 * The Overcloud is created by Heat. This is where most of the interesting puppet stuff happens and is driven by the tripleo-heat-templates project (which contains most of the TripleO Puppet work). During the Heat stack creation process the following happens:
 * "Baremetal" nodes are deployed via Nova and Ironic
 * Once the node boots a metadata agent called os-collect-config gathers metadata provided by Heat and invokes puppet hooks. This process may occur multiple times depending on the node/role and is controlled by Heat dependencies which gradually supply metadata to drive specific puppet deployment tasks.
 * When each Puppet Heat deployment task is finished it signals back to Heat to indicate it has finished along with any SUCCESS or FAILURE state.
 * Heat continues this process to until all resources reach a completion state or an error occurs.
 * Once Heat finished running the Overcloud is configured by the normal TripleO cloud configuration tools (os-cloud-config). This process creates neutron networks, keystone tenants, etc.
 * After the overcloud has been configured an instance is booted on the Overcloud as a test to make sure everything is working. The instance is booted using a volume backend (this helps us test Cinder is working). The instance is then assigned a floating IP which we ping test to ensure connectivity.
 * If all this works the job is a success.

Interpreting results (when things go wrong)
The first place to start when trying to interpret output of a failed TripleO CI job is the console.log file for the job. This script contains the high level output from our test scripts and should give you a general idea of where the job is failing. Things to look for in this file include:


 * Did the seed image build? (this is a disk-image-create command to create the seed image)
 * Did the seed image launch and get configured successfully?
 * Did the Overcloud images build? (right now we build two overcloud images: one for compute, and one image for the controller)
 * Did the Overcloud 'heat stack-create' command complete successfully? (this is where most of the Puppet work happens and a failure here may indicate a Puppet issue).

If you notice that the Heat stack for the Overcloud failed to get created successfully the next step is too look at the output of the 'heat event-list overcloud' and 'heat resource-list overcloud' commands in console.log. These should help indicate the node/role that had configuration errors at deployment time. So for example if you see a Heat event fail for one of the Compute nodes might then look at the specific logs for that node to determine more information about the failure (for example if a service fails to start, etc.)

The 'logs' directory contains sub directories for each role that was created. NOTE: if you don't see logs for a specific role it likely means Ironic deployment didn't complete successfully (probably not a puppet specific issue) or... perhaps there is an issue related to capturing log output in our CI job (it happens).

So you've identified what you think is a puppet error and you've found the relevant node logs for that machine... the log to check is probably the os-collect-config.log file. Since os-collect-config collects metadata, and runs os-refresh-config puppet hooks it contains the direct output from the 'puppet apply' commands that get executed along with each Heat deployment. This output should help indicate where the exact problem occurred. You may also need to look at other OpenStack service logs on this node to determine more information and help diagnose specific configuration related issues.

How do I make changes to the Puppet CI
We have our own CI project called tripleo-ci. The toci_gate_test.sh script in this project is used to drive our CI configuration directly and corresponds to the TripleO CI jobs we configure in openstack-infra via Zuul.

Useful links

 * JobTemplate
 * JobParameters
 * Zuul