TripleO/TripleOCloud

= Overview =

TripleO developers run a production-ready TripleO deployed cloud for development and testing of OpenStack.

We treat this as a production CD environment: Kanban for the folk working on it is at https://trello.com/b/0jIoMrdo/tripleo and https://trello.com/tripleo. We iterate and evolve in response to OpenStack as a whole, while improving the deployment mechanisms and code. Note: trello is just -a- kanban to story the operational status: an arbitrary choice to let us experiment with the setup, and subject to change if this works

= What is it used for =

There are four primary uses for the TripleO cloud:
 * gating tests of Nova baremetal and Ironic - actual bare metal deployment testing. This is not yet online, but will be coming online as the cloud gets big enough. We need to be able to run 40 tests an hour before this can go into the gate.
 * gating tests of the deployment of OpenStack via Heat + Nova(Kvm / Xen etc) + Neutron. This will include a run of Tempest against the deployed cloud to validate hypervisor specific functionality. Again, this is not yet online and is going to be dependent on having sufficient capacity.
 * check + gating tests of Nova baremetal / OpenStack deployments in emulated baremetal environments. This is being brought online at the moment - it is faster than physical metal, but cannot test with the same fidelity, and cannot test all hypervisors.
 * For development / testing / integration of TripleO itself: we deploy a persistent cloud offering VM's for use by OpenStack-infra, TripleO ATCs and other uses as the TripleO PTL deems fit.

= Regions =

The TripleO cloud is made up of regions contributed to the community. For more information, including how to contribute a region, please see the regions docs.

= tripleo-cd-admins team =

Direct access to the hardware needs to be kept to a balanced team : too few and they become a bottleneck, too many and it becomes a security / reliability concern. Any TripleO ATC can apply to be a member, and existing members will vote, with the PTL having a veto (but not an override). The tripleo-cd-admins team will be the determining factor for all access that is equivalent to 'run arbitrary code outside of kvm/Xen'; having this set be decoupled and manually reviewed is part of the security policy on the machines : we can't trivially change it - the machines were provisioned with the expectation of non-HP staff administering them, but that set is expected to be known quantities.

The current list of members is maintained in the incubator tree. Any TripleO ATC can request access by submitting a review adding themselves, but access won't take effect immediately as nova keypairs are not synced out automatically. Only existing tripleo-cd-admins should approve changes to the tripleo-cd-admins list

access rules
We expect http://ci.openstack.org/sysadmin.html#ssh-access (or some to be defined subset) to be followed - this is a production, operational environment.

https://docs.google.com/spreadsheet/ccc?key=0AlLkXwa7a4bpdERqN0p5RjNMQUJJeDdhZ05fRVUxUnc&usp=sharing has the undercloud admin credentials; please only use in emergencies - normal operation should use dedicated user codes. Access to the spreadsheet is given when joining the team.

= Workflow =

Basic principles: visibility of work needed and it's relevance to the product *right now* is crucial info for people to make good choices. (and similar Mark McLoughlin was asking about what bugs to work on which is a symptom of me/us failing to provide clear visibility) (Datacentre ops? Continuous deployment story?)
 * unblock bottlenecks first, then unblock everyone else.
 * folk are still self directed - it's open source - but clear
 * clear communication about TripleO and plans / strategy and priority

Reviews are king
Since we want to minimise folk with root access to the cluster, we need to ensure that most/all changes can be done entirely via code review or non-root-equivalent APIs. latency in review will drive WIP higher.

Roadmap
The tripleo roadmap should live in the roadmap Kanban board: it's where new initiatives will be drawn down from. At any point in time there should be just one conceptual goal on the roadmap, at least until we've eliminated stalls and driven inventory down.

TripleO WIP
The current TripleO team (== folk working on TripleO, regardless of organisation) track what is being worked on concurrently in the TripleO Kanban board. The following guidelines will help the team understand what things are progressing and what things are stalled at any point in time:


 * any one individual should only ever be working on one card: if one yak shaves, one should clearly show what's being worked on.
 * multiple people can be working on the same card: dogpiling and collaboration is fine - and a good thing
 * lane priority is right-to-left: moving the cards that are closest to the right side (done) is more useful than moving cards that are further away: the goal is to reduce work in progress and deliver rapidly.
 * cards can be any size/scope but the ideal, similar to code review is that they should be a single conceptual goal : there may be multiple steps in a single card.
 * Anyone can split/combine cards as needed to make the board communicate our WIP more usefully.

Bugs
TripleO still uses bugs:- issues in the code, and things that users would reasonably like to achieve that involve code changes are best represented as bugs. Kanban cards can and should link to bugs. We don't require a bug for every code change, but if there is a code change that you aren't going to do right now, but which should be captured: a bug is the right place.

use bugs for things which need code changes.

Blueprints
TripleO still uses blueprints: they are the approval mechanism for new items on the roadmap. Design work for them should still be done in etherpads.

use blueprints to get new items into the roadmap.

etherpads
TripleO still uses etherpads: they are great for collaborative editing of plans/designs, and tracking work being done on a card in a lightweight accessible fashion.

use etherpads for status of and design for cards

= Seed VM host =

The seed VM host is a manually installed machine, reachable over ssh. Until we've done an audit for prior data, access to the machine is restricted to HP employees who are TripleO ATC's. (It's also got some weird networking shit, avoid it!)

= Seed =

The seed is a VM provisioned via boot-seed vm on the Seed VM host, reachable over ssh from other machines in the cloud. Access is available to any tripleo-cd-admins member. The VM host is 138.35.77.2, the seed is 10.10.16.168

= Undercloud =

The API endpoint for the undercloud that deploys the cloud is: https://cd-undercloud.tripleo.org:13000/v2.0. Ssh credentials are available from Robert Collins (and is limited to tripleo-cd-admins).

API Access
API access to the Undercloud is available to tripleo-cd-admins, as API access permits deploying arbitrary code to the physical machines. Credentials are setup when joining tripleo-cd-admins, and removed when leaving the team.

The undercloud can deploy machines to the range 10.10.16.171 10.10.16.188 today (limited due to hardware availability).

Updates to software
We can't deploy the undercloud automatically yet, so it's basically static.

To update e.g. nova: cd /opt/stack/nova git review -d  /opt/stack/venvs/nova/bin/pip install. os-collect-config --one --force

Hardware configuration
The machines have 24 cores, 96G of ram and 2x2TB hard disks with a hardware raid controller. There is a 10G mellanox dual-port card as well, which shows up as eth2 - this is the only wired up network port. As far as we can tell the BIOS defaults to not enabling/net-booting the 10G card, so it needs to be both enabled (via pci devices) and have net boot turned on for it in the system BIOS setup.

Note that you'll need to watch it try and boot to get the MAC - assume unenrolled machines have the wrong mac in machine-information-tab.txt. Use 'F12' right after the smart array scrolls by to force a network boot so that you don't have a bad enrollment in nova. The mellanox nic0 address (e.g. 78:e7:d1:03:00:23:94:cd) has the first three and last three octets of the nic1 address, which is the one we want (78:e7:d1:23:94:cd). alternatively configure pxe booting and then watch for dhcp in tcpdump: 'set system1/bootconfig1/bootsource5 bootorder=1'

Current adhoc variance in the undercloud
= Overcloud =

The overcloud is deployed from OpenStack trunk in a loop, currently taking 40m per commit, and open access is available to any TripleO ATC: they may use the cloud as desired - having users on the cloud helps validate that the cloud is usable! File all bugs on the tripleo bug tracker. Access may also be granted to non-(TripleO ATC's) at the TripleO PTL's discretion.

Location
The overcloud API and ssh endpoint is cd-overcloud.tripleo.org.

API Access
Submit a review proposal to https://git.openstack.org/cgit/openstack/tripleo-incubator to add yourself to the list of tripleo-cloud users. Once merged, contact anyone in tripleo-cd-admins and they will retrieve your initial API credentials to the overcloud. The rc file for using the overcloud will look something like: export NOVA_VERSION=1.1 export OS_PASSWORD= export OS_AUTH_URL=http://cd-overcloud.tripleo.org:5000/v2.0 export OS_USERNAME= export OS_TENANT_NAME= export COMPUTE_API_VERSION=1.1 export OS_NO_CACHE=True export OS_CLOUDNAME=cd-overcloud-

SSH Access
This is granted via membership in tripleo-cd-admins, as bare metal access permits running anything at all in that environment. The technical implementation is that SSH keys are provisioned via the 'admin/default' key-pair in the undercloud (which is manually synced with tripleo-cd-admins).