Large Scale SIG/ScaleUp

The third stage in the Scaling Journey is Scale Up.

As you monitor your cluster at scale, you will see that it hits scaling limits within one cluster. All hope is not lost, though! There are things you can put in place push back how much a single cluster can handle, before having to resort to setting up a more complex deployment configuration. This page aims to help answer those questions.

Once you are past that stage, you are ready to proceed to next stage of the Scaling Journey: Scale Out.

FAQ
'''Q: Cleaning up deleted entries in my database is a bit of a hassle. is there a tool I could use to help me with that?'''

A: The OSarchiver tool, developed by OVH, can help you there: see https://github.com/ovh/osarchiver/. We are working on making it maintained upstream as part of the OSops tooling.

Q: How many compute nodes can a typical OpenStack cluster contain?

A: It is hard to give this question a singular answer, as it depends on the power of each compute node and the usage pattern in your cloud (amount of churn, oversubscription, I/O vs. CPU...). The answer ranges from 100 nodes (in case of high density nodes with lots of churn) to 2,000 nodes (simpler nodes with longer-lived workloads).

Q: What are the typical issues with a cluster with too many compute nodes?

A: Load on control plane (db/rabbit) increases a lot, and becomes a bottleneck. Higher density nodes tend to experience Neutron scaling issues. "Burst" load becomes harder to manage (e.g. restart of all neutron agent or nova computes is putting a high pressure on control plane). Last but not least, the failure domain becomes bigger.

Q: What can be done to push back the limit once it's reached?

A: Suggestions include creating a separate RabbitMQ cluster for neutron-related queues (if you are using ml2/ovs, ml2/linux-bridge or ml2/sriov-nic-agent), or using the python binding for ovs.

Q: How do you decide to add a new node for control plane?

A: If you found out that your rabbitmq queue keep piling up for a certain service, it usually means that it's time to add more control plane workers to those service to consume the queue.

Resources

 * A curated collection of scaling stories, as we collect them
 * Evaluation of internal messaging
 * https://www.openstack.org/summit/vancouver-2018/summit-schedule/events/21007/openstack-internal-messaging-at-the-edge-in-depth-evaluation
 * https://ieeexplore.ieee.org/document/8590992
 * https://www.openstack.org/summit/berlin-2018/summit-schedule/events/22115/rabbitmq-or-qpid-dispatch-router-pushing-openstack-to-the-edge
 * Old but still relevant/interesting: https://www.youtube.com/watch?v=bpmgxrPOrZw
 * Evaluation of databases
 * https://www.openstack.org/summit/vancouver-2018/summit-schedule/events/21212/keystone-in-the-context-of-fogedge-massively-distributed-clouds
 * https://beyondtheclouds.github.io/blog/openstack/cockroachdb/2018/06/04/evaluation-of-openstack-multi-region-keystone-deployments.html
 * Scaling Neutron: https://www.youtube.com/watch?v=5WL47L1P5kE (https://www.slideshare.net/moreirabelmiro/evolution-of-openstack-networking-at-cern)
 * Scaling Nova/Ironic: https://techblog.web.cern.ch/techblog/post/nova-ironic-at-scale/
 * Scheduling Performance: https://techblog.web.cern.ch/techblog/post/scheduling-optimizations/
 * Global scaling: https://www.openstack.org/summit/barcelona-2016/summit-schedule/events/15977/chasing-1000-nodes-scale

Other SIG work on that stage

 * Collecting scaling stories
 * Submit scaling stories on https://etherpad.openstack.org/p/scaling-stories
 * Curate them on Large_Scale_Scaling_Stories