Jump to: navigation, search

Neutron Trunk API Performance and Scaling

Revision as of 10:07, 5 December 2016 by Bence.romsics (talk | contribs) (surprise effect on filtered port listings)

Back to Neutron/TrunkPort

WORK IN PROGRESS

summary

TODO

testbed

underlying hardware

CPU Intel(R) Core(TM) i7-4600M CPU @ 2.90GHz
RAM DDR3 Synchronous 1600 MHz

VM running devstack

vCPU count 2
RAM 6 GiB

software versions

2016 late november, cc. 6 weeks after the newton release

ubuntu 16.04.1
openvswitch 2.5.0-0ubuntu1
devstack a3bb131
neutron b8347e1
nova 2249898
python-neutronclient 36ad0b7
python-novaclient 6.0.0
python-openstackclient 3.3.0

local.conf

only the relevant parts

[[post-config|/etc/neutron/neutron.conf]]
[DEFAULT]
service_plugins = trunk,...
rpc_response_timeout = 600
[quotas]
default_quota = -1
quota_network = -1
quota_subnet = -1
quota_port = -1

[[post-config|/etc/neutron/plugins/ml2/ml2_conf.ini]]
[ml2]
mechanism_drivers = openvswitch,linuxbridge
tenant_network_types = vxlan,vlan
[ovs]
ovsdb_interface = native
of_interface = native
datapath_type = system
[securitygroup]
firewall_driver = openvswitch

[[post-config|/etc/neutron/dhcp_agent.ini]]
[DEFAULT]
enable_isolated_metadata = false

[[post-config|/etc/neutron/l3_agent.ini]]
[DEFAULT]
enable_metadata_proxy = false

[[post-config|/etc/nova/nova.conf]]
[DEFAULT]
# NOTE this could have probably stayed the default and not make a difference
vif_plugging_timeout = 300

workflow variations

pseudocode of test scenarios

metadata and dhcp services are turned off to avoid being memory bound

add subports before boot, all at once

for 1 parent port:
    network create
    subnet create
    port create

for N subports:
    network create
    subnet create
    port create

trunk create with N subports
server create with parent port

N = 1, 2, 4, 8, 16, 32, 64, 128, 256, 512

add subports after boot, one by one

for 1 parent port:
    network create
    subnet create
    port create

trunk create with 0 subports
server create with parent port

for N subports:
    network create
    subnet create
    port create
    trunk set (add one subport)

port list
port list --device-id server-uuid
server show

N = 1, 2, 4, 8, 16, 32, 64, 128, 256, 512

add subports after boot, in batches

for 1 parent port:
    network create
    subnet create
    port create

trunk create with 0 subports
server create with parent port

for N batches of subports:
    for M subports in a batch:
        network create
        subnet create
        port create
    trunk set (add M subports)

port list
port list --device-id server-uuid
server show

N*M = 8*64(=512), 8*128(=1024)

results

add subports before boot, all at once

trunk create with subports

01-before-boot-trunk-create.svg The time to create a trunk with N subports is clearly a linear function of N. No errors in the measured range.

server boot with subports

02-before-boot-server-create.svg The time to boot a server with a trunk with N subports is also a linear function of N. Errors appear above cc. 150 subports if using default configuration.

Error symptoms:

  • trunk remains in BUILD state
  • subports remain in DOWN state

Example q-agt.log:

2016-11-25 14:42:56.571 ERROR neutron.agent.linux.openvswitch_firewall.firewall [req-75355670-e83b-42ac-8e99-08e50a352609 None None] Initializing port 30c12b8f-cfae-4090-bfb9-8257b4278300 that was already initialized.
2016-11-25 14:43:09.765 ERROR neutron.services.trunk.drivers.openvswitch.agent.ovsdb_handler [-] Got messaging error while processing trunk bridge tbr-fd003eb7-3: Timed out waiting for a reply to message ID 781bd8f7778941ddaf3e3b7feb166011

Example neutron-api.log:

2016-11-25 14:41:43.900 DEBUG oslo_messaging._drivers.amqpdriver [-] received message msg_id: 781bd8f7778941ddaf3e3b7feb166011 reply to reply_d020591a17164210af4973e8432edbef from (pid=7717) __call__ /usr/local/lib/python2.7/dist-packages/oslo_messaging/_drivers/amqpdriver.py:194
2016-11-25 14:43:15.977 DEBUG oslo_messaging._drivers.amqpdriver [req-75355670-e83b-42ac-8e99-08e50a352609 None None] sending reply msg_id: 781bd8f7778941ddaf3e3b7feb166011 reply queue: reply_d020591a17164210af4973e8432edbef time elapsed: 92.0763713859s from (pid=7717) _send_reply /usr/local/lib/python2.7/dist-packages/oslo_messaging/_drivers/amqpdriver.py:73

Where the elapsed time (eg. 92s above) is longer than the default 90s of:

/etc/neutron/neutron.conf:
[DEFAULT]
rpc_response_timeout = 90

Workarounds:

  • neutron-ovs-agent after a restart seems to bring up all subports and the trunk correctly without an rpc timeout.
  • rpc_response_timeout can be raised. The rest of the tests were performed with rpc_response_timeout = 600

The rpc method timing out:

2016-11-25 15:55:42.370 DEBUG neutron.services.trunk.rpc.server [req-5a3498ad-efcb-400d-a2e7-cfe19db946a9 None None] neutron.services.trunk.rpc.server.TrunkSkeleton method update_subport_bindings called with arguments ...

Of course rpc_response_timeout does not directly apply to vm boot time, so the chart is slightly misleading.

Relevant code: https://github.com/openstack/neutron/blob/b8347e16d1db62bf79b125d89e1c7863246fcd0d/neutron/services/trunk/rpc/server.py

def update_subport_bindings
    _process_trunk_subport_bindings()
def _process_trunk_subport_bindings
    for port_id in port_ids:
        _handle_port_binding()
def _handle_port_binding
    core_plugin.update_port()

That is the binding update to the parent port "cascades" to all subports. The cascading update that has to complete in a single update_subport_bindings rpc call.

add subports after boot, one by one

add subports to trunk, one by one

03-after-boot-trunk-set.svg The chart is clearly bimodal. The starting section is linear with low deviation. The second section is also linear but with a higher slope and much higher deviation. The second section starts around 150-200 subports, however the reapperance of 150 is just a coincidence.

TODO What changes between the two sections?

The bimodality may or may not have to do anything with the trunk feature, because the same can be observed on a sequence of port creations: 04-port-create.svg

cumulative time of adding subports to trunk, one by one

For the end-user the total time spent in adding subports is more significant. Based on the previous chart (measuring the time of adding one subport to a trunk) we can expect this to be roughly quadratic. There were no errors observed in the measured range.

05-after-boot-one-by-one-total-trunk-set.svg

surprise effect on filtered port listings

An unexpected slowdown can be observed on filtered port listings as in:

  • neutron port-list --device-id ID, or
  • openstack port list --server ID

It only occurs, if:

  • the ports are used as subports
  • if the port listing is filtered by the device_id of the parent port (listing all ports is fast)
  • if the VM ID used in the filtering refers to an existing VM using the parent port (filtering for non-existing IDs is fast, filtering for the ID of an unrelated VM is fast)

However the slowdown effect is present even when there are no concurrent operations on the controller. It's enough if the subports exist.

nova makes similar calls to neutron in various places (eg. openstack server show), with the consequence of some nova operations being slow.

TODO What happens if we boot a VM with no trunk and subports and we filter for its ID?
TODO What's the difference on the wire between neutron and openstack CLI?
TODO Investigate how sqlalchemy is involved.

2016-11-30 12:16:41.010 DEBUG sqlalchemy.orm.path_registry [req-b4c835fd-853f-4be4-948b-369c2734aa49 neutron 7948a752c6004c0d9131184dc7162cc4] set 'memoized_setups' on path 'EntityRegistry((<Mapper at 0x7f3aa0469fd0; Port>, <RelationshipProperty at 0x7f3a9f754230; dns>, <Mapper at 0x7f3a9fe566d0; PortDNS>))' to '{}' from (pid=15131) set /usr/local/lib/python2.7/dist-packages/sqlalchemy/orm/path_registry.py:63
2016-11-30 12:16:41.011 DEBUG sqlalchemy.orm.path_registry [req-b4c835fd-853f-4be4-948b-369c2734aa49 neutron 7948a752c6004c0d9131184dc7162cc4] set 'eager_row_processor' on path 'Mapper|Port|ports -> Port.dhcp_opts' to '<sqlalchemy.orm.util.ORMAdapter object at 0x7f3a9df678d0>' from (pid=15131) set /usr/local/lib/python2.7/dist-packages/sqlalchemy/orm/path_registry.py:63

06-port-list.svg

add subports after boot, in batches

TODO N*M = 8*128(=1024)

1024 subports

N*M = 8*128(=1024)

2016-12-02 14:27:25.591 ERROR neutron.plugins.ml2.drivers.openvswitch.agent.openflow.native.ovs_ryuapp [req-d8340ea4-43a1-486e-9b46-affc8cc49355 None None] Agent main thread died of an exception
...
2016-12-02 14:27:25.591 TRACE neutron.plugins.ml2.drivers.openvswitch.agent.openflow.native.ovs_ryuapp     self.pid = os.fork()
2016-12-02 14:27:25.591 TRACE neutron.plugins.ml2.drivers.openvswitch.agent.openflow.native.ovs_ryuapp OSError: [Errno 12] Cannot allocate memory