Jump to: navigation, search

L2population blueprint

Revision as of 08:20, 11 April 2014 by Édouard Thuleau (talk | contribs) (Created page with "= Overview = Open source plugins overlay implementations could be improved to increase their scalability: *Current OVS GRE implementation replicates broadcasts to every agent,...")
(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)

Overview

Open source plugins overlay implementations could be improved to increase their scalability:

  • Current OVS GRE implementation replicates broadcasts to every agent, even if they don’t host the corresponding network.
  • VXLAN implementation can map all the networks broadcasts to a single multicast group (as proposed in vxlan-linuxbridge)

As a result, overlays broadcasts and unknown unicasts quickly trend to be sent to all the agents, resulting in traffic and processing overhead. The other alternative is to use one multicast group per network (as proposed in openvswitch-kernel-vxlan). But as datacenter networks usually only support a limited number of multicast groups, it also limits the number of available virtual networks.

Proposed evolution

As broadcast emulation on overlay is costly, it may be better to avoid its use for mac learning and ARP resolution. This supposes the use of proxy ARP on the agent to answer VM requests, and to populate bridge forwarding table.

If this kind of prepopulation will probably dramatically limit L2 broadcasts in the overlay, it may anyway be necessary to provide broadcast emulation. This could be achieved by sending broadcasts packets over unicasts only to the relevant agents (as proposed in ovs-tunnel-partial-mesh for OVS plugin).

Driver commands

Linux bridge

As of linux 3.8, the VXLAN kernel module supports proxy ARP & FDB population. For that purpose the interface must be set-up with the following options:

# ip link add vx-NET_ID type vxlan id SEGMENTATION_ID proxy

Thus, ARP requests will be answered according to neighbor table entries:

# ip neighbor add REMOTE_VM_IP lladdr REMOTE_VM_MAC dev vx-NET_ID nud permanent

And traffic will be sent straight to the relevant host thanks to FDB population:

# bridge fdb add REMOTE_VM_MAC dev vx-NET_ID dst REMOTE_HOST_IP

(Note: ‘bridge’ binary command comes with new versions of iproute2)

Linux 3.9 is required to support edge replication using the following FDB commands:

# bridge fdb add ff:ff:ff:ff:ff:ff dev vx-NET_ID dst REMOTE_HOST_IP#1
# bridge fdb append ff:ff:ff:ff:ff:ff dev vx-NET_ID dst REMOTE_HOST_IP#2

As Havana’s timeline distributions (Ubuntu 13.04, Fedora 18...) come with 3.8 kernel, the last feature probably won’t be supported. Anyway, as explained earlier, broadcasts will probably be very limited in the overlay, mapping them to a global multicast group as proposed in current WIPs will probably provide a reasonable transition.

Open vSwitch

As OVS cannot respond to ARP requests, we could use the following ‘ebtables’ on HybridDriver linuxbridge:

# ebtables -t nat -A PREROUTING -i tapPORT_ID -p arp --arp-opcode Request --arp-ip-dst REMOTE_VM_IP -j arpreply --arpreply-mac REMOTE_VM_MAC --arpreply-target ACCEPT

FDB rules population is quite straightforward in OVS:

# ovs-ofctl add-flow br-tun hard_timeout=0,idle_timeout=0,priority=3,dl_dst=REMOTE_VM_MAC,in_port=OVS_VM_PORT_ID,dl_vlan=LOCAL_VLAN_ID,actions=set_tunnel:SEGMENTATION_ID,TUNNEL_PORT_NAME

As well as broadcast handling:

# ovs-ofctl add-flow br-tun hard_timeout=0,idle_timeout=0,priority=3,dl_dst=01:00:00:00:00:00/01:00:00:00:00:00,in_port=OVS_VM_PORT_ID,dl_vlan=LOCAL_VLAN_ID,actions=set_tunnel:SEGMENTATION_ID,TUNNEL_PORT_NAME#1,...,TUNNEL_PORT_NAME#n

OpenFlow rules can be updated on OVS with command ‘mod-flow’ instead of the command ‘add-flow’ or ‘del-flow’.

Implementation

The previous principles could be implemented within the upcoming ML2 plugin. A MechanismDriver could extend the default RPC callbacks to enable dissemination of reachability information among agents.

Proposed workflow #1

This workflow is quite straightforward: get_device_info would have to be enriched with IPAllocations corresponding to the port. Thus, agent is able to build FDB entries corresponding to the port being added, and notify the plugin about these entries, which will in turn propagate them to other agents:

When a port is plugged on an agent that wasn’t already handling the corresponding network, the agent must also propagate the FDB entry for broadcasts. In addition, he also has to request all the FDB entries corresponding to this network to the plugin:

The workflows described above are quite chatty and could be enhanced using a single RPC method as described below. In this case, agent would request the full update of the forwarding table using the “request_update” flag. Thus, the plugin would call back the same “update_network” method on the agent to provide it with the full list of forwarding entries for the corresponding network.

Using the same method on agents and plugins could be usefull for future uses cases such as live migration or L2 services insertion: an agent can directly request to another on to forward trafic corresponding to a specific flow it wants to intercept.

Proposed workflow #2

In workflow #1, we’d be building an independant forwarding table at the plugin level, but the latter already has nearly all the required information:

  • Port class already provides port mac address alongside its IP allocations
  • ml2_network_segments table provides network_type and segmentation_id
  • with portbinding_db, we’ll be able to figure out on which agent are located the different ports of a network
  • as a result, plugin just need to know about the tunneling IP used by the agents to build and disseminate the forwarding entries. This could be easily added to the “configuration” dict of the agent management extension db

The only limitation is that currently only nova-provisioned ports will be saved in portbinding_db (once https://review.openstack.org/#/c/29767/ will be merged). Thus for now, this database won’t be populated for ports used by l3 or DHCP agents. As we also need to populate the forwarding entries corresponding to these ports, the proposal would be to use update_device_up/down RPC calls to update portbinding status. So all ports managed by agents would be populated in portbinding_db.

Discussion on two approaches

In the first approach we’d be building an independant forwarding table, loosely coupled with quantum internal state (database and RPC callbacks could be hosted by a separate daemon, acting as a route reflector). This loose coupling may lead to consistency issues, and complexify future uses cases: as it only acts as a forwarding table repository, the plugin will hardly be able to set up more complex forwarding scenarios as steering traffic in an appliance for L2 service insertion, or mirror port trafic for example.

While the second approach may put more pressure on plugin that dictates forwarding in the agents, it may be easier to extend API and propagate corresponding forwarding information in the future. But in that case, injection of external forwarding entries that wouldn’t be managed by quantum will be more complex (for exemple, we could plug a BGP speaker that exchanges forwarding information with the outside world to extend netwroks across the WAN).

Handling VXLAN with multicast

As explained earlier, multicast-based VXLAN is not really a feature as it wont scale, but it would anyway be necessary to support it for two puposes:

Some networks may have to be interconnected with 3rd party appliances which uses multicast-based VXLAN. For that puporse, having the ability to specify a multicast group as an extended provider attribute could be a good solution. To support broadcast emulation in pre-3.9 linux kernel VXLAN implementation, we’ll have to rely on multicast. For that purpose, providing the multicast group as an agent configuration parameter as proposed in vxlan-linuxbridge blueprint could provide a good migration path, as this option could be removed once all the agents support edge replication for broadcast emulation.

As OVS VXLAN implementation doesn’t support multicast for now, one solution could be to use iptables rules to map a virtual tunnel IP to a multicast address:

Set up a tunnel with a martian endpoint IP address:

#ovs-vsctl add-port br0 vx0 -- set interface vx0 type=vxlan options:remote_ip=192.0.2.1

And DNAT it to a multicast address:

#iptables -t nat -A OUTPUT -d 192.0.2.1 -p udp --dport 4789  -j DNAT --to-destination 224.0.0.1

Additional notes

As L2 reachability information is propagated using fanout_casts, we’ll somehow move network broadcasts to RPC broadcasts. Of course populated FDB & Proxy ARP entries won’t expire so the latter are much less frequent.

Information dissemination could anyway be improved by only notifying agents holding ports of a specific network. Following options could be envisaged:

  • using portbinding db to only cast agents holding ports in the network
  • using one topic per network, thus fanout_casts would be only sent to the relevant agents. Anyway, this second alternative could put some pressure on actual RPC implementation, some enhancements could be pushed to improve that

Note: fanout_casting agents doesn’t only concerns reachability information dissemination, but also security groups, L3 routes, aso...

Note2: You may notice we can make a parallel with BGP evpn, just assume:

  • edge replication =~ MP-BGP e-VPN ethernet auto-discovery route
  • ARP proxy =~ MP-BGP e-VPN mac advertisement route
  • per-network topic =~ RT-constraint Membership NLRI Advertisements

This is quite intentional, as long term plan could be to plug a BGP speaker to the forwarding DB, so that Quantum networks could be seamlessly extended across the WAN, using either MPLS encapsulation or VXLAN as proposed here.