Improving performance of GET operations in the Nicira Plugin
High level description
Currently the Nicira plugin synchronizes the operational status for most resource at each GET request, to the aim of returning always an up-to-date information of the operational status of a resource. This however becomes a performance bottleneck especially when the REST/RPC traffic increases, as the NVP backend is hit for each GET request. Some operations, such as those from DHCP agent, tend to execute a get operation for every object they need to synchronize, thus leading to linearly increasing accesses to the NVP backend.
It's therefore advisable to move operational state synchronization to a distinct task, which can be repeated at a periodic interval. This task will synchronize the status for all the relevant resources (networks, ports, and routers). This can be achieved with a very limited number of queries to NVP. The number of queries depends on several factors the size of data to retrieve from NVP, and the number of chunks that the synchronization task will divide this data into. Our target is to ensure that each chunk gather a reasonable number of resources, so that the number of requests sent to NVP is not too high, and to other hand that each single request does not ask for too many resources, as this will result in high response times as well as exceissive load on the NVP backend.
No API change will be performed as part of this blueprint
Data Model Changes
No data model change.
The following operations mapped to GET API operations will stop fetching entities from NVP to gather their operational status
The operational status will always be returned from the value stored in the database. Such value will be updated periodically (to this aim oslo-incubator LoopingCall will be used).
Deployers should be able to tune the frequency of this periodic task using the following set of configuration variables:
- state_sync_interval - time, in seconds, between state synchronization runs;
- min_req_delay - minimum interval, in seconds, between fetching two chunks of data.
The aim of this parameter is to avoid NVP gets bombarded with status synchronization requests. Some of them put a non-trivial load on the backend.
- min_chunk_size - minimum chunk of data to retrieve.
Note that retrieving a chunk of data might involve up to 3 requests to NVP (one for lswitches, one for lrouters and one for lswitchports). The size of a data chunk might increase in order to ensure all resources are fetched in up to state_sync_interval/min_req_delay chunks.
The optimal chunk is calculated with the following formula:
total_data_size state_sync_interval max( --------------------- / -------------------------- , 1) * min_chunk_size min_chunk_size min_req_delay
total_data_size = 200 min_chunk_size = 300 state_sync_interval = 100 min_req_delay 25 chunk_size = max ( 200/300 / 100/25, 1) * 300 = max( 1/6, 1) * 300 = 300
total_data_size = 1000 min_chunk_size = 200 state_sync_interval = 100 min_req_delay 25 chunk_size = max ( 1000/200 / 100/25, 1) * 200 = max( 5/4, 1) * 300 = 375
When the chunk size increase, this is discovered only after calculating total_data_size. This happens after the first chunk of data has been retrieved, so a correction is applied to the subsequent chunk:
First chunk: 300 -> actual chunk size: 375 Second chunk: 375 + 75 = 450 3rd and subsequent chunks: 375
ratio = ((float(sp.total_size) / float(sp.chunk_size)) /
At each execution of such periodic task the following queries will be executed on the NVP backend:
/ws.v1/lswitch/*/lport?fields=uuid,tags,link_status_up,admin_status_up,fabric_status_up&relations=LogicalPortStatus /ws.v1/lrouter?fields=uuid%2Ctags%2Cfabric_status%2C%20&relations=LogicalRouterStatus /ws.v1/lswitch?fields=uuid%2Ctags%2Cfabric_status%2C%20&relations=LogicalSwitchStatus
Field selection will reduce response size. The parameters _page_length and _page_cursor will be used to fetch the appropriate chunk at each execution. During synchronization we won't block access to the Quantum database. Status for newly created resources will be synchronized at the next execution of the synchronization task.
If a user needs a punctual information about the object's operational status, it is possible to include the field 'status' in quantum's field query parameter. This will cause the status to be immediately synchronized.
Example: GET /v2.0/network/<network_id>?field=admin_status_up&field=status
Another aspect which might affect the scalability of the synchronization task are the operations on the Quantum Database. In theory one has to: 1) Retrieve all the informations from the Network, Port, and Router tables 2) Update all the records with the new status information
In order to prevent the synchronization task to keep running across unit tests unless we explicitly want it to run, the function that starts it will be stubbed out in all NVP unit tests. Test for validation synchronization will use instead the synchronizer object directly, simulating the behaviour of the synchronization task.