User:Physcx
Contents
- 1 "Transfer-Encoding : chunked" Performance Problem
- 1.1 Summary
- 1.2 Master Branch Performance
- 1.3 Proposed Solution
- 1.3.1 Solution Performance - NOT CHUNKED Big Objects
- 1.3.2 Solution Performance - CHUNKED Big Objects
- 1.3.3 Master Branch - NOT CHUNKED Small Objects (10KB multi-stream)
- 1.3.4 Master Branch - CHUNKED Small Objects
- 1.3.5 Solution Performance - NOT CHUNKED Small Objects
- 1.3.6 Solution Performance - CHUNKED Small Objects
- 1.4 5 Conclusions
"Transfer-Encoding : chunked" Performance Problem
- Author: Shaun McDowell
- Date: 2014-04-29 17:14:15 EDT
Summary
We are using Swift as an application backend to store > 2MB objects and we are using chunked encoding to upload many streams simultaneously without knowing the full object size ahead of time. To improve performance and reduce disk thrashing we want to increase the amount of data collected before writing to disk. We found the =network_chunk_size= in the object-server config files determines the size of buffers being sent down to DiskfileWriter's write function. After increasing our =network_chunk_size= from the default 64KB to 2MB we noticed an 8x speed difference between uploading with and without chunked transfer-encoding on our physical system. Further testing and profiling shows the looping in eventlet's wsgi =Input._chunked_read= function on the object servers is the source of enormous overhead when it needs to repeatedly query the socket and copy buffers to piece together many small chunks to get the requested 2MB chunk.
With default settings of 64k network chunk size we did not see the problem because the wsgi =_chunked_read= method tends to get the entire requested data off the socket in 1 or 2 iterations of its loop and while there is still double the amount of syscalls and extra copying being done it is not significant enough to noticeably affect performance. We also found that the problem seems to compound as network chunk size is increased or object size being uploaded increases. With a 5MB vagrant saio object-server =network_chunk_size= and 100MB objects we found performance to be halved. With a 10MB =network_chunk_size= the same upload took 4 times as long using chunked encoding compared to not chunked.
Results Comparison Table:
1 GB Objs Swift Bench 30 100 MB Objs SB 1,000 100 KB Objs Master Branch - content_length 11.297 42.982 12.712 Master Branch - chunked 24.776 123.583 13.277 Solution Branch - content_length 11.903 42.982 12.885 Solution Branch - chunked 15.660 49.398 13.274
Reproducing the problem in Vagrant SAIO:
Due to a number of hardware differences between our SAIO VM and actual storage nodes, we had to crank some of the settings to reproduce the problem locally.
How to reproduce the problem in a clean Vagrant SAIO
- Increase the VM's CPUs to 4, Memory to 4096M, and Swift's Storage capacity to 10GB
- Change the =network_chunk_size= setting in the object-server config files to >= 5242880 (5MB)
- Restart the object-servers
- From outside the SAIO virtual machine, time an upload of a large object (>= 100MB) with and without chunked encoding
While running chunked transfers we found the Storage node object-server processes were reaching 100% cpu utilization due to the overhead from eventlet.wsgi.Input._chunked_read.
Our Vagrant SAIO VM
Clean SAIO from [1] with the following changes:
- VM cpus = 4
- VM memory = 4096MB
- localrc =LOOPBACK_GB= = 15
- object-server/conf network_chunk_size = 5MB
Hosted on:
- Intel i7 4770
- 2x Corsair 16GB 1600 MHz
- Samsung 840 SSD
Master Branch Performance
We compared both single object upload and concurrent upload performance for chunked and not chunked transfer encoding. Single object upload was tested using curl and concurrent upload performance was tested using swift-bench. To force swift-bench to use chunked encoding we modified =swiftbench/bench.py='s BenchPUT.run function to use:
client.put_object(self.url, self.token, container_name, name, source, #content_length = len(source), chunk_size = 16384, http_conn=conn)
for our chunked tests.
Master Branch - NOT CHUNKED Big Objects
Single Big Object - 1 GB:
time curl -H "$XAuthToken" -T 1000MB -XPUT $XStorageURL/cont/test real 0m11.297s
Mid-upload top on SAIO (notice CPU usage here vs chunked CPU usage):
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND 939 vagrant 20 0 87836 29m 2344 R 97.2 0.7 10:03.55 swift-proxy-ser 4653 vagrant 20 0 1544m 41m 2472 R 32.6 1.0 0:53.92 swift-object-se 4655 vagrant 20 0 1546m 43m 2496 R 32.3 1.1 0:36.45 swift-object-se 4656 vagrant 20 0 1546m 43m 2496 R 31.3 1.1 0:33.72 swift-object-se
Timings with swift-bench 30 100MB objects:
time swift-bench --saio -s 109715200 -n 30 -g 0 -A http://192.168.8.80:8080/auth/v1.0 swift-bench 2014-04-29 11:05:06,088 INFO Auth version: 1.0 swift-bench 2014-04-29 11:05:06,337 INFO Auth version: 1.0 swift-bench 2014-04-29 11:05:17,885 INFO 1 PUTS [0 failures], 0.1/s swift-bench 2014-04-29 11:05:33,076 INFO 15 PUTS [0 failures], 0.6/s swift-bench 2014-04-29 11:05:47,932 INFO 30 PUTS **FINAL** [0 failures], 0.7/s swift-bench 2014-04-29 11:05:47,932 INFO Auth version: 1.0 swift-bench 2014-04-29 11:05:48,471 INFO 30 DEL **FINAL** [0 failures], 56.0/s swift-bench 2014-04-29 11:05:48,471 INFO Auth version: 1.0
real 0m42.982s user 0m5.028s sys 0m5.898s
Master Branch - CHUNKED Big Objects
Single Big Object - 1 GB:
time curl -H "$XAuthToken" -H "Transfer-Encoding: chunked" -T 1000MB -XPUT $XStorageURL/cont/test real 0m24.776s
Mid upload top on SAIO (notice CPU usage):
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND 4653 vagrant 20 0 1547m 44m 2468 R 100.1 1.1 1:05.75 swift-object-se 4656 vagrant 20 0 1545m 42m 2492 R 99.4 1.1 0:45.59 swift-object-se 1939 vagrant 20 0 87836 29m 2340 S 42.6 0.7 10:23.02 swift-proxy-ser 4655 vagrant 20 0 1540m 37m 2492 S 13.3 0.9 0:43.31 swift-object-se
- Final timings with modified swift-bench's =bench.py= to use chunked transfers by not setting =content_length=:
time swift-bench --saio -s 109715200 -n 30 -g 0 -A http://192.168.8.80:8080/auth/v1.0 swift-bench 2014-04-29 11:02:38,191 INFO Auth version: 1.0 swift-bench 2014-04-29 11:02:38,450 INFO Auth version: 1.0 swift-bench 2014-04-29 11:03:19,351 INFO 1 PUTS [0 failures], 0.0/s swift-bench 2014-04-29 11:03:42,994 INFO 12 PUTS [0 failures], 0.2/s swift-bench 2014-04-29 11:04:01,695 INFO 13 PUTS [0 failures], 0.2/s swift-bench 2014-04-29 11:04:41,287 INFO 30 PUTS **FINAL** [0 failures], 0.2/s swift-bench 2014-04-29 11:04:41,287 INFO Auth version: 1.0 swift-bench 2014-04-29 11:04:41,477 INFO 30 DEL **FINAL** [0 failures], 160.8/s swift-bench 2014-04-29 11:04:41,477 INFO Auth version: 1.0
real 2m3.583s user 0m11.934s sys 0m9.179s
The performance on our physical cluster showed a much larger performance slow down but this should be enough to show a problem.
Proposed Solution
We propose decoupling the =network_chunk_size= and the size of the buffers being written to disk. When a =network_chunk_size= of data is received by DiskFileWriter's write, rather than immediately passing it down to the =os.write= call we propose utilizing the =disk_chunk_size= config parameter, currently being used for DiskFileReader's reads, to buffer new data until at least =disk_chunk_size= is ready to be written. This will allow the =network_chunk_size= config parameter to solely affect the network transfers and the =disk_chunk_size= to affect the read and the write interactions with the disk. With this change we can keep =network_chunk_size= small to prevent massive overhead of repeated copying and socket reads from =eventlet.wsgi.Input._chunked_read= while reducing disk thrashing with larger disk reads and writes.
Below are timings with an implementation of our solution using the default =network_chunk_size= (64KB) and =disk_chunk_size= = 2MB with DiskFileWriter buffering up to 2MB before writing out to disk a 2MB sequential swath.
Solution Performance - NOT CHUNKED Big Objects
Single Big Object - 1 GB:
time curl -H "$XAuthToken" -T 1000MB -XPUT $XStorageURL/cont/test real 0m11.903s
Mid-upload:
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND 5540 vagrant 20 0 88880 27m 2448 D 92.2 0.7 0:40.18 swift-proxy-ser 5575 vagrant 20 0 1521m 18m 2444 S 32.3 0.5 0:09.66 swift-object-se 5577 vagrant 20 0 1523m 19m 2444 S 32.3 0.5 0:12.41 swift-object-se 5574 vagrant 20 0 1522m 19m 2436 D 31.6 0.5 0:08.10 swift-object-se
Run of 30 100 MB timings:
time swift-bench --saio -s 109715200 -n 30 -g 0 -A http://192.168.8.80:8080/auth/v1.0 swift-bench 2014-04-29 11:05:06,088 INFO Auth version: 1.0 swift-bench 2014-04-29 11:05:06,337 INFO Auth version: 1.0 swift-bench 2014-04-29 11:05:17,885 INFO 1 PUTS [0 failures], 0.1/s swift-bench 2014-04-29 11:05:33,076 INFO 15 PUTS [0 failures], 0.6/s swift-bench 2014-04-29 11:05:47,932 INFO 30 PUTS **FINAL** [0 failures], 0.7/s swift-bench 2014-04-29 11:05:47,932 INFO Auth version: 1.0 swift-bench 2014-04-29 11:05:48,471 INFO 30 DEL **FINAL** [0 failures], 56.0/s swift-bench 2014-04-29 11:05:48,471 INFO Auth version: 1.0
real 0m42.982s user 0m5.028s sys 0m5.898s
Solution Performance - CHUNKED Big Objects
Single Big Object - 1 GB:
time curl -H "$XAuthToken" -H "Transfer-Encoding: chunked" -T 1000MB -XPUT $XStorageURL/cont/test real 0m15.660s
Mid-upload:
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND 5540 vagrant 20 0 89400 27m 2448 R 96.8 0.7 2:12.30 swift-proxy-ser 5709 vagrant 20 0 1522m 19m 2436 S 26.6 0.5 0:19.29 swift-object-se 5710 vagrant 20 0 1521m 18m 2436 S 26.6 0.5 0:17.82 swift-object-se 5707 vagrant 20 0 1521m 18m 2452 S 26.3 0.5 0:19.72 swift-object-se
Final timings:
time swift-bench --saio -s 109715200 -n 30 -g 0 -A http://192.168.8.80:8080/auth/v1.0 swift-bench 2014-04-29 11:32:48,323 INFO Auth version: 1.0 swift-bench 2014-04-29 11:32:48,585 INFO Auth version: 1.0 swift-bench 2014-04-29 11:33:03,184 INFO 1 PUTS [0 failures], 0.1/s swift-bench 2014-04-29 11:33:19,122 INFO 13 PUTS [0 failures], 0.4/s swift-bench 2014-04-29 11:33:37,256 INFO 30 PUTS **FINAL** [0 failures], 0.6/s swift-bench 2014-04-29 11:33:37,257 INFO Auth version: 1.0 swift-bench 2014-04-29 11:33:37,410 INFO 30 DEL **FINAL** [0 failures], 199.7/s swift-bench 2014-04-29 11:33:37,410 INFO Auth version: 1.0
real 0m49.398s user 0m6.969s sys 0m5.949s
Master Branch - NOT CHUNKED Small Objects (10KB multi-stream)
Just to show that we haven't broken small object transfers:
time swift-bench --saio -s 102400 -g 0 -A http://192.168.8.80:8080/auth/v1.0 swift-bench 2014-04-29 12:21:15,315 INFO Auth version: 1.0 swift-bench 2014-04-29 12:21:15,566 INFO Auth version: 1.0 swift-bench 2014-04-29 12:21:17,572 INFO 255 PUTS [0 failures], 127.3/s swift-bench 2014-04-29 12:21:23,190 INFO 1000 PUTS **FINAL** [0 failures], 131.2/s swift-bench 2014-04-29 12:21:23,190 INFO Auth version: 1.0 swift-bench 2014-04-29 12:21:25,199 INFO 475 DEL [0 failures], 236.7/s swift-bench 2014-04-29 12:21:27,560 INFO 1000 DEL **FINAL** [0 failures], 229.0/s swift-bench 2014-04-29 12:21:27,560 INFO Auth version: 1.0
real 0m12.712s user 0m3.469s sys 0m0.707s
Master Branch - CHUNKED Small Objects
time swift-bench --saio -s 102400 -g 0 -A http://192.168.8.80:8080/auth/v1.0 swift-bench 2014-04-29 12:23:23,550 INFO Auth version: 1.0 swift-bench 2014-04-29 12:23:23,859 INFO Auth version: 1.0 swift-bench 2014-04-29 12:23:25,865 INFO 250 PUTS [0 failures], 124.8/s swift-bench 2014-04-29 12:23:31,905 INFO 1000 PUTS **FINAL** [0 failures], 124.3/s swift-bench 2014-04-29 12:23:31,905 INFO Auth version: 1.0 swift-bench 2014-04-29 12:23:33,908 INFO 468 DEL [0 failures], 233.9/s swift-bench 2014-04-29 12:23:36,360 INFO 1000 DEL **FINAL** [0 failures], 224.6/s swift-bench 2014-04-29 12:23:36,360 INFO Auth version: 1.0
real 0m13.277s user 0m3.934s sys 0m0.843s
Solution Performance - NOT CHUNKED Small Objects
time swift-bench --saio -s 102400 -g 0 -A http://192.168.8.80:8080/auth/v1.0 swift-bench 2014-04-29 11:46:24,642 INFO Auth version: 1.0 swift-bench 2014-04-29 11:46:24,893 INFO Auth version: 1.0 swift-bench 2014-04-29 11:46:26,899 INFO 260 PUTS [0 failures], 129.8/s swift-bench 2014-04-29 11:46:32,706 INFO 1000 PUTS **FINAL** [0 failures], 128.0/s swift-bench 2014-04-29 11:46:32,706 INFO Auth version: 1.0 swift-bench 2014-04-29 11:46:34,717 INFO 473 DEL [0 failures], 235.4/s swift-bench 2014-04-29 11:46:37,065 INFO 1000 DEL **FINAL** [0 failures], 229.5/s swift-bench 2014-04-29 11:46:37,065 INFO Auth version: 1.0
real 0m12.885s user 0m3.566s sys 0m0.664s
Solution Performance - CHUNKED Small Objects
time swift-bench --saio -s 102400 -g 0 -A http://192.168.8.80:8080/auth/v1.0 swift-bench 2014-04-29 11:42:27,571 INFO Auth version: 1.0 swift-bench 2014-04-29 11:42:27,825 INFO Auth version: 1.0 swift-bench 2014-04-29 11:42:29,833 INFO 246 PUTS [0 failures], 122.7/s swift-bench 2014-04-29 11:42:35,872 INFO 1000 PUTS **FINAL** [0 failures], 124.3/s swift-bench 2014-04-29 11:42:35,872 INFO Auth version: 1.0 swift-bench 2014-04-29 11:42:37,886 INFO 441 DEL [0 failures], 219.3/s swift-bench 2014-04-29 11:42:40,391 INFO 1000 DEL **FINAL** [0 failures], 221.4/s swift-bench 2014-04-29 11:42:40,391 INFO Auth version: 1.0
real 0m13.274s user 0m3.887s sys 0m0.811s
5 Conclusions
The motivation for increasing =network_chunk_size= was to increase the size of consecutive data being written to disk to mitigate thrashing. On our physical storage nodes that operate without SSDs we found much better performance with larger read/write sizes. When we changed to chunked transfer encoding we encountered a massive bottleneck on the storage nodes which we traced to eventlet's wsgi =Input._chunked_read= from the socket. The proposed change will allow larger reads/writes from disk while maintaining optimized network chunk read sizes. This also allows =disk_chunk_size= to appropriately determine read and write sizes from disk while currently it only affects reads. The change positively impacted large object chunked performance without impacting default settings performance or small object performance.