User:Physcx

="Transfer-Encoding : chunked" Performance Problem=
 * Author: Shaun McDowell
 * Date: 2014-04-29 17:14:15 EDT

Summary
We are using Swift as an application backend to store > 2MB objects and we are using chunked encoding to upload many streams simultaneously without knowing the full object size ahead of time. To improve performance and reduce disk thrashing we want to increase the amount of data collected before writing to disk. We found the =network_chunk_size= in the object-server config files determines the size of buffers being sent down to DiskfileWriter's write function. After increasing our =network_chunk_size= from the default 64KB to 2MB we noticed an 8x speed difference between uploading with and without chunked transfer-encoding on our physical system. Further testing and profiling shows the looping in eventlet's wsgi =Input._chunked_read= function on the object servers is the source of enormous overhead when it needs to repeatedly query the socket and copy buffers to piece together many small chunks to get the requested 2MB chunk.

With default settings of 64k network chunk size we did not see the problem because the wsgi =_chunked_read= method tends to get the entire requested data off the socket in 1 or 2 iterations of its loop and while there is still double the amount of syscalls and extra copying being done it is not significant enough to noticeably affect performance. We also found that the problem seems to compound as network chunk size is increased or object size being uploaded increases. With a 5MB vagrant saio object-server =network_chunk_size= and 100MB objects we found performance to be halved. With a 10MB =network_chunk_size= the same upload took 4 times as long using chunked encoding compared to not chunked.

Results Comparison Table: 1 GB Objs  Swift Bench 30 100 MB Objs   SB 1,000 100 KB Objs Master Branch - content_length       11.297                       42.982                 12.712 Master Branch - chunked              24.776                      123.583                 13.277 Solution Branch - content_length     11.903                       42.982                 12.885 Solution Branch - chunked            15.660                       49.398                 13.274

Reproducing the problem in Vagrant SAIO:
Due to a number of hardware differences between our SAIO VM and actual storage nodes, we had to crank some of the settings to reproduce the problem locally.

How to reproduce the problem in a clean Vagrant SAIO


 * 1) Increase the VM's CPUs to 4, Memory to 4096M, and Swift's Storage capacity to 10GB
 * 2) Change the =network_chunk_size= setting in the object-server config files to >= 5242880 (5MB)
 * 3) Restart the object-servers
 * 4) From outside the SAIO virtual machine, time an upload of a large object (>= 100MB) with and without chunked encoding

While running chunked transfers we found the Storage node object-server processes were reaching 100% cpu utilization due to the overhead from eventlet.wsgi.Input._chunked_read.

Our Vagrant SAIO VM
Clean SAIO from with the following changes:
 * VM cpus = 4
 * VM memory = 4096MB
 * localrc =LOOPBACK_GB= = 15
 * object-server/conf network_chunk_size = 5MB

Hosted on:
 * Intel i7 4770
 * 2x Corsair 16GB 1600 MHz
 * Samsung 840 SSD

Master Branch Performance
We compared both single object upload and concurrent upload performance for chunked and not chunked transfer encoding. Single object upload was tested using curl and concurrent upload performance was tested using swift-bench. To force swift-bench to use chunked encoding we modified =swiftbench/bench.py='s BenchPUT.run function to use:

client.put_object(self.url, self.token,      container_name, name, source,       #content_length = len(source),       chunk_size = 16384,       http_conn=conn)

for our chunked tests.

Master Branch - NOT CHUNKED Big Objects
Single Big Object - 1 GB: time curl -H "$XAuthToken" -T 1000MB -XPUT $XStorageURL/cont/test real   0m11.297s

Mid-upload top on SAIO (notice CPU usage here vs chunked CPU usage): PID USER     PR  NI  VIRT  RES  SHR S  %CPU %MEM    TIME+  COMMAND 939 vagrant  20   0 87836  29m 2344 R  97.2  0.7  10:03.55 swift-proxy-ser 4653 vagrant  20   0 1544m  41m 2472 R  32.6  1.0   0:53.92 swift-object-se 4655 vagrant  20   0 1546m  43m 2496 R  32.3  1.1   0:36.45 swift-object-se 4656 vagrant  20   0 1546m  43m 2496 R  31.3  1.1   0:33.72 swift-object-se

Timings with swift-bench 30 100MB objects: time swift-bench --saio -s 109715200 -n 30 -g 0 -A http://192.168.8.80:8080/auth/v1.0 swift-bench 2014-04-29 11:05:06,088 INFO Auth version: 1.0 swift-bench 2014-04-29 11:05:06,337 INFO Auth version: 1.0 swift-bench 2014-04-29 11:05:17,885 INFO 1 PUTS [0 failures], 0.1/s swift-bench 2014-04-29 11:05:33,076 INFO 15 PUTS [0 failures], 0.6/s swift-bench 2014-04-29 11:05:47,932 INFO 30 PUTS **FINAL** [0 failures], 0.7/s swift-bench 2014-04-29 11:05:47,932 INFO Auth version: 1.0 swift-bench 2014-04-29 11:05:48,471 INFO 30 DEL **FINAL** [0 failures], 56.0/s swift-bench 2014-04-29 11:05:48,471 INFO Auth version: 1.0

real   0m42.982s user   0m5.028s sys 0m5.898s

Master Branch - CHUNKED Big Objects
Single Big Object - 1 GB:

time curl -H "$XAuthToken" -H "Transfer-Encoding: chunked" -T 1000MB -XPUT $XStorageURL/cont/test real   0m24.776s

Mid upload top on SAIO (notice CPU usage): PID USER     PR  NI  VIRT  RES  SHR S  %CPU %MEM    TIME+  COMMAND 4653 vagrant  20   0 1547m  44m 2468 R 100.1  1.1   1:05.75 swift-object-se 4656 vagrant  20   0 1545m  42m 2492 R  99.4  1.1   0:45.59 swift-object-se 1939 vagrant  20   0 87836  29m 2340 S  42.6  0.7  10:23.02 swift-proxy-ser 4655 vagrant  20   0 1540m  37m 2492 S  13.3  0.9   0:43.31 swift-object-se

time swift-bench --saio -s 109715200 -n 30 -g 0 -A http://192.168.8.80:8080/auth/v1.0 swift-bench 2014-04-29 11:02:38,191 INFO Auth version: 1.0 swift-bench 2014-04-29 11:02:38,450 INFO Auth version: 1.0 swift-bench 2014-04-29 11:03:19,351 INFO 1 PUTS [0 failures], 0.0/s swift-bench 2014-04-29 11:03:42,994 INFO 12 PUTS [0 failures], 0.2/s swift-bench 2014-04-29 11:04:01,695 INFO 13 PUTS [0 failures], 0.2/s swift-bench 2014-04-29 11:04:41,287 INFO 30 PUTS **FINAL** [0 failures], 0.2/s swift-bench 2014-04-29 11:04:41,287 INFO Auth version: 1.0 swift-bench 2014-04-29 11:04:41,477 INFO 30 DEL **FINAL** [0 failures], 160.8/s swift-bench 2014-04-29 11:04:41,477 INFO Auth version: 1.0
 * Final timings with modified swift-bench's =bench.py= to use chunked transfers by not setting =content_length=:

real   2m3.583s user   0m11.934s sys 0m9.179s

The performance on our physical cluster showed a much larger performance slow down but this should be enough to show a problem.

Proposed Solution
We propose decoupling the =network_chunk_size= and the size of the buffers being written to disk. When a =network_chunk_size= of data is received by DiskFileWriter's write, rather than immediately passing it down to the =os.write= call we propose utilizing the =disk_chunk_size= config parameter, currently being used for DiskFileReader's reads, to buffer new data until at least =disk_chunk_size= is ready to be written. This will allow the =network_chunk_size= config parameter to solely affect the network transfers and the =disk_chunk_size= to affect the read and the write interactions with the disk. With this change we can keep =network_chunk_size= small to prevent massive overhead of repeated copying and socket reads from =eventlet.wsgi.Input._chunked_read= while reducing disk thrashing with larger disk reads and writes.

Below are timings with an implementation of our solution using the default =network_chunk_size= (64KB) and =disk_chunk_size= = 2MB with DiskFileWriter buffering up to 2MB before writing out to disk a 2MB sequential swath.

Solution Performance - NOT CHUNKED Big Objects
Single Big Object - 1 GB: time curl -H "$XAuthToken" -T 1000MB -XPUT $XStorageURL/cont/test real   0m11.903s

Mid-upload: PID USER     PR  NI  VIRT  RES  SHR S  %CPU %MEM    TIME+  COMMAND 5540 vagrant  20   0 88880  27m 2448 D  92.2  0.7   0:40.18 swift-proxy-ser 5575 vagrant  20   0 1521m  18m 2444 S  32.3  0.5   0:09.66 swift-object-se 5577 vagrant  20   0 1523m  19m 2444 S  32.3  0.5   0:12.41 swift-object-se 5574 vagrant  20   0 1522m  19m 2436 D  31.6  0.5   0:08.10 swift-object-se

Run of 30 100 MB timings: time swift-bench --saio -s 109715200 -n 30 -g 0 -A http://192.168.8.80:8080/auth/v1.0 swift-bench 2014-04-29 11:05:06,088 INFO Auth version: 1.0 swift-bench 2014-04-29 11:05:06,337 INFO Auth version: 1.0 swift-bench 2014-04-29 11:05:17,885 INFO 1 PUTS [0 failures], 0.1/s swift-bench 2014-04-29 11:05:33,076 INFO 15 PUTS [0 failures], 0.6/s swift-bench 2014-04-29 11:05:47,932 INFO 30 PUTS **FINAL** [0 failures], 0.7/s swift-bench 2014-04-29 11:05:47,932 INFO Auth version: 1.0 swift-bench 2014-04-29 11:05:48,471 INFO 30 DEL **FINAL** [0 failures], 56.0/s swift-bench 2014-04-29 11:05:48,471 INFO Auth version: 1.0

real   0m42.982s user   0m5.028s sys 0m5.898s

Solution Performance - CHUNKED Big Objects
Single Big Object - 1 GB: time curl -H "$XAuthToken" -H "Transfer-Encoding: chunked" -T 1000MB -XPUT $XStorageURL/cont/test real   0m15.660s

Mid-upload: PID USER     PR  NI  VIRT  RES  SHR S  %CPU %MEM    TIME+  COMMAND 5540 vagrant  20   0 89400  27m 2448 R  96.8  0.7   2:12.30 swift-proxy-ser 5709 vagrant  20   0 1522m  19m 2436 S  26.6  0.5   0:19.29 swift-object-se 5710 vagrant  20   0 1521m  18m 2436 S  26.6  0.5   0:17.82 swift-object-se 5707 vagrant  20   0 1521m  18m 2452 S  26.3  0.5   0:19.72 swift-object-se

Final timings: time swift-bench --saio -s 109715200 -n 30 -g 0 -A http://192.168.8.80:8080/auth/v1.0 swift-bench 2014-04-29 11:32:48,323 INFO Auth version: 1.0 swift-bench 2014-04-29 11:32:48,585 INFO Auth version: 1.0 swift-bench 2014-04-29 11:33:03,184 INFO 1 PUTS [0 failures], 0.1/s swift-bench 2014-04-29 11:33:19,122 INFO 13 PUTS [0 failures], 0.4/s swift-bench 2014-04-29 11:33:37,256 INFO 30 PUTS **FINAL** [0 failures], 0.6/s swift-bench 2014-04-29 11:33:37,257 INFO Auth version: 1.0 swift-bench 2014-04-29 11:33:37,410 INFO 30 DEL **FINAL** [0 failures], 199.7/s swift-bench 2014-04-29 11:33:37,410 INFO Auth version: 1.0

real   0m49.398s user   0m6.969s sys 0m5.949s

Master Branch - NOT CHUNKED Small Objects (10KB multi-stream)
Just to show that we haven't broken small object transfers: time swift-bench --saio -s 102400 -g 0 -A http://192.168.8.80:8080/auth/v1.0 swift-bench 2014-04-29 12:21:15,315 INFO Auth version: 1.0 swift-bench 2014-04-29 12:21:15,566 INFO Auth version: 1.0 swift-bench 2014-04-29 12:21:17,572 INFO 255 PUTS [0 failures], 127.3/s swift-bench 2014-04-29 12:21:23,190 INFO 1000 PUTS **FINAL** [0 failures], 131.2/s swift-bench 2014-04-29 12:21:23,190 INFO Auth version: 1.0 swift-bench 2014-04-29 12:21:25,199 INFO 475 DEL [0 failures], 236.7/s swift-bench 2014-04-29 12:21:27,560 INFO 1000 DEL **FINAL** [0 failures], 229.0/s swift-bench 2014-04-29 12:21:27,560 INFO Auth version: 1.0

real   0m12.712s user   0m3.469s sys 0m0.707s

Master Branch - CHUNKED Small Objects
time swift-bench --saio -s 102400 -g 0 -A http://192.168.8.80:8080/auth/v1.0 swift-bench 2014-04-29 12:23:23,550 INFO Auth version: 1.0 swift-bench 2014-04-29 12:23:23,859 INFO Auth version: 1.0 swift-bench 2014-04-29 12:23:25,865 INFO 250 PUTS [0 failures], 124.8/s swift-bench 2014-04-29 12:23:31,905 INFO 1000 PUTS **FINAL** [0 failures], 124.3/s swift-bench 2014-04-29 12:23:31,905 INFO Auth version: 1.0 swift-bench 2014-04-29 12:23:33,908 INFO 468 DEL [0 failures], 233.9/s swift-bench 2014-04-29 12:23:36,360 INFO 1000 DEL **FINAL** [0 failures], 224.6/s swift-bench 2014-04-29 12:23:36,360 INFO Auth version: 1.0

real   0m13.277s user   0m3.934s sys 0m0.843s

Solution Performance - NOT CHUNKED Small Objects
time swift-bench --saio -s 102400 -g 0 -A http://192.168.8.80:8080/auth/v1.0 swift-bench 2014-04-29 11:46:24,642 INFO Auth version: 1.0 swift-bench 2014-04-29 11:46:24,893 INFO Auth version: 1.0 swift-bench 2014-04-29 11:46:26,899 INFO 260 PUTS [0 failures], 129.8/s swift-bench 2014-04-29 11:46:32,706 INFO 1000 PUTS **FINAL** [0 failures], 128.0/s swift-bench 2014-04-29 11:46:32,706 INFO Auth version: 1.0 swift-bench 2014-04-29 11:46:34,717 INFO 473 DEL [0 failures], 235.4/s swift-bench 2014-04-29 11:46:37,065 INFO 1000 DEL **FINAL** [0 failures], 229.5/s swift-bench 2014-04-29 11:46:37,065 INFO Auth version: 1.0

real   0m12.885s user   0m3.566s sys 0m0.664s

Solution Performance - CHUNKED Small Objects
time swift-bench --saio -s 102400 -g 0 -A http://192.168.8.80:8080/auth/v1.0 swift-bench 2014-04-29 11:42:27,571 INFO Auth version: 1.0 swift-bench 2014-04-29 11:42:27,825 INFO Auth version: 1.0 swift-bench 2014-04-29 11:42:29,833 INFO 246 PUTS [0 failures], 122.7/s swift-bench 2014-04-29 11:42:35,872 INFO 1000 PUTS **FINAL** [0 failures], 124.3/s swift-bench 2014-04-29 11:42:35,872 INFO Auth version: 1.0 swift-bench 2014-04-29 11:42:37,886 INFO 441 DEL [0 failures], 219.3/s swift-bench 2014-04-29 11:42:40,391 INFO 1000 DEL **FINAL** [0 failures], 221.4/s swift-bench 2014-04-29 11:42:40,391 INFO Auth version: 1.0

real   0m13.274s user   0m3.887s sys 0m0.811s

5 Conclusions
The motivation for increasing =network_chunk_size= was to increase the size of consecutive data being written to disk to mitigate thrashing. On our physical storage nodes that operate without SSDs we found much better performance with larger read/write sizes. When we changed to chunked transfer encoding we encountered a massive bottleneck on the storage nodes which we traced to eventlet's wsgi =Input._chunked_read= from the socket. The proposed change will allow larger reads/writes from disk while maintaining optimized network chunk read sizes. This also allows =disk_chunk_size= to appropriately determine read and write sizes from disk while currently it only affects reads. The change positively impacted large object chunked performance without impacting default settings performance or small object performance.

Buffered Writes Tests