"Transfer-Encoding : chunked" Performance Problem

Author: Shaun McDowell
Date: 2014-04-29 17:14:15 EDT

Summary

We are using Swift as an application backend to store > 2MB objects and we are using chunked encoding to upload many streams simultaneously without knowing the full object size ahead of time. To improve performance and reduce disk thrashing we want to increase the amount of data collected before writing to disk. We found the =network_chunk_size= in the object-server config files determines the size of buffers being sent down to DiskfileWriter's write function. After increasing our =network_chunk_size= from the default 64KB to 2MB we noticed an 8x speed difference between uploading with and without chunked transfer-encoding on our physical system. Further testing and profiling shows the looping in eventlet's wsgi =Input._chunked_read= function on the object servers is the source of enormous overhead when it needs to repeatedly query the socket and copy buffers to piece together many small chunks to get the requested 2MB chunk.

With default settings of 64k network chunk size we did not see the problem because the wsgi =_chunked_read= method tends to get the entire requested data off the socket in 1 or 2 iterations of its loop and while there is still double the amount of syscalls and extra copying being done it is not significant enough to noticeably affect performance. We also found that the problem seems to compound as network chunk size is increased or object size being uploaded increases. With a 5MB vagrant saio object-server =network_chunk_size= and 100MB objects we found performance to be halved. With a 10MB =network_chunk_size= the same upload took 4 times as long using chunked encoding compared to not chunked.

Results Comparison Table:

                                    1 GB Objs   Swift Bench 30 100 MB Objs   SB 1,000 100 KB Objs
 Master Branch - content_length        11.297                       42.982                 12.712
 Master Branch - chunked               24.776                      123.583                 13.277
 Solution Branch - content_length      11.903                       42.982                 12.885
 Solution Branch - chunked             15.660                       49.398                 13.274

Reproducing the problem in Vagrant SAIO:

Due to a number of hardware differences between our SAIO VM and actual storage nodes, we had to crank some of the settings to reproduce the problem locally.

How to reproduce the problem in a clean Vagrant SAIO

Increase the VM's CPUs to 4, Memory to 4096M, and Swift's Storage capacity to 10GB
Change the =network_chunk_size= setting in the object-server config files to >= 5242880 (5MB)
Restart the object-servers
From outside the SAIO virtual machine, time an upload of a large object (>= 100MB) with and without chunked encoding

While running chunked transfers we found the Storage node object-server processes were reaching 100% cpu utilization due to the overhead from eventlet.wsgi.Input._chunked_read.

Our Vagrant SAIO VM

Clean SAIO from [1] with the following changes:

VM cpus = 4
VM memory = 4096MB
localrc =LOOPBACK_GB= = 15
object-server/conf network_chunk_size = 5MB

Hosted on:

Intel i7 4770
2x Corsair 16GB 1600 MHz
Samsung 840 SSD

Master Branch Performance

We compared both single object upload and concurrent upload performance for chunked and not chunked transfer encoding. Single object upload was tested using curl and concurrent upload performance was tested using swift-bench. To force swift-bench to use chunked encoding we modified =swiftbench/bench.py='s BenchPUT.run function to use:

 client.put_object(self.url, self.token,
      container_name, name, source,
      #content_length = len(source),
      chunk_size = 16384,
      http_conn=conn)

for our chunked tests.

Master Branch - NOT CHUNKED Big Objects

Single Big Object - 1 GB:

 time curl -H "$XAuthToken" -T 1000MB -XPUT $XStorageURL/cont/test
 real    0m11.297s

Mid-upload top on SAIO (notice CPU usage here vs chunked CPU usage):

  PID USER      PR  NI  VIRT  RES  SHR S  %CPU %MEM    TIME+  COMMAND
  939 vagrant   20   0 87836  29m 2344 R  97.2  0.7  10:03.55 swift-proxy-ser
 4653 vagrant   20   0 1544m  41m 2472 R  32.6  1.0   0:53.92 swift-object-se
 4655 vagrant   20   0 1546m  43m 2496 R  32.3  1.1   0:36.45 swift-object-se
 4656 vagrant   20   0 1546m  43m 2496 R  31.3  1.1   0:33.72 swift-object-se

Timings with swift-bench 30 100MB objects:

 time swift-bench --saio -s 109715200 -n 30 -g 0 -A http://192.168.8.80:8080/auth/v1.0
 swift-bench 2014-04-29 11:05:06,088 INFO Auth version: 1.0
 swift-bench 2014-04-29 11:05:06,337 INFO Auth version: 1.0
 swift-bench 2014-04-29 11:05:17,885 INFO 1 PUTS [0 failures], 0.1/s
 swift-bench 2014-04-29 11:05:33,076 INFO 15 PUTS [0 failures], 0.6/s
 swift-bench 2014-04-29 11:05:47,932 INFO 30 PUTS **FINAL** [0 failures], 0.7/s
 swift-bench 2014-04-29 11:05:47,932 INFO Auth version: 1.0
 swift-bench 2014-04-29 11:05:48,471 INFO 30 DEL **FINAL** [0 failures], 56.0/s
 swift-bench 2014-04-29 11:05:48,471 INFO Auth version: 1.0

 real    0m42.982s
 user    0m5.028s
 sys 0m5.898s

Master Branch - CHUNKED Big Objects

Single Big Object - 1 GB:

 time curl -H "$XAuthToken" -H "Transfer-Encoding: chunked" -T 1000MB -XPUT $XStorageURL/cont/test
 real    0m24.776s

Mid upload top on SAIO (notice CPU usage):

  PID USER      PR  NI  VIRT  RES  SHR S  %CPU %MEM    TIME+  COMMAND
 4653 vagrant   20   0 1547m  44m 2468 R 100.1  1.1   1:05.75 swift-object-se
 4656 vagrant   20   0 1545m  42m 2492 R  99.4  1.1   0:45.59 swift-object-se
 1939 vagrant   20   0 87836  29m 2340 S  42.6  0.7  10:23.02 swift-proxy-ser
 4655 vagrant   20   0 1540m  37m 2492 S  13.3  0.9   0:43.31 swift-object-se

Final timings with modified swift-bench's =bench.py= to use chunked transfers by not setting =content_length=:

 time swift-bench --saio -s 109715200 -n 30 -g 0 -A http://192.168.8.80:8080/auth/v1.0
 swift-bench 2014-04-29 11:02:38,191 INFO Auth version: 1.0
 swift-bench 2014-04-29 11:02:38,450 INFO Auth version: 1.0
 swift-bench 2014-04-29 11:03:19,351 INFO 1 PUTS [0 failures], 0.0/s
 swift-bench 2014-04-29 11:03:42,994 INFO 12 PUTS [0 failures], 0.2/s
 swift-bench 2014-04-29 11:04:01,695 INFO 13 PUTS [0 failures], 0.2/s
 swift-bench 2014-04-29 11:04:41,287 INFO 30 PUTS **FINAL** [0 failures], 0.2/s
 swift-bench 2014-04-29 11:04:41,287 INFO Auth version: 1.0
 swift-bench 2014-04-29 11:04:41,477 INFO 30 DEL **FINAL** [0 failures], 160.8/s
 swift-bench 2014-04-29 11:04:41,477 INFO Auth version: 1.0

 real    2m3.583s
 user    0m11.934s
 sys 0m9.179s

The performance on our physical cluster showed a much larger performance slow down but this should be enough to show a problem.

Proposed Solution

We propose decoupling the =network_chunk_size= and the size of the buffers being written to disk. When a =network_chunk_size= of data is received by DiskFileWriter's write, rather than immediately passing it down to the =os.write= call we propose utilizing the =disk_chunk_size= config parameter, currently being used for DiskFileReader's reads, to buffer new data until at least =disk_chunk_size= is ready to be written. This will allow the =network_chunk_size= config parameter to solely affect the network transfers and the =disk_chunk_size= to affect the read and the write interactions with the disk. With this change we can keep =network_chunk_size= small to prevent massive overhead of repeated copying and socket reads from =eventlet.wsgi.Input._chunked_read= while reducing disk thrashing with larger disk reads and writes.

Below are timings with an implementation of our solution using the default =network_chunk_size= (64KB) and =disk_chunk_size= = 2MB with DiskFileWriter buffering up to 2MB before writing out to disk a 2MB sequential swath.

Solution Performance - NOT CHUNKED Big Objects

Single Big Object - 1 GB:

 time curl -H "$XAuthToken" -T 1000MB -XPUT $XStorageURL/cont/test
 real    0m11.903s

Mid-upload:

 PID USER      PR  NI  VIRT  RES  SHR S  %CPU %MEM    TIME+  COMMAND
 5540 vagrant   20   0 88880  27m 2448 D  92.2  0.7   0:40.18 swift-proxy-ser
 5575 vagrant   20   0 1521m  18m 2444 S  32.3  0.5   0:09.66 swift-object-se
 5577 vagrant   20   0 1523m  19m 2444 S  32.3  0.5   0:12.41 swift-object-se
 5574 vagrant   20   0 1522m  19m 2436 D  31.6  0.5   0:08.10 swift-object-se

Run of 30 100 MB timings:

 time swift-bench --saio -s 109715200 -n 30 -g 0 -A http://192.168.8.80:8080/auth/v1.0
 swift-bench 2014-04-29 11:05:06,088 INFO Auth version: 1.0
 swift-bench 2014-04-29 11:05:06,337 INFO Auth version: 1.0
 swift-bench 2014-04-29 11:05:17,885 INFO 1 PUTS [0 failures], 0.1/s
 swift-bench 2014-04-29 11:05:33,076 INFO 15 PUTS [0 failures], 0.6/s
 swift-bench 2014-04-29 11:05:47,932 INFO 30 PUTS **FINAL** [0 failures], 0.7/s
 swift-bench 2014-04-29 11:05:47,932 INFO Auth version: 1.0
 swift-bench 2014-04-29 11:05:48,471 INFO 30 DEL **FINAL** [0 failures], 56.0/s
 swift-bench 2014-04-29 11:05:48,471 INFO Auth version: 1.0

 real    0m42.982s
 user    0m5.028s
 sys 0m5.898s

Solution Performance - CHUNKED Big Objects

Single Big Object - 1 GB:

 time curl -H "$XAuthToken" -H "Transfer-Encoding: chunked" -T 1000MB -XPUT $XStorageURL/cont/test
 real    0m15.660s

Mid-upload:

 PID USER      PR  NI  VIRT  RES  SHR S  %CPU %MEM    TIME+  COMMAND
 5540 vagrant   20   0 89400  27m 2448 R  96.8  0.7   2:12.30 swift-proxy-ser
 5709 vagrant   20   0 1522m  19m 2436 S  26.6  0.5   0:19.29 swift-object-se
 5710 vagrant   20   0 1521m  18m 2436 S  26.6  0.5   0:17.82 swift-object-se
 5707 vagrant   20   0 1521m  18m 2452 S  26.3  0.5   0:19.72 swift-object-se

Final timings:

 time swift-bench --saio -s 109715200 -n 30 -g 0 -A http://192.168.8.80:8080/auth/v1.0
 swift-bench 2014-04-29 11:32:48,323 INFO Auth version: 1.0
 swift-bench 2014-04-29 11:32:48,585 INFO Auth version: 1.0
 swift-bench 2014-04-29 11:33:03,184 INFO 1 PUTS [0 failures], 0.1/s
 swift-bench 2014-04-29 11:33:19,122 INFO 13 PUTS [0 failures], 0.4/s
 swift-bench 2014-04-29 11:33:37,256 INFO 30 PUTS **FINAL** [0 failures], 0.6/s
 swift-bench 2014-04-29 11:33:37,257 INFO Auth version: 1.0
 swift-bench 2014-04-29 11:33:37,410 INFO 30 DEL **FINAL** [0 failures], 199.7/s
 swift-bench 2014-04-29 11:33:37,410 INFO Auth version: 1.0

 real    0m49.398s
 user    0m6.969s
 sys 0m5.949s

Master Branch - NOT CHUNKED Small Objects (10KB multi-stream)

Just to show that we haven't broken small object transfers:

 time swift-bench --saio -s 102400 -g 0 -A http://192.168.8.80:8080/auth/v1.0
 swift-bench 2014-04-29 12:21:15,315 INFO Auth version: 1.0
 swift-bench 2014-04-29 12:21:15,566 INFO Auth version: 1.0
 swift-bench 2014-04-29 12:21:17,572 INFO 255 PUTS [0 failures], 127.3/s
 swift-bench 2014-04-29 12:21:23,190 INFO 1000 PUTS **FINAL** [0 failures], 131.2/s
 swift-bench 2014-04-29 12:21:23,190 INFO Auth version: 1.0
 swift-bench 2014-04-29 12:21:25,199 INFO 475 DEL [0 failures], 236.7/s
 swift-bench 2014-04-29 12:21:27,560 INFO 1000 DEL **FINAL** [0 failures], 229.0/s
 swift-bench 2014-04-29 12:21:27,560 INFO Auth version: 1.0

 real    0m12.712s
 user    0m3.469s
 sys 0m0.707s

Master Branch - CHUNKED Small Objects

 time swift-bench --saio -s 102400 -g 0 -A http://192.168.8.80:8080/auth/v1.0
 swift-bench 2014-04-29 12:23:23,550 INFO Auth version: 1.0
 swift-bench 2014-04-29 12:23:23,859 INFO Auth version: 1.0
 swift-bench 2014-04-29 12:23:25,865 INFO 250 PUTS [0 failures], 124.8/s
 swift-bench 2014-04-29 12:23:31,905 INFO 1000 PUTS **FINAL** [0 failures], 124.3/s
 swift-bench 2014-04-29 12:23:31,905 INFO Auth version: 1.0
 swift-bench 2014-04-29 12:23:33,908 INFO 468 DEL [0 failures], 233.9/s
 swift-bench 2014-04-29 12:23:36,360 INFO 1000 DEL **FINAL** [0 failures], 224.6/s
 swift-bench 2014-04-29 12:23:36,360 INFO Auth version: 1.0

 real    0m13.277s
 user    0m3.934s
 sys 0m0.843s

Solution Performance - NOT CHUNKED Small Objects

 time swift-bench --saio -s 102400 -g 0 -A http://192.168.8.80:8080/auth/v1.0
 swift-bench 2014-04-29 11:46:24,642 INFO Auth version: 1.0
 swift-bench 2014-04-29 11:46:24,893 INFO Auth version: 1.0
 swift-bench 2014-04-29 11:46:26,899 INFO 260 PUTS [0 failures], 129.8/s
 swift-bench 2014-04-29 11:46:32,706 INFO 1000 PUTS **FINAL** [0 failures], 128.0/s
 swift-bench 2014-04-29 11:46:32,706 INFO Auth version: 1.0
 swift-bench 2014-04-29 11:46:34,717 INFO 473 DEL [0 failures], 235.4/s
 swift-bench 2014-04-29 11:46:37,065 INFO 1000 DEL **FINAL** [0 failures], 229.5/s
 swift-bench 2014-04-29 11:46:37,065 INFO Auth version: 1.0

 real    0m12.885s
 user    0m3.566s
 sys 0m0.664s

Solution Performance - CHUNKED Small Objects

 time swift-bench --saio -s 102400 -g 0 -A http://192.168.8.80:8080/auth/v1.0
 swift-bench 2014-04-29 11:42:27,571 INFO Auth version: 1.0
 swift-bench 2014-04-29 11:42:27,825 INFO Auth version: 1.0
 swift-bench 2014-04-29 11:42:29,833 INFO 246 PUTS [0 failures], 122.7/s
 swift-bench 2014-04-29 11:42:35,872 INFO 1000 PUTS **FINAL** [0 failures], 124.3/s
 swift-bench 2014-04-29 11:42:35,872 INFO Auth version: 1.0
 swift-bench 2014-04-29 11:42:37,886 INFO 441 DEL [0 failures], 219.3/s
 swift-bench 2014-04-29 11:42:40,391 INFO 1000 DEL **FINAL** [0 failures], 221.4/s
 swift-bench 2014-04-29 11:42:40,391 INFO Auth version: 1.0

 real    0m13.274s
 user    0m3.887s
 sys 0m0.811s

5 Conclusions

The motivation for increasing =network_chunk_size= was to increase the size of consecutive data being written to disk to mitigate thrashing. On our physical storage nodes that operate without SSDs we found much better performance with larger read/write sizes. When we changed to chunked transfer encoding we encountered a massive bottleneck on the storage nodes which we traced to eventlet's wsgi =Input._chunked_read= from the socket. The proposed change will allow larger reads/writes from disk while maintaining optimized network chunk read sizes. This also allows =disk_chunk_size= to appropriately determine read and write sizes from disk while currently it only affects reads. The change positively impacted large object chunked performance without impacting default settings performance or small object performance.

Buffered Writes Tests

User:Physcx

Contents

"Transfer-Encoding : chunked" Performance Problem

Summary

Reproducing the problem in Vagrant SAIO:

Our Vagrant SAIO VM

Master Branch Performance

Master Branch - NOT CHUNKED Big Objects

Master Branch - CHUNKED Big Objects

Proposed Solution

Solution Performance - NOT CHUNKED Big Objects

Solution Performance - CHUNKED Big Objects

Master Branch - NOT CHUNKED Small Objects (10KB multi-stream)

Master Branch - CHUNKED Small Objects

Solution Performance - NOT CHUNKED Small Objects

Solution Performance - CHUNKED Small Objects

5 Conclusions