In part-1 of this blog, we explained the goals, determinants and a summary of the results from the network constrained tests we ran on streaming robot data. In this blog, we will walk through the detailed observations and improvements we made based on the outcome of the tests.
TCP Connection Churn
We instrument all of our HTTP clients using go-conntrack. This library provides us with TCP dialer metrics (connections attempted, failed, established, closed). The graphs below showcasing these metrics are from the 0% packet loss 20ms latency experiment.
TCP connections are being established and closed at a rate of ~40/sec. This is unexpected as we should be re-using TCP connections across cloud object storage requests, especially in this healthy network experiment. An inefficient use of TCP connections can have a negative performance impact at many layers. For example, we end up performing more TLS handshakes which increases bandwidth and CPU utilization. The TLS and TCP handshakes lead to increased upload latency, causing our in-memory buffer to fill at a faster rate. If this buffer overflows we end up writing data to disk.
Long Fat Network (LFN) Performance
By increasing the network latency in our experiments we are increasing the bandwidth-delay product. As the amount of unacknowledged data at any given time increases we rely on the TCP receive and congestion windows to be sufficient in size. If these windows are too small, we spend too much time waiting for acknowledgements, unable to utilize all of the available bandwidth.
When a TCP connection is established the TCP congestion window starts out small and grows over time (assuming a stable network). Combining this knowledge with what was observed in the previous section (Reduced TCP Connection Churn), we can assume that lots of short-lived connections will have a negative impact on our overall TCP performance due to the behavior of the congestion window and the time it takes for a connection to be established.
Starting at 180ms of latency we are no longer able to maintain a steady upload state as observed via the Uploader Item Store Usage and Uploader Cloud Object Storage Asset Buffer Fill Percent graphs. These graphs are copied below and are from the 0% packet loss 180ms experiment.