Speedbump: TCP proxy for simulating variable, yet predictable network latency

5e92cb50239222b · on July 31, 2022

This is easy to do on Linux if you want to have it system-wide, or for other protocols.

  # tc qdisc add dev eth0 root netem delay 100000

gives you a 100ms delay. It can also be made non-system-wide by using network namespaces or creating additional network adapters.

Replace add with `del` to remove it.

See `man netem` for more. It also allows you to specify distribution, and simulate other network issues — packet loss, reordering, data corruption. For example:

  # tc qdisc add dev eth0 root netem rate 1mbit

kffl · on July 31, 2022

Sure, tc can be used for adding a fixed delay (or random delay that follows a given distribution) to network traffic. In the use case that prompted me to develop speedbump (testing application metrics collection and visualization), I wanted to achieve artificial latency being added value of which would change predictably over time (i.e. forming a sawtooth wave or a sine wave so as to give me immediate visual feedback when testing/debugging PromQL queries visualized in Grafana dashboards), which I don't think is possible to achieve with tc alone.

On top of that, I usually work on metrics collection, visualization and load testing in a staging k8s cluster, in which it would be rather tedious to set up tc rules on the worker nodes themselves (or even impossible in the case of managed K8s services with no direct access to the VMs acting as worker nodes).

majke · on July 31, 2022

> which I don't think is possible to achieve with tc alone.

How you do it with tc:

- set up dedicated network namespacel, with veth configured well

- the veth on host should do forwarding and NAT/masquerade

- Add TC either on veth inside ns or veth on host, or both (depending on direction you want to add latency)

- then wait

- then change the tc parameters if you want the parameters to change

Now, the big problem with tc netem variable latency is that it does reorder packets, which is often undesirable.

So, if your project does _not_ effect in packet reorder, it could be useful. Having said that, it seems to focus on L7 layer which is slightly too high level for the stuff I 'm doing.

Anyway, nice tool.

kffl · on July 31, 2022

Wow, I didn't expect my project to get posted here. Thanks a lot!

I can see that many people are drawing comparisons with tc/netmem, which speedbump is not intended to replace, so I think it's best if I provide some additional context info regarding the use case that prompted me to develop this project.

When setting up application metrics collection and visualization (i.e. via Prometheus + Grafana), I've often found myself trying to introduce artificial latency within the instrumented system for the purpose of generating more interesting timeseries data to test a given monitoring solution. Even when running load tests against an instrumented system, the data plotted on Grafana dashboards was often rather boring, making it difficult to catch bugs in PromQL queries due to lack of immediate visual feedback. I figured that one way of adding predictable variability to the instrumented application’s metrics, would be to introduce variable latency between it and its upstream services (i.e. a databases, message brokers or other services called synchronously).

Let's say that you have instrumented your app using Prometheus client so that it collects latency of DB queries in a histogram and that you are now building a Grafana dashboard to visualize these metrics. If you knew that the DB query latency over time should form a sine wave with a period of 2 mins and amplitude of 10 ms, it would be much easier to validate the correctness of metrics collection and visualization (you would know what to look for on the latency histogram/grah). Speedbump allows me to achieve just that by introducing latency that is variable, yet predictable over time.

spockz · on July 31, 2022

Are you aware of a tool that supports or planning to add support for more advanced delays? E.g. let through the http response headers and then slowly drizzly the content? Or let the last byte be delayed? What about package drop at a certain point in the response?

I’ve found some interesting behaviour in our applications when delay isn’t uniform during the response.

kffl · on July 31, 2022

That's an interesting use case. I am not aware of any tools that would allow for such content-aware proxying of HTTP traffic with artificial latency introduced in the response headers content parts separately.

As of now, speedbump is protocol agnostic - it just proxies the data over TCP without analyzing it (I've used it to proxy HTTP, Postgres, Mongo and AMQP traffic). While the idea of delaying the last byte seems promising at the first glance, especially in the context of retaining protocol agnosticism, the issue is that the proxy doesn't really know which byte is the last one (unless it actually understands the traffic flowing through it - i.e. detecting end of the HTTP response).

EDIT:

You may want to look into echo-nginx-module, as it allows for adding timeouts between output flushed to the client. Perhaps it could be used in conjunction with NGINX's proxy module.

spockz · on July 31, 2022

Thanks, sidecars, like envoy and linkerd-proxy, already do protocol detection even if it is just for metrics. Maybe their stack can be augmented to do latency and fault stimulation. For plain http nginx might be an option indeed.

AckRite · on July 31, 2022

Since there's a comment about Linux - if you ever need to play around with latency/throttling/packet loss/etc and you're developing on Windows, you can also check out Clumsy! [0] It was quite useful a few times, highly recommend it!

[0] https://jagt.github.io/clumsy/

xrisk · on July 31, 2022

The Gaussian would be a more natural way to model latency variation IMO.

worewood · on July 31, 2022

Yes but I think the goal is to create repeatable results so that if your application stops working properly you can debug it.

tomohawk · on July 31, 2022

How does this compare with the built-in netem in linux?

kffl · on July 31, 2022

netmem allows for traffic shaping at the kernel-level, whereas speedbump operates on a TCP connection level.

While netmem can be used for adding a fixed delay to network traffic (or a randomly generated one for each packet following a given distribution), speedbump can be used for introducing latency that changes predictably over time (i.e. forming a sawtooth wave or a sine wave), which was particularly desirable in my use case (testing app instrumentation). On the flip side, netmem can simulate packet loss, which speedbump can't do.

ikiris · on July 31, 2022

This is something they built that they can put on their promo doc.

thomashabets2 · on July 31, 2022

I did something like this about 15 years ago here: https://github.com/ThomasHabets/dejitun

It only does fixed end-to-end latency, because what I wanted to do was just remove jitter (hence "de-jitter-tunnel"), at the cost of some latency.

I think I find it more useful than what the author has done here, because I did it using tunnel interfaces, so it's not just limited to TCP sessions, where the kernel would fix the order anyway.

But maybe the author here is trying to do something I don't immediately see. When latency and jitter really matters, isn't it packet based that matters (hence build a TAP or TUN), not TCP proxy?

My solution too can't be replaced with tc, because to remove jitter I need to mark every single packet with a timestamp, and have the receiving end sequence it correctly.