Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

Exponential backoff: I don't know if this is a published algorithm.

Basically, background processes need to keep retrying an operation until it succeeds. (Make an API call to a server, upload a file, ect, ect.) If the retry interval is too small, you can DOS the remote server. (Server returns a 5xx error because there's a corner case that hits a defect.) But, if the retry interval is too large, your background process can run too slowly when it encounters transient errors. (Temporary network glitch, database under high load because of a cache flush.)

So, we pick an ideal small retry interval, and the maximum tolerable interval. (In my case, it's usually 30 seconds and 15 minutes.) Then, when errors happen, the retry interval keeps doubling until it hits the largest interval. So, the operation is retried at 30 seconds, 1 minute, 2 minutes, 4 minutes, 8 minutes, and then every 15 minutes until it succeeds.

The result is that transient errors don't cause large delays. Situations where a defect prevents a request from submitting don't DOS, because the retry interval throttles itself longer and longer.



Ideally, you also want to implement "jitter" in the retries. AWS has a good article about it: https://aws.amazon.com/blogs/architecture/exponential-backof...

The summary (from the article) is that:

    sleep_time = random_between(0, min(2 ** retries, max_sleep_time))
so that retries are "spread across" the delay window.


Reminds me of the original Ethernet algorithm to deal with collisions on coax.



This is also used in TCP with timeouts and ACKs. The idea is you have an expected timeout (ETO) and a retransmission time out (RTO). Initially, you set RTO to ETO and then you send a packet. If you don't get an ACK back, and the timer for the RTO expires, you then set RTO = 2 * RTO. Otherwise, reset RTO = ETO. I think they referred to this as 'exponential averaging' however.


I tend to use a min/max equation that acts similar to a fibonacci backoff, allowing for a couple rapid retries before the exponential backoff kicks in. Provides a little extra speed for the transient problems.


This can't be an algorithm? I implemented it on a loop and a limit duration ( an hour) and a Max interval of a minute ( starting from 5 seconds) in 5 lines of code


A short algorithm doesn't make it any less of an algorithm :)




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: