Good question. So first I noticed in metrics dashboard (so it is important to ha...

Good question. So first I noticed in metrics dashboard (so it is important to have metrics) the receiver never seemed to have gotten the expected number of messages.

Then noticed the count of messages was being reset. Suspected something restarted the node. Browsed through metrics, noticed both memory usage was spiking too high and node was indeed restarting (uptime kept going up and down).

Focused on memory usage. We have a small function which returns top N memory hungry processes. Notice a particular one. Noticed its mailbox was tens of gigabytes.

Then used recon_trace to trace that process to see what it was doing. recon_trace is written by author of Erlang In Anger. So did something like:

    recon_trace:calls[{my_module, '_', '_'}, 100, [{scope, local}].

dbg module is built-in and can use that, but it doesn't have rate limiting so it can kill a busy node if you trace the wrong thing (it floods you with messages).

So noticed what it was doing and noticed that a particular operation was taking way too long. It was because it was doing an O(n) operation instead of O(1) on each message. On a smaller scale it wasn't noticeable but when got to 1M plus instances it was bringing everything down.

After solving the problem to test it, I compiled a .beam file locally, scp-ed to test machine, hot-patched it (code:load_abs("/tmp/my_module")) and noticed that everything was working fine.

On whether to pick a message queue vs regular processes. It depends. They are very different. Regular processes are easier, simpler and cheap. But they are not persistent. So perhaps if your messages are like "add $1M to my account" and sender wants to just send the messages and not worry about acknowledging it. Then you'd want something with very good persistence guarantees.