> When you have thousands of metrics, uptime kuma and bunch of friends wont help...

majkinetor · on June 30, 2022

> Has the Docker service failed? I'll get a notification. Docker bridge network down? I'll get a notification.

If you rely on cloud services yes. If you run your own infra, then no, you will have to metric/alert that in a custom manner as with everything else. So that thing you mention is NOT a borring technology (which should be promoted) but outsourceing (which should NOT get promoted in general).

> For development/test environments, Uptime Kuma on a separate server is enough

It doesn't matter as your network will fail. There is nothing worse then status page having false positives.

KronisLV · on June 30, 2022

> If you rely on cloud services yes. If you run your own infra, then no, you will have to metric/alert that in a custom manner as with everything else.

Consider this example:

  I have Zabbix on server A.
  I have an e-mail server on server B.
  I have Uptime Kuma on server C.
  I have an instance of Mattermost on server D.
  I have the application that I want to monitor on server E.

In a zero trust model (or even just running WireGuard) there is very little preventing you from having either on different cloud providers. There's also very little preventing you from having a setup like A-D on a few boxes that sit under your desk/colocated somewhere but having D in the cloud.

Thus, one can reason about the potential failure states:

  If servers C-E run into issues (say, Docker issues), I'll get a notification thanks to A and B (Zabbix sending an e-mail).
  If servers C-E are utterly unreachable (say, network interface problems), I'll get a notification thanks to A and B (Zabbix sending an e-mail).
  If servers A-B or E run into issues, I'll get a notification thanks to C and D (Uptime Kuma sending a message).
  In the current configuration, I wouldn't be protected against a compound failure of A-D (both Zabbix and Uptime Kuma down), but those might as well run on different clouds, with different orchestrators.

Of course, you can setup failover and redundancy options, but by that point you're probably also looking into distributed file systems for any backing storage like GlusterFS or Ceph but right now I don't need that complexity.

Furthermore, as you said, you can also rely on cloud services in addition to what you already have, so should A-D go down, then E will still be monitored by another solution as an alternative, though that's also hardly necessary for most things.

Hell, for all I care, I might as well have a Raspberry Pi on my desk that pings the servers, checks SSH connections, checks running Docker images, does a curl call and blinks and beeps aggressively when something isn't okay on servers that sit in a data center somewhere. It's not like there's not an endless amount of options. Of course, you can also go in the opposite direction and pick whatever is good enough, such as having A-B as a single server (or VM) and C-D as a single server (or VM), to not overcomplicate.

majkinetor · on June 30, 2022

I know you can have all that :) All I say is that you must relay on externals if A-E are all on the same network as it may go down. Then your emails or other notif. channels wont work.

Be that as it may I think people generally tend to overkill redundancy. One can usually tolerate most of the regular services going down an hour or tow once every couple of years...

KronisLV · on July 1, 2022

> All I say is that you must relay on externals if A-E are all on the same network as it may go down.

Thankfully, it's not too hard to take advantage of multiple networks in a hybrid/multi-cloud setup nowadays! Though, depending on the necessary access controls and auditing, such a setup might require slightly more work.

You do bring up an excellent point, though, about how it's a serious single point of failure in many systems out there, because personally I've also seen many setups like that (the majority of them, actually): I do suspect that in many cases that is indeed done for ease of use/convenience, even if it may lead to downtime.

Of course, in some cases downtime is acceptable, so I cannot argue that it can also make sense to choose such a simpler setup - for example, for having your own company's applications/monitoring for development environments all on the same network.

Though if this topology is retained at scale, things can get a bit interesting. On a similar note, I recall Bryan Cantrill doing an interesting presentation "Debugging Under Fire: Keep your Head when Systems have Lost their Mind" that talked about restarting their whole data center and the implications of that: https://youtu.be/30jNsCVLpAE