> o pull model metrics is an antipattern And sadly somehow whole Prometheus/Clou...

pphysch · on June 27, 2023

Pulling/polling isn't suitable for high throughput (e.g. network flows, application profiling, any sub-second sampling frequency), but it's totally fine for 99% of observability use cases. In fact, I would argue pushed metrics are an anti-pattern for most environments, where the performance upsides are not worth the added complexity & reduced flexibility.

There is real value to the observability system being observable by default. It is so nice to be able to GET /metrics endpoint from curl and see real-time metrics in human-readable format.

Pull by default, and consciously upgrade to push if you need more throughput.

KaiserPro · on June 27, 2023

I think the issue for me is that "pull" requires me to open up lots of services/hosts/sidecars to allow inwards connections. Thats a lot more things to monitor and test to see if its broken.

having a single dns record that I can route based on location/traffic/load/uptime autonomously is, I think, super convenient.

For example, if I want to have a single metrics config for a global service I can say: "server: metrics-host" and depending on the dns search path it'll either get the test, regional or global metrics server. (ie .local, .us-east-1 or *.company.tld)

However for most people its a single DNS record with a load balancer. When a host stops pushing metrics, you check the host aliveness score and alert.

llama052 · on June 27, 2023

I'd still argue that it's easier to scale Pulling than it is a distributed push. It's kind of why prometheus went that route in the first place.

Back in the days of puppet or nagios, which would take requests and not pull. It was very common to hear about them cascading and causing huge denial of service issues and even causing massive outages because of it. For the simple fact that it's way harder to control thousands of servers sending out data on a timeframe versus a set of infrastructure designed to query them.

If I recall correctly facebook in the early days had a full on datacenter meltdown due to their puppet cluster pushing a bad release causing every host to check in, they were offline for a full day I think, they couldn't update the thousands of hosts because things were so saturated.

However in the case of polling, you'd dictate that from the monitoring servers themselves, you can control and dictate that without causing sprawls of outages and calls from everything.

Pull model can obviously scale(a lot of the shortcomings are now addressed in Thanos/Cortex): https://promcon.io/2016-berlin/talks/scaling-to-a-million-ma...

KaiserPro · on June 28, 2023

> puppet cluster pushing a bad release causing every host to check in,

It was probably chef, but yeah I totally can see that happening.

in terms of scaling, nowadays everything either shards or can sit behind a load balancer, so partitioning is much more simple nowadays.

for network layout though, having hosts that can get access to a large number of machines is something I really don't like. Traditional monitoring, where you have an agent running as root and can execute a bunch of functions is also a massive security risk, and has largely moved to other forms of monitoring.

marcosdumay · on June 27, 2023

The most environments for what pushing is an anti-pattern (I do agree those are most) also should avoid complex monitoring tools, complex cloud architectures, and most of the troublemakers on the entire discussion here.

So, if you need to architect your metrics, the odds are much higher that you are one of the exceptions that also need to think about pulling or pushing them. (Or you are doing resume-driven architecture, and will ignore every advice here anyway.)