OpenTelemetry at Scale: Using Kafka to handle bursty traffic

blinded · on Oct 23, 2023

This arch is how the big players do it at scale (ie. datadog, new relic - the second it passes their edge it lands in a kafka cluster). Also otel components lack rate limiting(1) meaning its super easy to overload your backend storage (s3).

Grafana has some posts how they softened the s3 blow with memcached(2,3).

1. https://github.com/open-telemetry/opentelemetry-collector-co... 2. https://grafana.com/docs/loki/latest/operations/caching/ 3. https://grafana.com/blog/2023/08/23/how-we-scaled-grafana-cl...

I know the post is about telemetry data and my comments on grafana are logs, but the arch bits still apply.

wardb · on Oct 23, 2023

Grafana Labs employee here => On the linked articles: I'm not aware of any caching being used in the writing data to S3 part of the pipeline other then some time based/volume based buffering at the ingester microservices before writing the chunks of data to object storage.

The linked Loki caching docs/articles are for optimising the read access patterns of S3/object storage, not for writes.

blinded · on Oct 23, 2023

Yes. Thanks for the reply!

kilotaras · on Oct 23, 2023

Meta is also using something similar, see 4.1.1 of "Meta’s Next-generation Realtime Monitoring and Analytics Platform" [0]

https://www.vldb.org/pvldb/vol15/p3522-mo.pdf

blinded · on Oct 23, 2023

Super cool!

ankitnayan · on Oct 23, 2023

Caching is to improve read performance whereas Kafka is used to handle ingest volume. I couldn't correlate the Grafana articles shared

blinded · on Oct 23, 2023

Yep! Should have made that more clear. Brought it up as an example of other parts of the system that require scaling.

francoismassot · on Oct 22, 2023

I heard several times that Kafka was put in front of elasticsearch clusters for handling traffic burst. You can also use Redpanda, Pulsar, NATS and other distributed queues.

One thing that is also very interesting with Kafka is that you can achieve exactly-once semantic without too much efforts: by keeping track of the positions of partitions in your own database and carefully acknowledging them when you are sure data is safely stored in your db. That's what we did with our engine Quickwit, so far it's the most efficient way to index data in it.

One obvious drawback with Kafka is that it's one more piece to maintain... and it's not a small one.

viraptor · on Oct 23, 2023

> exactly-once semantic without too much efforts: by keeping track of the positions of partitions in your own database and carefully acknowledging them when you are sure data is safely stored in your db

That's not really "exactly once". What happens when your system dies after it made sure the data is safely stored in the db and before ack-ing?

Svenskunganka · on Oct 23, 2023

Depending on how you use the database it is. If you write the data as well as the offset to the DB in the same transaction, you can then seek to the offset stored in the DB after application restart and continue from there.

mirekrusin · on Oct 23, 2023

You should drop "(...) and carefully acknowledging them when you are sure data is safely stored in your db (...)" part then, because it means it's not necessary, you don't rely on it.

One-or-more semantics + local deduplication gives one-and-only semantics.

In this case you're optimising local deduplication with strictly monotonic index.

One downside is that you leak internals of other system (partitions).

The other is that it implies serialised processing - you can't process anything in parallel as you have single index threshold that defines what has been and what has yet not been processed.

Svenskunganka · on Oct 23, 2023

I'm not the one who wrote the original comment, so I can't modify it. But one should still commit offsets because it is the happy-path; DB transaction successful? Commit offset. If the latter fails due to e.g application crash and you seek at startup to the partition offset stored in the DB + 1, you get exactly-once semantics. There's some more details, e.g you'd have to do the same during consumer group rebalance, and topic configuration also plays a role, for example if the topic is a compacted topic or not, and if you write tombstones, what its retention policy is.

edit: You added some more to your comment after I posted this one, so I'll try to cover them as well:

> One downside is that you leak internals of other system (partitions).

Yeah, sure.

> The other is that it implies serialised processing - you can't process anything in parallel as you have single index threshold that defines what has been and what has yet not been processed.

It doesn't imply serialised processing. It depends on the use-case, if each record in a topic has to be processed serially, you can't parallelize full-stop; number of partitions equals 1. But if each record can be individually processed you get parallelism equal to the number of partitions the topic has configured. You also achieve parallelism in the same way if only some records in a topic needs to be processed serially, at which point you can use the same key for the records needing to be serially processed and they will end up in the same partition, for example recording the coordinates of a plane - each plane can be processed in parallel, but an individual plane's coordinates need to be processed serially - just use the planes unique identifier as key and the coordinates for the same plane will be appended to the log of the same partition.

mirekrusin · on Oct 23, 2023

Yes, it's good option but it requires serialised processing in partition scope, which may or may not be desirable.

If one-and-only-one semantics are needed and processing should be parallel, other methods have to be used.

robertlagrant · on Oct 23, 2023

Presumably every exactly once processing scenario needs you to squeeze things through a serial pipe at some point, or you could have 2 messages with the same ID come in and be processed in parallel?

mirekrusin · on Oct 23, 2023

Yes but scope/blocking/serialisation can be narrow or wide - ie. it can be per message id (highly parallel, more state to persist, one entry per id) or one for all messages of certain type/partition (not parallel, less state required, single last index for all messages of that kind).

fulmicoton · on Oct 23, 2023

> The other is that it implies serialised processing - you can't process anything > in parallel as you have single index threshold that defines what has been and > what has yet not been processed.

Fortunately Kafka is partitioned. You cannot work in parallel along partitions.

Also, you can streamline your process. If you are running your data through operation (A, B, C). (C on batch N) can run at the same time as (B on batch N+1), and (A on batch N+2)

We do both at quickwit.

francoismassot · on Oct 23, 2023

Good point: first you're right, we do the ack on Kafka but it's not necessary. Second, this is not what I wanted to stress... and I should have not used the verb "acknowledge". What we do is upload the data on S3, then we commit partitions + positions in what we call the metastore. I can't edit my comment unfortunately.

> One downside is that you leak internals of other system (partitions).

True, but we generalized the concept of partitions for other datasources, pretty convenient to use it for distributing indexing tasks.

viraptor · on Oct 23, 2023

> after application restart and continue from there.

What if the application doesn't restart before the queue decides the message was lost and resends?

hashhar · on Oct 23, 2023

In Kafka the "queue" is dumb, it doesn't lose messages (it's an append only durable log) nor does it resend anything unless the consumer requests it.

viraptor · on Oct 23, 2023

There has to be a retry system somewhere, otherwise you'd end up with a 0-or-more delivery system if the app crashes after picking up from the queue, but never processing or ack-ing.

richieartoul · on Oct 23, 2023

You have to do a bit more than that if you want exactly once end-to-end (I.E if Kafka itself can contain duplicates). One of my former colleagues did a good write up on how Husky does it: https://www.datadoghq.com/blog/engineering/husky-deep-dive/

francoismassot · on Oct 23, 2023

Yeah, I was only talking about exactly once semantic between Kafka and Quickwit.

pranay01 · on Oct 23, 2023

Have you done/seen any benchmarks between Redpanda/NATS and Kafka for this use case?

Some folks in SigNoz community have also suggested NATS for this, but I have not deep dived into benchmarks/features yet

francoismassot · on Oct 23, 2023

Unfortunately no :/

foota · on Oct 23, 2023

Isn't exactly once delivery the kind of problem like the CAP thereom where it's not possible?

You can make the downstream idemptoent wrt what the queue is delivering, but the queue might still redeliver things.

radicality · on Oct 23, 2023

Exactly-once semantics of what specifically? Or do you mean at-least-once ?

francoismassot · on Oct 23, 2023

Exactly-once semantic between Kafka and the observability engine.

ankitnayan · on Oct 23, 2023

https://www.confluent.io/blog/exactly-once-semantics-are-pos...

Joel_Mckay · on Oct 23, 2023

If you have distributed concurrent data streams that exhibit coherent temporal events, than at some point you pretty much have to implement a queuing balancer.

One simply trades latency for capacity and eventual coherent data locality.

Its almost a arbitrary detail whether you use Kafka, RabbitMQ, or Erlang channels. If you can add smart client application-layer predictive load-balancing, than it is possible to cut burst traffic loads by a magnitude or two. Cost optimized Dynamic host scaling is not always a solution that solves every problem.

Good luck out there =)

chris_armstrong · on Oct 23, 2023

A similar idea [^1] has cropped up in the serverless OpenTelemetry world to collate OpenTelemetry spans in a Kinesis stream before forwarding them to a third-party service for analysis, obviating the need for a separate collector, reducing forwarding latency and removing the cold-start overhead of the AWS Distribution for OpenTelemetry Lambda Layer.

[^1] https://x.com/donkersgood/status/1662074303456636929?s=20

bushbaba · on Oct 22, 2023

Seems like overkill no? Otel collectors are fairly cheap, why add expensive Kafka into the mix. If you need to buffer why not just dump to s3 or similar data store as a temporary storage array.

lmm · on Oct 22, 2023

> If you need to buffer why not just dump to s3 or similar data store as a temporary storage array.

At that point it's very easy to sleepwalk into implementing your own database on top of s3, which is very hard to get good semantics out of - e.g. it offers essentially no ordering guarantees, and forget atomicity. For telemetry you might well be ok with fuzzy data, but if you want exact traces every time then Kafka could make sense.

dikei · on Oct 23, 2023

Yeah, and to use S3 efficiently you also need to batch your messages into large blobs of at least 10s of MB, which further complicates the matter, especially if you don't want to lose those messages buffers.

bushbaba · on Oct 23, 2023

if your otel collector is being overwhelmed. In such cases you have a lot of backlogged data not able to be ingested. So you dead letter queue it to s3 for freeing up buffers.

The approach here is to only send data to s3 as a last ditch resort.

lmm · on Oct 24, 2023

If you're ok with losing some data when your collectors are overwhelmed, surely you'd just drop overflowing data in that case? Why go to all the effort of building a fallback ingestion path if it's not going to be reliable?

ankitnayan · on Oct 23, 2023

it's very hard to think s3 work as a buffer. Every datastore can work for almost all storage usecases buffer/queue/db when the scale is low but the latter were designed to work at scale

francoismassot · on Oct 22, 2023

I really like this idea. And there is an OTEL exporter to AWS S3, still in alpha but I'm gonna test it soon: https://github.com/open-telemetry/opentelemetry-collector-co...

richieartoul · on Oct 23, 2023

(WarpStream founder)

This is more or less exactly what WarpStream is: https://www.warpstream.com/blog/minimizing-s3-api-costs-with...

Kafka API, S3 costs and ease of use

lnenad · on Oct 23, 2023

Love the website, looks good, clear and to the point.

prpl · on Oct 22, 2023

Why not both, dump to S3 and write pointers to kafka for portable event-based ingestion (since everybody does messages a bit differently)

bushbaba · on Oct 22, 2023

No need as s3 objects is your dead letter queue and the system should be designed anyway to coupe with multiple write of same event.

The point is to only use s3 etc in the event of system instability. Not as a primary data transfer means.

nicognaw · on Oct 23, 2023

Signoz is too good at SEO.

Early days, I looked up otel and observability stuff, and I always saw Signoz articles on the first screen.

relaxing · on Oct 23, 2023

Bizarre. There's so little technical detail in the post beyond "use Kafka to handle bursty traffic", which is like, duh.

daurnimator · on Oct 22, 2023

I expect it would be far cheaper to scale up tempo/loki than it would be to even run an idle kafka cluster. This feels like spending thousands of dollars to save tens of dollars.

pranay01 · on Oct 23, 2023

Querying in Tempo/Loki does seem to not scale particularly well, and Loki has known issues with high cardinality data, so...

neetle · on Oct 23, 2023

Tempo can still buckle under huge bursts of traffic, and you don’t need the retention to be in the hours

ankitnayan · on Oct 23, 2023

When handling surges of the order of 10x, it's much more difficult to scale the different components of loki than to write them to Kafka/Redpanda first and consume at a consistent rate.

Spivak · on Oct 23, 2023

Where are you finding such an expensive Kafka cluster? Kafka can run on 3 VPS's in a trenchcoat.

anacrolix · on Oct 23, 2023

Are there any client side dynamic samplers that can target a maximum event rate? Burstiness with otel has been a thorn in everything that uses it from my experience and it's frustrating.

pranay01 · on Oct 24, 2023

Do you mean sampling at application level before sending traces to otel collector/Kafka?

nijave · on Oct 23, 2023

It'd be nice to have something simpler as an otel processor. Otel could just dump events to local disk as sequential writes then read them back, load permitting.

I'm curious how long things stay in Kafka on average and worse case. If it's more than a few minutes, I imagine it lowers the quality of tail based sampling.