Hacker News new | past | comments | ask | show | jobs | submit login

“Dataflow” and the open source ecosystem in that neighborhood (Flink, Spark, Beam, Kafka, that family of stuff) is a much more powerful way to look at logs in real time, rather than indexing them into storage and then querying. There just isn’t something off the shelf as easy to deploy as ElasticSearch with that architecture, that I’m aware of. (There should be!) When you jump the mental gap of events being identical to logs, then start looking at fun architectures like event sourcing, you start realizing streaming analysis is a pretty powerful way of thinking.

I’ve extracted insight from millions of log records per second on a single node with a similar setup, in real time, with much room to scale. The key to scaling log analysis is to get past textual parsing, which means using something structured, which usually negates the reason you were using ElasticSearch in the first place.

Google’s first Dataflow talk from I/O and the paper should give you an idea of what log analysis can look like when you get past the full text indexing mindset. Note that there’s nothing wrong with ELK, but you will run into scaling difficulty far sooner than you’d expect trying to index every log event as it comes. It’s also bad to discover this when you get slashdotted, and your ES cluster whimpers trying to keep up. One thing streaming often gets you is latency in that situation instead of death, since you’re probably building atop a distributed log (and falling behind is less bad than falling over).

The key here is: are you looking for specific log events or are you counting? You’re almost always counting. Why index individual events, then?




I don't think this is "always" true. The power of having the data in records is that you can trivially pivot and slice and dice the data in different ways. When it's aggregated to a count - I don't have that ability. When trying to debug something that's happening, I find it far easier to have the entire records.

As for scaling, it scales very well (you can read through the elastic blog/use cases to see plenty of examples.) That's not to say there aren't levels of scaling it won't handle. But I would venture to say that for 99% of the people out there, it will solve there problems very well.


Would it be accurate to say that your ability to count something new, starting from a previous point in time (i.e. not just from 'now') is dependent upon how easily you can replay your stream (I'm thinking in Kafka terms); or is there something in your architecture that allows you to consume from the stream from a (relative? absolute?) point in time (again, Kafka thinking leaking into my question)?

Thanks.




Consider applying for YC's Spring batch! Applications are open till Feb 11.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: