My biggest gripe with the 6.0.0 GA is the removal of multiple mapping types per index. This creates a significant breaking change that will hurt the community tools adoption to 6.0. Their initial plan was to deprecate it and only remove it v7 onwards, which imho would've been a better balance.
The removal of mapping types really kicks the bottom out of the app we're making; it's some serious docker-style "break all the things" behaviour. Seriously losing trust on this one.
Are there any good alternatives to ES? Has Solr moved?
If it helps at all, Kibana also had to be updated to go from multiple types to a single type. It was a big project, and a bunch of approaches were explored for dealing with existing data (which I assume you are).
Ultimately, we settled on continuing to have different "types" in Kibana, but we treated them as data concerns rather than architectural concerns of Elasticsearch. At a high level, this meant that we added a new "type" field to documents to track the type itself, and then we prefixed fields with the type as well to preserve the ability to use the same userland id on different "types" in the index and such. The type/id prefixing thing doesn't get exposed beyond the layer that queries Elasticsearch for the kibana index.
Once that change was ready in the code, we also had to consider the need for users to migrate their existing multiple-type index to this new format. The upgrade assistant in x-pack basic handles all of this automatically, but folks can certainly repurpose the same reindexing operation we perform on .kibana on your own indices.
I also really don't like this change, but you have to understand why it matters for you.
If you just wanted to be able to store different types in an index - you still can, just that the type will not be stored in a meta "_type" field but in a document level "type" (or whatever) field. Of course, the API does change since it's not in the URL and if you're creating custom doc ids you'll probably have to include the type in that (comment-123, post-123). So it's annoying, but I think mostly something that can worked around.
If you're using multiple types for parent-child, the situation is more bleak. They are still going to have a "join field" but there can only be one type of relation. While often this is ok, there are definitely reasonable use cases where it's not.
Currently the traditional parent-child hasn't been removed from the core because they need to support 5.x indices. But it will be phased out unless there is a big uproar.
Can you not use a separate index for each mapping type and just combine the indexes by aliasing them all to one "virtual" index? I think that would effectively be the same but I may be forgetting something.
Oh, good, instead of getting my app finished this month, I guess I get to screw around with my ES indices and make no real progress on anything. Should have went with Solr...
Ah, but if I don't go through the pain now, I will get to learn an entirely new definition of pain and suffering later on when I (hopefully) have a huge ES search index in production.
You can basically use a 5.x in the same way as 6 (one "type" per index, only one parent-child relation). If you're not doing it that way now - you can start slowly while you're on 5.x and slowly migrate your methodology to the 6.x way (if it's even so different..) Then you can move to 6 more with little change, if you even need to upgrade to 6.
> The removal of mapping types really kicks the bottom out of the app we're making
Not to rub salt in the wound, but building a product around a feature provided by a single vendor (ie, not a standard feature or something developed in-house) means that you've just committed to maintaining that software or paying someone else good money to do so.
Anybody who's built products around eg Oracle has learned this at least twice!
I'm doing a migration ES => SolrCloud right now, and it kinda works but its not great. The Json support is really a leaky XML mapping and you have to tune the configs yourself to make it work, the SolrJ Java client is a piece of crap, and some things aren't documented well.
But apparently migrating is still simpler than making a BigCorp pay for the ES security package.
Yes, I believe solr is better than the time I used it 4 years ago, partial json api support, but I don't think it has the analytics aggregations that ES gives us.
Yeah we got the same problems with ES. I do think we "overused" it a bit by often using it as a secondary index/viewmodel store of the sorts. In hindsight that probably wasn't the smartest thing to do. If we had only used it as a search index we would have been better off.
They haven't changed their plan: mapping types per index are only unsupported for new indices, old ones that are migrated will continue to work, until 7.
Currently, only read support is provided to indexes created in 5.x. As someone who used one of their pre-GA releases, they previously had a flag to enable multiple mapping types.
ElasticSearch was never really meant for log storage anyway. It’s a full text engine, and just happened to work reasonably well for that purpose at lower volume. ELK ran with it in an attempt to go after Splunk, but it is phenomenally difficult to scale an indexing pipeline like ELK to high volume. There are far better ways to handle log analysis, particularly when your primary query is counting things that happen over T instead of finding log entries that match a query (which it always is) — streaming analysis is a much better fit than indexing, just lesser known.
As someone who knows of several places using it for multi-petabyte log retention and log analysis, what is Elastic used for, if it's not good at log indexing and text search? If I need to find logs based on one snippet of data, then pivot on different fields to analyze events, what should I be using?
> ElasticSearch was never really meant for log storage anyway.
Indeed, it wasn't designed for log storage, though it happened to match this
use scenario well (now less so with every release).
> There are far better ways to handle log analysis, particularly when your primary query is counting things that happen over T instead of finding log entries that match a query (which it always is)
Oh? This is the first time I hear that my use case (storing logs from syslog
for diagnostics at a later time) counts things over time. Good to know. I may
ask you later for more insight about my environment.
> streaming analysis is a much better fit than indexing, just lesser known.
Well, I do this, too, and not only from logs. I still need arbitrary term
search for logs.
The snark is totally unnecessary, since the vast majority of people deploy ELK to do reporting. Full term search is achievable with grep; what does ES give you for your non-reporting use case, since troubleshooting is an extremely low frequency event? Are you primarily leaning on its clustering and replication? Genuinely curious.
The overhead per log record, building multiple indexes at log line rate, there’s just so many reasons not to do your use case in ES that I don’t even think about it. I think it’s a poorer fit than reporting, to be honest.
ELK > grep for searching. As the other poster said, per-field filtering and rapid pivoting is MUCH more effective workflow than greping for string fragments and hoping it matches on the proper field in a syslog message.
And you keep talking about how much you know and how ELK is literally worse than grep for searching off fields in logs for troubleshooting, but offer no alternative setups or use cases. You're hand-waving.
I've seen some of the performance issues of ELK at scale, and I'd be interested in what's out there, because its not my expertise. But you are just yelling "dataflow" and "streaming analytics".
> The snark is totally unnecessary, since the vast majority of people deploy ELK to do reporting.
You shouldn't have used authoritatively universal quantifier. There are plenty
of sysadmins who use ES for this case, you apparently just happened to only be
exposed to using it with websites.
Then, what ES+Kibana give me over grep? Search over specific field (my logs
are parsed to a proper data structure), which includes type of event
(obviously, different types for different daemons), a query language, and
a frontend with histograms.
Mind you, troubleshooting around a specific event is but one of the things
sysadmins do with logs. There are also other uses, all landing in the realm of
post-hoc analysis.
Kibana and histograms are reporting. Now the snark is even more confusing, since you’re doing exactly what I say is a poor fit, but claiming it’s not your use case. I spend what time I can trying to show those very same sysadmins you’re talking about why ES is a poor architecture for log work, particularly at scale.
As an SRE, I’ve built high volume log processing at every employer in multiple verticals, including web. I know what sysadmins do. Not a fan of the condescension and assumptions you’re making. I have an opinion. We differ. That’s fine. Let it be fine.
> Kibana and histograms are reporting. [...] you’re doing exactly what I say is
> a poor fit, but claiming it’s not your use case.
You must be from the species that can predict each and every report before
it's needed. Good for you.
Also, I didn't claim that I don't use reports known in advance; I do use them.
But there are cases when preparing such a report for just seeing one trend is
an overkill, and there's still troubleshooting that is helped by the query
language. Your defined-in-advance reports don't help with that.
> I spend what time I can trying to show those very same sysadmins you’re talking about why ES is a poor architecture for log work, particularly at scale.
OK. What works "particularly at scale", then?
Also, do you realize that "particularly at scale" is a quite rare setting, and
"a dozen or less of gigabytes a day" scale is much, much more common, and ES
works (worked) reasonably well for that?
You should read the Dremel and Dataflow papers as examples of alternative approaches and dial down your sarcastic attitude by about four clicks. You don’t need to define reporting ahead of time when architected well; it’s quite possible to do ad-hoc and post-hoc without indexing per record. At small scale, your questions are quite infrequent and the corpus small, meaning waiting on a full scan isn’t the end of the world.
A dozen or less gigabytes a day means: use grep. This is just like throwing Hadoop at that log volume.
This was an opportunity to learn from someone with a different perspective, and I could learn something from yours, but instead, you’ve made me regret even saying anything. I’m sorry, I just can’t engage with you further.
(Edit: I’m genuinely mystified that discussing alternative architectures is somehow arrogant “pissing on” people. Why personalize this so much?)
So, basically, you have/had an access to closed software designed
specifically for working with system logs and based on that you piss on
everybody who uses what they have at hand on a smaller scale. Or at least this
is how I see your comments here.
I may need to tone down my sarcasm, but likewise, you need to tone down your
arrogance about working at Google or compatible.
But still, thank you for the search keyword ("dremel"). I certainly will read
the paper (though I don't expect too many very specific ideas from
a publication ten pages long), since I dislike the current landscape of only
having ES, flat files, and paid solutions for storing logs at a rate of few GB
per day.
> A dozen or less gigabytes a day means: use grep. This is just like throwing Hadoop at that log volume.
No, not quite. I do also use grep and awk (and App::RecordStream) with that.
I still want to have a query language for working with this data, especially
if it is combined with easily usable histogram plotter.
“Dataflow” and the open source ecosystem in that neighborhood (Flink, Spark, Beam, Kafka, that family of stuff) is a much more powerful way to look at logs in real time, rather than indexing them into storage and then querying. There just isn’t something off the shelf as easy to deploy as ElasticSearch with that architecture, that I’m aware of. (There should be!) When you jump the mental gap of events being identical to logs, then start looking at fun architectures like event sourcing, you start realizing streaming analysis is a pretty powerful way of thinking.
I’ve extracted insight from millions of log records per second on a single node with a similar setup, in real time, with much room to scale. The key to scaling log analysis is to get past textual parsing, which means using something structured, which usually negates the reason you were using ElasticSearch in the first place.
Google’s first Dataflow talk from I/O and the paper should give you an idea of what log analysis can look like when you get past the full text indexing mindset. Note that there’s nothing wrong with ELK, but you will run into scaling difficulty far sooner than you’d expect trying to index every log event as it comes. It’s also bad to discover this when you get slashdotted, and your ES cluster whimpers trying to keep up. One thing streaming often gets you is latency in that situation instead of death, since you’re probably building atop a distributed log (and falling behind is less bad than falling over).
The key here is: are you looking for specific log events or are you counting? You’re almost always counting. Why index individual events, then?
I don't think this is "always" true. The power of having the data in records is that you can trivially pivot and slice and dice the data in different ways. When it's aggregated to a count - I don't have that ability. When trying to debug something that's happening, I find it far easier to have the entire records.
As for scaling, it scales very well (you can read through the elastic blog/use cases to see plenty of examples.) That's not to say there aren't levels of scaling it won't handle. But I would venture to say that for 99% of the people out there, it will solve there problems very well.
Would it be accurate to say that your ability to count something new, starting from a previous point in time (i.e. not just from 'now') is dependent upon how easily you can replay your stream (I'm thinking in Kafka terms); or is there something in your architecture that allows you to consume from the stream from a (relative? absolute?) point in time (again, Kafka thinking leaking into my question)?