The new sequence IDs are the most interesting things in this release I think. Having an officially supported cross datacenter replication strategy would be real nice.
Lots of folks will be mad, but removing multiple mapping types is a nice change too. It was a feature that never really made sense. Index-per-type was always the better strategy, even going back to the 0.23 days.
As others in this thread will no doubt point out though - the ES folks are moving awfully fast. I still support 1.7.5 clusters that will take heroic efforts to update. I'd love to use the new features on those clusters, but there simply isn't a valid business case to take on the risk. This isn't like small NPM packages that you can update on a whim - some of these systems require terabytes of re-indexing to upgrade :/
Cross data center replication is really a much needed feature.
The way Elasticsearch is going though looks promising. With sorted indices, single mapping type and the other changes we might give it another try after switching to Algolia.
Is there a safe way now to query Elasticsearch directly without the need to go via proxy scripts on the server? This just adds so much overhead to the queries compared to Algolia.
Jason from Elastic here. We are actively developing cross datacenter replication (internally we are calling it "cross cluster replication" so you will likely see it referred to this in the future but of course this is subject to change).
I can not give a timeframe, but it is one of the top features on the ES roadmap. :)
> Cross data center replication is really a much needed feature.
This works pretty well already if you are running on your own hardware and have a good network. We've been running a three data center setup across the US for four years. Next year we may extend it across the Atlantic.
You have always been able to use the query DSL to write queries, aggregations, etc. If you're referring to scripting then yes, in 5.0 the "Painless" scripting language was introduced which allows you to send scripted queries via API request without having to store them on the server. The language was designed to not allow for exploits like when using other languages for running scripts on Elasticsearch.
As in a publicly facing front end? If that's your case, you wouldn't ever want to expose Elasticsearch to your front end directly. If you have a private front end that is inside your firewall then just create HTTP requests to Elasticsearch - it has a RESTful API.
But querying from a publicly facing front end would be a poor idea - would you expose a database directly to the front end?
It is called Elasticsearch and not Elasticdatabase, at one point it sounded like good idea to jump on the nosql bandwagon.
It is a fantastic idea to call the index directly from the frontend and could be solved with a read only type of index or api key with read only scope.
The current design with an unnecessary security layer outside of Elasticsearch is a poor idea adding too much administrative overhead and ridiculous latency.
They have what you are looking for with their X-Pack Security addition, which requires a license, though under very favourable terms compared to others...
I wish after 1.x they had maintained backwards compatibility with upgrades as they had kind of promised... there is just so much money in making it hard to upgrade ;)
That consulting arm of Elastic needs something to do :)
More charitably, I can understand why they felt the need to make a hard break, but difficult upgrades plus fast release cycle means a fair bit of friction.
The no downtime upgrades will be nice, but on a big production system, I wouldn't feel comfortable upgrading major versions every 6 months.
They are supposed to implement it using a join field - but it's not as nice as it is now (AFAIK you can't have multiple join types.. eg, I can't have grandchildren or two different kinds of children). It's really unfortunate because the issues they had with multiple types could have been handled more elegantly and still kept the immense power of parent-child fields.
It feels from the discussion of the github issues that this was something that had caused some elastic developers pain and they decided to kill it.
The 6 way is maybe better than the 5.x way - but there are problems with it and they could have done the 6.x better than they did.
There were two fundamental problems: 1) types had different mappings which was confusing since internally it's the same index and there is only one mapping. 2) for the use case where you have one type per index, you still had to arbitrarily create a "type".
It could have been done by:
1) making the mapping definition only at the index level - there's no such thing as a mapping for a type (this is how it works internally anyway.)
2) with a "type" field being optional and specified via a query string instead of url path. This would have left all the internals alone. Eg, there could have still be an internal meta "_type" field which would have had the default value of "default" or something. For those people that needed multiple types, specifically for more complex parent-child, they still could have done it.
The current approach is far more complicated because the have to change the internals to support both the new and old way during the transition and deal with a lot of internal things breaking because everything expects a "_type". You can checkout the github issues and see the work involved.
I detailed some of my issues/suggestion at https://discuss.elastic.co/t/parent-child-and-elastic-6/8572... but to no avail. After I looked through some of the github issues, I realized how long this was in the pipeline for and how much inertia there was in the direction.
My biggest gripe with the 6.0.0 GA is the removal of multiple mapping types per index. This creates a significant breaking change that will hurt the community tools adoption to 6.0. Their initial plan was to deprecate it and only remove it v7 onwards, which imho would've been a better balance.
The removal of mapping types really kicks the bottom out of the app we're making; it's some serious docker-style "break all the things" behaviour. Seriously losing trust on this one.
Are there any good alternatives to ES? Has Solr moved?
If it helps at all, Kibana also had to be updated to go from multiple types to a single type. It was a big project, and a bunch of approaches were explored for dealing with existing data (which I assume you are).
Ultimately, we settled on continuing to have different "types" in Kibana, but we treated them as data concerns rather than architectural concerns of Elasticsearch. At a high level, this meant that we added a new "type" field to documents to track the type itself, and then we prefixed fields with the type as well to preserve the ability to use the same userland id on different "types" in the index and such. The type/id prefixing thing doesn't get exposed beyond the layer that queries Elasticsearch for the kibana index.
Once that change was ready in the code, we also had to consider the need for users to migrate their existing multiple-type index to this new format. The upgrade assistant in x-pack basic handles all of this automatically, but folks can certainly repurpose the same reindexing operation we perform on .kibana on your own indices.
I also really don't like this change, but you have to understand why it matters for you.
If you just wanted to be able to store different types in an index - you still can, just that the type will not be stored in a meta "_type" field but in a document level "type" (or whatever) field. Of course, the API does change since it's not in the URL and if you're creating custom doc ids you'll probably have to include the type in that (comment-123, post-123). So it's annoying, but I think mostly something that can worked around.
If you're using multiple types for parent-child, the situation is more bleak. They are still going to have a "join field" but there can only be one type of relation. While often this is ok, there are definitely reasonable use cases where it's not.
Currently the traditional parent-child hasn't been removed from the core because they need to support 5.x indices. But it will be phased out unless there is a big uproar.
Can you not use a separate index for each mapping type and just combine the indexes by aliasing them all to one "virtual" index? I think that would effectively be the same but I may be forgetting something.
Oh, good, instead of getting my app finished this month, I guess I get to screw around with my ES indices and make no real progress on anything. Should have went with Solr...
Ah, but if I don't go through the pain now, I will get to learn an entirely new definition of pain and suffering later on when I (hopefully) have a huge ES search index in production.
You can basically use a 5.x in the same way as 6 (one "type" per index, only one parent-child relation). If you're not doing it that way now - you can start slowly while you're on 5.x and slowly migrate your methodology to the 6.x way (if it's even so different..) Then you can move to 6 more with little change, if you even need to upgrade to 6.
> The removal of mapping types really kicks the bottom out of the app we're making
Not to rub salt in the wound, but building a product around a feature provided by a single vendor (ie, not a standard feature or something developed in-house) means that you've just committed to maintaining that software or paying someone else good money to do so.
Anybody who's built products around eg Oracle has learned this at least twice!
I'm doing a migration ES => SolrCloud right now, and it kinda works but its not great. The Json support is really a leaky XML mapping and you have to tune the configs yourself to make it work, the SolrJ Java client is a piece of crap, and some things aren't documented well.
But apparently migrating is still simpler than making a BigCorp pay for the ES security package.
Yes, I believe solr is better than the time I used it 4 years ago, partial json api support, but I don't think it has the analytics aggregations that ES gives us.
Yeah we got the same problems with ES. I do think we "overused" it a bit by often using it as a secondary index/viewmodel store of the sorts. In hindsight that probably wasn't the smartest thing to do. If we had only used it as a search index we would have been better off.
They haven't changed their plan: mapping types per index are only unsupported for new indices, old ones that are migrated will continue to work, until 7.
Currently, only read support is provided to indexes created in 5.x. As someone who used one of their pre-GA releases, they previously had a flag to enable multiple mapping types.
ElasticSearch was never really meant for log storage anyway. It’s a full text engine, and just happened to work reasonably well for that purpose at lower volume. ELK ran with it in an attempt to go after Splunk, but it is phenomenally difficult to scale an indexing pipeline like ELK to high volume. There are far better ways to handle log analysis, particularly when your primary query is counting things that happen over T instead of finding log entries that match a query (which it always is) — streaming analysis is a much better fit than indexing, just lesser known.
As someone who knows of several places using it for multi-petabyte log retention and log analysis, what is Elastic used for, if it's not good at log indexing and text search? If I need to find logs based on one snippet of data, then pivot on different fields to analyze events, what should I be using?
> ElasticSearch was never really meant for log storage anyway.
Indeed, it wasn't designed for log storage, though it happened to match this
use scenario well (now less so with every release).
> There are far better ways to handle log analysis, particularly when your primary query is counting things that happen over T instead of finding log entries that match a query (which it always is)
Oh? This is the first time I hear that my use case (storing logs from syslog
for diagnostics at a later time) counts things over time. Good to know. I may
ask you later for more insight about my environment.
> streaming analysis is a much better fit than indexing, just lesser known.
Well, I do this, too, and not only from logs. I still need arbitrary term
search for logs.
The snark is totally unnecessary, since the vast majority of people deploy ELK to do reporting. Full term search is achievable with grep; what does ES give you for your non-reporting use case, since troubleshooting is an extremely low frequency event? Are you primarily leaning on its clustering and replication? Genuinely curious.
The overhead per log record, building multiple indexes at log line rate, there’s just so many reasons not to do your use case in ES that I don’t even think about it. I think it’s a poorer fit than reporting, to be honest.
ELK > grep for searching. As the other poster said, per-field filtering and rapid pivoting is MUCH more effective workflow than greping for string fragments and hoping it matches on the proper field in a syslog message.
And you keep talking about how much you know and how ELK is literally worse than grep for searching off fields in logs for troubleshooting, but offer no alternative setups or use cases. You're hand-waving.
I've seen some of the performance issues of ELK at scale, and I'd be interested in what's out there, because its not my expertise. But you are just yelling "dataflow" and "streaming analytics".
> The snark is totally unnecessary, since the vast majority of people deploy ELK to do reporting.
You shouldn't have used authoritatively universal quantifier. There are plenty
of sysadmins who use ES for this case, you apparently just happened to only be
exposed to using it with websites.
Then, what ES+Kibana give me over grep? Search over specific field (my logs
are parsed to a proper data structure), which includes type of event
(obviously, different types for different daemons), a query language, and
a frontend with histograms.
Mind you, troubleshooting around a specific event is but one of the things
sysadmins do with logs. There are also other uses, all landing in the realm of
post-hoc analysis.
Kibana and histograms are reporting. Now the snark is even more confusing, since you’re doing exactly what I say is a poor fit, but claiming it’s not your use case. I spend what time I can trying to show those very same sysadmins you’re talking about why ES is a poor architecture for log work, particularly at scale.
As an SRE, I’ve built high volume log processing at every employer in multiple verticals, including web. I know what sysadmins do. Not a fan of the condescension and assumptions you’re making. I have an opinion. We differ. That’s fine. Let it be fine.
> Kibana and histograms are reporting. [...] you’re doing exactly what I say is
> a poor fit, but claiming it’s not your use case.
You must be from the species that can predict each and every report before
it's needed. Good for you.
Also, I didn't claim that I don't use reports known in advance; I do use them.
But there are cases when preparing such a report for just seeing one trend is
an overkill, and there's still troubleshooting that is helped by the query
language. Your defined-in-advance reports don't help with that.
> I spend what time I can trying to show those very same sysadmins you’re talking about why ES is a poor architecture for log work, particularly at scale.
OK. What works "particularly at scale", then?
Also, do you realize that "particularly at scale" is a quite rare setting, and
"a dozen or less of gigabytes a day" scale is much, much more common, and ES
works (worked) reasonably well for that?
You should read the Dremel and Dataflow papers as examples of alternative approaches and dial down your sarcastic attitude by about four clicks. You don’t need to define reporting ahead of time when architected well; it’s quite possible to do ad-hoc and post-hoc without indexing per record. At small scale, your questions are quite infrequent and the corpus small, meaning waiting on a full scan isn’t the end of the world.
A dozen or less gigabytes a day means: use grep. This is just like throwing Hadoop at that log volume.
This was an opportunity to learn from someone with a different perspective, and I could learn something from yours, but instead, you’ve made me regret even saying anything. I’m sorry, I just can’t engage with you further.
(Edit: I’m genuinely mystified that discussing alternative architectures is somehow arrogant “pissing on” people. Why personalize this so much?)
So, basically, you have/had an access to closed software designed
specifically for working with system logs and based on that you piss on
everybody who uses what they have at hand on a smaller scale. Or at least this
is how I see your comments here.
I may need to tone down my sarcasm, but likewise, you need to tone down your
arrogance about working at Google or compatible.
But still, thank you for the search keyword ("dremel"). I certainly will read
the paper (though I don't expect too many very specific ideas from
a publication ten pages long), since I dislike the current landscape of only
having ES, flat files, and paid solutions for storing logs at a rate of few GB
per day.
> A dozen or less gigabytes a day means: use grep. This is just like throwing Hadoop at that log volume.
No, not quite. I do also use grep and awk (and App::RecordStream) with that.
I still want to have a query language for working with this data, especially
if it is combined with easily usable histogram plotter.
“Dataflow” and the open source ecosystem in that neighborhood (Flink, Spark, Beam, Kafka, that family of stuff) is a much more powerful way to look at logs in real time, rather than indexing them into storage and then querying. There just isn’t something off the shelf as easy to deploy as ElasticSearch with that architecture, that I’m aware of. (There should be!) When you jump the mental gap of events being identical to logs, then start looking at fun architectures like event sourcing, you start realizing streaming analysis is a pretty powerful way of thinking.
I’ve extracted insight from millions of log records per second on a single node with a similar setup, in real time, with much room to scale. The key to scaling log analysis is to get past textual parsing, which means using something structured, which usually negates the reason you were using ElasticSearch in the first place.
Google’s first Dataflow talk from I/O and the paper should give you an idea of what log analysis can look like when you get past the full text indexing mindset. Note that there’s nothing wrong with ELK, but you will run into scaling difficulty far sooner than you’d expect trying to index every log event as it comes. It’s also bad to discover this when you get slashdotted, and your ES cluster whimpers trying to keep up. One thing streaming often gets you is latency in that situation instead of death, since you’re probably building atop a distributed log (and falling behind is less bad than falling over).
The key here is: are you looking for specific log events or are you counting? You’re almost always counting. Why index individual events, then?
I don't think this is "always" true. The power of having the data in records is that you can trivially pivot and slice and dice the data in different ways. When it's aggregated to a count - I don't have that ability. When trying to debug something that's happening, I find it far easier to have the entire records.
As for scaling, it scales very well (you can read through the elastic blog/use cases to see plenty of examples.) That's not to say there aren't levels of scaling it won't handle. But I would venture to say that for 99% of the people out there, it will solve there problems very well.
Would it be accurate to say that your ability to count something new, starting from a previous point in time (i.e. not just from 'now') is dependent upon how easily you can replay your stream (I'm thinking in Kafka terms); or is there something in your architecture that allows you to consume from the stream from a (relative? absolute?) point in time (again, Kafka thinking leaking into my question)?
> The images are available in three different configurations or "flavors". The basic flavor, which is the default, ships with X-Pack Basic features pre-installed and automatically activated with a free licence. The platinum flavor features all X-Pack functionally under a 30-day trial licence. The oss flavor does not include X-Pack, and contains only open-source Elasticsearch.
The pre-installed X-Pack basic license sounds great as well: this at least allows one to use features like monitoring, dev tools and the upgrade assistant out of the box, without the 30-day trial kicking in as was the case for the 5.x Docker images.
I really wish they'd included the ElasticSearch SQL in 6.0, but I guess that feature still isn't fully baked.
Even if it didn't have the full power of the Elastic JSON Queries, for simple SELECT COUNT() ..GROUP BY, it would have been a nice addition...oh well, back to counting open and closed brackets...
Agreed! Using something like elasticsearch-dsl for Python [1] makes writing queries a little more bearable, but it's not a good fit for manual ad-hoc querying (e.g. some analysis I'd be doing from Kibana).
The data is mainly used for dashboards (which is Highcharts) so the aggregation functions map to something called a “series”, which is what you’d expect if you’ve ever used Highcharts. Anyway I think it’s quite cool how they did it.
We're just in the process of looking into ElasticSearch for an analytics use case, most of the queries we will be doing are simple group by aggregations with count/sums etc
A SQL interface would help a lot, even better if it came with a JDBC driver
There is some ES integration in the Apache Calcite. And it has JDBC driver...
P.S. Writing this level of SQL for ES that you describe isn't very difficult - in my project we got working in implementation in 2-3 weeks. Take Calcite. (Recommended but complex, imho), Facebook's Presto SQL parser (not recommended, but simpler)
It works fine in most cases, you can opt for csv output and you get a tabular format as well, but not all queries return out what you expect, i.e ordering with group by's can be challenging
You should reach out to them or post on their forums- they are supportive and I know ES is one of their priorities.
I'd be interested to hear about what worked and what didn't. It's also important to try the 1.2 version. I had played with 1.1 and there were problems (not failures - just inefficient ES queries.)
In a large internal project we still make heavy use of Groovy for scripting. All scripting languages besides the new Painless language were deprecated in version 5, and now in version 6 they are removed. This hampers us in our migration efforts from 5 to 6. Does anyone know if it is still possible to use Groovy through a separate (third party) plugin? I played around a bit with Painless, but can't say that I really like it. Documentation was/is kind of poor, and it seemed to me it somewhat assumed familiarity with Java and it's framework/APIs.
They deprecated all scripting langs? Ok this is something new I learned in this thread. That is an extremely lame deprecation. Is ElasticSearch the new Apple? “let’s remove as much functionality as possible for the sake of being as minimal as possible”.
Groovy is a fantastic language. It really is a hidden gem. It is my language of choice for cross-platform work, especially in enterprise.
Groovy was a security nightmare. The groovy sandbox simply doesn't do enough to protect a server.
Our scripting language "Painless" is faster and more secure than we could achieve with groovy, so in Elasticsearch 5.0 we made Painless the default and deprecated groovy.
In 6.0, groovy is gone.
We didn't do it to be minimalist, but we couldn't in good conscience continue to ship an insecure scripting language when we had an alternative.
We revisit the G1GC recommendation every once it a while. In fact, I am doing benchmarks and testing for G1GC versus CMS with Elasticsearch 6.0.0 right now, so that we have a better idea of where we stand.
Disclaimer: I'm an Elasticsearch dev employed by Elastic.
Cool, I've a pretty big cluster with some GC issues (p90 - 15s, p99 - 60s) during node failures, and would be super interested in those results! If there's anything a user can do to help, my email is on my user page :D
We observed in past that long GC is the cause of node failures. When long GC happens node doesn’t respond, master node decides that this node had left the cluster :\
Ya, we often see a node die of natural causes, and then the garbage produced from recovering the node and relocating the data ends up bringing down the rest of the cluster via long GC pauses.
We run ES on G1GC, have been doing this for the last 3 years. With heaps of more than 20gb in size and a lot of churn CMS just doesn’t cut it. What helps is a high number of replicas to help with any potential corruptions, and that we never treat ES as a primary store.
lucidworks made banana, a kibana port to solr, but never take off like kibana. I dont find the diferences between solr and elasticsearch so big, and always prefered the solr way of configurating things instead of the json api of elasticsearch. The solr data-import-handler has been also a big timesaver.
Lots of folks will be mad, but removing multiple mapping types is a nice change too. It was a feature that never really made sense. Index-per-type was always the better strategy, even going back to the 0.23 days.
As others in this thread will no doubt point out though - the ES folks are moving awfully fast. I still support 1.7.5 clusters that will take heroic efforts to update. I'd love to use the new features on those clusters, but there simply isn't a valid business case to take on the risk. This isn't like small NPM packages that you can update on a whim - some of these systems require terabytes of re-indexing to upgrade :/