The distinct on the date format is the problem, probably should use a separate query on each table to fill in the date output as a new column first, and then join afterwards.
Also wrong tool for the job comes to mind. MariaDB/MySQL are just not great at joins and query planning, and at this scale, especially with all the work involved to export and use unix tools, why not use any of the columnar data warehouses that can handle this much faster?
Export it to bigquery with CSVs and the query will probably finish in 30 seconds.
On PostgreSQL I've also had really positive experience with the cstore_fdw from Citus. It's an extension you have to compile yourself but you end up with the ability to create compressed columnar tables.
Trimmed some heavy queries that I was working with from about 4 hours to about 3 minutes. Also compressed the data from 220gb to about 20gb.
Yes, it's great. Part of the issue is that PostgreSQL isn't great at parallelism yet so optimizing the storage greatly reduces compute time to begin with.
Unfortunately that extension has a bunch of limitations and issues that keep it from being production-ready. PostgreSQL could really use a proper columnstore table implementation, and there's a pluggable storage API on the roadmap but it hasn't gotten much traction yet (and is focused on an in-memory engine first).
Well yes, like I said it's not really a true columnstore. It's a basic storage extension that writes table data as ORC files for good compression and in the best-case scenario can managed to filter out segments. It's missing all the fancy processing features that real columnar warehouses use so it'll never be as fast.
I highly recommend Clickhouse for this. It is blazing fast and can do over 100GB/s on a single modern machine if you have enough RAM. And it is very easy to install and configure.
On a single node yea, unfortunately not beyond that. Really wish there was bigger focus on operational features, and a bigger community for a rather fantastic product.
This was my exact reaction to the OP's comment as well.
CH is an amazing piece of tech, but getting configuration just right on many nodes is quite complex, and all of the experts writing about how to do so are writing in Russian.
The documentation seems to be improving rapidly though.
The post references a StackExchange question [1] which has a simple analysis, and four helpful suggestions (similar to yours), as to which the author states:
"I tried the first suggestion, but the results weren't promising. As experimenting with each suggestion could easily take at least half a day, I proceeded with a way I knew would work efficiently and reliably"
Correct. When I see developers using 'distinct' and complaining about some query performance, it's almost always the developer doing it incorrectly rather than an issue with the database.
Well it's hardly surprising given the emphasis on CS trivia and fad-devops that exists in this industry. Basics like this are often neglected because employees are too busy chasing Kafka and Kubernetes or re-implementing skip lists on whiteboards. After all, if you can recall your textbook CS surely you can just re-invent the thousands of PhD and engineering hours that went into this stuff previously on an ad hoc basis for our projects and scale it with K8s.
And at the risk of sounding like a clown, I'll go ahead and ask: ETL seems to be such a basic and generic thing to the point that it's useless as a concept. What did those Phds achieve in those thousand hours that the author missed?
It's also a negative side-effect of the improvement of software and automation in this space.
In ye olden times, there would be an grumpy DBA in this story terminating the query and yelling at the user. Many, many people are consuming databases without understanding how to approach SQL optimization, because nobody is forcing them to!
> DBAs playing hardass without even trying to understand the need.
Although the parent posed a caricature with "yelling", this is an even more extreme one.
The "forcing" function merely requires a bit of "hardass" but not the willful ignorance you associate with it. At the risk of going true-scotsman, I believe that competent DBAs capable of the forcing function would already understand the need without trying.
As such, I don't think there needs to be any kind of "medium" between two extremes of ignorance, but, rather, back to the GP's point, a reduction of ignorance in "what we have now".
Of course, I don't believe hardass/yelling is the best way to achieve that, either, instead favoring (original) Devops culture, but that also doesn't work if Ops (including DBA) skills aren't recognized as valuable any more (and "Devops" just means "replace Ops with Devs").
Ah, yes. I interned in the Database Administration team of a Fortune 500 company. I definitely learned a lot as a developer under the tutelage of some old school mainframers (I was primarily targeting DB2 on OS/390).
I learned a lot about query optimization in my 3 years as an intern. I also caused a few issues along the way, including a number of PR reports to IBM. Some of which were benign, like incorrect documentation. Others were more problematic, such as semantically identical queries returning different results (e.g. coalesce vs case when null in a column spec) to knocking out the entire development LPAR on the mainframe with a simple select statement (hit a bug in the query optimizer and it crashed not just the DB, but the whole LPAR).
Still, those 3 years were invaluable to me. They developed both an intuition on how to write optimal queries, but also the knowledge of how to utilize the tools to identify problems, measure and fix them. It has served me well as a developer in the 15 years since I left my internship. I don't get into fights with DBAs since they recognize I know as much or more than them about optimizing queries on datasets I work with.
As someone that doesn't know much about ETL, I don't really see why it's worth mentioning in this context.
Yes, what the OP is doing can be called extract/transform/load, but that's an extremely general set of operations. Do you have some specific tool in mind that would make solving their problem a breeze? Is there a specific process that one should follow for ETL which was ignored?
Note: I'm just a backend dev, but I've been looking into data engineering a bit recently so I'm not terribly experienced. Happy to be corrected
Problem here is they are using a system designed for the typical webapp use-case (OLTP). It's optimized to find a particular row or rows and then maybe update them with low latency.
ETL tools are more optimized towards this sort of bulk data processing (OLAP) where we want to scan all the rows for a particular column and run some sort of analysis.
A simple optimization some databases make for this is to orient the data so that each whole column is stored together rather than storing each row together. This makes it much quicker to run queries over an entire column or two in a table.
Generally modern ETL is optimized towards working with LARGE amounts of data so there are many tools that are optimized for this.
The number of times I reinvented something due to lack of knowledge, assistance, training, or from being thrown into a tech stack I am unfamiliar with, etc. was quite maddening. I'm pretty sure that's how most CS courses are designed really, "Figure this out. Did you get? Great. That will never be useful because you can just do this now." Teaching people to think, blah blah. Complain when they don't know.
Code beauty is subjective. If the next developer that inherits your code yells “WTF!” then incredulously tells all the other devs what you did, then it’s hacky. Yes, invoking shell commands in a db would likely cause that.
I've run into situations like this where I was able to use a window function or a materialized view, to get huge speed gains. It basically was restructuring the tables into a new view, much like the shell script did in this case, and that view then could be optimized by the query planner much more effectively.
One was with an internal database we used for tracking employee time and generating invoices and reports. I forget the exact details but this technique took it down from an hour to a second, or similar.
Another time was more recently in an interview, using the mysql employee sample database. I had a naive query that did most of what was needed, but after quite a bit more work I was able to make a materialized view that caused the query to go from using 64+GB of RAM and dying from OOM after 4 hours, to running in 45 minutes in 2GB of RAM.
It usually takes me hours and hours to put together though. Because I do it so infrequently.
Isn't the Unix join essentially doing an inner join? Why is the original SQL doing a left join while the text processing with Unix join is doing an inner join?
Yes, the left join will scan the whole table. Left join means to return all the rows from the left hand side table. It won't use the index.
Inner join with index is way faster than left join.
If you find you're exporting and running via `uniq` or the like, `create temporary table as` is probably what you want. If you think about it, it's essentially what you're doing, without the additional filesystem involvement.
The optimizer should be handling this for you, though for very large datasets, constructing a new table from a query essentially avoids the locking issues you might otherwise run into.
I'd also like to see a breakdown of the time taken to complete the project.
RAM is cheap and jumping from 16GB to 64GB (or even 128GB) might have cost as much as the analysis time.
The only clue to that a memory upgrade might have been a quicker fix was that the merged file was 133G. Seems to me that an upgrade to 128GB of RAM might have led to a dramatically shorter query execution time.
If you need to add hardware to solve a software optimization problem such as this, you've failed your job as a software engineer (if you should even be called an engineer, neigh, a developer).
I'll probably get downvoted for this, but you should should examine opportunities to optimize your SQL/other storage patterns before you assume anything else if your indicators are your DB is slow.
> If you need to add hardware to solve a software optimization problem such as this, you've failed your job as a software engineer
That's an interestingly narrow viewpoint with which I, narrowly, agree.
I suspect that the narrow viewpoint is just a form of the aphorism "if all you have is a hammer, everything looks like a nail."
> if you should even be called an engineer (if you should even be called an engineer, neigh, a developer)
However, in the broader context of engineering as problem solving (or developing a product), I disagree.
Not all (computer) problems are best solved with software.
As a sysadmin (who can code but doesn't love to), I routinely struggle with managers who don't have an intuitive sense of even a first approximation of what modern hardware is capable of, because their entire background is programming.
The fact that (latency) "numbers every programmer should know" exists (and has for quite some time) is fairly telling. What's more telling is that there are no dollar values attached to any of them.
The use case is pretty straightforward; TFA says that the referenced columns are both indexed and the runtime is a function of how MariaDB executes that simple query. Do you have anything concrete to offer in terms of optimizing "storage patterns" in this case that the OP has missed?
I'm all for optimizing for minimal resource usage, but if plugging $200 worth of RAM into the machine allowed the original query to run in an acceptable timeframe, that may well easily offset the hours of analysis and transformation that resulted in the shorter runtime. Unless, of course, we're more concerned with engaging in a performative defense of our arbitrarily self-bestowed credentials as "engineers".
But you've succeeded in your job in being a practical person trying to get something done.
I would love to know where you work, where "let me spend some time on optimizing this query" could be justified over "lets put a few more sticks of ram". Everywhere that I've worked, developer time is much more valuable and much more costly than a few ram sticks (assuming it's just a server or two, and not N instances).
I think the date_format call forces a table scan, the query could be rewritten to be much faster. Optimizing the sql statement seems less error prone than extracting the data to text files.
Or precompute the date on insert, there are a dozen ways to refactor the data model and the query, all of which are better than extracting to txt files, a format with no schema and not based on a standard. That would be the direction I would give if a junior dev was attempting this method.
As far as I understand, 'project_commits' and 'commits' have one-by-one relationship (project_commits.commit_id = commits.id). From my point of view it is a strange design since they could just add 'project_id' column into 'commits' table. 'project_commits' table seems to be redundant here.
I'd write this query:
select pc.project_id, date_format(c.created_at, '%x%v1'), count(*)
from commits c
join project_commits pc
on c.id = pc.commit_id
group by pc.project_id, date_format(c.created_at, '%x%v1')
And I'd use left join only at the stage when it is needed to join 'projects' dictionary with the result of the query.
MariaDB can do sort/merge joins. It's rare that you want to force one, but you can. That's what IGNORE INDEX is for.
If you're testing, take a sample of the database and test on that.
First try the simple form:
explain
select
project_commits.project_id,
date_format(commits.created_at, '%x%v1') as week_commit
from commits, project_commits
where project_commits.commit_id = commits.id;
The query is simply very badly written. Who makes a left join of a 5b rows table to a smaller table and at the same time expects a distinct from the “left” table??? If the distinct is important to solve the problem without expecting the join to return “true” then why have a join at first place???
I once had a huge performance problem for a simple self-join that I expected would run quickly.
I talked to our resident SQL expert (retired from IBM, now doing a job as the lab administrative assistant). She suggested sorting the data before inserting it.
That sped up the query.
it still bother me that data order matters so much. But, I can't reproduce the problem in MySQL any more anyway.
this was a full self-join (SELECT * FROM blah a, JOIN blah b WHERE a.id = b.id). I ran strace on it, for each a item, it was doing a seek to each b row and reading a whole page. By pre-sorting, I increased locality.
It's very possible. I was running the query ~14 years ago and it wouldn't surprise me if any number of improved condition pushdown mechanisms were implemented.
After reading the post, and considering the time of export/import the data, I think will be better to denormalize the data (ie:precalculate the result).
I have something like in the near past, but with much heavy calculations. Denormalizing using triggers turn query that take minutes in less 1 second.
How long does it take to run the distinct on its own?
I'd be interested to know what happens if the distinct is in a sub query then the result is joined with the commits table
Ensure project_commits.project_id is in an index before running the test
Since there's nothing limiting the set of rows from project_commits, it might as well table-scan it. The primary key on commits will be used for each left join lookup.
Nested loops aren't that different, performance-wise, than sorted merge joins. Sorting takes O(n log n); whereas the nested loop does n lookups, each taking O(log n), for a similar O(n log n). Memory allowing, a hash join has more potential for speedup.
There should be a locality win from a sorted merge - depending on the correlation between scan order and foreign key order, the index lookups in the nested loop may be all over the place. Usually this doesn't matter much because you don't normally do 5+ billion row joins.
Not sure if MySQL is the same but Postgres won't use an index if the statistics suggest doing so will be slower than a full table scan - would expect MySQL to be similar.
With 5bn+ rows and a memory constraint, the type of index begins to make a difference - e.g. in Postgres I would have tried using a bloom index.
It can be very frustrating that, even with updated statistics, the optimizer decides to go a different way with a query. While using Sybase, this happened quite a lot when tables got rather large (corporation large not Google large). The normal response is to force the indexes on the query. Due to log space issues, I do remember having an awk script between queries back in the Ingres days.
Only for every returned row. The projection does not affect the underlying result set, it merely "shapes" the results you have received. WHERE clauses are always run before SELECT
Reading the article, the database did use an index for the lookup. However, if that index lookup is hitting spinning rust, then it can take 10ms or so per entry, which adds up.
In contrast, if you have sorted/indexed both tables on the join key, then the database is able to do a merge join. This is effectively what the command line implementation did, and is much faster, assuming you can get hold of sorted data quickly to begin with. If the tables are unsorted, then the extra cost of sorting them first needs to be added.
Most databases will evaluate several methods of running the query. Something like Postgres EXPLAIN will provide details of the method being used.
MySQL used the index on the joined table, but didn't use an index on the primary table because there isn't anything to filter on. The join condition would still cause one table to require a table scan to gather all of the records.
Possibly. The author at least knew enough to observe "Both join fields are indexed. However, MariaDB implements the join with a full scan of project_commits and an index lookup on commits."
That may have actually been a result of the ordering and poor query design.
Given that commits.id is a primary key, IIRC, it must also be NOT NULL; thus the LEFT JOIN is uselessly equivalent to an inner join; which most database engines seem to prefer expressed as a WHERE clause (since it's more trivial to re-order those operations).
Were it an 'inner join' equivalent where clause or the smaller table specified first, at least the full-table scan should have been on the smaller table.
Your database storage engine of choice might differ or have other options (E.G. in PostgreSQL an index on the result of comparing the two source keys COULD be created and a carefully structured query written to use /that/... I think, I haven't tested it but it'd at least be worth the experiment.)
In the original case, pretty much any real (as in NOT My/Maria) database would've worked (knowing how to use it would've helped, of course), but for columnar analytics Monet is pretty nice, too, and ever so slightly cheaper than Vertica.
Also wrong tool for the job comes to mind. MariaDB/MySQL are just not great at joins and query planning, and at this scale, especially with all the work involved to export and use unix tools, why not use any of the columnar data warehouses that can handle this much faster?
Export it to bigquery with CSVs and the query will probably finish in 30 seconds.