Hacker News new | past | comments | ask | show | jobs | submit login
How I slashed a SQL query's runtime with two Unix commands (spinellis.gr)
249 points by DSpinellis on Aug 6, 2018 | hide | past | favorite | 83 comments



The distinct on the date format is the problem, probably should use a separate query on each table to fill in the date output as a new column first, and then join afterwards.

Also wrong tool for the job comes to mind. MariaDB/MySQL are just not great at joins and query planning, and at this scale, especially with all the work involved to export and use unix tools, why not use any of the columnar data warehouses that can handle this much faster?

Export it to bigquery with CSVs and the query will probably finish in 30 seconds.


On PostgreSQL I've also had really positive experience with the cstore_fdw from Citus. It's an extension you have to compile yourself but you end up with the ability to create compressed columnar tables.

Trimmed some heavy queries that I was working with from about 4 hours to about 3 minutes. Also compressed the data from 220gb to about 20gb.


Yes, it's great. Part of the issue is that PostgreSQL isn't great at parallelism yet so optimizing the storage greatly reduces compute time to begin with.

Unfortunately that extension has a bunch of limitations and issues that keep it from being production-ready. PostgreSQL could really use a proper columnstore table implementation, and there's a pluggable storage API on the roadmap but it hasn't gotten much traction yet (and is focused on an in-memory engine first).


And it's... really not all that fast when compared to mature analytical databases. ClickHouse on identical hardware is ~ 100x faster than cstore_fdw.

http://tech.marksblogg.com/benchmarks.html

More interesting to me is the reverse: using FDW from the analytical DB to Postgres, e.g., https://aws.amazon.com/blogs/big-data/join-amazon-redshift-a...


This is also an older benchmark. I'd be curious to see this used against postgresql 11; there have been massive speed increases in 10 and now 11.

One of the most interesting things is you can shard postgres extremely easily now by using FDWs which can do native pushdown of optimizations.

Also, Citus itself has been improving in that time, and I'm curious to see how that performs.


Well yes, like I said it's not really a true columnstore. It's a basic storage extension that writes table data as ORC files for good compression and in the best-case scenario can managed to filter out segments. It's missing all the fancy processing features that real columnar warehouses use so it'll never be as fast.


This was doing some analysis on my laptop. What other free columnar options are out there for Postgres?

For the specific dataset I needed to take advantage of some of the additional datatypes that PG has available, which is how I found it.


I highly recommend Clickhouse for this. It is blazing fast and can do over 100GB/s on a single modern machine if you have enough RAM. And it is very easy to install and configure.


> And it is very easy to install and configure

On a single node yea, unfortunately not beyond that. Really wish there was bigger focus on operational features, and a bigger community for a rather fantastic product.


This was my exact reaction to the OP's comment as well.

CH is an amazing piece of tech, but getting configuration just right on many nodes is quite complex, and all of the experts writing about how to do so are writing in Russian.

The documentation seems to be improving rapidly though.


+1 for ClickHouse. But it's the kind of software you need to figure out before you scale beyond a single node.

Fast as hell and surprisingly performance in low memory conditions.


I also run often enough in bad execution time on MySQL for queries which looked really simple to optimize.

The rewrite was easy and performance went through the roof.


I don't know mysql/mariaDB that well, but I know in ms sql server that doing a DISTINCT on a Function([Column]) would not use an index on [Column].

In this case it would be much faster to have an intermediate table or column with that statistic that could then be properly indexed.


It even has a name - SARGable - able to use an index on the [S]earch [ARG]uments but still easy to get tripped up.

https://www.definitions.net/definition/sargable

https://stackoverflow.com/questions/799584/what-makes-a-sql-...


Depending on the function. The query optimizer is smart enough to recognize many starts-with scenarios, as one example.


I'd like to see the runtime for the same query but (1) without 'distinct', (2) without the date formatting and (3) with an inner join.


My thoughts exactly.

The post references a StackExchange question [1] which has a simple analysis, and four helpful suggestions (similar to yours), as to which the author states:

"I tried the first suggestion, but the results weren't promising. As experimenting with each suggestion could easily take at least half a day, I proceeded with a way I knew would work efficiently and reliably"



Correct. When I see developers using 'distinct' and complaining about some query performance, it's almost always the developer doing it incorrectly rather than an issue with the database.


Am I missing something or is this blog post just someone discovering that ETL exists


Well it's hardly surprising given the emphasis on CS trivia and fad-devops that exists in this industry. Basics like this are often neglected because employees are too busy chasing Kafka and Kubernetes or re-implementing skip lists on whiteboards. After all, if you can recall your textbook CS surely you can just re-invent the thousands of PhD and engineering hours that went into this stuff previously on an ad hoc basis for our projects and scale it with K8s.


Methinks you're unfair to the author: https://en.m.wikipedia.org/wiki/Diomidis_Spinellis

And at the risk of sounding like a clown, I'll go ahead and ask: ETL seems to be such a basic and generic thing to the point that it's useless as a concept. What did those Phds achieve in those thousand hours that the author missed?


Lots of "modern" blog posts seem like people rediscovering that X exists.

A consequence of many skipping on traditional learning processes and how developers are now also forced into castings and portfolios.


It's also a negative side-effect of the improvement of software and automation in this space.

In ye olden times, there would be an grumpy DBA in this story terminating the query and yelling at the user. Many, many people are consuming databases without understanding how to approach SQL optimization, because nobody is forcing them to!


There's a happy middle ground between what we have now and those DBAs playing hardass without even trying to understand the need.


> DBAs playing hardass without even trying to understand the need.

Although the parent posed a caricature with "yelling", this is an even more extreme one.

The "forcing" function merely requires a bit of "hardass" but not the willful ignorance you associate with it. At the risk of going true-scotsman, I believe that competent DBAs capable of the forcing function would already understand the need without trying.

As such, I don't think there needs to be any kind of "medium" between two extremes of ignorance, but, rather, back to the GP's point, a reduction of ignorance in "what we have now".

Of course, I don't believe hardass/yelling is the best way to achieve that, either, instead favoring (original) Devops culture, but that also doesn't work if Ops (including DBA) skills aren't recognized as valuable any more (and "Devops" just means "replace Ops with Devs").


Ah, yes. I interned in the Database Administration team of a Fortune 500 company. I definitely learned a lot as a developer under the tutelage of some old school mainframers (I was primarily targeting DB2 on OS/390).

I learned a lot about query optimization in my 3 years as an intern. I also caused a few issues along the way, including a number of PR reports to IBM. Some of which were benign, like incorrect documentation. Others were more problematic, such as semantically identical queries returning different results (e.g. coalesce vs case when null in a column spec) to knocking out the entire development LPAR on the mainframe with a simple select statement (hit a bug in the query optimizer and it crashed not just the DB, but the whole LPAR).

Still, those 3 years were invaluable to me. They developed both an intuition on how to write optimal queries, but also the knowledge of how to utilize the tools to identify problems, measure and fix them. It has served me well as a developer in the 15 years since I left my internship. I don't get into fights with DBAs since they recognize I know as much or more than them about optimizing queries on datasets I work with.


https://en.wikipedia.org/wiki/Extract,_transform,_load for anyone who didn't catch the acronym.


As someone that doesn't know much about ETL, I don't really see why it's worth mentioning in this context.

Yes, what the OP is doing can be called extract/transform/load, but that's an extremely general set of operations. Do you have some specific tool in mind that would make solving their problem a breeze? Is there a specific process that one should follow for ETL which was ignored?


Note: I'm just a backend dev, but I've been looking into data engineering a bit recently so I'm not terribly experienced. Happy to be corrected

Problem here is they are using a system designed for the typical webapp use-case (OLTP). It's optimized to find a particular row or rows and then maybe update them with low latency.

ETL tools are more optimized towards this sort of bulk data processing (OLAP) where we want to scan all the rows for a particular column and run some sort of analysis.

A simple optimization some databases make for this is to orient the data so that each whole column is stored together rather than storing each row together. This makes it much quicker to run queries over an entire column or two in a table.

Generally modern ETL is optimized towards working with LARGE amounts of data so there are many tools that are optimized for this.


Re-inventing it. (And yes, that's worse.)


The number of times I reinvented something due to lack of knowledge, assistance, training, or from being thrown into a tech stack I am unfamiliar with, etc. was quite maddening. I'm pretty sure that's how most CS courses are designed really, "Figure this out. Did you get? Great. That will never be useful because you can just do this now." Teaching people to think, blah blah. Complain when they don't know.


If you initiate the shell commands from inside the database, does it count as less hacky? :) https://github.com/petere/plsh


Code beauty is subjective. If the next developer that inherits your code yells “WTF!” then incredulously tells all the other devs what you did, then it’s hacky. Yes, invoking shell commands in a db would likely cause that.


I've run into situations like this where I was able to use a window function or a materialized view, to get huge speed gains. It basically was restructuring the tables into a new view, much like the shell script did in this case, and that view then could be optimized by the query planner much more effectively.

One was with an internal database we used for tracking employee time and generating invoices and reports. I forget the exact details but this technique took it down from an hour to a second, or similar.

Another time was more recently in an interview, using the mysql employee sample database. I had a naive query that did most of what was needed, but after quite a bit more work I was able to make a materialized view that caused the query to go from using 64+GB of RAM and dying from OOM after 4 hours, to running in 45 minutes in 2GB of RAM.

It usually takes me hours and hours to put together though. Because I do it so infrequently.


Isn't the Unix join essentially doing an inner join? Why is the original SQL doing a left join while the text processing with Unix join is doing an inner join?

Yes, the left join will scan the whole table. Left join means to return all the rows from the left hand side table. It won't use the index.

Inner join with index is way faster than left join.


If you find you're exporting and running via `uniq` or the like, `create temporary table as` is probably what you want. If you think about it, it's essentially what you're doing, without the additional filesystem involvement.

The optimizer should be handling this for you, though for very large datasets, constructing a new table from a query essentially avoids the locking issues you might otherwise run into.


I'd also like to see a breakdown of the time taken to complete the project.

RAM is cheap and jumping from 16GB to 64GB (or even 128GB) might have cost as much as the analysis time.

The only clue to that a memory upgrade might have been a quicker fix was that the merged file was 133G. Seems to me that an upgrade to 128GB of RAM might have led to a dramatically shorter query execution time.


Yes, having tables with the size of 5B rows on a machine with 16GB RAM is just stupid, going to the disk will most probably always happen.


Just because it's slow doesn't make it stupid; parsimonious maybe, but not stupid. Just my 2c.


5B rows is not that much we blew past 32 bit uids in 2001, 16GB was really expensive then.


If you need to add hardware to solve a software optimization problem such as this, you've failed your job as a software engineer (if you should even be called an engineer, neigh, a developer).

I'll probably get downvoted for this, but you should should examine opportunities to optimize your SQL/other storage patterns before you assume anything else if your indicators are your DB is slow.


> If you need to add hardware to solve a software optimization problem such as this, you've failed your job as a software engineer

That's an interestingly narrow viewpoint with which I, narrowly, agree.

I suspect that the narrow viewpoint is just a form of the aphorism "if all you have is a hammer, everything looks like a nail."

> if you should even be called an engineer (if you should even be called an engineer, neigh, a developer)

However, in the broader context of engineering as problem solving (or developing a product), I disagree.

Not all (computer) problems are best solved with software.

As a sysadmin (who can code but doesn't love to), I routinely struggle with managers who don't have an intuitive sense of even a first approximation of what modern hardware is capable of, because their entire background is programming.

The fact that (latency) "numbers every programmer should know" exists (and has for quite some time) is fairly telling. What's more telling is that there are no dollar values attached to any of them.


The use case is pretty straightforward; TFA says that the referenced columns are both indexed and the runtime is a function of how MariaDB executes that simple query. Do you have anything concrete to offer in terms of optimizing "storage patterns" in this case that the OP has missed?

I'm all for optimizing for minimal resource usage, but if plugging $200 worth of RAM into the machine allowed the original query to run in an acceptable timeframe, that may well easily offset the hours of analysis and transformation that resulted in the shorter runtime. Unless, of course, we're more concerned with engaging in a performative defense of our arbitrarily self-bestowed credentials as "engineers".


> you've failed your job as a software engineer

But you've succeeded in your job in being a practical person trying to get something done.

I would love to know where you work, where "let me spend some time on optimizing this query" could be justified over "lets put a few more sticks of ram". Everywhere that I've worked, developer time is much more valuable and much more costly than a few ram sticks (assuming it's just a server or two, and not N instances).


I think the date_format call forces a table scan, the query could be rewritten to be much faster. Optimizing the sql statement seems less error prone than extracting the data to text files.


There's no predicate or limit or inner join that could reduce the set of rows, so the table might as well be scanned.


Or precompute the date on insert, there are a dozen ways to refactor the data model and the query, all of which are better than extracting to txt files, a format with no schema and not based on a standard. That would be the direction I would give if a junior dev was attempting this method.


There is some more information in this reddit thread

https://www.reddit.com/r/programming/comments/94r6w6/how_i_s...


As far as I understand, 'project_commits' and 'commits' have one-by-one relationship (project_commits.commit_id = commits.id). From my point of view it is a strange design since they could just add 'project_id' column into 'commits' table. 'project_commits' table seems to be redundant here.

I'd write this query:

    select pc.project_id, date_format(c.created_at, '%x%v1'), count(*)
    from commits c
        join project_commits pc 
            on c.id = pc.commit_id
    group by pc.project_id, date_format(c.created_at, '%x%v1')
And I'd use left join only at the stage when it is needed to join 'projects' dictionary with the result of the query.


Am I missing something fundamental like a WHERE clause? I see a join, but no limiting of rows beyond that.


You're right - this query doesnt make sense, because if he's left joining, he should at least do

where tableB.someCol != null

or something. Otherwise, the join is useless (or he should use an inner join)


I think they just wanted all of the data.


Is there an actual schema and does MariaDB do index scans?


MariaDB can do sort/merge joins. It's rare that you want to force one, but you can. That's what IGNORE INDEX is for.

If you're testing, take a sample of the database and test on that.

First try the simple form:

    explain
    select 
      project_commits.project_id,
      date_format(commits.created_at, '%x%v1') as week_commit
      from commits, project_commits
      where project_commits.commit_id = commits.id;
That ought to call for a sort and merge.


The query is simply very badly written. Who makes a left join of a 5b rows table to a smaller table and at the same time expects a distinct from the “left” table??? If the distinct is important to solve the problem without expecting the join to return “true” then why have a join at first place???


I once had a huge performance problem for a simple self-join that I expected would run quickly.

I talked to our resident SQL expert (retired from IBM, now doing a job as the lab administrative assistant). She suggested sorting the data before inserting it.

That sped up the query.

it still bother me that data order matters so much. But, I can't reproduce the problem in MySQL any more anyway.


I guess you were missing an index, as data order should only matter when doing full scans.


this was a full self-join (SELECT * FROM blah a, JOIN blah b WHERE a.id = b.id). I ran strace on it, for each a item, it was doing a seek to each b row and reading a whole page. By pre-sorting, I increased locality.


MySQL only supported “nested loop” joins in 5.5 and below. Hash and merge joins were introduced in 5.6.

Depending on when you originally did the query, maybe you were using 5.5 or below, which would explain why you now can’t reprodce it?


It's very possible. I was running the query ~14 years ago and it wouldn't surprise me if any number of improved condition pushdown mechanisms were implemented.


After reading the post, and considering the time of export/import the data, I think will be better to denormalize the data (ie:precalculate the result).

I have something like in the near past, but with much heavy calculations. Denormalizing using triggers turn query that take minutes in less 1 second.


How long does it take to run the distinct on its own? I'd be interested to know what happens if the distinct is in a sub query then the result is joined with the commits table

Ensure project_commits.project_id is in an index before running the test


Sorry disregard, I mis-read the query


The question im interested in is why MySQL did not use the index for the lookup.


Since there's nothing limiting the set of rows from project_commits, it might as well table-scan it. The primary key on commits will be used for each left join lookup.

Nested loops aren't that different, performance-wise, than sorted merge joins. Sorting takes O(n log n); whereas the nested loop does n lookups, each taking O(log n), for a similar O(n log n). Memory allowing, a hash join has more potential for speedup.

There should be a locality win from a sorted merge - depending on the correlation between scan order and foreign key order, the index lookups in the nested loop may be all over the place. Usually this doesn't matter much because you don't normally do 5+ billion row joins.


Not sure if MySQL is the same but Postgres won't use an index if the statistics suggest doing so will be slower than a full table scan - would expect MySQL to be similar.

With 5bn+ rows and a memory constraint, the type of index begins to make a difference - e.g. in Postgres I would have tried using a bloom index.


It can be very frustrating that, even with updated statistics, the optimizer decides to go a different way with a query. While using Sybase, this happened quite a lot when tables got rather large (corporation large not Google large). The normal response is to force the indexes on the query. Due to log space issues, I do remember having an awk script between queries back in the Ingres days.


I believe that is the case with the old default being if more than 30% of the table was being returned, don't use the index.

I think that is configurable however.


The query includes `date_format(created_at, '%x%v1') as week_commit` - meaning it would need to be computed for every row.


That's only in the SELECT statement, which only needs to be performed after the join has been done.


Only for every returned row. The projection does not affect the underlying result set, it merely "shapes" the results you have received. WHERE clauses are always run before SELECT


The query in the article has no WHERE clause. It must calculate that function for every row in `commits`.


Reading the article, the database did use an index for the lookup. However, if that index lookup is hitting spinning rust, then it can take 10ms or so per entry, which adds up.

In contrast, if you have sorted/indexed both tables on the join key, then the database is able to do a merge join. This is effectively what the command line implementation did, and is much faster, assuming you can get hold of sorted data quickly to begin with. If the tables are unsorted, then the extra cost of sorting them first needs to be added.

Most databases will evaluate several methods of running the query. Something like Postgres EXPLAIN will provide details of the method being used.


MySQL used the index on the joined table, but didn't use an index on the primary table because there isn't anything to filter on. The join condition would still cause one table to require a table scan to gather all of the records.


Naive question: are your tables partitioned?


TLDR: author doesn’t know how to use database indexes properly.


Possibly. The author at least knew enough to observe "Both join fields are indexed. However, MariaDB implements the join with a full scan of project_commits and an index lookup on commits."


That may have actually been a result of the ordering and poor query design.

Given that commits.id is a primary key, IIRC, it must also be NOT NULL; thus the LEFT JOIN is uselessly equivalent to an inner join; which most database engines seem to prefer expressed as a WHERE clause (since it's more trivial to re-order those operations).

Were it an 'inner join' equivalent where clause or the smaller table specified first, at least the full-table scan should have been on the smaller table.

Your database storage engine of choice might differ or have other options (E.G. in PostgreSQL an index on the result of comparing the two source keys COULD be created and a carefully structured query written to use /that/... I think, I haven't tested it but it'd at least be worth the experiment.)

MySQL (mariadb presumably?) can't. https://stackoverflow.com/questions/8509026/is-cross-table-i...


[flagged]


What is the definition of a real analytical database?



by analytical database I think the person meant a columnar database. for mysql compatible examples that means infobright or infinidb.


In the original case, pretty much any real (as in NOT My/Maria) database would've worked (knowing how to use it would've helped, of course), but for columnar analytics Monet is pretty nice, too, and ever so slightly cheaper than Vertica.




Consider applying for YC's Spring batch! Applications are open till Feb 11.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: