PG-Strom: SSD-To-GPU Direct SQL Execution

blopeur · on June 4, 2017

Slides : https://www.slideshare.net/kaigai/20170602ossummitanintellig...

walterbell · on June 4, 2017

Could OpenCL (AMD, FPGA) be supported in the future?

eriknstr · on June 4, 2017

A bit off topic on my side here but, I've been a aware of both OpenCL and FPGAs for quite a while but never thought of these being combined in any way.

Would anyone care to comment on their experience with regards to the following questions I have about this?

1. Are you using an FPGA implementation of OpenCL, or are you synthesizing a specific piece of OpenCL code into the FPGA?

2. How does performance compare between OpenCL running on a high-end GPU and a high-end FPGA?

3. How does performance compare between implementing someting in FPGA directly and implementing it in OpenCL and running that on an FPGA?

4. How does working with OpenCL for FPGA compare to working with the FPGA directly? Iow, what are the advantages and disadvantages you find of OpenCL as opposed to Verilog or VHDL (or other ways you program an FPGA)?

5. Based on the above, if you had X amount of dollars that could buy you either a number of GPUs or a (probably different) number of FPGAs, which would you choose? Consider also the electricity cost of running these as I've heard that FPGAs generally waste less electricity because they produce less heat.

rch · on June 4, 2017

I've seen research (at Columbia?) on FPGA accelerated queries.

Tostino · on June 4, 2017

Really loved the slides. I'm in no position to need PG-Strom at this point, but it's very interesting work and I hope it keeps progressing.

ComodoHacker · on June 4, 2017

Interesting work indeed. It have the potential to beat Oracle Exadata.

guillem_lefait · on June 4, 2017

Interesting slides. I would like however see how performance relates with costs.

microcolonel · on June 4, 2017

This seems like a perfect fit for AMD's SSG architecture (1TiB flash storage directly on the GPU board).

arnon · on June 4, 2017

It's Nvidia only, it has direct CUDA

ricw · on June 4, 2017

I love the concept.p But it's hard to tell whether this is ready for production. Anyone have any hands on experience with it?!

nikanj · on June 4, 2017

This is the first time the concept has ever been shown publicly. You must be exceptionally brave to use it in production. Most CTOs are cynical, and always wait for the .1 version

jeltz · on June 4, 2017

Hm? I have not followed PGStorm but the project has been around for at least 5 years, probably more.

kaigai · on June 5, 2017

Early version of PG-Strom didn't plan SSD2GPU Direct SQL Execution, however, some users concerns didn't concern about CPU performance only but I/O also. So, I began to implement the feature to accelerate I/O also. Its development started at Dec-2015.

en4bz · on June 4, 2017

Note that GPUDirect is only available on nVidia Quadro and Tesla series GPUs.

nosefouratyou · on June 4, 2017

Does that mean that this library[1] requires GPUDirect? That's sad.

[1]: https://github.com/pg-strom/nvme-strom

kaigai · on June 5, 2017

Yes, Tesla or Quadro are required. In addition, Tesla P40/P100 are recommended due to size of i/o mappable region.

redm · on June 4, 2017

It's great to see this type of evolution in database performance, maybe this will become the standard someday.

The only thing missing is that there isn't a comparable project for MySQL.

microcolonel · on June 4, 2017

Not trying to be a snoot here, but what is the point of using MySQL today? I can't think of something that I'm aware it does better than Postgres.

Filligree · on June 4, 2017

Interoperating with code and data files that were made for/by MySQL.

microcolonel · on June 4, 2017

Code I can understand, data a bit less. I'm just thinking that no MySQL program which has already been written is going to have any reliance on the performance characteristics of this flash-based database system, since it wasn't a thing when they were written.

If you could change the shape of your tables to suit this approach, then surely you could adopt Postgres.

carbocation · on June 4, 2017

Postgres can have some write-amplification issues for update-heavy loads where at least one of the updated fields is covered by an index. Famously, this triggered Uber to switch to MySQL. Its impact is noticeable in one of my projects as well -- and it isn't that large (a few dozen million rows).

anarazel · on June 4, 2017

Not trying to doubt your experience, just collecting datapoints for improvements. I'd appreciate if you could describe what your problems around write amplification are. Is it

- heap bloat

- index bloat

- CPU overhead of index manipulations

- Write-ahead-log overhead of updates

- increased overhead / frequency of VACUUM

- total amount of disk writes

Thanks!

Edit: formatting.

carbocation · on June 4, 2017

This is a great question. I have to say I haven't profiled Postgres to that level of granularity, so I am making an assumption about why my updates are slow. What I notice is relatively slow UPDATE performance, with high total disk writes and a write-ahead-log that seems to be large. The bloat on disk isn't an issue for me (or, rather, I have started using lower fillfactors to reduce the variability in DB size as the rows are updated).

Do you have a favored resource that you can direct me to that would help me identify the factors that you cited and ultimately allow me to make more nuanced criticisms (or to find a fixable error in my approach, which would of course be the ideal result)?

anarazel · on June 4, 2017

> I have to say I haven't profiled Postgres to that level of granularity, so I am making an assumption about why my updates are slow. What I notice is relatively slow UPDATE performance, with high total disk writes and a write-ahead-log that seems to be large.

Hm, that could be quite independent things. Could you check what your checkpoint_segments (< 9.5)/max_wal_size & checkpoint_timeout settings are? Having very frequent checkpoints will trigger a lot of superfluous writes; both the checkpointing itself and increased percentage of full page writes (to prevent torn pages).

When you say your write-ahead-log is large, do you mean the pg_xlog directory, or the amount of writes to it?

Another common reason for slowness can be foreign keys, if you've not indexed both sides of it. E.g. if there's a foreign key referencing the table you're updating, and the referencing columns aren't indexed, you'll end up with seqscans...

> Do you have a favored resource that you can direct me to that would help me identify the factors that you cited and ultimately allow me to make more nuanced criticisms (or to find a fixable error in my approach, which would of course be the ideal result)?

I'd strongly suggest enabling log_checkpoints - that'll log information about every checkpoint, allowing at least some diagnosis.

Another useful thing would be to run pg_xlogdump --stats, that'd give you information about what takes up space in your WAL.

I'd also look at EXPLAIN (ANALYZE, BUFFERS) one_of_your_updates; - that'll tell you a bit more about the timing, including trigger runtime.

Edit: brevity

raverbashing · on June 4, 2017

Wondering how much the "moving data to the GPU" overhead won't help to make things significantly faster

cjhanks · on June 4, 2017

This was posted on an old HN -http://kaigai.hatenablog.com/entry/2016/09/08/003556

The major performance gain is doing SSD-to-GPU direct memory access bypassing the need for data to needlessly transfer over PCI-Express twice.

blopeur · on June 4, 2017

I think that, with the emergence of OpenCapi and the like of CCIX, we will see more and more offloading to specialised HW as generic CPU are not able to cope with the required performance and amount of data. If you look at Infiniband roadmap, they plan to hit 1 Tb/S within a couple of years. And no core dedication a la DPDK will allow generic CPU to handle that. The heterogeneous hardware platforms are becoming the norm really quickly.

throwaway91111 · on June 4, 2017

Really depends on the shape of the data and query at hand. If you're doing aggregate computations over million/billions of scalar-sized rows, it could be quite fast. However joining two cold tables will still be more limited by schema and disk read than anything.

arnon · on June 4, 2017

Unless you're doing skipping and indexing properly....

X86BSD · on June 4, 2017

This seems pretty nice. Does anyone know if there is anything similar in the oracle camp? This seems like a big win over them if not.

nova1313 · on June 5, 2017

I suspect it might cut into their Exadata appliance market if they did, but I'd love to hear anything from Oracle indicating they were adding something like this to a new version. If you try to go massively parallel for workloads they seem to push us towards Exadata. Parallel table/index scans may be your best bet at the moment on Oracle.

As far as I know you can't do anything like custom scan providers in Oracle. I've tried searching a few times for functionality like that and always came up empty. Even if that was an available feature, I'm not sure how you'd even begin costing it for the optimizer to take that into account, unless it was something you'd just enable via hint.

I have OpenCL code running from within Oracle stored procedures via Java. You can at least write your own logic from within the DB to utilize your GPUs. I've done things like push 2 tables into GPU memory and let the GPU do the joins once you manually load the columns to join on. Return the set to SQL and use an indexed column to return the data you want from the table. It's not something I'd consider running in production, but it was a neat exercise and you can get it to work.

sixdimensional · on June 4, 2017

I still am hoping for affordable SSD to FPGA direct execution someday!

ccrush · on June 4, 2017

Direct execution? With GPUDirect from NVIDIA you could have an FPGA get a memcpy() from graphics RAM to its data has. Still, not sure how you plan on using this. I would hope that NVIDIA could give us programmable fabric on the GPU silicon, and it could somehow access each compute units instruction execution pipeline. Would be interesting indeed. Or just massive, fast FPGAs with included highly parallel compute IP built in. Or all new NVIDIA cards just ship a very fast FPGA that runs 3D graphics by default, but can be reprogrammed to do CUDA compatible computation and its open source so users can add new instructions or modify existing ones or even reconfigure the number or arrangement of cores.

sixdimensional · on June 5, 2017

Thinking something along the lines of, implementing/modelling SQL operators in FPGA hardware, using massive parallelization through the FPGA. That needs direct I/O access to very fast storage to be optimal, and probably still additional processing via a CPU or GPU.

Expensive custom data warehouse appliances like Netezza already do something like this, but I'm hoping for an implementation that is less expensive and more generally available, that can be exploited to accelerate any database's operations. Bonus points if it can scale up to a distributed model somehow.

throwaway-1209 · on June 4, 2017

Where clause is the worst possible use case for GPU: GPUs don't tolerate branches or non-coalesced memory access all that well, and WHERE executes a ton of branches. Even so, things seem to be faster. Go figure.

ris · on June 4, 2017

I think it hugely depends on what your WHERE clause is.

throwaway-1209 · on June 5, 2017

I guess if your where clause is very selective, you could avoid the hit since the same branch is taken by all threads in a warp. But then there are better approaches such as BRIN indexes, RLE and dictionary encoding with which you don't even touch the data that doesn't match your clause.