The One Billion Row Challenge in Go: from 1m45s to 4s in nine solutions

bbkane · 2024-03-03T05:34:29 1709444069

I found this super interesting - especially as all the data I've written code to manipulate has been small enough that I haven't needed to optimize my code, so I've never had to think in this direction.

I think my favorite part was the very first section, where he got baseline measurements with `cat`, `wc`, and friends. I wouldn't have thought to do that and its such an easy way to get a perspective on what's "reasonable".

jasonwatkinspdx · 2024-03-03T16:44:29 1709484269

It also underscores just how insane raw disk bandwidth is on modern ssds. Most software is not designed around a world where you have gigabytes a second of sequential scan on a laptop.

walth · 2024-03-03T18:56:58 1709492218

I believe this test is run several times and the data set fits in page cache.

jasonwatkinspdx · 2024-03-04T01:35:28 1709516128

Yes, though on a laptop it's unlikely the entire file is being cached in dram. Also ssd's with multiple GiB/sec bandwidths are indeed common, which was my basic point in comparison vs how much slower sed/awk style processing is.

timetopay · 2024-03-03T09:17:51 1709457471

A few months ago, I had to quickly bang out a script to output about 20 million lines of text, each the output of a hash function. My naive solution took more than a few minutes - simple optimizations such as writing every 10k lines cut the time significantly. Threading would have helped quite a bit as well.

latchkey · 2024-03-03T06:07:17 1709446037

I hate to "me too", but you also nailed that analysis.

faizshah · 2024-03-03T07:01:34 1709449294

I was curious how long it would take with Polars (for scale), apparently 33s: https://github.com/Butch78/1BillionRowChallenge/tree/main

I’m kind of interested in the opposite problem, what is the simplest solution using a well known library/db that approaches the fastest hand optimized solution to this problem?

sharno · 2024-03-03T07:09:26 1709449766

That’s the question worth asking imo. I was wondering how fast is the idiomatic Java solution

benhoyt · 2024-03-03T10:21:43 1709461303

On my machine the simple/idiomatic "baseline" Java solution (https://github.com/gunnarmorling/1brc/blob/main/src/main/jav...) takes 2m0s, compared to 1m45s for my simple Go version.

lucianbr · 2024-03-03T08:34:43 1709454883

71 seconds

https://questdb.io/blog/billion-row-challenge-step-by-step/

kaba0 · 2024-03-03T08:58:39 1709456319

Is that the same hardware? Otherwise it doesn’t say much.

KingOfCoders · 2024-03-03T09:40:14 1709458814

Java is often slightly faster than Go, has similar (perhaps, older, better optimized Map) constructs, perhaps better GC (older, more optimized), though I don't think the GC is a challenge, has slower startup times - so I'd say roughly the same as the idiomatic Go version?

Xeoncross · 2024-03-03T18:37:27 1709491047

Java can be faster than Go, but it comes at the same cost as most "faster" things in software: Java uses significantly more memory.

Java is also more mature, which means you are entering a massive package-bloat setup that has evolved over the years to work for everyones wild and varied needs. By the time you have your database, cache, http/other handlers, tests, fixtures, metrics, logging, tracing, etc... setup you're looking at a scary pile of dependencies spanning thousands of classes that would make even NPM jealous.

kaba0 · 2024-03-03T10:45:41 1709462741

Java’s GC is just incomparably better, literally every research on the topic is written for Java. Though that’s true that without value types, Java does rely slightly more on it.

geodel · 2024-03-03T19:21:58 1709493718

It could be matter of fact comment if incomparably better and slight reliance are brought to the middle.

gigatexal · 2024-03-03T07:37:01 1709451421

Where’s the source data I’d like to attempt ingesting this and processing it with DuckDb.

llimllib · 2024-03-03T16:50:43 1709484643

Here’s a thread on results with duckdb, I don’t mean to discourage you taking a shot at all though: https://github.com/gunnarmorling/1brc/discussions/39

schu · 2024-03-03T08:14:31 1709453671

Instructions on how to create it can be found here:

https://github.com/gunnarmorling/1brc?tab=readme-ov-file#run...

The Python version:

https://github.com/gunnarmorling/1brc/blob/main/src/main/pyt...

nhinck3 · 2024-03-03T08:11:08 1709453468

In the original 1BRC, it's a python script that generates the data.

geysersam · 2024-03-03T09:08:57 1709456937

Sounds very reasonable. In the blog post about 20s were shaved off by assuming we don't need complicated string parsing. An of the shelf library can't make that assumption so they will always have to pay the extra cost.

392 · 2024-03-03T11:23:33 1709465013

True in general but some (especially libs aimed at larger datasets processed in batch) are taking advantage of benchmarks like this to do things like: Try the fast way, if it works great Try the slow way, if above fails This makes the slow path 2x slower at worst (and you can advise to always use the slow way with optional params) but the fast path can be 10x faster

jsmith99 · 2024-03-03T10:00:25 1709460025

I’m surprised at the poor performance of python here. For reference there are several very brief R examples which are just 2-3 seconds. Eg http://blog.schochastics.net/posts/2024-01-08_one-billion-ro...

cmdlineluser · 2024-03-04T15:32:16 1709566336

Are you talking about the 2nd table in the Benchmark section?

It seems they are not running against the full dataset:

> Moving on to the 100 million file to see if size makes a difference.

  ggplot2::autoplot(reorderMicrobenchmarkResults(bench1e8))

One would also have to run both approaches on the same hardware for a meaningful comparison?

michae2 · 2024-03-03T07:27:46 1709450866

For anyone looking for more examples of 1BRC in Go, we had a friendly competition at work and collected the results here: https://github.com/dhartunian/1brcgo/

In addition to the loop-unrolling and bit-twiddling tricks that also show up in the fastest Java and C++ versions, some Go-specific things I learned were:

- unsafe.Pointer can be used to read memory without bounds checks

- many functions in the bytes and bits packages in the standard library are written in assembly

- debug.SetGCPercent and SetMemoryLimit to turn off GC

- runtime.LockOSThread to lock a goroutine to a thread

- print is slightly faster than fmt.Printf (but writes to stderr)

benhoyt · 2024-03-03T07:53:36 1709452416

Oh, I'd missed those solutions, thanks. You guys got way more hard core than I did -- nice work! Looking forward to reading the code for those solutions this week.

Update: for reference, Jason Chu's solution (https://github.com/dhartunian/1brcgo/blob/494eabd6ea958cc193...) seems to be the fastest on my machine, and runs in about 1.3s!

michae2 · 2024-03-03T08:00:58 1709452858

I think we all ended up using unsafe, though there were some solutions without mmap. It would have been interesting if we had adhered to the same constraints you did!

markoman · 2024-03-03T08:04:15 1709453055

Could you say why you find using memory-mapped files to be a portability issue? Thanks.

benhoyt · 2024-03-03T08:10:28 1709453428

Well, I guess it's more that the standard library doesn't have a cross-platform way to access them, not that memory-mapped files themselves can't be done on (say) Windows. It looks like there's a fairly popular 3rd party package that supports at least Linux, macOS, and Windows: https://github.com/edsrzf/mmap-go

WesolyKubeczek · 2024-03-03T09:38:56 1709458736

Not all filesystems support memory-mapped files equally, and for some that do, the support comes with caveats and could be slower than non-memory-mapped access.

Exuma · 2024-03-03T22:02:09 1709503329

Damn, that is fine work. I know nothing and am humbled

hoten · 2024-03-03T09:57:30 1709459850

> I find this report confusing. Why does if items[hashIndex].key == nil show as taking 5.01s, but the call to bytes.Equal shows as only 390ms. Surely a slice lookup is much cheaper than a function call? If you are a Go performance expert and can help me interpret it, I’m all ears!

These two lines are both conditionals, so the time reported is sensitive to branch mispredictions. If the timings are not intuitive based on the complexity of the associated lines, then it may be explained by the data being not very predictable and the branch predictor having a bad time.

benhoyt · 2024-03-03T10:16:17 1709460977

Yeah, someone emailed me privately after they'd dug into this. They mentioned that "items[hashIndex]" was a significant source of cache misses. They used "perf record -e cache-misses" and found it was the largest source of cache misses. They also found (by digging into the assembly) that the "bytes.Equal" line has some sort of source-line attribution issue.

camgunz · 2024-03-03T08:44:13 1709455453

I love the nerdery around 1BRC. My axe to grind is that unless you do dangerous stuff DBs are just as fast, less complicated, and more resilient to data updates than application code [0]. Do more in the database!

0: https://geraldonit.com/2024/01/31/1-billion-row-challenge-in...

giantrobot · 2024-03-03T18:20:24 1709490024

I agree with doing more in the database, you're closer to the data (disk/disk cache/L2 cache) than the application code is. At the same time I get really nervous around doing work in the database because you have to be really disciplined that the in-database code (functions, views, etc) match the code in source control. Also that your testing/QA database contains all the same code and enough test data to actually exercise the performance bounds of that code.

With application code I can easily step through it in a debugger and verify the deployed code matches what's in the repo. It's more difficult to do in the database because it requires more organizational discipline.

camgunz · 2024-03-04T09:06:01 1709543161

Yeah you need some different tools when working in data. I've been recommending dbt [0] as kind of the on-ramp for SREs to data work. Among other things it allows you to keep your work in a VCS and has a testing framework.

[0]: getdbt.com

riku_iki · 2024-03-03T18:17:00 1709489820

actually that guy can now be sued, per TOS it is illegal to publish benchmarks for Oracle DB.

Rexxar · 2024-03-03T18:35:19 1709490919

I'm not sure this part of the TOS is valid in many jurisdictions. But there is a better reason to not publish benchmarks: they do not deserve free advertisement. We should just collectively forget they exist and use other tools.

riku_iki · 2024-03-03T18:42:16 1709491336

which corp is considered to be good now days?..

christophilus · 2024-03-03T19:39:04 1709494744

Umbrella Corp.

thangalin · 2024-03-03T06:58:10 1709449090

Back in 2010, I used PostgreSQL for a web app that queried 270 million rows of climate data from Environment Canada:

https://www.youtube.com/watch?v=10KEr3sEG80

I wanted to see how the temperature was changing over time for specific regions using a map-based interface. The following chart was particularly eye-opening:

https://www.youtube.com/watch?v=iEtvf9xzRB4&t=164s

The software won a couple of awards and was heavily optimized to produce reports in under a minute. Kudos to the author for getting a parse time of a billion records down to mere seconds.

fizx · 2024-03-03T07:32:33 1709451153

It's worth noting that if you're messing around with large text files from the CLI, awk, grep, etc will be an order-of-magnitude faster if you opt out of unicode parsing.

I'm pretty confident adding LC_ALL=C to the awk solution would get it easily under a minute.

benhoyt · 2024-03-03T10:34:36 1709462076

It's a good point, but I believe because of the "-b" option (characters as bytes), using LC_ALL=C doesn't make a difference.

verytrivial · 2024-03-03T18:36:43 1709491003

I just want to go on record that given the simplistic (i.e fun) problem, a shell developer would have had the answer to a first, specific set of billion rows done while all the other language were still putting on their shoes.

jonahx · 2024-03-03T05:26:32 1709443592

Nice post. Interesting that the fastest Java beats the fastest Go, though they are close:

    AY fastest Go version 2.90s 36.2
    TW fastest Java version 0.953s 110

I would have expected Go to win. That JVM works pretty good...

Mawr · 2024-03-03T06:49:07 1709448547

That's only a valid comparison if the "fastest Java" and "fastest Go" implementations are either the same or at the limit of what each language allows.

The more interesting comparison anyway is performance of the straightforward, idiomatic code, since that's what we all write 99% of the time.

Here's the key insight from the article: "Processing the input file in parallel provides a huge win over r1, taking the time from 1 minute 45 seconds to 24.3 seconds. For comparison, the previous “optimised non-parallel” version, solution 7, took 25.8 seconds. So for this case, parallelisation is a bit faster than optimisation – and quite a bit simpler."

geodel · 2024-03-03T19:28:46 1709494126

Java version is written by lead researcher and founder of GraalVM at Oracle labs. It is really native AOT compiled code comparable to best in C++/Rust.

It is Java entry because language used is Java but finally compiled artifact is far far away from typical compiled Java artifact.

dsff3f3f3f · 2024-03-03T05:29:54 1709443794

The Java version does additional optimizations that his Go version doesn't do and he mentions that at the end of the post. The Java version is really optimized and is an interesting read.

neonsunset · 2024-03-03T07:11:39 1709449899

The more accurate statement would be is Go implementatation is incapable of accessing optimizations that exist in Java and then Java is incapable of optimizations performed by C# and C++ implementations.

See https://hotforknowledge.com/2024/01/13/1brc-in-dotnet-among-...

dsff3f3f3f · 2024-03-03T07:37:03 1709451423

Go is perfectly capable of all of the additional optimizations that are in the fastest Java implementation that is linked in the article.

neonsunset · 2024-03-03T10:20:06 1709461206

Until some time ago Go did not even have inlining profitability logic in the compiler and could only inline leaf functions, something which is worse than JVM implementations could do in...2005 or so I think?

Are you sure?

avinassh · 2024-03-03T05:38:18 1709444298

> The Java version is really optimized and is an interesting read.

is there any similar blog post on the Java optimisations?

cempaka · 2024-03-03T06:22:57 1709446977

Cliff Click did a walkthrough of his solution on YouTube: https://youtu.be/NJNIbgV6j-Y?si=Wj97f-Imw5nfIzF7

o11c · 2024-03-03T05:49:36 1709444976

https://news.ycombinator.com/item?id=39467885 but it looks like there have been improvements since.

lucianbr · 2024-03-03T08:33:45 1709454825

https://questdb.io/blog/billion-row-challenge-step-by-step/

dsff3f3f3f · 2024-03-03T05:44:11 1709444651

Not that I know of. I just looked at the code and the commit history but a more in depth article would certainly be interesting.

AtlasBarfed · 2024-03-03T08:26:58 1709454418

Well the Go guy probably didn't read it because "java doesn't interest him"

threatofrain · 2024-03-03T06:47:37 1709448457

I don't think Go has ever demonstrated that it deserved to be thought of as casually faster than Java.

Comma2976 · 2024-03-03T08:35:35 1709454935

https://github.com/attractivechaos/plb2/blob/master/README.m...

Synthetic benchmarks aside, I think as far as average (spring boots of the world) code goes, Go beats Java almost every time, often in less lines than the usual pom.xml

kaba0 · 2024-03-03T10:50:21 1709463021

So you are comparing a feature-packed enterprise framework with a raw http server ping-ponging hello world?

Also, what even does the last line mean? Go in general is significantly more verbose than Java.

iraqmtpizza · 2024-03-03T08:51:40 1709455900

[flagged]

Comma2976 · 2024-03-03T10:39:55 1709462395

And yet, people will refer to some other synthetic benchmark when arguing that the average Java project, which doesn't disable GC nor uses unsafe reads et al, is comparable.

Can't have and eat it too

iraqmtpizza · 2024-03-04T01:37:36 1709516256

Um, GraalVM (JIT) has both GC and memory safety. It's a drop-in replacement for HotSpot. It takes literally five seconds to change JAVA_HOME and you're done.

The benchmark literally uses Graal, something from Java, for Ruby and Python, but inexplicably doesn't use it for Java itself.

parkcedar · 2024-03-03T05:38:44 1709444324

The fastest Java version is even beating his baseline of `cat` to `/dev/null`

JohnBooty · 2024-03-03T07:04:30 1709449470

Yes, though it's also worth noting that the fastest solutions are all doing their work in parallel which is not a thing for `cat`.

timeagain · 2024-03-03T06:36:51 1709447811

More proof that the JVM is space-age future technology.

bufferoverflow · 2024-03-04T01:47:11 1709516831

3x difference is not "close".

neonsunset · 2024-03-03T06:04:52 1709445892

Still loses to .NET. On reference host Java still closer to 1.7-2s ballpark (and has to use awkward SWAR to get there) while the fastest solution in C# is 1.2s, beating C++ (code can be ported however).

But yes, "I expected Go to win..." is exactly the core of the problem here. Same as with e.g. Swift, which people expect to perform on the level of Rust, when it is even slower than Go. The intuition caused by common misconceptions just does not correspond to reality sadly.

pjmlp · 2024-03-03T08:35:23 1709454923

It only goes to show how much cargo cult is there in adopting these languages in hipster circles.

hnlmorg · 2024-03-03T09:07:22 1709456842

No. What it actually demonstrates is that people didn’t read the source material properly.

The Java and Go versions use different optimisations. There’s nothing stopping either language from using the same optimisations as the other. It just wasn’t something their respective authors cared to try in their respective exercises.

neonsunset · 2024-03-03T10:22:36 1709461356

There, however, is something stopping Go from using optimizations present in Java or C#/C++/Rust. This is lack of SIMD API without dropping to hand writing ASM and overall much weaker compiler. This puts much greater burden on the programmer to match the performance while staying with Go.

hnlmorg · 2024-03-03T11:37:25 1709465845

> There, however, is something stopping Go from using optimizations present in Java or C#/C++/Rust

...

> This puts much greater burden on the programmer to match the performance while staying with Go.

Your second statement contradicts your first. You're not stopped from using SIMD in Go. There are in fact several 3rd party libraries out there to use SIMD. It's just not part of the standard library. So you can still use SIMD in Go without writing Go's dialect of assembly.

It's also worth noting that SIMD isn't even due to drop into std in C++ until C++26. At the moment you either have to use experimental or a 3rd party library.

You're also missing the point of these examples. Nobody writing these exercises are trying to claim that all languages are equal. And they certainly not trying to write idiomatic code either. They're just fun little exercises demonstrating the degree of optimisations one can go through. You wouldn't want to write code like those in the examples in all but a the tiniest of scenarios.

It's silly the amount of people over-analysing this, what is essentially just a game, and then arguing its "proof" about their biases towards different programming languages.

kaba0 · 2024-03-03T09:05:10 1709456710

> beating C++

Source for that?

neonsunset · 2024-03-03T09:08:04 1709456884

https://hotforknowledge.com/2024/01/13/1brc-in-dotnet-among-...

threeseed · 2024-03-03T06:07:53 1709446073

JVM has always been on par if not often faster than hand-written C code.

Go's advantage has always been that it is good enough at a lot of things.

ffsm8 · 2024-03-03T06:57:09 1709449029

I think the reason why this misconception is so widespread is because there is a grain of truth in it, because almost everyone sees Java synonymous with gigantic framework like spring, quarkus etc.

In go you've got your standard libraries, these are generally quicker than the Java equivalent simply because they do less in the lifecycle of the operation.

This lets Java do funky stuff like enabling full jvm/code tracing just by adding a jar file at runtime. But it does come with a performance penalty.

pjmlp · 2024-03-03T08:38:54 1709455134

Which is a reason why dynamic loading of agents now requires being enabled explicitly.

_ph_ · 2024-03-03T08:37:39 1709455059

One has to differentiate here a bit. Java JIT technology has become really great. Highly optimized native code generation which hugely benefits from the ability to use live profiling data to optimize the code. This is why it often beats static compilers at generating faster code. The static compilers can only optimized on the range of possible data, the JIT can optimize based on the data presented to the program.

On the down side, there are quite a few features of the Java language and the JVM, which often make programs slow. Like a lot of details of the object model, lack of value classes, JIT compiling which takes time on startup etc. Also, a lot of Java libraries are pretty heavy weight.

Go is quite different here. It is statically compiled, which allows for fast program startup and the language model makes it quite easy to rather naively write programs which perform reasonally fast. The down side is, that the compiler is static and not so heavily optimizing as other static compilers for fast compilation speed. However recently the ability was added to use profiling data for optimizing the compilation.

Shorel · 2024-03-05T12:24:36 1709641476

> JVM has always been on par if not often faster than hand-written C code.

I have never found this to be true.

There are a few programs (Talend, MySQL workshop) written in Java that I sometimes have to use, and I avoid them as much as I can, because they are slow, bloated, and eat lots of memory.

Java new operator is faster than malloc, because Java allocates all memory needed from the OS before starting the program. And malloc is extremely slow. C++ new operator uses malloc so it is also slow.

Stack allocation in a C program is hundreds of times faster than using new in Java.

And that's for one feature, there are many other features like C++ compile-time calculations in templates that simply have no Java equivalent.

Mawr · 2024-03-03T06:51:49 1709448709

Please, you've been reading too much PR from the Java side and not looking at benchmarks and real-world performance enough. What you're claiming is inherently not possible, cherry-picked benchmarks notwithstanding.

threeseed · 2024-03-03T07:18:33 1709450313

Can you explain why it's not technically possible.

JVM has had decades of experience at optimally translating bytecode to machine code and can take advantage of SIMD, AVX etc when needed. Most hand-written C code is far from optimal.

dzaima · 2024-03-03T17:29:29 1709486969

A couple weeks ago I managed to get a nice setup for viewing Java JIT disassembly[1], and did a bunch of ad-hoc tests on OpenJDK 21 (and some on whatever build of 23 was on openjdk.org). do manage to vectorize a decent amount of vectorizable loops, but semi-often missed some improvements, and some trivial things didn't vectorize at all (most amusingly, a loop summing an int[] to an int didn't get vectorized, while int[] to long is). Scalar code codegen is also quite mediocre.

GCC & clang, in comparison, often produce assembly which I'd consider optimal, or close to it, for vectorizable loops, and are noticably better than OpenJDK for scalar code.

[1]: it's a bit messy to use but: https://github.com/dzaima/grr#java-jit

saagarjha · 2024-03-03T21:02:42 1709499762

GCC and clang produce far from optimal code when vectorizing. Anyone doing serious work is unlikely to rely on the autovectorizer without consulting the output religiously.

dzaima · 2024-03-03T23:33:10 1709508790

There are certainly a lot of things they may do suboptimally (in my experience: when there's register pressure; things that aren't nicely elementwise (bit arrays); preparation & tail handling can get pretty excessive; basically no tuning for making better use of ports) and of course you should always be checking the output (this is true for any optimization effort, vectorized or not), but for most things IME they do just fine (assuming the thing in question is vectorized at all). At the very least it's still miles above OpenJDK.

saagarjha · 2024-03-04T11:00:02 1709550002

I've only really seen decent results on the simplest of examples, like a bounded loop with a basic operation in it (like, adding all the numbers in an array). As soon as you go beyond that I see much worse code: reloading things inside the loop body that really should have been hoisted out, spilling for no good reason, weird contortions to shove certain things into scalar registers in the middle only to take them out again. And, of course, if your control flow is not trivial the compiler will often not recognize the loop as vectorizable at all.

dzaima · 2024-03-04T12:02:01 1709553721

Unknown-bound loops are not vectorizable in general (at least not sanely; you'd need masked stores, aligned writes, and a bunch of handling for different cases of bound computation). I haven't seen many cases of missed hoisting (outside of register pressure, where it's not really "missed") or pointless moving to scalar (at least with clang; I've looked at gcc much less), nor control flow be much of an issue either, both compilers can handle conditional loads & stores where available.

10000truths · 2024-03-03T07:57:43 1709452663

C compilers also have decades of experience optimally translating C code into machine code, and they are arguably more capable of emitting SIMD (good luck trying to use cutting edge AVX-512 intrinsics like vpopcntdq with the JVM). The fact is that there is nothing a JIT compiler can do that an AOT compiler can't do, but in the case of AOT, the resources spent compiling the code are amortized to effectively 0, whereas that resource cost is borne upon every program startup for a JIT engine.

neonsunset · 2024-03-03T09:08:20 1709456900

That's not necessarily true, on JIT vs AOT split. I'm mostly going off of how the divergence in available optimization is starting to look like in .NET after introduction of Native AOT with light research into LLVM and various optimization-adjacent Rust crates.

In particular, with JIT, you are able to initialize certain readonly data once, and then, on recompilation to a more optimized version, bake such data as JIT constants right into emitted machine code. This is not possible with AOT. Same applies for all kinds of in-runtime profiling/analysis and recompilation to incorporate a collected profile according to this exact run of an application. JIT also offers the ability to load modules dynamically in the form of bytecode without having to have a strict machine-level ABI, only the bytecode one, which allows for efficient generics that cross modules, as well as cross-module function inlining. And last but not least - there is no need to pick the least common denominator in supported hardware features as the code can be compiled to use the latest features provided by hardware like AVX512.

On the other hand, pure AOT means a frozen world which allows the compiler to know exact types and paths the code can take, performing exact devirtualization and much more aggressive preinitialization on code that accepts constant data. It also means bigger leeway in the time the compiler can spend on optimizing code. Historically, GCC and LLVM have been more advanced than their JIT counterparts because of different tradeoffs more favouring to absolute performance of the emitted code as well as simply higher amount of man hours invested in developing them (e.g. .NET punches above it's weight class despite being worked on by a smaller team vs OpenJDK or LLVM).

pjmlp · 2024-03-03T08:42:18 1709455338

C compilers only have one opportunity to do that once, at compile time, if the developer was lucky with their data set used to train the PGO output, maybe the outcome is greatly improved.

Modern JVMs, not only have the JIT being able to use actual production data, they are able to cache PGO data between execution runs, and reach an optimimal set of heuristics throughout execution time.

And on Android, those PGO files are even shared between devices via Play Store.

samatman · 2024-03-03T18:54:03 1709492043

Of course it's inherently possible. The code is running on the same chip, it is inherently possible for a Foo compiler to emit the same machine code as a Bar compiler for the same algorithm. Foo being Java and Bar being C doesn't change this.

You might mean it's impractical? Or that it happens to not be true in the general case?

vanviegen · 2024-03-03T07:04:55 1709449495

Sure it's possible. The JVM can do guided optimizations at run time. There is no such thing for native executables.

pjmlp · 2024-03-03T08:43:25 1709455405

And as I mentioned in another comment, you can even cache PGO data between executions, not needed to start always from zero.

tutfbhuf · 2024-03-03T18:07:30 1709489250

Your expectation is correct. The Java version is more tuned. Here https://github.com/dhartunian/1brcgo/?tab=readme-ov-file#lea..., you can find a version that runs in 1.108 and is almost 3x better than the one you quoted. I think one can reduce it even further. In the end, it depends on how fast the executable can boot up and execute the code. At some point, JVM will lose because it takes quite some time just to initialize the JVM, whereas Go executables can boot up very fast. Here you can see a comparison of Hello World Programs: https://github.com/qznc/hello-benchmark?tab=readme-ov-file#h.... JVM takes a whopping 596 ms to boot up and print Hello World, whereas Go just requires 149 ms.

benhoyt · 2024-03-03T21:13:12 1709500392

I think that's true with the JVM, but the fastest Java solutions are using GraalVM and its ahead-of-time compilation mode to avoid startup time. In addition, while "go run" might take 149ms to compile and run a program, a compiled Go program starts in just a couple of milliseconds:

  $ time ./t
  Hello, world
  
  real  0m0.003s

tutfbhuf · 2024-03-04T08:00:58 1709539258

Thank you for the insights. I would like to add that native compilation with GraalVM sounds excellent in theory, but in all the real-world Java codebases that I have tested for native compilation, it wasn't possible due to some limitations in the code or some dependency (it is not a drop-in replacement).

michalsustr · 2024-03-03T09:10:45 1709457045

> the fastest, heavily-optimised Java solution runs in just under a second on my machine

I don’t understand how this is possible. The file in question has 13GB, while the fastest commonly available SSDs are 12400 MB/s. Am I missing something?

mainde · 2024-03-03T09:20:21 1709457621

I think this bit in the baseline section applies to the Java one too

>Note that that’s a best-of-five measurement, so I’m allowing the file to be cached. Who knows whether Linux will allow all 13GB to be kept in disk cache, though presumably it does, because the first time it took closer to 6 seconds.

nsteel · 2024-03-03T09:48:39 1709459319

Yea, I assumed that. Which makes the parallel version improvements still interesting but surely it's very artificial. You can't processes all the data at the same time if you don't have it all yet.

dietr1ch · 2024-03-03T09:17:35 1709457455

Benchmarking this gets tricky once you realize that the file might be entirely cached if the computer has enough RAM.

klyrs · 2024-03-03T16:40:03 1709484003

Hah, we use different computers. I was thinking, this might be tricky if you aren't careful to keep the file entirely cached in RAM.

codegladiator · 2024-03-03T09:50:01 1709459401

there are no disk file read times in the original rules. file is in memfs.

maximus-decimus · 2024-03-03T13:48:42 1709473722

If he runs the test multiple times, the file should all be cached in RAM after the first run.

whyever · 2024-03-03T09:16:47 1709457407

My guess: If you run the benchmark several times, the OS will cache the file in RAM.

gozzoo · 2024-03-03T09:15:04 1709457304

most of the file remains in the OS disk cache after the first run

chii · 2024-03-03T09:18:06 1709457486

that might be a sustained single threaded read performance.

What if the method of access was concurrent from different parts of the file, and was operating system cached?

Keyframe · 2024-03-03T17:14:51 1709486091

actual 1brc runs with data in RAM disk, he could've/should've done the same.

WesolyKubeczek · 2024-03-03T09:30:51 1709458251

I love the author’s step-by-step approach as very often it so happens that a hyper-optimized solution may be overfitted to the exact dataset it’s operating on. In each step, the tradeoffs are being explained: what we gain, but also what we lose by stripping the functionality away.

JensRantil · 2024-03-03T05:34:45 1709444085

Second article I'm reading on implementing this in Go. Since the temperatures are in the range [-99.9, 99.9] with a tenth of precision (~2k values), I am surprised why no one has implemented a parsing of the numbers using a prepopulated lookup table. Should probably speed things up.

I submitted a github issue on this for the other implementation I looked at here[1].

[1] https://github.com/shraddhaag/1brc/issues/2

pillusmany · 2024-03-03T05:48:45 1709444925

He already uses custom parsing.

How do you search in the lookup table? If you are thinking of a hash map it will be slower than the few operations of his custom parser.

treyd · 2024-03-03T06:07:27 1709446047

If you're clever about how you initially take apart the input you can just do a single memory load at a computed offset.

praptak · 2024-03-03T06:49:05 1709448545

I don't see how you can make computing the offset faster than just parsing the number.

PeterisP · 2024-03-03T08:47:15 1709455635

You could have a much simpler parsing than ordinary parsing if you know/assume that you definitely have a valid number from -99.99 to 99.99.

For example, you could find whether it starts with a '-' and where the delimiter is to know the length of the number string representation (that's the "simpler parsing" part), and then don't do any work at all to decode the number, simply use these 1-5 bytes of that string (without sign and separator) directly as an index into a very large very sparse memory region in which all the valid values are pre-populated with the proper result.

You'd need to allocate 4 or 8 terabytes of virtual memory address space for that lookup table, but you'll touch only a tiny fraction of the memory pages, so it doesn't require an unacceptable amount of physical memory.

akvadrako · 2024-03-03T09:17:22 1709457442

If that is faster would seem to depend on if you can get most lookups from the L2 cache. Otherwise you're waiting for main memory, which is a few hundred cycles. Even with multiple loads in parallel, it would be hard to beat arithmetic.

https://specbranch.com/posts/lookup-tables/

heavenlyblue · 2024-03-03T09:44:17 1709459057

You don't need to cover all bits of the values, just 10 numeric values that can pass a a bounds check. That reduces the space to only 10K elements. With some bit shifting (and pre-validation) that should easily reduce the search space.

nwellnhof · 2024-03-03T12:04:36 1709467476

Creating a LUT index with bit-shifting is essentially the same as parsing into an integer. Even if the LUT fits in L1 cache, I doubt it would be faster. If it doesn't fit, it's certainly slower.

samatman · 2024-03-03T19:02:29 1709492549

I'd start here: The ASCII for '9' is 0b00111001, a UInt8 9 is 0b00001001 (this was of course deliberate). So (A & 0b11110000) << 4 + (B & 0b11110000) to get the low byte, the high byte is an exercise for the reader, 16 bit jump table to the value, if there's a '-' you invert it.

pletnes · 2024-03-03T11:25:42 1709465142

Take the bytes of the number, «-9.7», and interpret as an integer? That’s a 4-byte int -> array index. (Haven’t tried but…)

K0nserv · 2024-03-03T06:34:32 1709447672

I did it with custom parsing[0] and treated the numbers as 16 bit integers, the representation in the file is not a constant number of bytes which complicates the table approach. If you end up computing a hash I think it might be slower than just doing the equivalent parsing I do and a four byte constant table will be very large and mostly empty. Maybe a a trie would be good.

0: https://github.com/k0nserv/brc/blob/main/src/main.rs#L279

KingOfCoders · 2024-03-03T09:43:55 1709459035

Wouldn't you need a fixed length of the temps?

  00.1
 -10.3

or

  0.1 (with an ending space)

so you can look up 5 bytes in the map? (+/i, two digits, dot, one digit)

codegladiator · 2024-03-03T09:54:42 1709459682

you can create a perfect hash based on the presence of at least 4 characters. perfect hash is pre calculated based on possible inputs (-99.9 to 99.0 in bytes). the hash is usual byte*seed+hash. "seed" is chosen so that there is no clash (you can find a static seed in a single brute force from 1 to 1m in < 1 min)

KingOfCoders · 2024-03-03T13:28:08 1709472488

I thought the lookup would just be a byte tree not a hash. Wouldn't a hash with it's calculation beat the purpose of being faster than parsing a number?

The idea would be, you have a tree of all values of 0.0 to 99.9 and then just use the bytes to iterate the tree (e.g. in an array) to come up with the int value of e.g. 45.5

codegladiator · 2024-03-03T14:10:05 1709475005

parsing a number contains an (addition + multiplication) *(number of digits) for each row. if you do precalculated perfect hash then multiplication for each row can be avoided. ( you anyways need to read each byte)

nottorp · 2024-03-03T06:50:46 1709448646

I have a feeling that a naive implementation in Java would be a lot worse than a naive implementation in Go so optimizing matters more there.

Had to parse csvs in Java on a very memory constrained system once... we ended up cutting out a feature because it wasn't worth it.

masklinn · 2024-03-03T08:30:06 1709454606

Depends what you call “naive”, but the “idiomatic Java solution” from last week’s post (https://questdb.io/blog/billion-row-challenge-step-by-step/) clocked in at 71 seconds, or 1:11. And just running it on graal was down to 66.

“Very memory constrained” would be a massive factor here, 1BRC is not really constrained (let alone very much so), it has 1 billion rows on a 32GB machine.

nottorp · 2024-03-03T09:06:50 1709456810

Gigabytes? It was a while ago and i had megabytes for the whole OS :)

Anyway, it's just a fun memory now.

pjmlp · 2024-03-03T07:24:20 1709450660

Depends on which Java implementation is used.

People keep forgetting Java is like C and C++, plenty of implementations to choose from, each with its own approach to JIT, AOT, GC and escape analysis.

benhoyt · 2024-03-03T10:36:13 1709462173

For what it's worth, on my machine the simple/idiomatic "baseline" Java solution (https://github.com/gunnarmorling/1brc/blob/main/src/main/jav...) takes 2m0s, compared to 1m45s for my simple Go version. So Go is a bit better for the naive version.

cangeroo · 2024-03-03T07:01:27 1709449287

Regarding Java, It probably could be done with arrays and object reuse (arenas). But it's slightly less ergonomic. And the ecosystem isn't designed for it, so you'd have to implement your own memory-efficient parser.

nottorp · 2024-03-03T08:05:10 1709453110

Yep, but it wasn't a critical feature and we were in a rush, so the feature was killed instead.

> Depends on which Java implementation is used.

... if you have a choice. It was a port of AOSP, so we didn't. In any case it wasn't the jvm's fault, the device just had very little ram.

syspec · 2024-03-03T10:10:09 1709460609

Sounds like a skill issue

speedgoose · 2024-03-03T09:22:07 1709457727

Golang is actually not as efficient as Java in quite a few benchmarks.

~~Using LLVM isn’t a magic solution to perform better than something relying on the JVM.~~

Here is a source: https://sites.google.com/view/energy-efficiency-languages

TwentyPosts · 2024-03-03T09:55:36 1709459736

Huh? Go doesn't use LLVM though, where did you get the idea that it does? That's part of why its compile times are so fast.

speedgoose · 2024-03-03T10:44:01 1709462641

You are right, it doesn’t use LLVM. No idea where I got the idea, I was confidently wrong.

avinassh · 2024-03-03T05:13:48 1709442828

> I’m in the same ballpark as Alexander Yastrebov’s Go version. His solution looks similar to mine: break the file into chunks, use a custom hash table (he even uses FNV hashing), and parse temperatures as integers. However, he uses memory-mapped files, which I’d ruled out for portability reasons – I’m guessing that’s why his is a bit faster.

I am curious, can it be made even faster than this?

makotech221 · 2024-03-03T05:24:54 1709443494

Dunno about Go, but most c# solutions are around 2s and under https://hotforknowledge.com/2024/01/13/1brc-in-dotnet-among-...

junto · 2024-03-03T07:07:53 1709449673

Wow that’s pretty damn fast. C# has made some improvements in the past years or they have some other advantages?

to11mtm · 2024-03-03T18:41:19 1709491279

Yeah.

A few highlights, as can be seen in some of the blog posts mentioned by other replies:

- 'Span<T>' to represent chunks of either managed OR unmanaged memory without using unsafe pointers throughout [0][1]

- Not relevant to this task necessarily, but a lot of machinery has been added to allow reuse of objects for tasks like queuing thread pool work, or waiting for an asynchronous result.

- Lots of intrinsics helpers for SIMD workloads, and increased usage of such intrinsics in internal parsers/etc.

- Generally improving a lot of the internal IO to take advantage of other improvements in the runtime.

- PGO (Performance Guided Optimization) on the JIT side, essentially helps with things like better devirt [2] and other improvements.

- AOT compilation, if that's your thing, (I do believe the fastest C# 1BRC submissions use this)

[0] - To be clear, unsafe can still be faster, however for most cases Span is fine and gives you a little more runtime safety.

[1] - You can also grab a Span<T> of a primitive (i.e. int, char) within a method, so long as you don't blow up stack, this is very nice when you need a small buffer for parsing but don't want to thrash the GC or deal with volatile or locks on some sort of pool.

[2] - Devirt historically was a problem in 'call heavy' .NET apps when Interfaces are used, before PGO there was more than one library I worked on where we intentionally used abstract base classes rather than interfaces due to the need to squeeze as much out as we could.

Anon4Now · 2024-03-03T07:43:33 1709451813

Stephen Toub wrote a book-length blog post about all the performance improvements made in .NET 8 [1]. Add the option to compile to native.

[1] https://devblogs.microsoft.com/dotnet/performance-improvemen...

pjmlp · 2024-03-03T08:44:52 1709455492

His blogs posts go back all the way to .NET 5, for those curious to some deep dive on performance improvements done on each release.

pjmlp · 2024-03-03T07:27:16 1709450836

Yes to both.

.NET team has been doubling down on performance improvements, people forget CLR also has features to support C like languages (hence Managed C++ and C++/CLI), and many of those capabilities are now surfaced into C# as well.

krallja · 2024-03-03T05:30:25 1709443825

There's a layer of pointer indirection when using slices in Go, you may be able to eke out some time by moving to arrays on the stack.

scotty79 · 2024-03-03T09:22:42 1709457762

Instead of hash table I'd try sort of "eager" trie inside stack allocated memory. So I can find the slot for the stats of given station after parsing minimal number of characters that differentiate this station from others.

pram · 2024-03-03T05:45:05 1709444705

From looking at the final code it’s probably the performance of copy() as the biggest hurdle.

Alifatisk · 2024-03-03T13:21:00 1709472060

Ia there a collection on all the languages that have attempted this challenge? I know comparing and competing languages is somewhat useless but it is still interesting to me.

gunnarmorling · 2024-03-03T22:32:25 1709505145

There's the "Show & Tell", where folks have shared implementations in many different languages: https://github.com/gunnarmorling/1brc/discussions/categories....

ayewo · 2024-03-04T11:33:11 1709551991

You could also view attempts in other languages at https://github.com/1brc

mik09 · 2024-03-04T13:52:32 1709560352

c dominates every other language again...https://github.com/dannyvankooten/1brc#submitting

Alifatisk · 2024-03-04T14:07:22 1709561242

The C implementation runs in ~1.7 seconds, but so does Swift [1]. The Dart [2] implementation achieved 1.17 seconds, but the record now looks to be ~0.57 seconds [3].

But I believe looking at this like I do is wrong because they all run on different machines with different performances.

1. https://github.com/gunnarmorling/1brc/discussions/251#discus...

2. https://github.com/gunnarmorling/1brc/discussions/241#discus...

3. https://github.com/gunnarmorling/1brc/discussions/710

blue_pants · 2024-03-03T07:59:22 1709452762

There's a nodejs version which takes 23s

https://github.com/1brc/nodejs

Alifatisk · 2024-03-03T13:15:44 1709471744

How come node performed this good? Any specific reason?

saagarjha · 2024-03-03T21:08:15 1709500095

https://github.com/1brc/nodejs/tree/main/src/main/nodejs/Edg... has a few

sireat · 2024-03-03T08:56:31 1709456191

Those super optimized solutions are fun to read about.

However, in real life I would never assume a millions rows of text all have all valid data in a specific format, much less a billion rows.

Thus a slower but more robust solution would be more realistic.

worldwidelies · 2024-03-03T09:59:35 1709459975

I’d like to see a 1 trillion row challenge.

tbragin · 2024-03-03T17:23:10 1709486590

One exists https://blog.coiled.io/blog/1trc.html

genewitch · 2024-03-04T08:19:07 1709540347

five minutes! with some silliness groupby("name").agg(the math stuff). So their testbed (coiled, a cluster manager?) is meant to do exactly what the challenge requires. and it was data out of S3.

one thing that i caught there was that parquet serves the same purpose as CSV but for much bigger datasets and now i want to learn more about it.

afiodorov · 2024-03-03T07:52:52 1709452372

My first instinct would be to spin up a local Postgres and keep station data there. A lot of the solutions assume we have enough ram to keep the stats per station, however that's a bad assumption when dealing with a lot of data.

masklinn · 2024-03-03T08:35:53 1709454953

This is not a general solution, it’s a specific problem with a specific data shape, and processes specifically 1 billion rows on a 32GB machine.

pletnes · 2024-03-03T11:31:17 1709465477

And this is why file io is not the bottleneck - all the data is in RAM/disk cache.

Alifatisk · 2024-03-03T13:18:51 1709471931

> A lot of the solutions assume we have enough ram to keep the stats per station

I’ll give this example in Ruby but you’ll get the point, what you are mentioning is an issue if you chose to use File.read(), because it opens the file and reads its content into the ram. But this can be solved if you use File.readlines() because it streams each row instead which uses much less ram.

afiodorov · 2024-03-03T14:29:35 1709476175

What I am saying is that the solutions presented assume we have N stations which are way less than 1_000_000_000.

Worst case scenario for RAM would be if each line contained a unique station. In this case we'd have to allocate 1_000_000_000 * sizeof(stats) in RAM to contain the result of the computation.

So most of the solutions assume sufficient RAM ahead of reading the file.

In the first solution:

type stats struct { min, max, sum float64 count int64 }

would take up 32GB for the stats alone for all 1E9 stations, and that's ignoring space needed for each station's name!

blackoil · 2024-03-03T18:55:07 1709492107

You are optimizing for a hypothetical edge case. If there are few hundred thousand weather stations in world or 50,000 airports or 2 million hotels. Building system assuming they'll change to billions is impractical.

Also, even for this, original problem statement sets maximum to 10,000 different station names.

afiodorov · 2024-03-03T19:32:06 1709494326

I thought this was an illustrative example of how to process big datasets. We could easily have a statistic per e.g. bitcoin transaction in a different problem, see https://github.com/afiodorov/bitcoin_ancestries .

I struggle a lot with this toy problem. Without constraints too trivial to pay attention to; then no one seems to agree on potential real-world constraints (stdlib only? no mmaps?).

If we have to solve just this problem as is, shouldn't we be timing simple solutions using various frameworks (polars, pandas, spark, bigquery[^1], go, awk) to compare various frameworks? Once you have the answer, why would you try to get the same answer again but in 4 seconds the second time round?

Comparing frameworks would at least indicate if a data practitioner should upskill and pick up yet another data framework.

[^1]: https://medium.com/@shuvro_25220/the-one-billion-rows-challe...

Beltalowda · 2024-03-03T18:38:16 1709491096

How did you generate the timings on: https://benhoyt.com/images/go-1brc-profile-r9-source.png ?

benhoyt · 2024-03-03T18:45:33 1709491533

Using Go's profiling tool with its "source" view: I used "go tool pprof -http=: cpu.prof", where cpu.prof was generated by "go-1brc -cpuprofile=cpu.prof -revision=9 measurements.txt".

Beltalowda · 2024-03-03T19:02:43 1709492563

Cheers; I've always difficulty mapping those pprof graphs to actual concrete code and I never managed to get anything more useful out of it.

This is the biggest take-away from this post to be honest; had no idea it could do anything like that. Sometimes it's the little things...

You can get something similar with the CLI using:

  go tool pprof -weblist='mypkgname' cpu.out     # Generate HTML and open
  go tool pprof -list='mypkgname' cpu.out        # Generate text to stdout

nicois · 2024-03-03T05:50:41 1709445041

It would be interesting to know how effective Profile Guided Optimisation is here.

benhoyt · 2024-03-03T07:16:00 1709450160

Unfortunately it doesn't seem to help at all, I think mainly because (at present) Go's PGO basically inlines hot functions, and the important code here is all in one big function.

neonsunset · 2024-03-03T07:10:38 1709449838

It is only mildly effective because how anemic Go compiler is. And even then it's extremely limited. If you want to see actual good implementations - look into what OpenJDK HotSpot and .NET JIT compilers do with runtime profiling and recompilation (.NET calls it Dynamic PGO).

llmblockchain · 2024-03-04T19:59:07 1709582347

One "trick" not tried in the article that I have used is to gzip the data (so you have a .csv.gz) and stream that instead of the raw file. I find it reduces a large amount of the disk read (as you have 6-10x less to read).

mohamedattahri · 2024-03-03T23:58:22 1709510302

There’s something cool about the fact that parallel code in Go is still idiomatic Go.

CyanLite2 · 2024-03-03T19:56:18 1709495778

Meanwhile the C# (.net) implementation is 4x faster.

fullstackchris · 2024-03-03T09:26:59 1709458019

performance is great but i would imagine the paralellized version requires a significantly higher minimum amount of RAM than the non paralellized ways... he claims that each more perfomant solution is less compute cost than the one before it, but in the case of paralellization its just the same amount of compute in just a shorter amount of time, right?

satvikpendem · 2024-03-03T08:36:14 1709454974

Nice. I wonder how Rust would fare, given that it has no garbage collector.

n8henrie · 2024-03-03T14:39:44 1709476784

There are a few rust solutions in the "Show and Tell" linked above, for example this fairly readable one at 15.5s: https://github.com/gunnarmorling/1brc/discussions/57 A comment above referencing Python "polars" actually has rust polars, std, and SIMD solutions as well (SIMD was fasted, but less readable for a hobbyist like me).

EDIT: https://github.com/Butch78/1BillionRowChallenge/tree/main

tjpnz · 2024-03-03T09:15:02 1709457302

You can disable GC for Golang but don't think it will improve on 4s.

1vuio0pswjnm7 · 2024-03-03T07:33:18 1709451198

How about kdb+ or shakti

tonymet · 2024-03-03T05:36:06 1709444166

i saw the custom hashtable, but why was Map slow?

sethammons · 2024-03-03T10:26:01 1709461561

I believe it is because he did not pre allocate the map, so as it grew, it reallocated multiple times. Just a guess.

tonymet · 2024-03-03T18:44:20 1709491460

I double checked and he mentioned a couple other points. One was incrementally hashing the keys to reduce double read. The other was about storing pointers for the value structs more efficiently.

I encourage implementing maps and other DS btw, i was just curious

canucker2016 · 2024-03-03T20:23:56 1709497436

The authour has already converted the code to using a pointer to value struct for storing in the standard go hash table in Solution 2.

Solution 7 contains the code and description for the custom hash table.

I can see where interleaving/inlining the hash generation of the station name/key with the search for the separator reduces the number of scans of bytes from 2-3x to just 1x.

The second point in Solution 7 was the use of the byte slice to the underlying buffer when the station name is found in the buffer instead of creating a new string. This saves a memory allocation.

m3kw9 · 2024-03-03T05:50:21 1709445021

I’d just pay my way to 4s by upgrading hw

JohnBooty · 2024-03-03T07:21:31 1709450491

You can't just throw hardware at this one to get to 4s. At least not in 2024.

The author's naive single-threaded Go solution took 1m45s on an "amd64 laptop with fast SSD drive and 32GB of RAM."

So, you'd need something 25x faster than his setup in terms of single-threaded performance. Let us know when you've upgraded your hardware to the equivalent of a 75ghz AMD processor with memory and SSD bandwidth to match!

lmeyerov · 2024-03-03T07:52:31 1709452351

The nice thing about a GPU soln (ex: python dataframes in cudf, just a few loc) is these generally come down to your IO bandwidth, like a single 2GB/s SSD to a 16-32 GB/s PCIe to 1-2 GPUs running crazy fast. And then buy more cheap SSDs to chain together before buying more/better GPUs :)

JohnBooty · 2024-03-03T19:22:28 1709493748

I guess it depends on what we mean by "throwing hardware at it."

GPUs aren't magic. You still need to come up with a parallelizable algorithm.

The TL;DR is that the fastest solutions are basically map/reduce with a bunch of microoptimizations for parsing each line.

But before you do that, you need to divide up the work. You can't just give each core `file_size_bytes/core_count` chunks of the file because those chunks won't align with the line breaks. So, you need to be clever about that part somehow.

Once you've done that, you have a nice map/reduce that should scale linearly up to at least 20 or 30 cores. So in that sense, you can "throw hardware at it."

Whether or not any of that is a good fit for GPU acceleration, I don't know.

You should try the challenge. It's trickier than you think but surprisingly fun.

lmeyerov · 2024-03-03T19:49:49 1709495389

Indeed!

You may enjoy this talk where I do just that... end-to-end on GPUs, and < 100loc Python: https://www.youtube.com/watch?v=8ZMzsTbfImU

Your intuition about mapping to kernels is good. Basically all SQL, Polars, DuckDB, Pandas, etc operators are pretty directly mappable to optimized GPU operators nowadays. This includes GPU-accelerated CSV/parquet parsing. This was theoretically true starting maybe 10 years ago, and implemented in practice about 3-5 years ago. These systems allow escape hatches via numbajit etc to do custom kernels, but it's better to stay in pure sql/pandas/etc subsets, which are already mapped and to more careful kernels.

To get a feel for times, I like to think about 2 classes: constant overheads and throughput

Constant overhead:

- JIT'ing. By using pure SQL/pandas/etc, you can avoid most CUDA JIT costs

- GPU context creation etc: Similar, after starting and initial memory pool is allocated, it gets reused

- Instruction passing: The pandas API is 'eager', so "df1() + df2()" may have a lot of back-and-forth of instructions between CPU<>GPU even if the data doesn't move. Dask & Polars introduce lazy semantics that allow fusion, but GPU implementations haven't leveraged that yet AFAICT.

Bandwith limits:

- SSD is the biggest killer. Even "Expensive" SSDs are still < 10GB/s, so you need to chain a bunch to get 100B/s ingest

- CPU pathways throttle things down again (latency+bandwidth): GDS/GDN lets you skip them

- PCIe cards are surprisingly fast nowadays. With PCIe5+, the bottleneck is getting pushed quickly back to the storage, and probably easier to buy more PCIe+GPU pairs than need individual to go faster for most workloads

- Once things hit the GPU, things are fast :)

4s is a LOT of time wrt what even commodity GPU hardware can do, so benchmarks showing software failing to saturate it is fascinating to diagnose

JohnBooty · 2024-03-03T22:32:06 1709505126

Wow! Super informative, thanks!!

I also apologize. As you can probably tell, I lumped you in with all the folks who were being super glib about easy hardware gains!

jiggawatts · 2024-03-03T07:21:56 1709450516

... how?

There aren't any generally-available CPUs that are substantially faster today than were available ten years ago. Maybe double the speed per core, triple at best.

After that, throwing more cores at it also rapidly runs out of steam because parallel code has its own overheads. Any shared state instantly kills performance, no matter the language. Very clever tricks have to be used to get decent scaling past 64 hardware threads (32 cores), and going past 256 is surprisingly difficult. You start having to worry about NUMA, IRQ steering, and core pinning. Bandwidth gets to be an issue, even to L3 and L4 cache, let alone out to main memory.

This notion that you can just "dial up" hardware performance to infinity as a fix for any amount of developer laziness needs to die.

neonsunset · 2024-03-03T06:07:37 1709446057

The effort and long time it took Go to get to something that 3-6x times slower than other, better languages should be an important reminder to everyone assuming it belongs to the same weight class as Rust, C# or Java.

kitd · 2024-03-03T08:31:00 1709454660

If you read the article, you'll see he doesn't attempt the optimizations that helped those other languages get to 3-6x faster. Your snark is wasted.

Mawr · 2024-03-03T06:41:33 1709448093

That you put Rust among those languages says it all. Do some basic research.

neonsunset · 2024-03-03T07:17:21 1709450241

Oh, and what the basic research you speak of constitutes? Surely you looked at ASM emitted by compilers for these languages and HPC-adjacent APIs each of them offers? No? Then let me tell you - Go is pretty much consigned to having to use its special flavour of non-portable bespoke hand-written ASM which is the only way to access SIMD instructions necessary to achieve optimal hardware utilization in the benchmark. This takes a lot of effort and skill, so, as you may have noticed, if you can't do it, Go simply cannot come close to better options you can see on the benchmark chart.

And yet, this is something that can be trivially done in C#, C++ and Rust (albeit C# has the best UX with crossplat SIMD API introduced in .NET 7, with C++ close second with its own take on this being in preview). Java OTOH manages to be in the same category by having extremely advanced JIT that allows it to have comparable codegen quality even though it lacks comparable SIMD API for now (Panama vectors are problematic currently), so benchmarks implementations using it are forced to do SWAR.

My main gripe is of course an extremely common misconception about Go's speed which it just does not have the moment you write anything sufficiently advanced or want to express a particular problem in a terser way than writing thousands of open coded loops.

donor20 · 2024-03-03T06:35:06 1709447706

But isn’t the Java version unrolling loops? This seems like some effort on the Java side.

dsff3f3f3f · 2024-03-03T06:40:51 1709448051

The fast Java version is using all the same optimizations as this Go version and then some. It's significantly more complicated.