I found this super interesting - especially as all the data I've written code to manipulate has been small enough that I haven't needed to optimize my code, so I've never had to think in this direction.
I think my favorite part was the very first section, where he got baseline measurements with `cat`, `wc`, and friends. I wouldn't have thought to do that and its such an easy way to get a perspective on what's "reasonable".
It also underscores just how insane raw disk bandwidth is on modern ssds. Most software is not designed around a world where you have gigabytes a second of sequential scan on a laptop.
Yes, though on a laptop it's unlikely the entire file is being cached in dram. Also ssd's with multiple GiB/sec bandwidths are indeed common, which was my basic point in comparison vs how much slower sed/awk style processing is.
A few months ago, I had to quickly bang out a script to output about 20 million lines of text, each the output of a hash function. My naive solution took more than a few minutes - simple optimizations such as writing every 10k lines cut the time significantly. Threading would have helped quite a bit as well.
I’m kind of interested in the opposite problem, what is the simplest solution using a well known library/db that approaches the fastest hand optimized solution to this problem?
Java is often slightly faster than Go, has similar (perhaps, older, better optimized Map) constructs, perhaps better GC (older, more optimized), though I don't think the GC is a challenge, has slower startup times - so I'd say roughly the same as the idiomatic Go version?
Java can be faster than Go, but it comes at the same cost as most "faster" things in software: Java uses significantly more memory.
Java is also more mature, which means you are entering a massive package-bloat setup that has evolved over the years to work for everyones wild and varied needs. By the time you have your database, cache, http/other handlers, tests, fixtures, metrics, logging, tracing, etc... setup you're looking at a scary pile of dependencies spanning thousands of classes that would make even NPM jealous.
Java’s GC is just incomparably better, literally every research on the topic is written for Java. Though that’s true that without value types, Java does rely slightly more on it.
Sounds very reasonable. In the blog post about 20s were shaved off by assuming we don't need complicated string parsing. An of the shelf library can't make that assumption so they will always have to pay the extra cost.
True in general but some (especially libs aimed at larger datasets processed in batch) are taking advantage of benchmarks like this to do things like:
Try the fast way, if it works great
Try the slow way, if above fails
This makes the slow path 2x slower at worst (and you can advise to always use the slow way with optional params) but the fast path can be 10x faster
For anyone looking for more examples of 1BRC in Go, we had a friendly competition at work and collected the results here: https://github.com/dhartunian/1brcgo/
In addition to the loop-unrolling and bit-twiddling tricks that also show up in the fastest Java and C++ versions, some Go-specific things I learned were:
- unsafe.Pointer can be used to read memory without bounds checks
- many functions in the bytes and bits packages in the standard library are written in assembly
- debug.SetGCPercent and SetMemoryLimit to turn off GC
- runtime.LockOSThread to lock a goroutine to a thread
- print is slightly faster than fmt.Printf (but writes to stderr)
Oh, I'd missed those solutions, thanks. You guys got way more hard core than I did -- nice work! Looking forward to reading the code for those solutions this week.
I think we all ended up using unsafe, though there were some solutions without mmap. It would have been interesting if we had adhered to the same constraints you did!
Well, I guess it's more that the standard library doesn't have a cross-platform way to access them, not that memory-mapped files themselves can't be done on (say) Windows. It looks like there's a fairly popular 3rd party package that supports at least Linux, macOS, and Windows: https://github.com/edsrzf/mmap-go
Not all filesystems support memory-mapped files equally, and for some that do, the support comes with caveats and could be slower than non-memory-mapped access.
> I find this report confusing. Why does if items[hashIndex].key == nil show as taking 5.01s, but the call to bytes.Equal shows as only 390ms. Surely a slice lookup is much cheaper than a function call? If you are a Go performance expert and can help me interpret it, I’m all ears!
These two lines are both conditionals, so the time reported is sensitive to branch mispredictions. If the timings are not intuitive based on the complexity of the associated lines, then it may be explained by the data being not very predictable and the branch predictor having a bad time.
Yeah, someone emailed me privately after they'd dug into this. They mentioned that "items[hashIndex]" was a significant source of cache misses. They used "perf record -e cache-misses" and found it was the largest source of cache misses. They also found (by digging into the assembly) that the "bytes.Equal" line has some sort of source-line attribution issue.
I love the nerdery around 1BRC. My axe to grind is that unless you do dangerous stuff DBs are just as fast, less complicated, and more resilient to data updates than application code [0]. Do more in the database!
I agree with doing more in the database, you're closer to the data (disk/disk cache/L2 cache) than the application code is. At the same time I get really nervous around doing work in the database because you have to be really disciplined that the in-database code (functions, views, etc) match the code in source control. Also that your testing/QA database contains all the same code and enough test data to actually exercise the performance bounds of that code.
With application code I can easily step through it in a debugger and verify the deployed code matches what's in the repo. It's more difficult to do in the database because it requires more organizational discipline.
Yeah you need some different tools when working in data. I've been recommending dbt [0] as kind of the on-ramp for SREs to data work. Among other things it allows you to keep your work in a VCS and has a testing framework.
I'm not sure this part of the TOS is valid in many jurisdictions. But there is a better reason to not publish benchmarks: they do not deserve free advertisement. We should just collectively forget they exist and use other tools.
I wanted to see how the temperature was changing over time for specific regions using a map-based interface. The following chart was particularly eye-opening:
The software won a couple of awards and was heavily optimized to produce reports in under a minute. Kudos to the author for getting a parse time of a billion records down to mere seconds.
It's worth noting that if you're messing around with large text files from the CLI, awk, grep, etc will be an order-of-magnitude faster if you opt out of unicode parsing.
I'm pretty confident adding LC_ALL=C to the awk solution would get it easily under a minute.
I just want to go on record that given the simplistic (i.e fun) problem, a shell developer would have had the answer to a first, specific set of billion rows done while all the other language were still putting on their shoes.
That's only a valid comparison if the "fastest Java" and "fastest Go" implementations are either the same or at the limit of what each language allows.
The more interesting comparison anyway is performance of the straightforward, idiomatic code, since that's what we all write 99% of the time.
Here's the key insight from the article: "Processing the input file in parallel provides a huge win over r1, taking the time from 1 minute 45 seconds to 24.3 seconds. For comparison, the previous “optimised non-parallel” version, solution 7, took 25.8 seconds. So for this case, parallelisation is a bit faster than optimisation – and quite a bit simpler."
Java version is written by lead researcher and founder of GraalVM at Oracle labs. It is really native AOT compiled code comparable to best in C++/Rust.
It is Java entry because language used is Java but finally compiled artifact is far far away from typical compiled Java artifact.
The Java version does additional optimizations that his Go version doesn't do and he mentions that at the end of the post. The Java version is really optimized and is an interesting read.
The more accurate statement would be is Go implementatation is incapable of accessing optimizations that exist in Java and then Java is incapable of optimizations performed by C# and C++ implementations.
Until some time ago Go did not even have inlining profitability logic in the compiler and could only inline leaf functions, something which is worse than JVM implementations could do in...2005 or so I think?
Synthetic benchmarks aside, I think as far as average (spring boots of the world) code goes, Go beats Java almost every time, often in less lines than the usual pom.xml
And yet, people will refer to some other synthetic benchmark when arguing that the average Java project, which doesn't disable GC nor uses unsafe reads et al, is comparable.
Um, GraalVM (JIT) has both GC and memory safety. It's a drop-in replacement for HotSpot. It takes literally five seconds to change JAVA_HOME and you're done.
The benchmark literally uses Graal, something from Java, for Ruby and Python, but inexplicably doesn't use it for Java itself.
Still loses to .NET. On reference host Java still closer to 1.7-2s ballpark (and has to use awkward SWAR to get there) while the fastest solution in C# is 1.2s, beating C++ (code can be ported however).
But yes, "I expected Go to win..." is exactly the core of the problem here. Same as with e.g. Swift, which people expect to perform on the level of Rust, when it is even slower than Go. The intuition caused by common misconceptions just does not correspond to reality sadly.
No. What it actually demonstrates is that people didn’t read the source material properly.
The Java and Go versions use different optimisations. There’s nothing stopping either language from using the same optimisations as the other. It just wasn’t something their respective authors cared to try in their respective exercises.
There, however, is something stopping Go from using optimizations present in Java or C#/C++/Rust. This is lack of SIMD API without dropping to hand writing ASM and overall much weaker compiler. This puts much greater burden on the programmer to match the performance while staying with Go.
> There, however, is something stopping Go from using optimizations present in Java or C#/C++/Rust
...
> This puts much greater burden on the programmer to match the performance while staying with Go.
Your second statement contradicts your first. You're not stopped from using SIMD in Go. There are in fact several 3rd party libraries out there to use SIMD. It's just not part of the standard library. So you can still use SIMD in Go without writing Go's dialect of assembly.
It's also worth noting that SIMD isn't even due to drop into std in C++ until C++26. At the moment you either have to use experimental or a 3rd party library.
You're also missing the point of these examples. Nobody writing these exercises are trying to claim that all languages are equal. And they certainly not trying to write idiomatic code either. They're just fun little exercises demonstrating the degree of optimisations one can go through. You wouldn't want to write code like those in the examples in all but a the tiniest of scenarios.
It's silly the amount of people over-analysing this, what is essentially just a game, and then arguing its "proof" about their biases towards different programming languages.
I think the reason why this misconception is so widespread is because there is a grain of truth in it, because almost everyone sees Java synonymous with gigantic framework like spring, quarkus etc.
In go you've got your standard libraries, these are generally quicker than the Java equivalent simply because they do less in the lifecycle of the operation.
This lets Java do funky stuff like enabling full jvm/code tracing just by adding a jar file at runtime. But it does come with a performance penalty.
One has to differentiate here a bit. Java JIT technology has become really great. Highly optimized native code generation which hugely benefits from the ability to use live profiling data to optimize the code. This is why it often beats static compilers at generating faster code. The static compilers can only optimized on the range of possible data, the JIT can optimize based on the data presented to the program.
On the down side, there are quite a few features of the Java language and the JVM, which often make programs slow. Like a lot of details of the object model, lack of value classes, JIT compiling which takes time on startup etc. Also, a lot of Java libraries are pretty heavy weight.
Go is quite different here. It is statically compiled, which allows for fast program startup and the language model makes it quite easy to rather naively write programs which perform reasonally fast. The down side is, that the compiler is static and not so heavily optimizing as other static compilers for fast compilation speed. However recently the ability was added to use profiling data for optimizing the compilation.
> JVM has always been on par if not often faster than hand-written C code.
I have never found this to be true.
There are a few programs (Talend, MySQL workshop) written in Java that I sometimes have to use, and I avoid them as much as I can, because they are slow, bloated, and eat lots of memory.
Java new operator is faster than malloc, because Java allocates all memory needed from the OS before starting the program. And malloc is extremely slow. C++ new operator uses malloc so it is also slow.
Stack allocation in a C program is hundreds of times faster than using new in Java.
And that's for one feature, there are many other features like C++ compile-time calculations in templates that simply have no Java equivalent.
Please, you've been reading too much PR from the Java side and not looking at benchmarks and real-world performance enough. What you're claiming is inherently not possible, cherry-picked benchmarks notwithstanding.
Can you explain why it's not technically possible.
JVM has had decades of experience at optimally translating bytecode to machine code and can take advantage of SIMD, AVX etc when needed. Most hand-written C code is far from optimal.
A couple weeks ago I managed to get a nice setup for viewing Java JIT disassembly[1], and did a bunch of ad-hoc tests on OpenJDK 21 (and some on whatever build of 23 was on openjdk.org). do manage to vectorize a decent amount of vectorizable loops, but semi-often missed some improvements, and some trivial things didn't vectorize at all (most amusingly, a loop summing an int[] to an int didn't get vectorized, while int[] to long is). Scalar code codegen is also quite mediocre.
GCC & clang, in comparison, often produce assembly which I'd consider optimal, or close to it, for vectorizable loops, and are noticably better than OpenJDK for scalar code.
GCC and clang produce far from optimal code when vectorizing. Anyone doing serious work is unlikely to rely on the autovectorizer without consulting the output religiously.
There are certainly a lot of things they may do suboptimally (in my experience: when there's register pressure; things that aren't nicely elementwise (bit arrays); preparation & tail handling can get pretty excessive; basically no tuning for making better use of ports) and of course you should always be checking the output (this is true for any optimization effort, vectorized or not), but for most things IME they do just fine (assuming the thing in question is vectorized at all). At the very least it's still miles above OpenJDK.
I've only really seen decent results on the simplest of examples, like a bounded loop with a basic operation in it (like, adding all the numbers in an array). As soon as you go beyond that I see much worse code: reloading things inside the loop body that really should have been hoisted out, spilling for no good reason, weird contortions to shove certain things into scalar registers in the middle only to take them out again. And, of course, if your control flow is not trivial the compiler will often not recognize the loop as vectorizable at all.
Unknown-bound loops are not vectorizable in general (at least not sanely; you'd need masked stores, aligned writes, and a bunch of handling for different cases of bound computation). I haven't seen many cases of missed hoisting (outside of register pressure, where it's not really "missed") or pointless moving to scalar (at least with clang; I've looked at gcc much less), nor control flow be much of an issue either, both compilers can handle conditional loads & stores where available.
C compilers also have decades of experience optimally translating C code into machine code, and they are arguably more capable of emitting SIMD (good luck trying to use cutting edge AVX-512 intrinsics like vpopcntdq with the JVM). The fact is that there is nothing a JIT compiler can do that an AOT compiler can't do, but in the case of AOT, the resources spent compiling the code are amortized to effectively 0, whereas that resource cost is borne upon every program startup for a JIT engine.
That's not necessarily true, on JIT vs AOT split. I'm mostly going off of how the divergence in available optimization is starting to look like in .NET after introduction of Native AOT with light research into LLVM and various optimization-adjacent Rust crates.
In particular, with JIT, you are able to initialize certain readonly data once, and then, on recompilation to a more optimized version, bake such data as JIT constants right into emitted machine code. This is not possible with AOT. Same applies for all kinds of in-runtime profiling/analysis and recompilation to incorporate a collected profile according to this exact run of an application. JIT also offers the ability to load modules dynamically in the form of bytecode without having to have a strict machine-level ABI, only the bytecode one, which allows for efficient generics that cross modules, as well as cross-module function inlining. And last but not least - there is no need to pick the least common denominator in supported hardware features as the code can be compiled to use the latest features provided by hardware like AVX512.
On the other hand, pure AOT means a frozen world which allows the compiler to know exact types and paths the code can take, performing exact devirtualization and much more aggressive preinitialization on code that accepts constant data. It also means bigger leeway in the time the compiler can spend on optimizing code. Historically, GCC and LLVM have been more advanced than their JIT counterparts because of different tradeoffs more favouring to absolute performance of the emitted code as well as simply higher amount of man hours invested in developing them (e.g. .NET punches above it's weight class despite being worked on by a smaller team vs OpenJDK or LLVM).
C compilers only have one opportunity to do that once, at compile time, if the developer was lucky with their data set used to train the PGO output, maybe the outcome is greatly improved.
Modern JVMs, not only have the JIT being able to use actual production data, they are able to cache PGO data between execution runs, and reach an optimimal set of heuristics throughout execution time.
And on Android, those PGO files are even shared between devices via Play Store.
Of course it's inherently possible. The code is running on the same chip, it is inherently possible for a Foo compiler to emit the same machine code as a Bar compiler for the same algorithm. Foo being Java and Bar being C doesn't change this.
You might mean it's impractical? Or that it happens to not be true in the general case?
Your expectation is correct. The Java version is more tuned. Here https://github.com/dhartunian/1brcgo/?tab=readme-ov-file#lea..., you can find a version that runs in 1.108 and is almost 3x better than the one you quoted. I think one can reduce it even further. In the end, it depends on how fast the executable can boot up and execute the code. At some point, JVM will lose because it takes quite some time just to initialize the JVM, whereas Go executables can boot up very fast. Here you can see a comparison of Hello World Programs: https://github.com/qznc/hello-benchmark?tab=readme-ov-file#h.... JVM takes a whopping 596 ms to boot up and print Hello World, whereas Go just requires 149 ms.
I think that's true with the JVM, but the fastest Java solutions are using GraalVM and its ahead-of-time compilation mode to avoid startup time. In addition, while "go run" might take 149ms to compile and run a program, a compiled Go program starts in just a couple of milliseconds:
Thank you for the insights. I would like to add that native compilation with GraalVM sounds excellent in theory, but in all the real-world Java codebases that I have tested for native compilation, it wasn't possible due to some limitations in the code or some dependency (it is not a drop-in replacement).
> the fastest, heavily-optimised Java solution runs in just under a second on my machine
I don’t understand how this is possible. The file in question has 13GB, while the fastest commonly available SSDs are 12400 MB/s. Am I missing something?
I think this bit in the baseline section applies to the Java one too
>Note that that’s a best-of-five measurement, so I’m allowing the file to be cached. Who knows whether Linux will allow all 13GB to be kept in disk cache, though presumably it does, because the first time it took closer to 6 seconds.
Yea, I assumed that. Which makes the parallel version improvements still interesting but surely it's very artificial. You can't processes all the data at the same time if you don't have it all yet.
I love the author’s step-by-step approach as very often it so happens that a hyper-optimized solution may be overfitted to the exact dataset it’s operating on. In each step, the tradeoffs are being explained: what we gain, but also what we lose by stripping the functionality away.
Second article I'm reading on implementing this in Go. Since the temperatures are in the range [-99.9, 99.9] with a tenth of precision (~2k values), I am surprised why no one has implemented a parsing of the numbers using a prepopulated lookup table. Should probably speed things up.
I submitted a github issue on this for the other implementation I looked at here[1].
You could have a much simpler parsing than ordinary parsing if you know/assume that you definitely have a valid number from -99.99 to 99.99.
For example, you could find whether it starts with a '-' and where the delimiter is to know the length of the number string representation (that's the "simpler parsing" part), and then don't do any work at all to decode the number, simply use these 1-5 bytes of that string (without sign and separator) directly as an index into a very large very sparse memory region in which all the valid values are pre-populated with the proper result.
You'd need to allocate 4 or 8 terabytes of virtual memory address space for that lookup table, but you'll touch only a tiny fraction of the memory pages, so it doesn't require an unacceptable amount of physical memory.
If that is faster would seem to depend on if you can get most lookups from the L2 cache. Otherwise you're waiting for main memory, which is a few hundred cycles. Even with multiple loads in parallel, it would be hard to beat arithmetic.
You don't need to cover all bits of the values, just 10 numeric values that can pass a a bounds check. That reduces the space to only 10K elements. With some bit shifting (and pre-validation) that should easily reduce the search space.
Creating a LUT index with bit-shifting is essentially the same as parsing into an integer. Even if the LUT fits in L1 cache, I doubt it would be faster. If it doesn't fit, it's certainly slower.
I'd start here: The ASCII for '9' is 0b00111001, a UInt8 9 is 0b00001001 (this was of course deliberate). So (A & 0b11110000) << 4 + (B & 0b11110000) to get the low byte, the high byte is an exercise for the reader, 16 bit jump table to the value, if there's a '-' you invert it.
I did it with custom parsing[0] and treated the numbers as 16 bit integers, the representation in the file is not a constant number of bytes which complicates the table approach. If you end up computing a hash I think it might be slower than just doing the equivalent parsing I do and a four byte constant table will be very large and mostly empty. Maybe a a trie would be good.
you can create a perfect hash based on the presence of at least 4 characters. perfect hash is pre calculated based on possible inputs (-99.9 to 99.0 in bytes). the hash is usual byte*seed+hash. "seed" is chosen so that there is no clash (you can find a static seed in a single brute force from 1 to 1m in < 1 min)
I thought the lookup would just be a byte tree not a hash. Wouldn't a hash with it's calculation beat the purpose of being faster than parsing a number?
The idea would be, you have a tree of all values of 0.0 to 99.9 and then just use the bytes to iterate the tree (e.g. in an array) to come up with the int value of e.g. 45.5
parsing a number contains an (addition + multiplication) *(number of digits) for each row. if you do precalculated perfect hash then multiplication for each row can be avoided. ( you anyways need to read each byte)
“Very memory constrained” would be a massive factor here, 1BRC is not really constrained (let alone very much so), it has 1 billion rows on a 32GB machine.
People keep forgetting Java is like C and C++, plenty of implementations to choose from, each with its own approach to JIT, AOT, GC and escape analysis.
Regarding Java,
It probably could be done with arrays and object reuse (arenas).
But it's slightly less ergonomic.
And the ecosystem isn't designed for it, so you'd have to implement your own memory-efficient parser.
> I’m in the same ballpark as Alexander Yastrebov’s Go version. His solution looks similar to mine: break the file into chunks, use a custom hash table (he even uses FNV hashing), and parse temperatures as integers. However, he uses memory-mapped files, which I’d ruled out for portability reasons – I’m guessing that’s why his is a bit faster.
I am curious, can it be made even faster than this?
A few highlights, as can be seen in some of the blog posts mentioned by other replies:
- 'Span<T>' to represent chunks of either managed OR unmanaged memory without using unsafe pointers throughout [0][1]
- Not relevant to this task necessarily, but a lot of machinery has been added to allow reuse of objects for tasks like queuing thread pool work, or waiting for an asynchronous result.
- Lots of intrinsics helpers for SIMD workloads, and increased usage of such intrinsics in internal parsers/etc.
- Generally improving a lot of the internal IO to take advantage of other improvements in the runtime.
- PGO (Performance Guided Optimization) on the JIT side, essentially helps with things like better devirt [2] and other improvements.
- AOT compilation, if that's your thing, (I do believe the fastest C# 1BRC submissions use this)
[0] - To be clear, unsafe can still be faster, however for most cases Span is fine and gives you a little more runtime safety.
[1] - You can also grab a Span<T> of a primitive (i.e. int, char) within a method, so long as you don't blow up stack, this is very nice when you need a small buffer for parsing but don't want to thrash the GC or deal with volatile or locks on some sort of pool.
[2] - Devirt historically was a problem in 'call heavy' .NET apps when Interfaces are used, before PGO there was more than one library I worked on where we intentionally used abstract base classes rather than interfaces due to the need to squeeze as much out as we could.
.NET team has been doubling down on performance improvements, people forget CLR also has features to support C like languages (hence Managed C++ and C++/CLI), and many of those capabilities are now surfaced into C# as well.
Instead of hash table I'd try sort of "eager" trie inside stack allocated memory. So I can find the slot for the stats of given station after parsing minimal number of characters that differentiate this station from others.
Ia there a collection on all the languages that have attempted this challenge? I know comparing and competing languages is somewhat useless but it is still interesting to me.
The C implementation runs in ~1.7 seconds, but so does Swift [1]. The Dart [2] implementation achieved 1.17 seconds, but the record now looks to be ~0.57 seconds [3].
But I believe looking at this like I do is wrong because they all run on different machines with different performances.
five minutes! with some silliness groupby("name").agg(the math stuff). So their testbed (coiled, a cluster manager?) is meant to do exactly what the challenge requires. and it was data out of S3.
one thing that i caught there was that parquet serves the same purpose as CSV but for much bigger datasets and now i want to learn more about it.
My first instinct would be to spin up a local Postgres and keep station data there. A lot of the solutions assume we have enough ram to keep the stats per station, however that's a bad assumption when dealing with a lot of data.
> A lot of the solutions assume we have enough ram to keep the stats per station
I’ll give this example in Ruby but you’ll get the point, what you are mentioning is an issue if you chose to use File.read(), because it opens the file and reads its content into the ram. But this can be solved if you use File.readlines() because it streams each row instead which uses much less ram.
What I am saying is that the solutions presented assume we have N stations which are way less than 1_000_000_000.
Worst case scenario for RAM would be if each line contained a unique station. In this case we'd have to allocate 1_000_000_000 * sizeof(stats) in RAM to contain the result of the computation.
So most of the solutions assume sufficient RAM ahead of reading the file.
In the first solution:
type stats struct {
min, max, sum float64
count int64
}
would take up 32GB for the stats alone for all 1E9 stations, and that's ignoring space needed for each station's name!
You are optimizing for a hypothetical edge case. If there are few hundred thousand weather stations in world or 50,000 airports or 2 million hotels. Building system assuming they'll change to billions is impractical.
Also, even for this, original problem statement sets maximum to 10,000 different station names.
I thought this was an illustrative example of how to process big datasets. We could easily have a statistic per e.g. bitcoin transaction in a different problem, see https://github.com/afiodorov/bitcoin_ancestries .
I struggle a lot with this toy problem. Without constraints too trivial to pay attention to; then no one seems to agree on potential real-world constraints (stdlib only? no mmaps?).
If we have to solve just this problem as is, shouldn't we be timing simple solutions using various frameworks (polars, pandas, spark, bigquery[^1], go, awk) to compare various frameworks? Once you have the answer, why would you try to get the same answer again but in 4 seconds the second time round?
Comparing frameworks would at least indicate if a data practitioner should upskill and pick up yet another data framework.
Using Go's profiling tool with its "source" view: I used "go tool pprof -http=: cpu.prof", where cpu.prof was generated by "go-1brc -cpuprofile=cpu.prof -revision=9 measurements.txt".
Unfortunately it doesn't seem to help at all, I think mainly because (at present) Go's PGO basically inlines hot functions, and the important code here is all in one big function.
It is only mildly effective because how anemic Go compiler is. And even then it's extremely limited. If you want to see actual good implementations - look into what OpenJDK HotSpot and .NET JIT compilers do with runtime profiling and recompilation (.NET calls it Dynamic PGO).
One "trick" not tried in the article that I have used is to gzip the data (so you have a .csv.gz) and stream that instead of the raw file. I find it reduces a large amount of the disk read (as you have 6-10x less to read).
performance is great but i would imagine the paralellized version requires a significantly higher minimum amount of RAM than the non paralellized ways... he claims that each more perfomant solution is less compute cost than the one before it, but in the case of paralellization its just the same amount of compute in just a shorter amount of time, right?
There are a few rust solutions in the "Show and Tell" linked above, for example this fairly readable one at 15.5s: https://github.com/gunnarmorling/1brc/discussions/57
A comment above referencing Python "polars" actually has rust polars, std, and SIMD solutions as well (SIMD was fasted, but less readable for a hobbyist like me).
I double checked and he mentioned a couple other points. One was incrementally hashing the keys to reduce double read. The other was about storing pointers for the value structs more efficiently.
I encourage implementing maps and other DS btw, i was just curious
The authour has already converted the code to using a pointer to value struct for storing in the standard go hash table in Solution 2.
Solution 7 contains the code and description for the custom hash table.
I can see where interleaving/inlining the hash generation of the station name/key with the search for the separator reduces the number of scans of bytes from 2-3x to just 1x.
The second point in Solution 7 was the use of the byte slice to the underlying buffer when the station name is found in the buffer instead of creating a new string. This saves a memory allocation.
You can't just throw hardware at this one to get to 4s. At least not in 2024.
The author's naive single-threaded Go solution took 1m45s on an "amd64 laptop with fast SSD drive and 32GB of RAM."
So, you'd need something 25x faster than his setup in terms of single-threaded performance. Let us know when you've upgraded your hardware to the equivalent of a 75ghz AMD processor with memory and SSD bandwidth to match!
The nice thing about a GPU soln (ex: python dataframes in cudf, just a few loc) is these generally come down to your IO bandwidth, like a single 2GB/s SSD to a 16-32 GB/s PCIe to 1-2 GPUs running crazy fast. And then buy more cheap SSDs to chain together before buying more/better GPUs :)
I guess it depends on what we mean by "throwing hardware at it."
GPUs aren't magic. You still need to come up with a parallelizable algorithm.
The TL;DR is that the fastest solutions are basically map/reduce with a bunch of microoptimizations for parsing each line.
But before you do that, you need to divide up the work. You can't just give each core `file_size_bytes/core_count` chunks of the file because those chunks won't align with the line breaks. So, you need to be clever about that part somehow.
Once you've done that, you have a nice map/reduce that should scale linearly up to at least 20 or 30 cores. So in that sense, you can "throw hardware at it."
Whether or not any of that is a good fit for GPU acceleration, I don't know.
You should try the challenge. It's trickier than you think but surprisingly fun.
Your intuition about mapping to kernels is good. Basically all SQL, Polars, DuckDB, Pandas, etc operators are pretty directly mappable to optimized GPU operators nowadays. This includes GPU-accelerated CSV/parquet parsing. This was theoretically true starting maybe 10 years ago, and implemented in practice about 3-5 years ago. These systems allow escape hatches via numbajit etc to do custom kernels, but it's better to stay in pure sql/pandas/etc subsets, which are already mapped and to more careful kernels.
To get a feel for times, I like to think about 2 classes: constant overheads and throughput
Constant overhead:
- JIT'ing. By using pure SQL/pandas/etc, you can avoid most CUDA JIT costs
- GPU context creation etc: Similar, after starting and initial memory pool is allocated, it gets reused
- Instruction passing: The pandas API is 'eager', so "df1() + df2()" may have a lot of back-and-forth of instructions between CPU<>GPU even if the data doesn't move. Dask & Polars introduce lazy semantics that allow fusion, but GPU implementations haven't leveraged that yet AFAICT.
Bandwith limits:
- SSD is the biggest killer. Even "Expensive" SSDs are still < 10GB/s, so you need to chain a bunch to get 100B/s ingest
- CPU pathways throttle things down again (latency+bandwidth): GDS/GDN lets you skip them
- PCIe cards are surprisingly fast nowadays. With PCIe5+, the bottleneck is getting pushed quickly back to the storage, and probably easier to buy more PCIe+GPU pairs than need individual to go faster for most workloads
- Once things hit the GPU, things are fast :)
4s is a LOT of time wrt what even commodity GPU hardware can do, so benchmarks showing software failing to saturate it is fascinating to diagnose
There aren't any generally-available CPUs that are substantially faster today than were available ten years ago. Maybe double the speed per core, triple at best.
After that, throwing more cores at it also rapidly runs out of steam because parallel code has its own overheads. Any shared state instantly kills performance, no matter the language. Very clever tricks have to be used to get decent scaling past 64 hardware threads (32 cores), and going past 256 is surprisingly difficult. You start having to worry about NUMA, IRQ steering, and core pinning. Bandwidth gets to be an issue, even to L3 and L4 cache, let alone out to main memory.
This notion that you can just "dial up" hardware performance to infinity as a fix for any amount of developer laziness needs to die.
The effort and long time it took Go to get to something that 3-6x times slower than other, better languages should be an important reminder to everyone assuming it belongs to the same weight class as Rust, C# or Java.
Oh, and what the basic research you speak of constitutes? Surely you looked at ASM emitted by compilers for these languages and HPC-adjacent APIs each of them offers? No? Then let me tell you - Go is pretty much consigned to having to use its special flavour of non-portable bespoke hand-written ASM which is the only way to access SIMD instructions necessary to achieve optimal hardware utilization in the benchmark. This takes a lot of effort and skill, so, as you may have noticed, if you can't do it, Go simply cannot come close to better options you can see on the benchmark chart.
And yet, this is something that can be trivially done in C#, C++ and Rust (albeit C# has the best UX with crossplat SIMD API introduced in .NET 7, with C++ close second with its own take on this being in preview). Java OTOH manages to be in the same category by having extremely advanced JIT that allows it to have comparable codegen quality even though it lacks comparable SIMD API for now (Panama vectors are problematic currently), so benchmarks implementations using it are forced to do SWAR.
My main gripe is of course an extremely common misconception about Go's speed which it just does not have the moment you write anything sufficiently advanced or want to express a particular problem in a terser way than writing thousands of open coded loops.
I think my favorite part was the very first section, where he got baseline measurements with `cat`, `wc`, and friends. I wouldn't have thought to do that and its such an easy way to get a perspective on what's "reasonable".