Hacker News new | past | comments | ask | show | jobs | submit login
JIT-optimized Ruby can outperform a C extension (railsatscale.com)
169 points by mooreds on Sept 9, 2023 | hide | past | favorite | 106 comments



People say the title of the article "Ruby Outperforms C: Breaking the Catch-22" is misleading, which is true, this is about Ruby code optimized by JIT outperforming a extension written in C.

But to give some context: the author Aaron Patterson is a Ruby and a Rails core team member. The article and headline is clearly targeting the ruby community, where this article has been very well received. I think it's a good title for the intended audience.

The post clarifies in the first section:

> In this post I’d like to present one data point in favor of maintaining a pure Ruby codebase, and then discuss some challenges and downsides of writing native extensions. Finally we’ll look at YJIT optimizations and why they don’t work as well with native code in the mix.

edit: added original title of the hackernews post / article


This is specifically about breaking the myth that performing expensive self-contained operations (e.g, parsing GraphQL) in a native extension (C, Rust, etc.) is always faster than the interpreted language.

The JS ecosystem has the same problem, people think rewriting everything in Rust will be a magic fix. In practice, there's always the problem highlighted in the post (transitioning is expensive, causes optimization bailouts), as well as the cost of actually getting the results back into Node-land. This is why SWC abandoned the JS API for writing plugins - constantly bouncing back and forth while traversing AST nodes was even slower than Babel (e.g https://github.com/swc-project/swc/issues/1392#issuecomment-...)


Interesting that both of the points you state completely contradict my experience with LuaJIT.

Parsing has always been one of the things its tracing JIT struggled with; it is still faster than the (already fairly fast) interpreter, but in this kind of branch- and allocation-heavy code it gets nowhere near the famed 1.25x to 1.5x of GCC (or so) that you can get by carefully tailoring inner-loopy code.

(But a tracing JIT like LuaJIT is a different from a BBV JIT like YJIT, even if I haven’t yet grokked the latter.)

LuaJIT’s FFI calls, on the other hand, are very very fast. They are still slower than not going through the boundary at all, naturally, but that’s about it. On the other hand, going through the Lua/C API inherited from the original, interpreted implementation—which sounds similar to what the Ruby blog post is comparing pure-Ruby code to—can be quite slow.

The SWC situation I can’t understand quickly, but apart from the WASM overhead it sounds to me like they have a syntax tree that the JS plugin side really wants to be GCed in the GC’s memory but the Rust-on-WASM host side really wants to be refcounted in WASM memory, and that is indeed not a good situation to be in. It took a decade or more for DOM manipulation in JS to not suck, and there the native-code side was operating with deep (and unsafe) hooks into the VM and GC infrastructure as opposed to the WASM straitjacket. Hopefully it’ll become easier when the WASM GC proposal finally materializes and people figure out how to make Rust target it.

In any case, it annoys me how hard it is in just about any low-level language to cheaply integrate with a GC. Getting a stack map out of a compiler in order to know where the references to GC-land are and when they are alive is like pulling teeth. I don’t think it should be that way.


There is a very big difference between a simple FFI system and the sort of C interface offered by Ruby and Node. Those interfaces allow objects to be passed to the native code, and the native can then do pretty much anything to the language run state. This is great if you want a C library that can do anything your higher level language could do, but it also means the JIT has to treat all those calls as impenetrable barriers that cannot be optimised through, so even a small C call can prevent the rest of your application from being optimised.

We got round this in TruffleRuby by running C extensions through an LLVM Bitcode interpreter that was part of the same framework as the Ruby interpreter and allowed them to be JITted together, but that had other downsides, and wasn’t great for things like parsers which had huge switch statements.


Yes but in this case the TruffleRuby approach would fix the Shopify issue, I think? And if by downside you mean longer warmup times that's an issue for YJIT or any other JIT too, so how much of a downside it is depends a lot on the nature of the deployment.


Shopify bigger repos are deployed pretty much every 30 minutes. As you point out most JITs struggle in these conditions.

But YJIT warms up extremely fast, and is able to provide real world speedup to these services almost immediately.


I think a parser is perhaps the example of where resorting to a compiled extension can be beaten by something more JIT-favourable.

In a language like Ruby it tends to be heavily dominated by scanning text, and creation of objects, and 1) you can often speed it up drastically by reducing object creation. E.g. here is Aaron writing about speeding up the GraphQL parser partly by doing that[1], 2) creating Ruby objects and building up complex structures in the C extension is going to be almost exactly as slow in the C extension as in the Ruby, 3) the scanning of the text mostly hits the regexp engine which is already written in C.

(That said, I heavily favour not resorting to C-extensions unless you really have to; even without going as far as some of Aaron's more esoteric tricks for that parser you can often get a whole lot closer that you think, and the portion you need to rewrite if you still have to might well turn out to be much smaller than you'd expect)

[1] https://tenderlovemaking.com/2023/09/02/fast-tokenizers-with...


I think 'magic fix' could be replaced with 'fun thing to do'


I spent a while rewriting a tiny bit of some useful Ruby in Rust and integrating it via Wasmer, and I definitely think my time expenditure is better classified as "fun thing to do" than "magic fix".

https://ossna2023.sched.com/event/1K55z/exotic-runtime-targe...

https://www.youtube.com/watch?v=EsAuJmHYWgI

My goal was to determine: how will I use Wasm as a Rubyist? Spoiler: I did not intend to use Rust, but embedding Ruby in a Wasm and running it from within Ruby proved to be a fool's exercise. I have a feeling that some of the claims I made in this talk about there being "no theoretical benefit" to running Ruby in Wasm in Ruby will quickly be proven incorrect. But I'm not expecting it to be faster.

If it's ever faster, that will be filed under "surprising results"

Michael Yuan who obviously knows a lot more about Wasm than I do addressed this in his talk as well, but not from the perspective of a JIT necessarily, although I guess you could consider local machine compile target optimization on the target machine at runtime as a type of "just in time optimization" it's not really (it's just the regular ahead-of-time type of optimization, but not fumbled and the majority of benefits immediately getting lost in such a way as it usually is...)

https://www.youtube.com/watch?v=kOvoBEg4-N4

The spoiler from that talk (for me anyway) was finding out that you do see surprising results sometimes, and sometimes it's due to a pathological case that (a) happens all the time, and (b) does not have a readily obvious solution in the form the problem regularly takes. Like linux distros shipping generic binaries "target-optimized" for the lowest common denominator on any given particular architecture dist target, because the binaries they ship obviously have to run anywhere.

(Recap of the conversation we had off-stage after the Q&A ended: We don't know when we say "x86-64" what modern opcode targets that means we really have available, so we have to ship a binary with only the oldest opcodes that are guaranteed to be available on any similar chip that we intend to support. Web Assembly uses a compiler on the target machine, cranelift, to translate from a platform independent binary to a platform-specific one. Running the compiler takes a while, but we can do it ahead of time. Will we come out ahead in the end?)

All of this stuff is certainly very fun to reason about :D


>to present one data point in favor of maintaining a pure Ruby codebase

Chris Seaton has been stating this for over 5 years. It is unfortunate this mental model has never caught on in Rails.


Evan Phoenix is saying it for even longer, and created rubinius as a means to prove it :) (it has since been abandoned, but Chris's truffleruby ported rubinius core and stdlib implementations, so it's great that they've all been feeding each other for quite a while)


Oh yes. It is unfortunate even when something is "right" or "correct" it doesn't means the world will move in that direction.

Hopefully now Shopify has enough resources to push this through.


This article outlines well the paradox that JITs require to be truly more efficient: if more of the target language is available to optimize, it'll get waaay more optimized, compared to dropping down to the layer below and try to hand-stitch it.

Of course, there is massive overhead in doing so. Just look at go, which had to rewrite practically everything already available in go, and must always require a native implementation (protobuf for example shares the underlying interface across ruby, python, php... but then has a full separate implementation in go, and java I think). And they have the budget for it at least, Google won't let go die under the overhead it created for itself.

So definitely, write more ruby, enough of those "fast-C gem - rewritten as C extension", but still keep using low level libraries like libpq.


I'm just toying with a Ruby terminal talking direct to X (no X lib) running a pure Ruby TrueType renderer (and running a pure-Ruby editor in it...). Ruby is still not "fast" (but I haven't tested it with yjit yet). I put up with it because I know I can make it significantly faster later (and the quick and dirty first approximation is memoize what I can, like glyph shapes)

As it turns out for a whole lot of things Ruby is slow for people because one of the nice things about programming in Ruby a lot of the time is expressing the problem the more readable way first rather than paying attention to performance. When people are aware, it makes it easy to iterate fast and fix performance later. The problem is a lot of the time that leads to pathological cases of people not thinking this through at all.

You'll notice most of his performance increase came from writing better Ruby. Clearly the original Graphql parser was nowhere near optimal. It's worth keeping in mind that writing a C extension ought to be a last resort after making the Ruby as fast as you can first, because often that means you don't need to.

E.g. another of Aaron's recent articles (EDIT: [1]) is about speeding up a parser by cutting down on object creation, and one specific example he gave was to return just the token type from the lexer instead of [token_type, token_value]. The latter forced creation of an Array object to hold each token and a string object for the token value (though for fixed tokens you could avoid that by returning the same frozen string literal), and for a whole lot of tokens the parser had no interest in the token string (e.g. if you see :lparen, or :rparen, getting "(" and ")" is entirely uninteresting). When people run into slow Ruby code like that it's tempting to resort to C right away, rather than understand why their Ruby is slow.

I love that yjit makes it easier to get to a point where you don't need to reach for C, though.

EDIT: [1] https://tenderlovemaking.com/2023/09/02/fast-tokenizers-with... It's actually about the GraphQL parser mentioned in the original article.


Great point. Not just writing ruby, but writing more performant ruby. I found that we several "here's how to implemented this GoF pattern in ruby" to "thank" for a lot of the unoptimized code we see around.


It’s funny you mention libpq, because my first thought upon reading the article was “I wonder if a pure Ruby implementation of the postgres wire protocol could possibly lead to performance improvements?


Probably, but you'd have to reimplement a lot of things you get for free out of libpq (prepared statements, pooler support, other things I'm forgetting about). But fwiw, there's already this: https://github.com/mneumann/postgres-pr


Iirc NodeJS started with a libpq binding but it was moved to pure-JS since as in the article, overhead of going in and out of JIT-land penalized it (as well as lacking any object optimizations).


Do you mean this one: https://node-postgres.com/features/native ? I don't know much about the node ecosystem to judge how many use this instead of libpq bindings, but judging by the number of features mentioned in that page as incompatible, it doesn't look like a free lunch.


This pure Python library claims quite fabulous performance: https://github.com/MagicStack/asyncpg

I believe it because that team have done lots of great stuff but I haven't used it, I just remembered thinking it was interesting the performance was so good. Not sure how related it is to running on the asyncio loop (or which loop they used for benchmarks).


I'd assume that the benchmark is showcasing the gains of the async model, more than making a point that native python is faster than C. The issue is ultimately: how much of the functionality available through libpq is not yet backported to asyncpg (i imagine it's nonzero, but Idon'tknow the answer). Which is why the pragmatic in me still prefer a middle layer like libpq in between.


> Of course, there is massive overhead in doing so.

Some concrete numbers for Go:

• Calling C from Go: ~40-50ns [1] [2]

• Calling Go from C: ~100-200ns [3]

[1] "Cgo calls take about 40ns, about the same time encoding/json takes to parse a single digit integer. On my 20 core machine Cgo call performance scales with core count up to about 16 cores, after which some known contention issues slow things down." (https://shane.ai/posts/cgo-performance-in-go1.21)

[2] "In response to the "cgo is slow" questions, this shows that cgo calls are about 50ns on my four-year-old x86 MacBook Pro. Is that fast or slow? It depends on what the cgo call is doing. If it's executing a single add instruction, an extra 50ns is slow. If it's doing something more substantial, an extra 50ns may be nothing at all." – rsc (https://news.ycombinator.com/item?id=36006347)

[3] "Calls from C to Go on threads created in C require some setup to prepare for Go execution. On Unix platforms, this setup is now preserved across multiple calls from the same thread. This significantly reduces the overhead of subsequent C to Go calls from ~1-3 microseconds per call to ~100-200 nanoseconds per call." (https://go.dev/doc/go1.21)


>Of course, there is massive overhead in doing so. Just look at go, which had to rewrite practically everything already available in go, and must always require a native implementation (protobuf for example shares the underlying interface across ruby, python, php... but then has a full separate implementation in go, and java I think)

Both Java and Go could just as well use FFI for those things "already available" as C.

They don't because native (to the language) is better, and can be even faster (due to GC or lack of crossing the boundary).

Python, PHP and co just take the easy way out. Plus, given their speed, if stuff like JSON and co were native-to-the-language they would be much slower than C extensions.


That's true, but I think that it is due to a few factors: JNI is considered a big penalty in the java, and treated as a "last resort" (libvips is an example); go community has been advocating against C bindings since the runtime was itself rewritten, and treated as an antipattern; protobuf and grpc teams are compromised of mostly googlers, which have more incentives to optimize java and go (it's known that a big chunk of their services are java and c++, and most new stuff is go).

And the rest might as well just use the bindings, as it's probably easier to maintain for them, and low in ggwir priorities lidt. There's actually a pure ruby protobuf library (I believe square employees use and maintain it), but it probably won't ever be merged into google/protobuf for those reasons.


The point here is simplifying the codebase by removing an extra language.

Rubyists like writing ruby, not C extensions. And using the JIT allows us to simplify and keep enough performance.


can numba and typed python will have similar effect ?


sure, in cases where Numba works well, it certainly is faster than a badly written C extension.

It can however not be faster than a well written C extension. They're all just LLVM frontends.


Ruby outperforms C that is a Ruby extension, solving a problem by wrangling with Ruby types.

When you write a run-time for a dynamic, high level language, and then library functions in C which use the API's and objects of that run time, the resulting code is reasonably fast, but never as fast as solving the problem using direct C idioms.

There are ways to make improvements in that type of code, but they tend to be tedious to code. For instance, you can use split the code into cases by type, and then use faster, more type-specific routines.

Sometimes it's possible to get all the inputs out of the objects, work with lower, less encapsulated level C representations to do the bulk of the problem, and then put results back into the run time's objects again. Or do that partially at certain steps.

In the TXR Lisp mapcar implementation, alloca is used to create a native C array of stack-allocated iterator objects. (For the parallel iteration over multiple sequences: mapcar is an N-ary function!). The element values are pulled from all these iterators and stuffed into a stack allocated args structure destructively; that same structure is re-used for each call to the projection function.

That's not as fast as it could be. Beyond that it would be possible to separately handle the case when there is only one sequence argument, and to split that by type, and other such approaches.

It's faster, though, than if it allocated a dynamic vector of dynamic iterators, and/or if it built a dynamic list of arguments in each iteration which were applied to the function as a list.


Hacker News is going to nitpick at the bold claim of the title instead of focusing on the fact that it's a well written technical article that underlines some neat tricks that allow a rather slow, highly dynamic language such as ruby to provide great performance under some circumstances, which is all that engineers should care about


Critizising the title is not exactly nitpicking. The title is important. And it’s misleading.

I disagree that this article is well written. It is hard to identify what is compared exactly, at least as someone not deeply familiar with the ruby ecosystem. The article doesn’t do a great job of explaining the caveats and limitations of the comparison either.


>Critizising the title is not exactly nitpicking. The title is important.

Only if one doesn't bother to read the article. Else the title is just an insignificant detail, or, at worse, a redundant piece of promotional material attached to the article.


And how do you decide if you're going to read the article? Ah yes, the title.

I guess we will always find someone who tries to word-wrangle their way out of an obvious fact. The title is the first thing you see. It's printed in a bigger font than the rest. Most people won't even read anything else. In short: It's important. It's not a detail. Hence it's not nitpicking when I criticize it.


>And how do you decide if you're going to read the article? Ah yes, the title.

Nope: the subject. This is about Ruby and execution speed, for example.

Then you open it and give it a quick cursory glance. Most articles have non-descriptive titles to begin with, or clickbait titles. The title could say any kind of non descriptive BS like "Ruby's new JIT rocks!" or "Ruby smoking C in benchmarks". That's even the case for "serious" media, like NYT.

If you go by the title you're in for a bad experience.


Exactly. This is why without the original title nobody who's not into Ruby would even consider reading it. And only a fraction of those people who are into Ruby and upvoted it would consider upvoting it. Thanks for confirming it.


Well, it's not meant for people not into Ruby.


Well, maybe read the original title again: "Ruby Outperforms C: Breaking the Catch-22"


Still not meant for people not interested into Ruby.


Got it. The title of an article is a detail and this title in particular doesn't draw any attention from people interested in C or the claims made by performance comparisons between different types of programming languages in general.

That's probably also the reason why the title was changed on HN after this criticism was made and the submission disappeared from the front page a short while after that.


I get your point, but it's a blog post by a Ruby & Rails core member, published on a Ruby-oriented blog named "Rails at Scale", with a particular audience in mind, and is written as a response to an ongoing discussion in the Ruby community. In context, I think the title is fine. But, for those without the context the title may be misleading. At no point is the goal to convince people to throw away C and switch to Ruby for everything.


This title in particular encourages sharing the article out of context, because it is a bit sensational.

That's why the title is so important.

If an article is aimed at a particular audience only and could be misunderstood otherwise, that's a good indication that the title should reflect that fact. (As the revised one does.)

As a sidenote: This submission has been tried multiple times on HN and it would very likely not have been successful here without the sensational title. [0]

0. https://hn.algolia.com/?dateRange=all&page=0&prefix=true&que...


The fastest parser in the post works at 35MB/s which is ~100x slower than necessary, improved from a baseline of ~1000x slower than necessary.


The title is misleading, just like other commenters mentioned. Just check how much indirection "rb_iv_get()" has to make (at the end, it will call [1], which isn't "a light" call). Now, check generated JIT code (in a blog post) for the same action where JIT knows how to shave off unnecessary indirection.

We are comparing apples and oranges here.

[1] https://github.com/ruby/ruby/blob/b635a66e957e4dd3fed83ef1d7...


Yes, it may well be misleading, yet you're making the point of the commenter you replied to while pointing out why the article is interesting. If you're a position where you're using Ruby and considering whether writing a C extension will speed things up, it has valuable advice on what other things to also try. That is a lot more interesting to me than the title.


The article findings apply to any managed language. Don't call extern libs often as the call overhead is non-trivial and it cannot be inlined. There's nothing new about


That one part is nothing new. That does not mean the specific mechanisms of what to do to take advantage of this for Ruby isn't of interest to Ruby developers, who are the main audience for this given it's posted on the blog for Shopify's Ruby/Rails team.


If we are keeping the nitpickings to this thread, I think adding a “when” to the beginning would keep it just as interesting a title and be good enough in terms of correctness


> Hacker News is going to nitpick at the bold claim of the title instead of focusing on the fact that it's a well written technical article t

I think that's an unfair characterisation of HN readers.

They aren't criticising the article after all, they're criticising a clickbait and misleading title.


Clickbait titles gets criticism because some people won't read beyond the title


Clickbait title gets criticism, because it makes bold claims drawing the attention of people who when reading the article find that their time was unfairly wasted, because said claims are a pile of ...

Excuse my French.


Talking about technically well written:

When showing microbenchmarks - show the exact hardware used, make sure the CPU does run at fixed frequency (no boost, most laptops can't control the freq. esp Macs)


If the writer of the article wanted people to actually read the article and judge it by its overall quality, they wouldn't have used such a blatantly clickbait-y title.


I agree; the title is doing a lot of work in terms of getting people to read the article and I'm sure the author is aware. This is a trade-off and when the article doesn't quite sell on the promise of the title you are bound to get a little backlash.


> Hacker News is going to nitpick at the bold claim of the title instead of focusing on the fact that (...)

...that the claims are somewhere between false and cherry-picked?


I think this is a somewhat misleading title.

It does not outperform C.

This article appears to compare a suboptimal native C implementation to a JIT optimized version of some similar Ruby code. The JIT only outperforms that particular hand written C implementation in this case.

Still, great that the Riby JIT can do this now, but the title would be better phrased as ‘Ruby JIT in some cases outperforms suboptimal hand written C code‘. That wouldn’t be good clickbait though.


Yeah, it appears that in the original benchmark an off the shelf C parser outperformed an off the shelf Ruby parser. Then they wrote their own parser in Ruby, which was faster. But for a fair comparison, you should also write the same parser in C and compare it to that.

It's possible that writing it in Ruby is preferable if it avoids FFI overhead and can do more aggressive inlining. But this doesn't seem like an apples to apples comparison.


"X outperforms C/C++/Rust/Golang".

Only if the code design in C/C++/Rust/Golang is bad.


Golang's compiler is not great. The main way to achieve acceptable performance in golang is to use their weird asm dialect.


Somehow I can't find a source for this but it's well known that Go prefers compilation speed over runtime performance. Doesn't help that they implement the entire compiler stack themselves so it definitely misses optimisations that other compilers can already do.

Also, I've never heard of anyone using assembly for performance optimisations. It would never happen where Go in commonly used (server programming), maybe if you're doing graphics or something nonstandard?


> Also, I've never heard of anyone using assembly for performance optimisations. It would never happen where Go in commonly used (server programming), maybe if you're doing graphics or something nonstandard?

It's in the standard library. It's also how you write parsers and formatters with acceptable performance if you don't have SIMD intrinsics.


I've done it for a c++ parser. I grabbed the compiled c++ and disassembled it into assembly, then reworked it. Got a massive increase in speed when I tweeked the registry use.


As well as differences in code gen, as I understand it Go also has worse tail latencies than servers written in C/C++/Rust because of the GC.


In this context, they are not talking about GC but the compiler


Golang's compiler is so fast though that I can iterate much faster so I'm willing to accept slightly worse performance than C++. C++ I waste so much time just waiting (esp. if I'm using a lot of templatized header only libraries)


But by that logic ruby is faster still since the compile time is 0. Less than zero really because it can be modified in flight.


Only if the compilation speed difference is worth the loss of performance. That seems unlikely for most use cases because the speed gain is small while the performance loss is severe.


The trade offs between the two are huge.

Go programs compile in less than a second, or several seconds for huge projects, where idiomatic Go gets you within the same order of magnitude to that of C(++), with some memory overhead. Concurrency is first class, GC pauses exist but are seldom an issue.

Idiomatic Ruby runs several orders of magnitude slower, with a huge memory overhead, and is rife with runtime gymnastics. Concurrency is an expensive hack, with stop-the-world GC.


Not at all. The main way is to reduce pressure on the GC by reducing allocations, which is a fairly straightforward process thanks to good support from the tooling.

Past that, you either call out to C via FFI or give up and rewrite in C/C++/Rust. Nobody really bothers with the asm.


If this was 4 years ago I would say “fair enough” but now Go 1.21 ? This claim is no longer Valid, just like saying CGO is slow, just because one language uses LLVM or GCC doesn’t mean they have state of the compiler


Yeah it's not fair to put Go in the same bracket as fast langs tbh.

It's a lang to ship MVPs quickly without much thought & cheaper devs - not a high perf one.


Perhaps gcc-go is better?


gcc-go, gollvm there’s actually no difference in comparison to the go compiler, heck in some cases gcc-go perform worse, Go internals are built with a different mindset and architecture there’re certain features that’s can perform worse than just hard coded assembly


For the time being. But C on x86 hasn't been a good model of what the CPU is actually doing for a while now and eventually we'll have a better way to harness the computing power available to us.


I want to be optimistic and agree that ‘we’ll have a better way to harness the computing power available to us’, but at this point it’s been years or maybe decades of propping up the x86 ISA while actual operation of the computing device drifts further away. I certainly understand the desires of processor manufacturers to want to be able to maintain an ISA and then be able to implement a machine that can change underneath it to realize performance gains. However, this has left the status quo utterly entrenched and I can not see what would make them change this behavior.

I really would get onboard with an alternative ISA that was more in line with actual architectural features of modern processors and I often wonder why we could not have an extra compilation (jit or aot) step from x86 to some other format. But the trend seems to be going the opposite direction for all the large scale processor manufacturer/designers toward a more abstracted interface at the lowest level.

I’ve seen various comments that basically boil down to “you can’t change the ISA because then C/C++/Rust programs would loose performance in benchmarks in the short term”, and this resistance only reinforces maintaining current trends. So… what are you assuming happens that prompts the shift toward a better way to harness computing power?


ARM seems like it's destined to only grow in importance.


I don’t know for certain, or in any great specificity, but in this context (i.e. ISA representing a more accurate or concrete mapping to hardware functionality/capabilities) I don’t know that ARM is particularly different to x86. The both present a relatively single core, flat memory architecture to programmers; while the actual processor is absolutely not that.


You may be right about that, though I had the impression that less wasteful parallelism was possible because of fixed-size instructions.


The memory model is a bit different on arm64, but it is not Alpha.

Other than that they are very similar


What architectural features would you like to surface that are currently papered over in x86?


My particular interests would see me appreciating access to cache structure, cache behavior, prefetching, cache coherency, memory management unit control, instruction pipelining, and maybe register renaming control.

I am not particularly qualified to try and design a better ISA, I just know there are some things I would like the option to control at some times (or at least I imagine I would). A lot of the list above is about exposing more of the architecture to user control as a general principle. However, I’d rather see some areas of hardware design change to facilitate something other than coding paradigms and practices that stretch back nearly 40 years. Chisnall’s ‘C is not a Low Level Language’ article discusses several architectural directions I would like to see happen, but also talks about how as long as the ISA remains the same moving to new programming styles/philosophies continues to be a difficult proposition.


I see. These things sound swell, though for stuff like controlling register renaming it sounds hard for the instruction decode overhead to be worth it. Currently you can use prefetch intrinsics for prefetching and the level of cache management available is mostly non-temporal stores and explicit allocation of L3 space to specific cores (which ends up meaning specific processes if you are careful).

The world has mostly not adopted existing tools for writing 10-100x faster software (simd, huge pages, computing both branches then using a blend or cmov when this is sufficiently cheap, software prefetching, collection APIs that support bulk updates so that they can use all of these things instead of adding totally unnecessary dependency chains to the resulting program, etc.) so i'm not optimistic about things getting better if CPUs grow a thousand times more memory bandwidth and cores and begin to support very wide scatter/gather like GPUs.


C is a pretty good model in part because all ISAs were designed with C in mind.


immintrin.h seems tough to beat.


C was never intended to be a model of what the CPU is actually doing.

This is purely an optimization issue and the article compares apples to oranges for clicks.


(Edit: my mistake, I misread the comment I was replying to assuming it claimed these languages themselves were equivalent.)

Why would the performance of these languages be identical? These languages aren’t the same in lots of ways.

Golang has a GC and gives you much less control over memory layout than any of the other languages you mentioned. It also has a very different concurrency model, which will vastly change the performance characteristics of multithreaded code.

Rust is limited to llvm, while C can be compiled with many other compilers - some of which produce better code in many cases. (Eg icc)

Rust has much better noalias analysis - which allows the compiler to make optimizations that aren’t available in general purpose C or C++ code. (Since nobody adds noalias to all their variables in C, for a variety of reasons).

Unlike the others, C doesn’t support monomorphization (without disgusting macro tricks). Monomorphization cuts both ways with performance - since it bloats executables (and thus hurts the branch predictor and cache) while generating more efficient code.

These languages aren’t the same. There’s lots of reasons they will perform differently both at the limits of what we can optimize and in the average case of everyday code. It’s wild that you would consider them to all be the same.


> Rust is limited to llvm, while C can be compiled with many other compilers - some of which produce better code in many cases. (Eg icc)

There's a rust front-end for gcc as well.


Is it ready for use? Last I saw it was far from feature complete.


Afaik the borrow-checker is missing - please correct me if I am wrong - but that doesn't stop you from using it. You could very well use `rustc` for the borrow-checker and compile with GCC on any other backend. The borrow-checker only expresses that your code is sound according the rust spec.


> The borrow-checker only expresses that your code is sound according the rust spec.

Doesn’t the borrow checker also track when things fall out of scope, and thus control when Drop is called?


No. Drop has straightforward rules.

Find the scope you can refer to the variable in, i.e. the curly brackets it is contained in. That variable is dropped right before the closing curly bracket. Things are dropped in reverse order of creation.

(Similar rules apply to dropping temporaries)

https://doc.rust-lang.org/reference/destructors.html


Afaik it tracks when objects go out of scope as that deals with ownership and lifetimes, but that's also part of the rust spec.

The drop is there for stuff that you can't just release the memory for, such as closing files and connections.


I don't understand why we have to pretend Ruby could ever be as fast as C. Or faster. That's like saying a jockey is faster than their horse. Ruby is riding on C, so it cannot be faster.

Also why even bother. Ruby is fast enough for the job it does. It's higher level and makes things that are "harder" in C, easier. Dev time is less, and thats where we win. We take the win, and accept what it is.


I think the point of the essay is that we shouldn’t immediately assume that going to C is the best way to speed up part of a Ruby system.

It might be conventional wisdom to break out to C for the speed gain, but a purpose made, hand written piece of Ruby, with JIT optimizations, and be a better/faster solution.


Some Ruby implementations are "riding on C". Others are not, e.g. Truffle Ruby.

While beating C overall would be very hard because of the dynamism of Ruby, most Ruby code is not very dynamic after a short warmup (loading for most code; connecting a database etc. for code using a dynamic ORM), and at least on specific code it's at least theoretically possible for a Ruby JIT to beat C by optimising to the actual runtime configuration and evaluate out all the dynamism that costs Ruby performance.

TruffleRuby goes part-way there, in terms of guarded section that makes assumptions about types but where the guards shunts code that breaks those assumptions onto a slow path, but a Ruby JIT or a combination of a bytecode compiler and JIT, can at least in theory go much further. E.g. a whole lot of the Ruby code library contains type checks that a compiler could hoist and/or elide.

Yjit is a baby step, but a good start.


TL;DR

This is about pure Ruby code optimized by JIT vs manual optimization by writing C extensions (native code). There is Ruby code around those extensions but

> The JIT cannot know what the native code will do to parameters (possibly mutate them), it cannot know what type of Ruby object will be returned, if the native code will look at the stack, or if it will raise an exception, etc. Thus the JIT is forced to have a conservative approach when dealing with native code, essentially forgetting much of what it had learned, and falling back to the most conservative implementation possible.

Hence the JIT often makes pure Ruby code run faster than the code with C extensions.

I think I heard the same argument many times with other JITed languages, interpreted or compiled.


Any managed language - java and go included, pays for trips to external, non-intrinsics calls. The standard way to deal is large portions/buffers to amortize the costs. This has been the case since the late '90s


"faster than C" - TBH, I kinda dread these clickbait posts because faster than C doesn't matter in 2023, where speed is dictated by multi-core and GPU.

Python is abysmally slow, but provides access to GPU and (soon) multi-core.


One should remember that not all problems can be solve faster with concurrency - not all problems are "embarrassingly parallel". Besides, even when they are, one gets diminishing returns by adding more cores. Raw performance on single core is still relevant.

That said, support for concurrency is relevant but more for "masking" I/O latency than for MIPS, a problem which is apparently more common. For a language created after 2010, it is nearly a fault not to have something less primitive than threads to handle concurrency. It is like still manually managing memory manually like C also does.

I agree that "faster than C" claims too often hides some kind of cheating. But we have HN "reviewers" for that ;-).


Did you see the post yesterday about the people who deployed a kubernetes cluster because it took 24 hours to parse a few GB of CSVs in JavaScript?


I'd like to read that but can't seem to find it. Do you remember the title or have a link?


Yep! I wish I was joking, but https://news.ycombinator.com/item?id=37415812


The best stories are always those that end with a rewrite in rust. This one is no exception, thanks for sharing.


uhm, isn't the CPU access in python written in C (mostly)?


Without control, power is nothing.

C gives you control.


It doesn’t though.


Well it gives you the belief that you are in control, because undefined behavior are implementation dependent, and that "modern processor" not only predict, but also can decide of the execution order of your nice code.

That said, save for the Basic language, C is probably the lowest common denominator for system programming and for all the programming languages to interface with each other (with relative efficiency).


Every time you would have direct control over the hardware it’s considered UB and the compiler throws the code away.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: