Hacker News new | past | comments | ask | show | jobs | submit login
Going faster by duplicating code (voidstar.tech)
158 points by voidstarcpp on Oct 10, 2023 | hide | past | favorite | 68 comments



> Compilers try to hoist constant conditions outside of loops but they're bad at it. Even in the trivial example above, on -O2 gcc does a redundant check with every loop iteration.

GCC is actually good at that, -O3 has no issue recognizing it. In fact there even is a very explicit option (-funswitch-loops) responsible for extracting loop invariants. It is not enabled on -O2 because it has a space-speed tradeoff. If this optimization is truly desirable even on -O2, `#pragma GCC optimize("-funswitch-loops")` can be used to force it.


I don't know why people are so reluctant to just use -O3


Because -O3 enables optimizations that break code with some lesser known cases of UB[1]. So people just don't enable -O3 because they can't be bothered to fix UB in their codebase, or because they think it's a "compiler bug".

There are other reasons for not using -O3 which people already mentioned.

As a matter of fact, Gentoo specifically says in their docs that -O3 breaks some packages[2].

[1] https://stackoverflow.com/questions/57889116/different-evalu...

[2] https://wiki.gentoo.org/wiki/GCC_optimization#-O


UB-breaking optimizations are already enabled at O1. I don't think -O3 is particularly noticeable for that: your first link shows code that breaks between O3 and no optimization at all.

O3 does more aggressive inlining and generally expensive optimizations and optimizations that are not necessarily profitable.


I was only saying that -O3 enables _even more_ optimizations, which can lead to UB code that works with -O2 but misbehaves on -O3. And because -O3 is seldom used (for reasons other than code breakage), the code patterns that are actually UB but don't get compiled to bad code at -O2 (by e.g. gcc) are lesser known. It all depends on how much a given compiler is willing to make assumptions based on UB and what's permitted by C spec to make optimizations. At -O3 it probably makes a lot more of those assumptions, along the lines of "the spec says this is undefined behavior so I will assume it can never happen and optimize it based on that". The more aggressive inlining in conjunction with that probably exposes more even more UB.

Edit: found example where code works on -O2 but breaks on -O3, related to using abnormal amounts of stack space https://stackoverflow.com/questions/47058978/g-optimization-...


Isn’t the warning that -O3 might break stuff horribly obsolete? It was an issue in 2.95, but that was many years ago.


Because i) not all programs benefit tremendously from -O3 and ii) it adds to the compile time and binary size? It would be great to have some optimization level between -O2 and -O3 so that only portions that have a potential to be improved more than, say, 5% are compiled using -O3. In fact the existence of `#pragma GCC optimize` does suggest that this might be possible today with some heuristics... (Or use PGO, which will have the same effect. But PGO is still a novelty in 2023.)



you can manually set the optimizations (and order of optimizations) you want. -O3 ist just a predefined set of optimizations.


But it is not portable among compilers (for example, both GCC and Clang support SLP-based autovectorization but with different flags). -O# is one of a few flags that have roughly same meaning across many compilers.


Because in the real world code size is often more important than throughput on for large inputs in microbenchmarks. There is a point of diminishing returns and in the opinion of many -O3 is beyond that. If it does the job and the job is worth doing use it.


Using -O3 was resulting in correctness problems in my work codebase in the past. Nowadays the compiler (ifort) crashes when using -O3 for some reason.


> Using -O3 was resulting in correctness problems

Likely because your codebase had UB in it that didn't show itself until a certain optimization level. The solution is to fix all instances of UB. See my comment above.


I believe in our case it was a compiler bug, because it only happened for a few versions of ifort and was eventually fixed. But it scared us off from using -O3.


How often does O3 vs O2 enable the sort of optimizations that break code that is technically incorrect?


I just gave up and use tcc.

Forces me to use a better algorithm when it matters. Which more often than not, does not.

;-)


>GCC is actually good at that, -O3 has no issue recognizing it.

imo, if you have to go to O3 or enable a pragma to get an "obvious" optimization then this is undesirable and probably something the programmer still needs to be conscious of, especially since we spend so much time doing development builds that are not at the maximum release optimization level.


"Misaligned" would be a better word than "obvious". It is well known and understood that a higher optimization level means more compilation time, potential speed gain and more subtle breakage in case of presence of UB, and being the maximum level, -O3 also implies potential binary size increase as well. In this understanding I believe `-funswitch-loops` is correctly placed on -O3. But ideally we want less compilation time, potential speed gain, less subtle breakage and insignificant binary size increase at once, and you can argue that all existing optimization levels are too far from that ideal.


IMO the reason the compiler doesn't add special cases for the simplest version is that it doesn't know which of its _many_ special cases to use. If you actually use the unoptimised version of the code like

    void withSwitch(vector<int>& Values, bool v) {
      if (v) {
        multiply1(Values, 2.0);
      } else {
        multiply1(Values, 3.0);
      }
    }
Then it actually inlines the code and optimises each one correctly, as it has context about which special cases are available. (Doesn't even need the `inline` keyword for this at `-O2`)

You can see the code here: https://godbolt.org/z/5beeYe77a


This is possible if the call site can see the implementation, but you can't count on it for separate translation units or larger functions.

My goal was to not rely on site-specific optimization and instead have one separately compiled function body that can be improved for common cases. Certainly, once the compiler has a full view of everything it can take advantage of information as it pleases but this is less controllable. If I were really picky about optimizing for each use I would make it a template.

>Doesn't even need the `inline` keyword for this at `-O2`

The inline keyword means little in terms of actually causing inlining to happen. I would expect the majority of inlining compilers do happens automatically on functions that lack the "inline" keyword. Conversely, programmers probably add "inline" as an incantation all over the place not knowing that compilers often ignore it.


>Conversely, programmers probably add "inline" as an incantation all over the place not knowing that compilers often ignore it.

Funnily, the inline keyword actually has a use, but that use isn't to tell the compiler to inline a function. The use is to allow a function (or variable) to be defined in multiple translations units without being an ODR violation.


MSVC treats inline as a hint [0] , GCC is ambiguous [1] but I read it as utilising the hint. My understanding of clang [2] is that it tries to match GCC, which would imply that it applies.

[0] https://learn.microsoft.com/en-us/cpp/build/reference/ob-inl...

[1] https://gcc.gnu.org/onlinedocs/gcc/Inline.html

[2] https://clang.llvm.org/compatibility.html


What the OP is saying is that it has a use to satisfy (https://en.cppreference.com/w/cpp/language/definition) and in that case it is not optional and not ignored by the compiler. Whether it will "actually" inline it is another matter, that is optional.


would it have made a difference if the function was static? The compiler would then be able to deduce that it isn't used anywhere else, and thus could do this inline optimization.


Gcc actually totally does that. Even doing it while not static by copy/pasting the function and enabling further optims.

to me modern compilers are like voodoo magic


You can also force it by using extensions like `[[gnu::always_inline]]` or `__forceinline`. I've actually used this technique to generate an auto-vectorizable function whenever it's possible, without any code duplication [1].

[1] https://github.com/lifthrasiir/j40/blob/252e7987d36d50f617f2...


On modern hardware, it's also not clear that a special case that makes things faster from a purely CPU-oriented perspective makes things faster overall. Adding additional code to handle special cases makes the code larger, and making the code larger can make it slower because of the memory hierarchy.

There's a CPPCast interview with one of the people who worked on the Intel C Compiler where he talks about the things they do to get it producing such fast binaries. At least from what he said, it sounds like the vast majority of the unique optimizations they put into the compiler were about minimizing cache misses, not CPU cycles.


> Doesn't even need the `inline` keyword for this at `-O2`

Not surprising, since inline has nothing to do with inlining. :)


> the reason the compiler doesn't add special cases for the simplest version is that it doesn't know which of its _many_ special cases to use

This is a fact claim. I'm having trouble parsing the "IMO" that's a preface to your comment.


The "IMO" here is used instead of "I think" or "I believe" to indicate that this is not a fact claim but a (presumably educated) guess. Not a very correct use of "IMO", technically, but a fairly common one nonetheless.


Are you sure this is common? I've never seen it before.


For all sorts of reasons, I think that being pedantic about the meanings of acronyms probably does more to inhibit understanding each other than it does to facilitate it.

In this particular case, regardless of how often anyone has personally seen the term used to mean something more like "I think that...", taking that to be the intended meaning is the most friendly interpretation, and therefore the preferable one. It's an informal forum, we type fast, and don't necessarily carefully proof-read comments before hitting reply. Nor should we -- this place is a more pleasant space to occupy for everyone involved if we give each other the benefit of the doubt about choice of words.


> being pedantic about the meanings of acronyms[...] taking that to be the intended meaning is the most friendly interpretation

Why are you responding as if it's a foregone conclusion that my intent was to be uncharitable/unfriendly?

Someone wrote a comment with unusual phrasing, and I asked for help understanding the message. It contains neither an explicit nor an underhanded attempt to be an asshole to someone. The adversarial reading you're going with is exactly that—something that you're bringing to it, and is (perversely) wholly uncharitable in and of itself...


Whoa, that’s wild! IMO == “In My Opinion”, and it’s all over the internet, along with its cousin IMHO (“In My Humble Opinion”). Are you sure you haven’t seen it before? Try searching for it on HN or Google. HN search shows me almost 100k comments with IMO, and the first page of hits is all this meaning.


"IMO" is common. "IMO" used this way, on the other hand, is uncommon enough that I've literally never encountered it before.


So, by making your code harder to read ("Duh - both these branches do the same thing, so I'll refactor it"), you can give the optimiser actionable information.

But it's not explicit in the code; you'd need a comment in the code to explain your reasons. Comments rot. And your tuned "high-level" code (yeah, C++ isn't generally considered a high-level language) is now dependent on some specific optimizer, which might change in the future; code that appears to be portable, and compiles on multiple compilers, only performs properly if you use the right optimizer.

I don't like this advice; but I'm not a C++ programmer. I don't like macros, and I'm not at all keen on the merciless optimizers that are shipped with modern compilers.


I'm generally opposed to unnecessary optimization, but I'm also opposed to this tendency to start flaming as soon as people even talk about optimization.

Because sometimes this kind of careful performance tuning, even at the cost of readability, is necessary and desirable. And, for the times when that happens, this kind of information needs to be out there, available, and being discussed, so that people can learn the techniques and understand how to use them properly and responsibly.

Yes, that does include knowing that a change to the compiler's optimizer could mean that things that used to make the code faster now make it slower. For that matter, changing the CPU microarchitecture could have the same effect. But that's something we all learn in Optimizing 101. It's so well-known that the very first setting on Godbolt's output pane is a toggle to select which compiler and version it should use to generate said output.


> this kind of information needs to be out there

Agreed; I think TFA is a useful article. Highly-tuned code is going to be hard to read (it used to be written in assembler, which is at least explicit).

I was simply commenting on the opacity of having two branches in the source that appear to do the same thing, and rely on something outside the code (or the language specification) to achieve the desired performance.


The language specification covers the semantics and correctness of the code. Excepting some very specific things such as tail call elimination that also touch on semantics, optimizations do not belong in the language specification.

If we did have a principle saying that you can't get clever with optimization or take advantage of your language's optimizer, that would imply that the past 15-20 years of growth of Web technology was based on an invalid foundation: the v8 JavaScript engine and people's efforts to learn how to use it effectively and push what's possible in a browser.

You'd also have much less impressive video games, and your smartphone would have worse battery life.


OP — If this comp_nearest is still a hot path for you or if you want to generate more articles, consider testing:

1. using `restrict` to tell that compiler that src and dest are sure not to overlap

2. Converting your two increment and test blocks to add+mod to allow for uninterrupted pipelining

Neither might make a difference, but either could.


Addressing the aliasing concern would be the easiest improvement. I observed in the assembly that the source pixel is being re-read all four times it is used, which could be fixed.

Writing an optimal composite function is of course not really the goal, nor of much educational/entertainment value. For any additional speed I already have a function which slices up compositing tasks into chunks and puts them on a thread pool.


> This function doesn't know the value being multiplied until it is called. The compiler (gcc, O2) emits a generic integer multiply instruction in its loop body.

Well, that's because they did it wrong. Tiny math function like this should be defined in the header file and force-inlined, problem solved. No need for that template monstrosity suggested at the end.


The "tiny math function" is used for expository purposes of the generic application, to fit a trivial example on one page. Obviously it's not how you would literally write a function that transforms elements in a vector.


"Programmers often put in speed hacks based on performance knowledge which is outdated or inapplicable to the target platform. Separating cases lets the programmer supply their high-level knowledge about what values are likely to be encountered, which the compiler can optimize to the platform based on its low-level information."

But if our performance knowledge is outdated, how do we know which cases to separate the code into? Using the toy example here, how do we know that multiplying by 2 is a special case we should add a branch for? What about multiplying by 3 or 4 or 5? It seems like you still need to use up-to-date performance knowledge for this technique to work, and in that case just write the left shift into your code so that when this code gets read later it will be more clear why it's there.


Reminds me of a point in the interview with Donald Knuth in Coders at Work where he criticized overuse of abstraction in code, and said that he thinks it's more important for code to be easy to read and edit than for it to be easy to reuse.

He didn't say it explicitly, but the implication that I took was that things that are meant, in essence, to make code more configurable can sometimes (often?) do more harm than good. (SOLID, I'm looking at you.) Using them successfully often requires oracular knowledge about how the code might evolve in the future.

I've found that doing my job got a lot easier after I started following this idea. The code might be more repetitive and boilerplate-y, but there's actually less of it, so it's still easier to read, understand, and maintain.


That. YAGNI >> DRY.

The DRY should be obvious when to refactor, not because of some compulsive behavior.

And then once you need more than 2 branches, please break the DRY that tend to stay forever.


I really like the way functional programmers tend to think about factoring code. There's more of a tendency to think in terms of creating algebras. That forces you to think a lot more carefully about what abstractions you're producing, why, and how they relate to your domain model. And to be more cautious about it.

DRY, by comparison, is an almost offensively crude heuristic.


> how do you know which cases to separate the code into?

It takes practice. For this you really just have to read the assembly, and profile the code. Then try separating it one way and see if it helps, and if not then another. In my experience, it takes many tries to update your perf knowledge, and optimizations sometimes don’t work even when your knowledge is current. All this is why the article noted people should be “Giving the compiler an opportunity to do something is useful, but measure carefully before forcing it.”


>But if our performance knowledge is outdated, how do we know which cases to separate the code into?

Case separation is less of a commitment than deciding on the specific optimization yourself. You know empirically which cases are used in your application and can separate those to see if there's a benefit without any manual optimization knowledge, like I did in the image scaling example, without having much of a theory in mind as to why it might be faster.


TL;DR: If you copy paste the same implementation code in different branches, you give the compiler opportunities to generate faster code for each case it wouldn't have otherwise generated, without you having to do any manual optimization work.


This is really an excellent programming technique article.

I think what makes it so great is that your examples hit a sweet spot of simplicity while still being motivating and you show how to get a "good enough" solution very quickly.


If that's the case, then I imagine the inline keyword would have the same effect?


You need to use a compiler specific "always inline" directive if you want macro-like functionality of actually inlining code.

On its own, the C++ "inline" keyword does not cause inlining to happen, although compilers may treat it like a hint depending on optimization level. GCC does not inline an "inline" function on O0.

"Inline" in the C++ standard means "multiple definitions permitted" so the same entity can exist in multiple translation units without upsetting the linker. This is why C++17 added "inline" variables, which can be initialized in a header that's included in multiple places, even though the inlining concept has no applicability to a variable. The keyword was adopted for this purpose because of the primary association with affecting linkage behavior.


This isn’t an optimization you should consider without insight into your actual bottlenecks, but compilers can be even more aggressive with code that’s strictly local to one translation unit (i.e. inside an anonymous namespace in a cpp file) than they typically would be when seeing an inline hint elsewhere.

It’s not quite the same.

Plus, another benefit of duplication is that you can more freely hand-tune your implementation once you’ve decided its private. Memory alignment, pointer aliasing hints, clever loop structures, SSE stuff, etc can all be used more freely when you know nothing else needs to use this version.

The article is a good teaser around how unintuitive optimization can be, but it only scratches the surface.


yeah, languages which can do inking optimizations do this for you, sometimes even without you knowing.


thank you


I've used this before for memcpy with arbitrary values that are likely to be one of a small set of known small sizes.

Calling memcpy with an arbitrary value always has to call out to libc but a specific value can be inlined into a single instruction or two for smaller values, so you can have an if-else chain or switch with a few common ones for your program and it's a noticeable speedup.


I'm reminded of http://number-none.com/blow/john_carmack_on_inlined_code.htm...

Sometimes it really is faster just to do what a rookie would, dump all the work in a single function.


In summary, programming in a staticly compiled language is really about giving your compiler hints as to what you want to happen. Sometimes it listens, sometimes it doesn't. Sometimes you can force it with some kind words, and sometimes you need to use beat it into submission.


This is well known and I thought GCC did it already. This is a form of ipcp with cloning. Whether it is worth it is hard to determine without profiling info, except in the case where an argument is always constant vs partially constant


I kind of wish we had a form of preconditions/hints for the C compiler. There are lots of attributes, but those all look weird. Imagine annotating a calculation that 1 is a common value for example. or that a function is never called with a null-pointer.


With GCC/clang you can write an ASSERT/ASSUME macro that does this. Basically: if (!cond) __builtin_unreachable();


Honest question, would `V *= Factor` be faster?


In almost all cases: no.

"A compound assignment of the form E1 op= E2 differs from the simple assignment expression E1 = E1 op (E2) only in that the lvalue E1 is evaluated only once." (C99, 6.5.16.2p3)

It only matters when evaluating E1 has side effects. For example, `a[i++] += 1;` which is equivalent to `a[i] = a[i] + 1; i++;` rather than `a[i++] = a[i++] + 1;`.


> `a[i++] = a[i++] + 1;`

This is probably not what you want, as "i" is increased twice.


That's why it matters in that situation, because it changes what actually happens (not just the performance).


As @andersa correctly pointed it out, put the function in a header file and refrain from inventing useless complications.

How could this post get to the front page of Hacker News?


My guess is it's the combination of:

1. A pushback against code standards a lot of organizations have

2. The ego a lot of individuals have that is incompatible with #1

again, just guessing




Consider applying for YC's Spring batch! Applications are open till Feb 11.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: