Deep Bug

bluelu · on April 10, 2024

I had a similar bug more than 10 years ago, also while building a search index through Lucene. It would crash after hours of running with an impossible nullpointer exception. It always appeared after running for hours, and running that specific iteration would not trigger the exception, so hard to reproduce.

Turned out it was a java jvm bug which was triggered when the jvm decided to recompile that code part since it was used more frequently.

Try running your code with the -server flag and see if that makes a difference.

blauditore · on April 10, 2024

Reminds me of a time where I almost certainly found a bug in Safari Mobile's JS implementation. Some variable ended up undefined in a place where it was provably impossible. It didn't happen consistently and was prone to right timing, so extremely hard to pin down and debug. Given the fact that all debugging had to be done tediously through a phone emulator and remote dev tools (it was inside a Cordova app), I eventually just gave up and added some sort of if-else for that case.

darepublic · on April 10, 2024

Cannot hear the name cordova app without a rush of negative memories

blauditore · on April 17, 2024

Why? I had fairly good experiences with Cordova, once I figured out the right way to use it (documentation tends to lack a bit IIRC).

NobleExpress · on April 10, 2024

-server is only a thing for HotSpot. They mention that HotSpot works perfectly fine. There is no option of "-server" for Graal native-image. It has "-O{0,1}" though for turning optimizations off and on respectively.

marginalia_nu · on April 10, 2024

It's not a native image, it runs on the (JIT) Graal JDK.

NobleExpress · on April 10, 2024

Ah right. I assumed it was a native image.

marginalia_nu · on April 10, 2024

I think the project does have a bit of a naming problem. They've gotten the part out where it's very fast and pretty good, but everything is named GraalVM-something and it's not always entirely clear what's being referred to.

wwilson · on April 10, 2024

If you would like us to repro this bug in Antithesis and help you debug it there, email me and I’ll hook you up: will@antithesis.com

yellow_lead · on April 11, 2024

If this is a JVM bug, would using antithesis be able to repro it?

NobleExpress · on April 10, 2024

Interesting. Perhaps you can inspect the disassembly of the function in question when using Graal and HotSpot. It is likely related to that.

Another debugging technique we use for heisenbugs is to see if `rr` [1] can reproduce it. If it can then that's great as it allows you to go back in time to debug what may have caused the bug. But `rr` is often not great for concurrency bugs since it emulates a single-core machine. Though debugging a VM is generally a nightmare. What we desperately need is a debugger that can debug both the VM and the language running on top of it. Usually it's one or the other.

> In general I’d argue you haven’t fixed a bug unless you understand why it happened and why your fix worked, which makes this frustrating, since every indication is that the bug exists within proprietary code that is out of my reach.

Were you using Oracle GraalVM? GraalVM community edition is open source, so maybe it's worth checking if it is reproducible in that.

[1]: https://github.com/rr-debugger/rr

postatic · on April 10, 2024

I had a bug in my code during my research degree. I was able to get some results and write it up as a thesis, submit it and get my degree.

That was almost 15 years ago. But to this day, every now and then I feel a little guilty about getting that degree because I haven’t truly solved that bug.

taneq · on April 10, 2024

Research code doesn’t have to be bug free, it just has to work correctly long enough to generate the graphs for your paper.

joncrocks · on April 10, 2024

Had a similar bug in a production system about 10 years ago. Indexes got corrupted after some length of time. Only in production, couldn't replicate. A distributed system as well.

We (mainly a collegue of mine) eventually traced it down to a smaller part of the code until we noticed a subtle bug.

Multi reader + multi writer and a atomic operation that miscounted the number of readers.

Took us a couple of weeks to track down if I remember right.

https://issues.redhat.com/browse/ISPN-4777

marginalia_nu · on April 10, 2024

Unrelated but related: Doing systems programming in Java actually makes me appreciate all the little things C does to ensure your program run correctly.

vsnf · on April 10, 2024

Hearing that C has assists for program correctness is, shall I say, unexpected to say the least. What kinds of things are you referring to?

eddd-ddde · on April 10, 2024

Probably the things C does not do? No messy vtables, no exceptions, no vm, etc. If you just do less stuff it's less likely you do something wrong.

elteto · on April 10, 2024

You trade that off by a healthy dose of foot guns and UB. And a non-existent standard library that encourages NIH write-my-own for the most common data structures. C barely does anything and it has a stark beauty of its own, agreed. And yet it is incredibly hard to write secure code with it.

af3d · on April 11, 2024

Developing robust programs in C just requires a clear understanding of the language itself (and of course a good overall skill set as a programmer). I have been writing (mostly) bug-free code in C for over twenty years. My only real complaint is that it forces you to write lots of "boilerplate". Beyond that, it is a fine language for systems programming (provided you know what you are doing). I have had many more issues working with scripting languages (I'm lookin' at you, Java and Ruby!) where the bugs existed not in my program, but in the VM itself. Having said that, I do enjoy working with higher-level abstractions.

ignoramous · on April 10, 2024

> What kinds of things are you referring to?

Pointer arithmetic, I imagine. Elementary math after all.

elteto · on April 10, 2024

You found one compiler bug after moving to a new version of your compiler/runtime. Not unheard of, to be honest. Now think of all the common bugs you _didn’t_ get in all these years because you wrote your code in Java instead of C.

marginalia_nu · on April 10, 2024

I'm mostly referring to the fact that half the debugging was spent on a wild goose chase combing through integer types.

With unsigned integers, type aliases and pointer types this would have been a lot easier to keep track of. The limited range of primitive integers also means static analysis is basically impossible. Assing a pointer to an integer or vice versa and that's a code smell. Assign an integer to an integer and the compiler is none the wiser.

elteto · on April 10, 2024

I think you went on a wild goose chase _because_ it was a compiler bug. You never assume "it's the compiler" on the first try, so you go looking for bugs in your code that don't exist. Java's integer types maybe didn't help in this situation, agreed. And the lack of unsigned integers is harebrained, agreed too. But I think you wouldn't have had an easier time in C.

You say that the lack of unsigned types in Java is in part responsible for some of your wasted time. But C has notoriously bad integer types and promotion rules, especially for interactions between signed and unsigned! If you had found this same bug in C code (a compiler bug), your first instinct as a C programmer might have been "it's probably some signed/unsigned interaction issue". And you'd also have wasted time trying to find a non-existent issue in your code.

kragen · on April 10, 2024

entirely by coincidence, i have spent a substantial part of the night finding signed/unsigned bugs in my c code by carefully reading through the disassembly, so i endorse this message

int_19h · on April 11, 2024

C# is perhaps a better example, as it does have unsigned integer types, but it also does the sensible thing by e.g. promoting mixed int/uint operands to long. If you try to assign the result of such an expression to either int or uint without casting, it's a compile-time error.

virgilp · on April 10, 2024

Like, what? Honest question. I can think of a few things that Java does (bounds checking, for one), but very few things that C does.

(also, not entirely sure what you meant. There isn't a canonical C compiler, or toolset, and the language itself really doesn't do anything to prevent you from shooting yourself in the foot)

marginalia_nu · on April 10, 2024

Like having unsigned integers and pointers. Java makes you do pointer arithmetic with all signed integers that all have the same type as every other integer in the program.

afandian · on April 10, 2024

What do you mean by pointer arithmetic in Java? Real pointers in FFI? Or array indexes?

marginalia_nu · on April 10, 2024

Offsets in files and mapped memory. None of this is done through Java objects or arrays, but rather by reading data directly from memory or disk and manually keeping track the data's type and structure.

deely3 · on April 10, 2024

And how exactly C is better in these situations?

marginalia_nu · on April 10, 2024

It has pointer types and unsigned integer types.

mike_hearn · on April 10, 2024

Does the new Panama memory API help? There are new pointer-like classes for indexing into memory mapped files.

Also, Kotlin has unsigned integers. So you could use that for the parts of your code that benefit from it.

marginalia_nu · on April 10, 2024

It helps in the sense that it's no longer necessary to slice up the file into individual 2 GB byte buffers, though I'm not really using MemoryLayout/StructLayout much yet to do structured access. Haven't made sense of how to use that part of the API well yet.

shrx · on April 10, 2024

If you're using pointers in Java you're doing something wrong

marginalia_nu · on April 10, 2024

If you're going to do this sort of low level database-adjacent work in Java, this is basically how you need to do it. It's pretty unpleasant and the language resists you in every way, but luckily in most cases it's only a small core of the program that needs to be written this way.

Well, that or JNI/FFI in some C code I guess.

kamov · on April 10, 2024

Sorry, I don't want to be that guy, but I'm curious why use Java and not something like Rust? Is it because of the Lucene ecosystem?

marginalia_nu · on April 10, 2024

The short story is I'm productive in Java, and I enjoy its mature ecosystem and stable APIs. Systems programming is awkward, but it's also a very small part of the project and for what it is, an interesting challenge.

What I'm building is what Lucene does (i.e. document indexing) and then the rest of the search engine as well including crawling and serving traffic.

int_19h · on April 11, 2024

Have you tried C# for those kinds of things? It's normally the same level of abstraction as Java, and has similarly mature ecosystem with stable APIs, but it also has unsigned types, unmanaged data and function pointers (with pointer arithmetic), unions, the equivalent of alloca etc for when you need to go low-level.

YoshiRulz · on April 10, 2024

Have you ever tried Kotlin? For non-system-level stuff on the JVM, it's IMO a direct upgrade to Java. `value class` for type aliases, `when`, and better null handling. But then, modern Java is doing all that stuff too, I hear.

mrkeen · on April 10, 2024

I did this.

I previously worked at a search engine which did its index building and querying in C, with its higher-level stuff (web-apps, scheduling, tooling, etc.) in Java.

Later when I built my own version, I started with C for the low-level and Haskell for the high-level. I made a few iterations in C, but eventually rewrote it in Rust, and I was pretty happy with that choice.

I was more familiar with C, and it was a really good fit for what I was writing. Terms and Documents just become termIds and docIds (numbers) sitting in indexes (arrays). Memory-mapping is a really comfortable way to do things: files are just arrays; let the OS sort out when to page things in and out of disk.

But where C fell down for me was in the changing the code. The meaning of the data&code were lost in nested for-loops and void-function-pointers, etc. Rust gave me a better shot at both writing and rewriting the code.

Java for the low-level was a non-starter for me for a few reasons, but the biggest two were startup time and difficulty of the mmap api (31-bits of address-space for a file? c'mon!).

marginalia_nu · on April 10, 2024

> Java for the low-level was a non-starter for me for a few reasons, but the biggest two were startup time and difficulty of the mmap api (31-bits of address-space for a file? c'mon!).

Neither of these are issues anymore though. Especially not with graal's native images.

gaogao · on April 10, 2024

Ooo, I also had a graalvm migration bug, where it was corrupting a deep aes256 call which was costing us ~1M dollars a day or so until we figured it out.

mastermedo · on April 10, 2024

Yikes, can you explain a bit more? It sounds like a nightmare to debug.

gaogao · on April 10, 2024

Yeah, I think it was that an AES function/instructions and some array clearing were getting reordered unsafely, but only under some very specific distributed modes. We were mostly able to get repo after a couple of days, but it was complicated by having a pretty deep stack and the fact that the corrupting change was being canaried out, not released.

gergo_barany · on April 22, 2024

Hello, visitor from the future coming across this old discussion! In case you're wondering, marginalia_nu did come up with a usable reproducer for the bug, filed a bug report, and the problem is fixed: https://github.com/oracle/graal/issues/8747

tuyiown · on April 10, 2024

I had a similar problem on a nth program start, in seemingly the same conditions, was a disk cache memory hardware failure, some memory block return 1, whatever was written into it. The bug trigger was highly erratic, because corrupt reads only was on cache read, and crashed my program as it was one of the single thing running on the system most of the time, and one of the single thing that was re-read enough to be kept in cache despite large disk traffic.

The program had a large number of hour without incident, was thoroughly tested with unit test, and had 0 valgrind warning. I'd been intermittently tracking the problem for months, and the fact that it was an hardware issue was a huge relief, as I really started to loose footing on the necessary minimum of confidence in predictability you need to write software.

dangoodmanUT · on April 10, 2024

I think this is an interesting metaphor about "compatible" runtimes. I see the same a lot with Bun and Node. Node works, Bun doesn't, completely unable to explain why.

joshstrange · on April 10, 2024

This is my biggest concern with bun and deno, they both look awesome and provide some really cool features out of the box, but I’m terrified that I won’t be able to fix a low level problem if one arises. I’m already dealing with the extra layer of AWS lambda, I find the trade-off to be worth it, but adding more complexity seems like a bad idea.

kazinator · on April 10, 2024

I see this might be where he poked at the code in relation to this problem:

https://github.com/MarginaliaSearch/MarginaliaSearch/commit/...

Directory: https://github.com/MarginaliaSearch/MarginaliaSearch/tree/ma...

marginalia_nu · on April 11, 2024

Yeah that commit would be when I thought I'd perturbed the bug out of existence.

koliber · on April 10, 2024

Love a good debugging story. Curious if there is a way to narrow the bug down.

Looking at the code, it seems it is copying val to counts. It does analogous arithmetic on val an on counts, and since they both should contain the same data, it's surprising that the result of the arithmetic does not match.

Can you reorder the code to see if the error is reproducible? Do it only for a diagnostic run, and don't change the actual code. For example, change the first loop into two loops:

  for (int i = 0; i < length; i++) {
    counts[i] = val[i];
  }

  long offset = 0;
  for (int i = 0; i < length; i++) {
    offset += val[i];
  }

Perhaps some combination of writing val[i] and immediately reading it causes an issue.

Another variant to try is to add another running counter that mimics the size arithmetic, but in the first loop, and compare the three values to see if you get 2 out of three to match:

  long offset = 0;
  long size_debug = 0;
  for (int i = 0; i < length; i++) {
    counts[i] = val[i];
    offset += val[i];
    size_debug += counts[i];
  }

If this does not change the behavior, see if there is a region of the file where size and offset diverge. Maybe it is in a random place. Maybe it's always near the end. Maybe it's localized somehow. How to do this would involve writing a value of offset every X iterations of i, as well as writing the value of size every X iterations of i, and seeing if there is a pattern where they diverge. Do it every 1Gb or so.

If I had to place a long bet, my guess is that somewhere in the first loop, after assigning counts[i] = val[i]; reading from val[i]; does not return the same value. You said it's in the same thread, but we need to admin this is a deep bug, so all bets are off.

rc_kas · on April 10, 2024

> “Doesn’t work” is famously not an error description

This guy is extremely self-aware

neerajsi · on April 10, 2024

Seems like it could be a compiler or GC bug (or a bug with the safe pointing interaction between compiler and GC).

It's possible that a write to the counts array is misdirected so it doesn't occur.

Is the sum of counts always less than the sum of vals?

On second thought, if counts is mmapped, GC might not be in play. But a similar bug might occur due to state being lost during an On-stack-replacement of one of the loops.

I'm not familiar with deep jvm stuff, but is there a way to ask the jvm for a compilation log to see if JIT OSR is happening?

KingOfCoders · on April 10, 2024

"The Bug" by Ellen Ullman

dwattttt · on April 10, 2024

They're a little off when looking for 32bit overflows: unsigned 32bit integers top out at ~4gb, that kind of bug would've shown up much smaller than 32gb.

marginalia_nu · on April 10, 2024

The indexed data consists of 128 bit records, not bytes.

dwattttt · on April 10, 2024

Apologies, I was confused by the reference to 32bit integer overflows for file sizes over 32gb.

marginalia_nu · on April 10, 2024

Yeah, file size is 2 GB x record size (2 and not 4 because signed ints.)

pixelmonkey · on April 10, 2024

Great post. I posted about my real-world experience of a deep bug like this on lobste.rs. Here is that comment/explainer:

https://lobste.rs/s/qjrdss/computers_can_be_understood#c_hqu...

For any fellow sufferers, the needle in the haystack is "skb rides the rocket." Same bug was also hit by Brendan Gregg around the same time.

variadix · on April 10, 2024

Seems like inspecting the JVM byte code for errors would be the next step.

justinclift · on April 10, 2024

Which architecture (and preferably) what models of cpu is it occurring on?

marginalia_nu · on April 10, 2024

I've reproduced both in bare metal on an Epyc 7543 and a qemu-virtualized Ryzen 9 3900X; so x86_64.

justinclift · on April 10, 2024

Interesting. Might be worth a try on something non-AMD if possible, just to rule out some weird microcode bug hanging around in the Zen family?

_wire_ · on April 10, 2024

"You can prove the presence of bugs, but not their absence" —A Programmer

The systems employ concurrency at multiple levels, where the system design lacks a unified paradigm for establishing its correctness.

So concurrency "errors" should be expected.

I always start wondering why such bugs aren't more common. But then I realize a dilemma.

For concurrency errors, you can know an error occurs but have no way to work backwards to the specific conditions of its cause (incorrect design). All you can do is perturb the system into not exhibiting the error.

Lacking a formal way to establish correctness, we are left to an engineering of attrition, in which the author is engaged.

If the failure becomes common, system parameters will be adjusted to perturb behavior back into obscurity with a black art called "debugging" that approximates correctness.

The contravening property of the system is the underlying logic runs so fast with respect to human attention that failure modes are pressed into high likelihood of being observed and therefore "corrected" (approximately). Bugs which are common enough to merit attention are perturbed out of "existence" by "version charges."

This seems to imply that such failures should be both expected as unavoidable and rare according to limits of human attention according to engineering by attrition.

Welcome to the author's world. We are all facing this sunk cost.

In the bug picture (haha) we simply abide systems that "work" according to a distribution of our tolerance for the nuisance of their inevitable failure in modes that are too rare to be "corrected."

As the author explains he's working with newer versions of a machine. Huzzah!

Meanwhile, the ubiquity of deployments is scaling towards infinity, implying that a small clique of individual humans can expect to be driven mad by faults that appear demonic while the hoard of humanity lumbers on enduring the costs of "good enough" design.

Except maybe for the contingency that the strategic nuclear deterrent is be placed under control of an AI.

Luckily for the individual human there is death.

Unluckily for humanity, someone is likely trying to place the strategic nuclear deterrent under control of an AI

pyinstallwoes · on April 10, 2024

Related the forbidden "glitch" - not allowed to discuss.

https://www.youtube.com/watch?v=47lcjbyqF_k&t=4066s

1. Measuring Deep Metastability and Its Effect on Synchronizer Performance -

2. He Who Hesitates is Lost: Decisions and free will in men and machines - http://async.org.uk/David.Kinniment/DJKinniment-He-Who-Hesit...

"Buridan's ass is an illustration of a paradox in philosophy in the conception of free will. It refers to a hypothetical situation wherein an ass (donkey) that is equally hungry and thirsty is placed precisely midway between a stack of hay and a pail of water. Since the paradox assumes the donkey will always go to whichever is closer, it dies of both hunger and thirst since it cannot make any rational decision between the hay and water.

A common variant of the paradox substitutes the hay and water for two identical piles of hay; the ass, unable to choose between the two, dies of hunger.

The paradox is named after the 14th-century French philosopher Jean Buridan, whose philosophy of moral determinism it satirizes.

Although the illustration is named after Buridan, philosophers have discussed the concept before him, notably Aristotle, who put forward the example of a man equally hungry and thirsty, and Al-Ghazali, who used a man faced with the choice of equally good dates.

A version of this situation appears as metastability in digital electronics, when a circuit must decide between two states based on an input that is in itself undefined (neither zero nor one).

Metastability becomes a problem if the circuit spends more time than it should in this "undecided" state, which is usually set by the speed of the clock the system is using. Interesting"

_wire_ · on April 10, 2024

Great references! Thank you for commenting.

I dislike the figure of speech "A lot to unpack," but the Catt dialog on "the glitch" and electric current has a lot going on and I found it well worth the listen.

In interview 2/2 on electric current the "Demystifying Science" pair strangely fall into the orthodoxy of wanting to block and control dialog in order to manage the apparent controversy illuminated by Catt's perspective. They fight back even as they clearly express curiosity about, and sympathy with\ Catt's views and laments. I came away noting there's a profound natural block in the discourse of science towards a commonsense which is obviously insufficient to accommodate the world as we now find it.

It so happens that N. Chomsky gives a lucid presentation on the history of science that pertains directly to Ivor Catt's laments on blocking of science by an engineering orthodoxy.

Chomsky - Mind, Language and the Limits of Thought https://www.youtube.com/watch?v=2NffP06zqkw

The talk begins with a proposal for 3 problems — "Plato's, Orwell's and Descartes'" — supported by a review of modern scientific thought, then segues into a supporting illustration of modern effects of these problems in the orthodoxy of NATO policy in Serbia. It's the tightest package of criticism of contemporary thought I've come across and will not be a waste of time for viewers at any level of interest and familiarity with the history of thought.

There are two versions of this presentation on Youtube made to different audiences on different dates. I prefer the clarity of this one.

99112000 · on April 10, 2024

One word: fuzzing.

marginalia_nu · on April 10, 2024

Fuzzing finds logic errors though. This appears to be a compiler error.