Hacker News new | past | comments | ask | show | jobs | submit login
Debugging a weird 'file not found' error (jvns.ca)
145 points by ingve on Nov 18, 2021 | hide | past | favorite | 58 comments



Misleading error messages are the worst.

As a software developer, it's bad enough that I have to constantly challenge my assumptions to try to debug something. It's even worse when I have to think "What part of the system is lying to me?"

Recently I upgraded Jira Service Desk (new name: Service Management) to a new version. After the upgrade I got an error in the administrative section of the app. It popped up an error message saying "Something is wrong. Contact administrator." Ugh... but I'm the administrator! And I don't know what's wrong.

Maybe it's not a misleading error message, but it's certainly an unhelpful one.

Since software can be a lying devil, I am happy that I can make use of so much open source software. I've often delved into the source code to understand what's going on. Once I do that I can either fix the issue, work around it, or even submit a PR upstream.

It's much harder to figure out what's going wrong with closed source software. And even if I do figure it out, I'm powerless to change the situation.

Now if you'll excuse me, I'll be contacting Atlassian support to help me fix the "Something is wrong" message.


One of my "favorite" examples is Docker's CMD Dockerfile instruction. I often build static Go binaries and deploy them on a "scratch" base image, which just means there's nothing there but the kernel and your app. But when I set `CMD "/bin/app"` or whatever, it will throw an error complaining that `/bin/sh` doesn't exist. Apparently `CMD "/bin/app"` compiles to `/bin/sh -c /bin/app` or whatever and instead to get it to work properly, you have to make it an array: `CMD [ "/bin/app" ]`. I only run into this once every ~9 months so I tend to forget the details, apologies if this description isn't quite correct.


Ah, yes, Docker uses the shell depending on what form of CMD you use. I do find that to be poor design. I would much prefer that they have separate instructions for those, perhaps something like SHELLCMD vs RAWCMD. Then it's immediately clear which is which.


Microsoft is the worst for this. If you have a cryptic Linux error message it's usually because something "interesting" has happened, whereas on Mac they tend to be fairly descriptive (particularly in dmesg). Microsoft's offerings inevitably are a world of "Sorry, something went wrong! ID $RANDOM_HEX" and "An unexpected error occurred". I cannot stand it.


I have to disagree there. Yes they're only writing the id of the error, but if you just Google that you usually get a super indepth description with steps how to resolve it.

Just printing error ids is fine if there is a good knowledge base which addresses them.


> Yes they're only writing the id of the error, but if you just Google that you usually get a super indepth description with steps how to resolve it.

This is a lot of fun to do when you're trying to figure out why your network driver is crashing. Had that happen before and had to result to using my mobile phone, trying to identically write the code from my computer. Again, lots of fun. Extra fun ensures when the solution involves having to download a new driver, which you then have to download on the phone, figure out how to transfer (good luck if you have an iPhone) without using WiFi since your network card doesn't work, and then finally apply the update.

Maybe, just maybe, they can add a "Details" button that would actually print out the error itself.


Sure, most would of course like that but then they would also need to get all N highly technical and detailed error messages professionally translated into all Y languages they support, right?

That is like a separate project, just as collecting/authoring the messages initially would be.

Or maybe this part of Windows is English-only? Not sure. Still, it would be quite the project to manage only for English.


> Sure, most would of course like that but then they would also need to get all N highly technical and detailed error messages professionally translated into all Y languages they support, right?

Not sure. If the error messages are translated on the website, they could as well put those translations directly into the OS. If they are not translated on the website, no need to translate them in the OS.

AFAIK, languages have to be separately installed/shipped with the installation, so if you have EN-US installed, you'll just have the error messages for EN-US.


Translating error messages just makes them less searchable. Don't. (if we're talking about drivers/kernel at least :))


yeah, MS already has a huge knowledgebase and why cant they ship the errors they already know of with a detailed fix ? w11 already requires 64GB or more storage for install. what is bad in another gb of text if it came to that?


Furthermore, there's no way it would be a gig of text.

War and Peace is about three megabytes uncompressed.

So, even if you shipped all the translations, I doubt you'd have a full gigabyte's worth. Especially not if you compressed them while at rest.


As a user, all the cloud services errors seem to do is generate a huge internal debugging string that I can't find by googling. My university's MS administrators have outright told me that they are useless. Why can't I authenticate to this service? Why do I get "Sorry, something went wrong" – they don't know. The error comes and goes periodically. It's infuriating.

As the other poster said, error codes made a lot of sense when bytes were expensive (you could fit 255 kinds of fail in a single one!). Bytes are no longer expensive. Brain time is more expensive now. Optimise for that – and be descriptive.


I think the worst for this is probably just about any C++ compiler when you make a mistake trying to use a template. You can easily get an error message that is several kilobytes and dozens of lines long.


One that always annoyed me was running a 32 bit binary on a 64 bit Linux system - it says file not found, and when I first encountered that it threw me for a little while.


Yep, that's the downside of syscalls only giving you a finite set of errors they can return, instead of being able to say "couldn't find DT_INTERP '/lib/ld.so.1'"


On the bright side, syscalls have manpages that accurately describe their errors. In this case, "man 2 execve" says:

> ENOENT The file pathname or a script or ELF interpreter does not exist.


Would you think to run "man 2 execve" when getting "file not found" with a random binary though?


No, but I'd think to do the ls and the strace as the author did. Then I'd check the manpage instead of stackoverflow. Obviously her way worked, too, but I prefer manpages. /shruggie

(Well, I just knew the answer in this case because I've hit this particular error a bunch of times, but that's not really the point.)

I actually agree with jmgao's point that this inscrutability is the downside of syscalls' limited error return; just pointing out the good as well as the bad.


Yes, you'd rather strace and as that also obscurely fails you'd then ldd the binary...


Salesforce is like this too! Cryptic error messages that say "contact admin" when I am the admin.


Playing around with a mainframe OS from the 70s (I'm weird like that), I realized that IBM has that wonderful thing where nearly every message has a code that you can use for lookup. In manuals or online, for example.

If you ever used OS/2 you might have seen that, too.

The problem is, the original handbooks are hard to come by[1], and while another wonderful thing (in some aspects) is that a modern, current mainframe is still compatible with that old code from the 70s, and so you can use their resources, you sometimes encounter error codes where the description is basically just "contact IBM". Most likely because IBM really does not expect you to encounter that error on a mainframe delivered since, say, the Berlin Wall fell.

Even if I wanted to, I doubt IBM wants to sell me a support contract for their 4 decades old last public domain version of their mainframe OS on an emulator.

[1] ... and especially on that old OS the message itself can be cryptically short, likely because even a few bytes still cost serious money in the beginning. You still see that in bootloaders.


writing good error messages is hard* and a potential security issue if it leads to information leakage. for this reason in my team (not jira) we opt for "sorry, something is wrong" per default in the UI (with maybe the real error message in the logs) and only provide the real message if deemed necessary.

* and good developers aren't necessarily good writers


Yes, it is hard to write a good error message. Sometimes it's not necessary to write a good error message if you can provide something that the user can Google or at least something that's useful to someone who can investigate further.

For example, the old "Error occurred, check the logs" is useful because it tells me that there is info in the logs.

It's even better if it says "Error 4824 occurred, check the logs" because it gives me an error code to try to look up.

Better yet is "Error 4824 occurred, check the foobar-error.log file for transaction_id nv3kd32dx9" which also gives you which error log to check AND gives you a unique ID to grep for. Sadly, I've NEVER seen that type of error message in any software.


It's much harder to figure out what's going wrong with closed source software. And even if I do figure it out, I'm powerless to change the situation.

Actually, the source can lie since the compiler could be buggy. The disassembly won't, and if you see the equivalent of 1+1=3 then you know you probably have stumbled upon a hardware bug. A hex editor is your friend; you can spend hours or even days on trying to figure out how to compile the thing correctly with all its dependencies and not change something else, or a few seconds to minutes finding what bytes to change.


Even as a system level programmer, who actually sees bugs in compilers, CPUs, and anything else, I would postulate that the vast majority of programmers in the vast majority of cases would be totally sufficiently equipped with the source code, with no need to know assembly and how to reverse engineer a binary.

In fact, the chance of running into a compiler bug is so low at higher levels, that you're better off assuming that it is not a compiler or hardware bug. If it is, you're either doing something so special that you a) might want to be doing it differently if inadvertently, or b) know exactly what you're dealing with already. Or you encountered something worth bragging about, and finding such a corner case unicorn in assembly when you don't even know the original source to know what is supposed to happen, gosh...

Otherwise, it's a bit akin to expect from a car mechanic that they can run a debugger on the ECU, or doing chemical analysis on exhaust buildup...


with no need to know assembly and how to reverse engineer a binary.

That's precisely the problem. People not knowing the fundamentals --- because the industry doesn't like for it to be widely known that a lot of DRM and other user-hostility can be disabled with changing a single bit, for example --- thinking open-source is somehow a panacea, then either wondering why it's so difficult to fix what should really be a simple thing that is "open source", or spending a ton of time trying to do it anyway. Then there's the stuff that's "open source" but effectively "look but don't touch" because they are so massive or otherwise locked out in some other way. If you've tried to change something in Android or one of its apps, you might have experienced that. (Fortunately, Java decompiles very easily.)

In other words, RMS was only partially right about open-source. Don't let that be a distraction from the real issue of software freedom which is closer to "right to repair"/modify.

Otherwise, it's a bit akin to expect from a car mechanic that they can run a debugger on the ECU, or doing chemical analysis on exhaust buildup...

No, it's more like expecting that one will look for spark, fuel, and air on a no-start before blindly swapping parts, or indeed know how to tell whether the mixture is rich or lean from the deposits on the spark plugs...


Expecting everyone to know how to read a hexdump is a very unusual take on CS education; I would not rank platform-specific assembly knowledge above control flow statements, command-line semantics, or other Intro to Programming-type material.

There's also whole categories of user-hostile software that can't be fixed with a simple binary patch -- ASM can still be obfuscated, or DRM might be enforced by a secondary processor. Dealing with those at a technical level is what I would definitely consider above the basics.

Similarly, right to repair advocacy has focused on cultural and legislative changes rather than greater technical competence. It's a good and useful skill but addressing the ills of the software industry won't come from assembly knowledge.


I guess I struck a nerve or some vested interests... but that's how it is: industry-brainwashed half-competent and unquestioningly docile developers unfortunately make up the bulk of them. Open-source is merely a distraction while they slowly destroy freedom. Dig deeper and you'll see the uncomfortable truth.


> Actually, the source can lie since the compiler could be buggy. The disassembly won't ...

...unless the disassembler is buggy.


This is on the top of my list of "least helpful error messages" and isn't just a golang thing, any dynamically linked executable will throw this error in an unexpected execution environment.

Cheatsheet (standard issue hello world -- hello.c)

# gcc -o hello ./hello.c

# file ./hello ./hello: ELF 64-bit LSB pie executable, x86-64, version 1 (SYSV), dynamically linked, interpreter /lib64/ld-linux-x86-64.so.2, for GNU/Linux 3.2.0, BuildID[sha1]=1e41ccb72d7a36ab5ac0c2c11bcd0e994e5eb013, not stripped

This bit is what throws that error if unavailable:

"interpreter /lib64/ld-linux-x86-64.so.2"

Also, if ldd / file are unavailable but readelf is:

# readelf -a ./hello | grep interpreter

      [Requesting program interpreter: /lib64/ld-linux-x86-64.so.2]
# readelf -a ./hello | grep NEEDED

0x0000000000000001 (NEEDED) Shared library: [libc.so.6]


Go is dynamically linked in order to use glibc's name resolution functions, as it doesn't support all of glibc NSS's (arguably bloated and mostly unneeded) functionality.

Anyway, I too found the "file not found" error to be terribly unhelpful the first time I encountered it half a decade go. I thought I was going crazy.


"step 4: use readelf"

   "$ ldd  ./serve"
   "not a dynamic executable"
No need to use ldd, can use readelf, as in step 4

   $ readelf -d ./serve
   There is no dynamic section in this file.
http://www.catonmat.net/blog/ldd-arbitrary-code-execution/

"And statically linking in this case doesn't even produce a bigger binary (for some reason it seems to produce a slightly smaller binary?? I don't know why that is)"

To force use of built-in net libs instead of system ones, and remove debugging info to make the binary a little smaller, something like this

   $ CGO_ENABLED=0 GOOS=linux GOARCH=amd64 go build -tags netgo -ldflags '-w -extldflags "-static"' -o serve *.go
re: "File not found" I encounter this "weird bug" all the time because I like to build smaller, self-contained, no dependency systems and started doing so long before Docker existed. Static binaries are one way to solve the missing linker problem. Chroots and the like are another. I will sometimes create an alias that runs the program using chroot. Then I can keep the chroot directory as small as possible, populating it with only those resources the program actually needs. This is praticularly fun when it is an older program, no source available, and requires older versions of glibc, and other libraries. Dynamic linking in a time of GBs of core memory and TBs of storage is something I still question. People say dynamic linking is useful for the library "version control" it allows, but this just becomes a PITA for someone wanting to use older programs. Lots of folks compiling static binaries today for their Github repos only do so with glibc, which is hostile toward static linking. For someone using musl, this is no better than a dynamically-linked binary.


Something I didn’t realize I would love about static languages after years of Python: programs more often crash at the error site. Python often keeps working for a while until the duck isn’t shaped correctly.

The number of times I’ve had to unravel a bug where we iterated on a string rather than the list containing strings…


That is exactly one of the reasons why people advocate for static languages. More so, often they don't even compile if you mixed up your types somewhere (the "more static" and "more strict" the language is, the more that is true).


I heard this referred to as a language's tendency to "keep on truckin'".

Instead of causing an error by dividing by 0, JavaScript, for example, will return Infinity which corrupts any further calculations.

It can be caused by dynamic typing but it also creeps into any language that allows implicit type conversion. Java, for example, is statically typed but allows some implicit type conversion in primitive types.

The problem is that an implicit type conversion is easy for a programmer to miss, and thus they don't realize that that the variable doesn't have the value or type they're expecting.


AFAIK, division by zero in IEEE 754 floating point shall return infinity, which is a valid floating point value. For once, javascript does the right thing.


Fair enough. I was trying to come up with something weird that JavaScript does with numbers and I guess I chose the wrong thing.


Javascript is truly weird to me (I’ve programmed mostly in python and Fortran). From what I understood though, js uses IEEE 754 floating point for everything, simply because it works the same on pretty much any hardware platform. Arm, intel, power, whatever. Int/uint/… not so much. The numbers are therefore 100% sane. The insanity stems from converting «0.0» to float then guessing at how to format it (maybe «int» is what he meant?). Etc.


I’ve run into this “file not found” error a couple of times, and it feels kind of mind bending because it initially seems impossible

I've seen this error often enough to know that if you're doing anything with execution, "file not found" = "the file, or one of its dependencies, was not found". On the other hand, "file not found" on an open() call when you seem to see that it exists --- which I have actually seen --- should take you down a different path of troubleshooting, such as making sure you're in the right path and on the right machine.


I work on iSH, which ships Alpine Linux for iPhone, so I know this error by heart: people constantly try to run software that relies on glibc, which fails on Alpine because of the missing interpreter. I guess it’s certainly a way to teach people about the differences in libcs, but it’s generally hard for people to figure out for exactly the reason given in the blog post. I wished that shells did a bit more here and tried to figure out if the ENOENT was from a missing interpreter or binary, but maybe I’m asking too much…


I've been dealing with POSIX (or more generally "unix-y") environments for so long, when I see "No such file or directory" I automatically tend to think "ENOENT" and not actually "file not found", as in something did not exist.

Many system calls (sysctl as just one example) happily return ENOENT when they did not find a thing no matter whether it is even remotely related to a file. I picked sysctl because it also can return ENOTDIR and EISDIR for something not related to directories at all.

It's commonly not that much of an issue when you handle the error in code, but if the error gets "helpfully" bubbled up with its strerror description (instead of just a number), it can sure confuse people.

Somewhat simplified: errnos were at some distant point in the past divined into existence, and system calls and system level C APIs had to make with what they got since then.


I'm intimately familiar with this particular error, having debugged it a few times on NixOS. Binaries tend to assume that /lib/ld-linux.so exists


Had the same because of using Nix. A file was mounted from an Amazon Linux 2 VM into a Nix-based container, and ran into this exact problem.

One of the more interesting debugging adventures I had.


Reminds me of copy-paste shell script error, between windows and mac, where first line of shell script turned out to be

#!/bin/bash^M

That is, a non-printing carriage return after "bash"

and running the shell script "helpfully" kept telling me: "/bin/bash: no such file or directory"


Does the shell filter the carriage return out in the error for some reason? I would more expect something like:

  /bin/bash
           : No such file or directory
(Not that that's more helpful to a user, but at least a trained eye would immediately get very suspicious.)


I guess it depends greatly on what version of bash and such, but with something recent, trying to execute the script will escape the problem characters when printing if it prints the interpreter name at all, and possibly also say 'bad interpreter: no such file or directory"

...since there's more than 2 valid line ending configurations?


Yeah that sounds well plausible, especially because there seem to be many (minor and major) versions of bash still around in different contexts. Not to speak of any other shell. Escaping seems best!

The "bad interpreter" being there or not might be OS-specific, though. Unless some bash versions use fork+execve vs. posix_spawn or similar and that makes a difference there. I haven't checked. Overall I can see how even some current stuff might give the confusing message...


That would be a "line feed" (0x0A) - Carriage return is (0x0D) (ascii values 10 and 13, respectively) the Mac/Windows conversion is the difference between CR and CR/LF

All I can guess is that in terminal, when that particular version of bash and OS interacted, it didn't go back to the start of line, and the CR was unprintable, so it all ended up on the same line.

I'd expect to have seen something like what you describe with the line break, if it were a bare LF, but in my case, it wasa bare CR.


Sorry, yes, I did mix up CR and LF in my example output that I would expect. So then I would actually expect:

   : No such file or directory
(And the rest still applies.)


This reminds me of k8s’ «out of memory error» which turns out to be «kill -9» / SIGKILL which can also happen by k8s’ own «kill process when health check fails».

I have no idea who thought this was a good idea, but this needs to be fixed. Somehow.


  $ echo > /nonexistent-dir/filename
  zsh: no such file or directory: /nonexistent-dir/filename

  $ echo > /nonexistent-dir/filename
  bash: /nonexistent-dir/filename: No such file or directory
In the middle of a shell script or makefile recipe, this can be hard to spot, and it may take a while to realise why it’s actually failing—that it’s the directory that doesn’t exist, not the file (which of course it doesn’t exist, you’re trying to create it!). I recently hit a messy race condition related to this: https://github.com/alexforster/libbpf-sys/commit/e65b4962e7d... (described in the commit message).

I’ve been wondering about filing bugs against zsh and bash to be a little bit cleverer about their error message in this specific case, to only mention the nonexistent directory rather than the full path.


Why does strace not show the kernel looking for, and failing to find, the rtld? (/lib64/ld-linux-x86-64.so.2)

When fooling around running linux binaries on FreeBSD, ktrace/kdump show the (FreeBSD) kernel looking for the rtld..


strace just watches syscalls, like truss. execve(2) returns ENOENT and that's it.


ldd is always on assistance to see why executable file can't be loaded.


Yeah I had this error once and ldd solved it iirc.


> It works! I checked, and that’s an alternative way to fix this bug – if I just set the CGO_ENABLED=0 environment variable in my build container, then I can build a static binary and I don’t need to switch to the golang:alpine container for my builds. I kind of like that fix better.

With a static binary, you can even start your Dockerfile with `from scratch`. Feels good to have 5kb docker images


Such a good blog post, from start to finish.

There aren’t a lot of expertly-written in-the-weeds-tech blog posts, but this one was crafted so well. I guess that’s why Julia Evans’ name is so recognizable.


Suppose there was no Google result, and you were the first to ask about this. What would have been your next step?




Consider applying for YC's Spring batch! Applications are open till Feb 11.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: