Woah. This is something that would cause me to switch over from Linux.
Reproducible builds are so important, they prevent a build server or developer laptop from being a single point of failure as a tainted build can now be detected by others.
Debian developers have been working on Reproducible Builds for a number of years[0]. Currently the Debian amd64 distribution is ~93.7% reproducible[1]. It's possible that Debian amd64 systems will be fully reproducible in Stretch+1 (~2-3 years).
The Debian wiki page appears to be about reproducible builds for packages.
The NetBSD wiki entry is about progress on reproducible builds for tools, kernel and userland.
As for packages, reproducible builds under pkgsrc is still a WIP.
My use of third party binary packages and pkgsrc is minimal but I am continually building custom kernels and crunched binaries with build.sh and I do make use of the binary userlands from releng.
For me at least, having reproducible builds for these alone is quite useful.
I should have guessed. I use their libc6 package for q)k) under /emul/linux and in the past the live.debian.org images occasionally but that is about as much as I know about Debian. Apologies for my ignorance.
93% of the distribution could be enough for almost everybody or for nearly nobody if some core packages are not reproducible yet. Is there any data on that? Maybe using the Debian popcon data?
No, the build system is completely unique to NetBSD. To my knowledge no other OS has the ability to cross-build any platform from any POSIX OS. NetBSD can build the whole OS from source tree to distribution media with a single command.
The article specifically mentions that they borrowed the reproducible-builds.org analysis infrastructure that was developed by Debian.
"I would also like to acknowledge the work done by the Debian folks who have provided a platform to run, test and analyze reproducible builds. Special mention to the diffoscope tool that gives an excellent overview of what's different between binary files, by finding out what they are (and if they are containers what they contain) and then running the appropriate formatter and diff program to show what's different for each file."
> NetBSD can build the whole OS from source tree to distribution media with a single command.
This one, at least, can be done in NixOS/Guix once you check out the source -- and the Nix package manager can technically be installed on any Linux distro, too (and some other ports to Cygwin/FreeBSD/Mac etc) and run a single command to get the ISO, or any kind of build product you want.
The carefully tested and maintained portability/cross-compilation is another thing though: NetBSD has fantastic support here that is not easily replicated without just doing a ton of work. So its universal basically-always-works cross compilation, everywhere -- is rather unique here. You can't build NixOS ISOs natively e.g. Nix-on-Darwin, which is rather unfortunate.
I don't quite understand the real world use-case Veriexec is designed to solve.
1) Prevent tampering by making part of the system immutable? The fingerprint isn't necessary; unconditionally prevent modification to the relevant files instead.
2) Prevent tampering by using trusted files? Normally this should be done by having a set of trusted keys, not hardcoded hashes. That way you can still securely upgrade the system.
3) Accessing files from a remote untrusted filesystem? This doesn't seem to work either; see the caveats section in veriexec(9).
They really dont. The compiler can still be malicious since developers might be malicious or just screw up. You're just increasing probability the binary has same vulnerabilities as the source with nothing added during compilation.
You can just as easily just download the source over one or more links to your OS then compile it locally. That was what TCSEC required for trusted distribution. TCSEC was partly designed by guy (Karger) who invented the compiling-compiler attack if you're wondering credibility. If your existing compiler is subverted, then your system is probably subverted with rootkit given that's easier than sniping one program you may never use the intended way with exotic attack. If you cant trust the repos for source or local compiler, then how did you trust your kernel or anything else to begin with?
The real solution is secure SCM, ability to build from source, and a trusted point to start with. You have the latter unless your system is pre-backdoored. Reproducible builds might have other benefits such as during debugging. It's bringing you little security benefit over the stronger alternative I mentioned.
Agree that secure source is required but the compiler trust issues become different with reproducible builds: you only need to trust the initial operating system binary and intermediate source code. The next compiler is built reproducibly from the old compiler.
The maliciousness in the compiler would have to be introduced in source code (which is harder) or somehow already in the initial operating system binary (disregarding bad hardware that is bad for all systems testing builds).
"you only need to trust the initial operating system binary and intermediate source code. The next compiler is built reproducibly from the old compiler."
That's kind of what I'm saying. In your model, you trust the binaries and files you received in the OS to run with a bunch in kernel mode. You don't trust the compiler. In my model, I trust the semi-privileged compiler I received with the OS since I already trust all the OS code. An attacker that could subvert the compiler would prefer subverting the OS instead or in addition to compiler. Then, they could possible avoid any solutions I have such as double compilation. If I use a stock scheme, they might even subvert it.
If you're not trusting the repo, you have much bigger problems than reproducible builds can solve. At the least, you will be doing reproducible builds plus a bunch of downloads from mirrors, hash/sig checks, comparisons, etc.
> The real solution is secure SCM, ability to build from source, and a trusted point to start with. You have the latter unless your system is pre-backdoored. Reproducible builds might have other benefits such as during debugging. It's bringing you little security benefit over the stronger alternative I mentioned.
Even if you have all that, how can you be sure no one tampered your system at some point? Reproducibility lets you compare your entire system against known-good states.
" Reproducibility lets you compare your entire system against known-good states."
If you have a known, good state, then you don't have to. You simply use the source or binaries in the repo after checking hashes or signatures on a few mirrors. If your state is that bad, then you're already compromised with possible a rootkit.
"Even if you have all that, how can you be sure no one tampered your system at some point?"
That takes way more than reproducible builds. You need source code for a compiler that doesn't have 0-days in it in terms of backdoors, optimizations that eliminate security checks, or passes that create exploitable faults. That's called a verified compiler. Only one exists for C (CompCert) that's proprietary with quite a few open ones for ML languages. The SCM's distributing that source code must be secure so they aren't adding modifications to source or your OS binary you started with. During transfer, they have to be signed or the transport has to be secure. You then need some kind of local tool you can trust to bootstrap it. If you trust nothing, you're going to be using ancient hardware you bought in person with cash to run an interpreter you wrote by hand that executes something equivalent to a small C compiler you use to compile the others that you inspected by eye or trusted via 3rd party review w/ hash/sig confirmation.
If exotic attacks like Karger's are in your threat model, so are 0-days that are accidental or deliberate along with endpoint and SCM attacks. Reproducible builds don't protect you from much in that model. The stronger methods, from endpoint protection to SCM to local tooling, will pay off over time in more situations given they're independently critical. It's why Karger et al put them in first standards for INFOSEC to begin with.
The other problem is that making all binaries come out equal creates more of a monoculture that aids attacks on endpoints. Many good tools in the field automatically harden C programs by making them memory or data-flow safe. Softbound+CETS comes to mind. Others obfuscate the hell out of them on a per-user basis to reduce odds of one-size fits all attack. You can't use these techniques with everyone having same binary. So, it's a step back from existing compiler and system security methods in total assurance you can achieve.
> The other problem is that making all binaries come out equal creates more of a monoculture that aids attacks on endpoints.
This is a red herring, for two reasons. First, even if the build is not reproducible, the distribution produces packages from some source code and those packages are fixed. If one is installing a distribution's binary packages there's no more or less of a monoculture if that package can be reproduced.
Second, the sources of non-reproducibility we're trying to fix here come from things like timestamps embedded in the binaries, arbitrary directory file ordering, etc. These provide little to no meaningful diversity in the resulting binaries anyhow.
Having the ability to reproducibly build packages does not preclude the application of tools or techniques to introduce diversity in the compilation process or otherwise obfuscate built binaries on one's own system.
"This is a red herring, for two reasons. First, even if the build is not reproducible, the distribution produces packages from some source code and those packages are fixed. "
That's not a red herring. That means the first distros will all start the same if a binary or be built from source (eg Gentoo). For the binary, they can build from source on local machine to immediately diversify then do the rest with source. This means the monoculture risk is only present when system is first set up. It can even be programmed to exclusively connect to repos until that risk is removed or warn the user if they turn that off manually. One could go further to even have multiple binaries available that each used a different, security technique or combo of them. All built into an automated, build system.
"Second, the sources of non-reproducibility we're trying to fix here come from things like timestamps embedded in the binaries, arbitrary directory file ordering, etc. These provide little to no meaningful diversity in the resulting binaries anyhow."
That's a red herring. The diversity I referred to came from compiler transformations that provably enhance security with site-specific obfuscations on top of that. Transformations that couldn't have happened if everyone's compile results in the same binary with hashes they're checking against each other. Those other things don't effect security much as you said. It's why I didn't bring them up.
"Having the ability to reproducibly build packages does not preclude the application of tools or techniques to introduce diversity in the compilation process or otherwise obfuscate built binaries on one's own system."
That much is true. However, solving the SCM problem + local, build tools already eliminates the need for the security aspect of it. The other techniques can then be included in by default. A reproducible build can also be done for any of it for extra benefits it brings but the compiler subversion is already knocked out by former techniques + a certifying compiler.
> That's a red herring. The diversity I referred to came from compiler transformations that provably enhance security
No. The work to get to a state where packages build reproducibly, in general, consists of removing timestamps, providing stable sort order for inputs, and similar cases. In the vast majority of cases the only diversity in a distro's package set comes from these sorts of differences. If addressing these cases results in a reproducible build, there was no meaningful diversity to begin with.
You keep focusing on this one set of changes you do in the process when it has absolutely nothing to do with what I'm saying about the security argument for doing reproducible builds, diverse compiles, binary checks, etc. The big picture of what you're doing with what attacks are likely to come in.
If the goal is stopping subversion, I identified a bunch of other things you have to do. Some conflict with reproducible binaries where you avoid them or throw them away immediately. Some with strongest security... memory-safe languages, certified compilation, or highly-assured SCM... you aren't doing at all that I'm aware. Your attackers will try to hit all of this, though, rather than just do a compiler-compiler-subversion thing in MITM scenario. Hence the need for strong, holistic stuff instead of tactical hacks.
I'm sorry that we seem to be talking past each other. I don't disagree with what you write, my point is only that the reproducible builds effort has no real effect on aspects of it (such as binary diversity).
Of course there's a lot more that needs to be done to prevent or detect malfeasance, and while it's related, it's beyond the scope of the reproducible builds effort.
We could be talking past each other a bit. Ill end the tangent in that case. I agree the diversity effect is contentious (a smaller claim) and possibly out of scope for people doing these projects.
The main threat model that reproducible builds are meant to guard against is simple: an attacker, who has write access to a binary distribution of some piece of software, uploads a malicious binary rather than the true output of compiling the corresponding source.
With a reproducible build, anyone can prove that a given binary is non-malicious by re-running the build and verifying that the output is the same as the binary they downloaded. This leaves the problem of ensuring that everyone gets the same binary, i.e. the server distributing binaries hasn't been modified to serve different files to different IP addresses or something like that. But there are ways to solve that: for example, you could have multiple independent parties who each verify all the builds and sign them with their keys, and end users could check for signatures from N different trusted parties before installing.
Maybe you understand this, but I have a hard time seeing how many of the things you propose in any way substitute for reproducibility.
> If you have a known, good state, then you don't have to. You simply use the source or binaries in the repo after checking hashes or signatures on a few mirrors. If your state is that bad, then you're already compromised with possible a rootkit.
Checking signatures on a few mirrors is nice, but you still have to trust the single machine where the package was originally built. Reproducible builds let you avoid that.
If you use no binaries at all, i.e. you start with some existing (trusted) OS with a compiler already installed, and from that OS you build the entire target OS from scratch, then sure, you don't need reproducible builds. But nobody installs operating systems that way. The vast majority of OSes don't even support being built from any OS other than themselves. As has been mentioned in this thread, NetBSD is a partial exception as it can build from any POSIX system, but that still rules out building from, say, Windows. Much easier to just verify some GPG signatures on your OS of choice, at least if you trust that the N verifiers won't collude or all be hacked, etc.
> That takes way more than reproducible builds. You need source code for a compiler that doesn't have 0-days in it in terms of backdoors, optimizations that eliminate security checks, or passes that create exploitable faults. That's called a verified compiler.
A verified compiler would be nice to have in many cases (with the drawback that the output binaries are usually rather slow), but that's pretty much orthogonal to reproducible builds, as the threat models of "0-day via bad compiler optimization" and "binary distribution compromised" are completely separate. Compiler backdoors are also orthogonal, since a binary purportedly corresponding to a verified compiler can still be backdoored (the verification is usually done on the compiler's source)...
> The other problem is that making all binaries come out equal creates more of a monoculture that aids attacks on endpoints. Many good tools in the field automatically harden C programs by making them memory or data-flow safe. Softbound+CETS comes to mind. Others obfuscate the hell out of them on a per-user basis to reduce odds of one-size fits all attack. You can't use these techniques with everyone having same binary.
If you want to customize binaries per-user, or use different compiler settings/passes from upstream, then that's great. In that case, reproducible binaries are still useful for the initial bootstrap step, as I mentioned above.
But I don't know what that has to do with hardening passes. There is nothing inherently nondeterministic about those. They can be part of the upstream build, in which case they should be reproducible, or they can be not part of it, in which case you need to build from source if you want them. But in that respect they don't differ from any other compiler option or compiler variant.
"The main threat model that reproducible builds are meant to guard against is simple: an attacker, who has write access to a binary distribution of some piece of software, uploads a malicious binary rather than the true output of compiling the corresponding source."
This isn't the only attack reproducible builds are about. They started as part of Wheeler's Countering Trusting Trust paper on how to beat the Karger attack of modifying compilers to backdoor themselves. So, there's a MITM problem and compiler subversion problem they're about in most places doing them. Most of the threads on the topic (including this one) also have people brining up stuff in Thompson paper or Wheeler's technique. That threat model requires more security, esp against compiler vulnerabilities or malicious source.
My other argument was that users are trusting an image of their OS from a repo that comes in binary. If they trust that, then why not trust the binary of the compiler or other apps in that repo? If they don't trust the repo, they shouldn't be downloading privileged software from it that runs in kernel mode. Bit of a contradiction. One counter is that it might be compromised later on. Well, do they not do software updates either then? Benefits of reproducible binaries over source-based distribution and updates from secure repo are slim to none outside saving compilation time.
By 'repo that comes in binary', are you referring to some preexisting trusted system, or the desired OS?
If the former, then as I said, it's theoretically possible to build the entire desired OS from source on the existing system, but that's often not supported by OS build systems, and very uncommon in practice as an installation method.
If the latter, with reproducible builds you don't have to trust the repo - even for the initial install. You only have to trust that at least one of the people who signed the packages in the repo, ideally after verifying using entirely separate infrastructure, is honest. There's no single point of failure.
Well, you also have to trust that the source isn't backdoored, of course, but that's at least somewhat easier to detect than backdoors in binaries.
> If you have a known, good state, then you don't have to. You simply use the source or binaries in the repo after checking hashes or signatures on a few mirrors. If your state is that bad, then you're already compromised with possible a rootkit.
If you download a binary package that still leaves you trusting the packager who built and signed it, and makes it much harder for other packagers to cross-check their work. To what extent does anyone currently disassemble the binary packages in, say, Debian?
At the moment every packager has the keys to the kingdom. Cutting that down to only the packagers of core system binaries like the compiler would be a win. Having multiple independent packagers double-checking each other's work would be a win. It's not everything, but it's valuable.
"If you download a binary package that still leaves you trusting the packager who built and signed it"
I'm agreeing with that. I'm also saying you're already doing it for a whole distro. Why not the compiler they can ship with it on top of that? If you don't trust repo, then you need to be doing more than reproducibly building a compiler's source.
I fully intend to do more, in the long term. But reducing the scope of what I trust is a good start. If I can go from trusting every binary packager working for the distro to trusting only the binary packagers of packages in build-essential, that's a huge win.
"If I can go from trusting every binary packager working for the distro to trusting only the binary packagers of packages in build-essential, that's a huge win."
This is true unless the repo you're using can be accessed by malicious parties or contain code from them. As in, if you're avoiding packages from person A but not B and both have write access to the repo, its server, or its network then your security is unchanged. If different source/binaries are completely isolated by different people, then your choice might reduce your risk in event one becomes malicious.
> This is true unless the repo you're using can be accessed by malicious parties or contain code from them. As in, if you're avoiding packages from person A but not B and both have write access to the repo, its server, or its network then your security is unchanged. If different source/binaries are completely isolated by different people, then your choice might reduce your risk in event one becomes malicious.
A and B sign the packages they build with their own PGP keys, no?
(Cool thought: with reproducible builds, multiple independent packagers could perform the build locally and upload the package signature - only the first packager would have to upload the actual package, but we could check the others' signatures to increase our confidence that no funny business was going on)
That's a good idea. It's even in my distributed, subversion-resistant, SCM scheme from years ago. ;) As I told Wheeler, you can get extra benefits with reproducible builds on top of what I recommend. I just encourage people solve the root problems first. He did too with his SCM security page but did this stuff for event high-assurnce security (esp SCM). That sadly was what happened.
I don't know about "solving the root problems". There are a lot of things that need to be done to get us to secure software, and in many cases there are multiple "blockers". I consider a security effort valuable when it both a) provides some improvement today, even if a fairly trivial one and b) will be useful in the end state we're aiming for , and I think reproducible builds qualify.
At Cygnus we had a customer from the telecom industry. They had SLAs with their own customers that included terms like "no more than 10 minutes of downtime per decade". They paid a LOT of money to have one, consistent release (no upgrades, only bug fixes); when they reported a big and got a fix they would diff the new binary and required that every change could be traced solely to the patch issued a nothing else.
I always dread dealing with build systems, mostly in the C land.
Deterministic behaviour, especially in this rigorous fashion, is probably very helpful for much more cases than just trust.
This looming assumption that make executes pure functions to produce output could actually become true. Now it really suffices if make triggers a target if one of the inputs changed.
Indeed, there are a load of QA benefits for reproducible builds. Let alone the CO2 savings that result from cache hits instead of pointlessly rebuilding dependencies.
"Reproducible builds of Debian as a whole are still not a reality, though individual reproducible builds of packages are possible. So while we are making very good progress, it is a stretch to say that Debian is reproducible."
In my experience, if you have a single home made package in C it is pretty easy to make it reproducible
It's funny that to prove your argument you're quoting a paragraph that says exactly otherwise. Not to mention, NetBSD builds are about the OS, that's kernel, tooling, base and so on, and not about packages, as in Debian's case.
From their website:
NetBSD is a free, fast, secure, and highly portable Unix-like Open Source operating system. It is available for a wide range of platforms, from large-scale servers and powerful desktop systems to handheld and embedded devices. Its clean design and advanced features make it excellent for use in both production and research environments, and the source code is freely available under a business-friendly license.
Of all the great OSes, NetBSD seems to get the least fanfare but it's always been my favorite. It's fast, secure, and due to a well reasoned architecture it works on everything. pkgsrc is amazing. They also have a very helpful, friendly community. It especially warms my heart to run `ps -ax` on a new install and see all of about 10 processes. The OS feels lean, neat, and organized, and I feel like I know exactly what is going on, where to find a given file, etc.
These special strengths -- vast hardware compatibility, rump kernels, now full reproducible builds, are all enabled by a greater underlying (and seemingly underrated) technical excellence.
Also run "top" on the fresh install to see it uses about 10MB :D . I learned how to use Unix when I installed and used NetBSD for some years, so for me it is "Mother-UNIX". It's a pity that general hardware support is not that great anymore nowadays. But it runs quite well on the Lenovo X240/T440 range of laptops which can still be had new in some places.
The post itself has a pretty good list of 10 sources of non-determinism:
1. timestamps
2. dates/times/authors etc. embedded in source files
3. timezone sensitive code
4. directory order/build order
5. non-sanitized data stored into files
6. symbolic links/paths
7. general tool inconsistencies
8. toolchain
9. build information / tunables / environment
10. making sure that the source tree has no local changes
Reproducible builds are so important, they prevent a build server or developer laptop from being a single point of failure as a tainted build can now be detected by others.