Distributed Cloud Builds for Everyone

georgewfraser · on June 3, 2021

sccache[1] is a similar project, which supports remote execution and caching for C, C++, and Rust. Unfortunately the remote-execution mechanism is designed for Mozilla's internal environment and doesn't support cloud backends like Lambda or Google Cloud Build. But the code is well-structured Rust and not too big, adding a cloud backend would be a nice project.

[1] https://github.com/mozilla/sccache

maccam94 · on June 3, 2021

I was considering buying a new dev machine a few months ago. I'd been hearing great things about the new Ryzen chips with 16+ cores, but then I saw the price tag and thought about what percentage of the time I'd even be fully utilizing that many cores: probably less than an hour or two a day, on average, if that. And then there's the cost of powering it. It would be much less expensive up front to just spin up a VM in GCP and suspend it when I'm not coding. The trick would just be making the experience as seamless as possible (maybe build an LSP proxy that automatically suspends/resumes the VM?). Nelson seems to be solving the build performance issue with llama, but I'm curious about running LSP servers in the cloud.

nsm · on June 4, 2021

The thing about LSP is that you really want to minimize latency for the user, which local is still the fastest at. Particularly with newer incremental completion engines that don't need to re-read the entire giant project.

While you do want your build to be as fast as possible, the constraint is often throughput and not latency.

EricBurnett · on June 4, 2021

Very nice! I really like the ease-of-use of this, as well as the scale-to-zero costs. That's a tricky thing to achieve. Seems like it could become a standard path to ease the migration from local to remote builds.

If the author is interested in standardizing the same, I'd suggest implementing the REAPI protocol (https://github.com/bazelbuild/remote-apis). It should be amenable to implementing on a Lambda-esque back-end, and is already standard amongst most tools doing Remote Execution (including Bazel! Bazel+llama could be fun). And equally, it's totally usable by a distcc-esque distribution tool (recc[1] is one example) - that's also what Android is doing before they finish migrating to Bazel ([2], sadly not yet oss'd).

The main interesting challenge I expect this project to hit is going to be worker-local caching: for compilation actions it's not too bad to skip assuming the compiler is built into the container environment, but if branching out into either hermetic toolchains or data-heavy action types (like linking), fetching all bytes to the ephemeral worker anew each time may prove to be prohibitive. On the other hand, that might be a nice transition point to switch to persistent workers: use a lambda backed solution for the scale-to-0 case, and switch execution stacks under the hood to something based on reused VMs when hitting sufficient scale that persistent executors start to win out.

(Disclaimer: I TL'd the creation of this API, and Google implementation of the same).

[1] https://gitlab.com/BuildGrid/recc

[2] https://opensource.googleblog.com/2020/11/welcome-android-op...

Game_Ender · on June 4, 2021

This is a pretty interesting project, a low setup distributed build cluster is cool. That said I find the arguments against Bazel pretty weak. Yes with a distcc model you can scale out the compilation stage of C++ project quickly, but what about the test execution, linking, ProtoBuf code generation, or caching? In either case, Bazel or Llama you have to make your build system aware of how these actions work in some way in order to distribute or cache them.

The difference is with Bazel you also gain a suite of solutions for large software projects like container creation, target visibility control, isolation, dependency fetching, and build graph analysis. Also with the larger community behind Bazel adopting it for a large project is getting easier over time.

nsm · on June 4, 2021

The author has plenty of experience with Bazel, as Sorbet, the Ruby type checker they helped create uses Bazel. I agree that Bazel is almost always the right solution beyond a certain scale or multi-language builds.

The unfortunate reality is that open source developers have decided that they don't want the state of the art because "it's written in Java and is like a 50mb binary". Yes, I know there are certain things about Bazel that don't fit certain open source projects, like it may not be easy to bootstrap a build environment for OSes that want everything to be built from scratch, or want to support 10 different architectures, and I agree that Bazel may not make sense there. At the same time, if you see comments from both HN and on the internet, "it's in Java" is often how FOSS projects are making these decisions :(

I'll always be bitter about this. In particular I feel like Cargo was new enough that Bazel existed and if they had at least built a Bazel compatible tool in Rust (so use Starlark etc.) it would have been very much in line with the Rust ethos of safety and speed.

0xbadcafebee · on June 3, 2021

I haven't checked, but I'm willing to bet that the costs for a one-time Fargate task and Lambda are similar enough that Fargate will be vastly simpler. Both can use containers, but Fargate is more explicitly built to just run a container without managing any server. And there's EKS Fargate, too.

Fargate Spot pricing is $0.012144 (vCPU Hour) + $0.0013335 (GB hour) + $0.000111 (Ephermeral Storage hour). Pricing is per second with a 1-minute minimum. So for one hour of compiling on 1 vCPU, 1GB ram, 1GB storage, that's $0.0135885, to run any old Docker container.

Lambda cannot execute for longer than 900000 milliseconds, or 15 minutes. For 15 minutes at 1GB for 1 request, that's $0.02. For 4 requests (1 hour) that's $0.06.

Scramblejams · on June 5, 2021

forty cents buys you about 10 minutes of on-demand compute on a 96-core c5 instance

This is incorrect. 48 cores, 96 threads.

vCPUs mostly refer to threads, not cores. There are some exceptions. More here...

https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/instance...

...and here (which is inexplicably missing the c5.24xlarge):

https://aws.amazon.com/ec2/physicalcores/

the_arun · on June 3, 2021

Is Llama a typo for Lambda? OR is there a service which uses Lambda for doing remote builds? Sorry, unable to figure it.

wgjordan · on June 3, 2021

> Is Llama a typo for Lambda? OR is there a service which uses Lambda for doing remote builds? Sorry, unable to figure it.

"Recently, I blogged about building LLVM in 90 seconds [1] using AWS Lambda and my Llama project."

[1] https://blog.nelhage.com/post/building-llvm-in-90s/

banana_giraffe · on June 3, 2021

llama is the project being discussed, a platform for pushing work to the cloud, using Lambda to do the work.

https://github.com/nelhage/llama

My question is how much upload bandwidth this needs. The article linked here mentions it in passing, but that's always been the pain point for things like this for me.

hathawsh · on June 3, 2021

From what I can tell, Llama uploaded 1.8 GB. An Internet connection that can upload 10 megabits per second, sustained, should be able to complete that in a half hour. If that's too long, it might make sense to run Llama from a small EC2 VM, eliminating any concerns about bandwidth.

IMHO Llama looks so good that it makes me wonder if AWS is undercharging to attract developers.

rejectedandsad · on June 4, 2021

I'm skeptical of that, if anything everyone complains about Lambda being too expensive for general purpose API tasks.

tryagain112 · on June 4, 2021

iOS builds?

johnklos · on June 3, 2021

I think there's a bit of oversimplification going on. Sure, free, or almost free, stuff is available everywhere. But at what cost?

How much work do we have to do to get this "free" thing? What privacy do we give up? What if we're working on something that's part of a product, or otherwise proprietary?

The idea that everyone is doing it, so we should stop being sticks in the mud and just jump on the bandwagon, is harmful and ignorant.

"My data has never been stolen from the cloud" is about as ridiculous as the fortune quote, "As far as we know, our computer has never had an undetected error." How would you know until it's way too late?

Or, even better, you find out the data you've used in the cloud has been leaked on the Internet. Was it AWS? Was it crappy software on your laptop? Or was it a breach on your network? You've just made finding out that much harder, particularly because the cloud doesn't have logs that you can audit, nor real, understanding humans with whom you can correspond.

People who don't understand security can't really be faulted for eschewing the idea of using the cloud for everything, but people who know better really shouldn't be pushing these unhealthy ideas.

n_jd · on June 3, 2021

If this paradigm catches on there's no reason you won't be able to do the same thing with your own infrastructure.

I know this is a straw man because I can't know what sort of infra you're imagining, but I suspect a lot of people would be more comfortable with their builds stored on S3 under the scrutiny of their security team rather than on their engineers' laptops.