From Laptop to Lambda: Outsourcing Everyday Jobs to Thousands of Containers [pdf]

stefco_ · on July 15, 2019

This is really great! I wrote and operate a pipeline framework for analyzing data from multiple astronomical observatories in low-latency [1], and I'm currently adding this sort of burst-performance scaling for things like background simulations. Things like Kubernetes spin-up time or packaging huge libraries for AWS Lambda are, indeed, challenges. Getting those startup times down and doing autoscaling in a relatively platform-agnostic way with low boilerplate overhead would be really game-changing for these sorts of analyses (among other applications).

If you can run an analysis in low-latency and detect a joint gravitational wave/EM source, for example, you can quickly follow it up with other telescopes, like we did with the first direct kilonova observation [2]. Though the gamma-ray detection localization wasn't good enough to really aid counterpart searches for that event (GW170817), there are other types of candidates (like GW+neutrinos, my project) for which this rapid localization would be a huge improvement. And if you can do really hard things like estimate the GW source parameters before merger (e.g. whether you're looking at a binary neutron star merger, which is likely to emit detectable light if it's close enough), you can try to get fast-slewing telescopes pointing on source at merger time. Not to mention that burst processing would let you do more clever things with your statistics (getting better sensitivity) in low-latency as well. So stuff that makes this easier is really exciting!

Maybe if the API is amenable to my own solution I'll be able to implement this as a backend :)

[1] http://multimessenger.science [2] https://journals.aps.org/prl/abstract/10.1103/PhysRevLett.11...

isoprophlex · on July 15, 2019

Reading all that makes me think.... you must have one of the coolest jobs around

stefco_ · on July 15, 2019

It is! My response to the other comment describes some of the challenges, but overall it's wonderful! It feels a lot like running a startup, though of course the compensation is more spiritual than financial. But it has all of the great and exciting parts of a startup combined with the bonus of getting to think about stars crashing into each other for your actual job.

jedberg · on July 15, 2019

Check out lambda layers to help with your large package deployment issues.

stefco_ · on July 15, 2019

I have, and they're great! But limited. The real issue is the size of the dependency stack for scientific computing; I need to use a lot of monolithic, project-wide libraries (read: minimal python installs of many gigabytes), far beyond Lambda limitations.

The solution is pretty obviously to split up my own monolithic architecture so that subcomponents can run in limited environments, allowing some steps in my pipeline DAG to be implemented on Lambda; this is the direction I'm moving in (it also has certain development advantages). That said, leaving my own monolith behind requires some careful planning. For one thing, reproducibility is easier if you only need to track a single code version; it's more work to have a fully-reproducible infrastructure (something I'm working on with versioned containers, detailed file-generation metadata, and automated build/test processes). It's also sometimes necessary to hot-fix things in production: this is science, so the APIs I pull data from are not versioned and are subject to breaking changes during production with nothing more than a short email thread to announce them (I get hundreds of emails a day from both major collaborations I'm in (LIGO and IceCube), so tracking these changes is its own hell). This means being able to develop and fix things extremely quickly is paramount, which is also easier on a properly-factored monolith unless you have a proper microservices architecture and deployment system. We also get crazy demands for new features during production from other collaborators and scientists that make my software development friends gasp. And nobody understands anything I do because documentation isn't prioritized, so I have to do all this in a way that's maintainable and extensible by one person. (Seriously, I need to fight to document and properly refactor/maintain my code sometimes because people don't want to allocate time for these tasks.)

This is all hard fundamentally because scientific software development is a garbage fire of overworked developers and misaligned incentives (you don't get credit for code no matter how many man hours it saves/science it enables; only papers count). It's like we're running in constant do-or-die startup mode, with MVPs developed by 22 year-olds who just learned python for the first time used as mission-critical undocumented abandonware after they finish their PhDs. All of our code has 1/3 the manpower it should, so we're all prioritizing ruthlessly with varying degrees of success. Some people have done really amazing, inspiring work using best-practices devops and the rest, and it's slowly getting better in many spots, but the systemics of the dev environment means that unless you're a 10x person, your scientific code is going to be pretty rough around the edges.

All this is to say that Lambda is great but not exactly what I can use at the moment, though my plan is to support it once I've Balkanized my code a bit :)

[edit: typo]

juomba · on July 15, 2019

If you are not aware already - Bazel has a corresponding Remote Execution API - https://docs.google.com/document/d/1AaGk7fOPByEvpAbqeXIyE8HX....

Corresponding GCP offering - https://console.cloud.google.com/apis/library/remotebuildexe...

s_Hogg · on July 15, 2019

Is there a use case for this where it becomes uneconomic? I was working with AWS last year and found that in some cases it really was better to just have a single EC2 instance (or auto-scaling group) rather that ~9999999999999 lambdas to handle each individual task.

keithwinstein · on July 15, 2019

(Co-author here) It really depends (see Figure 2).

On the question of Lambda vs. EC2, EC2 instances take much longer to start. So depending on the job, to get the performance you can get with a "burst parallel" flock of Lambda workers, you would need to keep a warm cluster of EC2 instances ready to take your job. At which point, the cost comparison depends on how often you have work to execute. EC2 is cheaper if you have a 100% duty cycle (but you probably don't).

The gg tool, though, is mostly agnostic to the backend -- you can take a job that's expressed in gg IR (e.g., "compile this program") and then execute it with any of the gg back-ends. We have one for Lambda and one for a cluster of warm VMs. The performance of gg-to-EC2 is generally better than outsourcing methods that leave your laptop in the driver's seat (e.g. bazel-to-icecc) and give less semantic information about data- and control-flow of the job to the remote execution engine. (E.g. in Figure 9, you can see that gg-on-EC2 is much faster than icecc-on-EC2 for compiling GIMP and Inkscape.)

polskibus · on July 15, 2019

Any plans to evolve gg to use open source serverless platforms like knative, openwhisk or openfaas?

keithwinstein · on July 15, 2019

Sure -- we actually already have an OpenWhisk backend. The IR is sufficiently stupid that it's pretty easy to write a new backend.

At least a low-performing one -- it gets harder if you want to use (a) persistent workers [instead of invoking a new platform worker for every "thunk"] and (b) direct inter-worker networking [instead of putting every intermediate result in a storage medium].

crb002 · on July 15, 2019

Where is the DAG scheduler in the source? About to take a look. Interested in labeling thunks with file size and CPU time then feeding into a solver I am writing to optimize cost within a runtime budget.

sadjad · on July 15, 2019

Co-author here. You may wanna take a look at https://github.com/StanfordSNR/gg/blob/master/src/execution/... and https://github.com/StanfordSNR/gg/blob/master/src/execution/....

s_Hogg · on July 15, 2019

That makes sense, thanks for taking the time to expand on this point.

dajohnson89 · on July 15, 2019

the startup time isn't always that important. whether a task takes 20 seconds or 60 seconds isnt always something the user cares that much about. and as always we have to weigh the developer headache versus the performance improvement. the customer always wins, but the less headaches the devs have, the more responsive they are to issues that are more pressing than marginal performance improvements.

KUcxrAVrtI · on July 15, 2019

https://en.wikipedia.org/wiki/Amdahl%27s_law

or

https://en.wikipedia.org/wiki/Gustafson%27s_law

Basically there are parts of every workload that needs to be run sequentially, there will be a point where running more smaller functions will be less efficient than running one large function.

albertwang · on July 15, 2019

Source appears to be here: https://github.com/StanfordSNR/gg

Some more info linked here: https://news.ycombinator.com/item?id=16570548

polskibus · on July 15, 2019

Could this be adapted to work with an opensource serverless framework like knative? It would be more useful if I could choose whether to use local serverless cluster or public depending on data volume, security and other workload-specific requirements.

lsofzz · on July 15, 2019

Whenever I have boring cron-style or once off task that needs to be done, I bring out Apache Airflow cluster to schedule them for me. Nomad is another option but we haven't productionised it [edit]yet[/edit].