Launch HN: Ploomber (YC W22) – Quickly deploy data pipelines from Jupyter/VSCode

ensemblehq · on Feb 3, 2022

Congrats on the launch. I'm a MLOps consultant that helps enterprises with productionizing their models on cloud platforms. Previously, also a startup founder who iterated in the same space and can probably exchange notes.

The problem is definitely a time-consuming and costly one and I'm intrigued to play around with Ploomber. How does Ploomber handle collaboration/code sharing across data scientists?

edublancas · on Feb 3, 2022

We allow people to write pipeline tasks in .py files but open them as notebooks in Jupyter. So they keep the same workflow they're used to but under the hood, they're writing .py files - so they can do code reviews (jupytext handles the .py to .ipynb conversion). Also, when executing the pipeline, ploomber generates an output report for each script, so teams can use this to review any outputs generated by the code.

Finally, since the pipeline is modularized, it's easier to split the work. Some people may work in data cleaning, others in feature engineering, and they can all orchestrate the pipeline with "ploomber build".

You can read more about our approach in this guest blog post we published a few months ago on the Jupyter blog: https://blog.jupyter.org/ploomber-maintainable-and-collabora...

We'd love to hear from your experience! Please send me an email to eduardo@ploomber.io

idomi · on Feb 3, 2022

We took an approach of keeping the notebook interface/IDE, and behind the scenes Ploomber converts it into .py files so you can collaborate with teammates through Git. The users can still open those as a notebook and interact with the files regularly.

tracyhenry · on Feb 3, 2022

Hey congrats on the launch! This is definitely a useful concept.

I haven't dug deep, but is code reviews possible? A big point of the whole data-as-code movement is to enable easier review of the data generation process, make abstractions and versioning. Being able to generate pipelines from Jupyter notebooks sounds exciting in theory, but I'd imagine code reviewing the generated pipelines can be a pain.

edublancas · on Feb 3, 2022

We allow users to open .py files as notebooks in Jupyter, so you can get the best of both worlds: interactivity with Jupyter and nice code reviews. jupytext does the heavy lifting for us (it's a great package!) and we add some extra things to improve the experience

More on the docs: https://docs.ploomber.io/en/latest/user-guide/jupyter.html

wizwit999 · on Feb 3, 2022

I think this is a good idea. Decoupling seems like an interesting approach. When I worked in this space as an engineer, bridging the notebook - production-ization divide was annoying. I'd be interested to see if this solves it.

edublancas · on Feb 3, 2022

Thanks for your feedback! Do you have any stories to share? I'd love to hear about your experience with the notebooks-production gap

wizwit999 · on Feb 3, 2022

Yeah a bunch, I worked at Amazon but I'm sure its similar everywhere else. Basically, the scientists were way more familiar with notebooks, and they'd code their models there, but when we needed to deploy it, we needed a proper python package that we could store in git, build, test, run in a container, integrate with data engineering tools, and deploy on some internal tools and AWS Sagemaker later. So we'd usually end up converting it to a Python package once it was ready, which worked OK, but you could tell the scientists were more comfortable in notebooks.

Funnily, there were a bunch of internal MLOps type frameworks there (at least 4) that tried to let the scientists deploy to production w/o engineers, but they all failed or semi-failed. I've heard Netflix made it work and I follow MLFlow so I'd be curious what sticks here.

I don't work in the space anymore but it was an interesting space, definitely could use more standardized tooling.

edublancas · on Feb 3, 2022

That totally resonates with me! I spent 6 years working as a data scientist and notebooks just make it a lot simpler to explore and interact with the data, so I totally understand my data science peers for sticking with notebooks.

Having said that, the challenge now is to hit a sweetspot between keeping the Jupyter interactive experience, and providing some features to help data scientists develop modular work. That's where most frameworks fail, so we want to keep our eyes open, and get feedback from both scientists and engineers to develop something that works for everyone.

arjvik · on Feb 4, 2022

I've had a great experience using DVC for both data versioning and pipelines before. Can you tell us why Ploomber is a better solution than DVC?

edublancas · on Feb 4, 2022

I'm not a DVC user so I'll speak for what I've seen in the documentation and the couple examples I ran a while ago. DVC's core is data versioning and the pipeline features are an extension to it. The main difference is that DVC's pipeline feature is agnostic: you define the command, inputs and outputs, and DVC executes pipeline. On the other hand, Ploomber has a deeper integration between your code and the pipeline. For example, our SQL integration allows you to tell Ploomber how to connect to a database and then list a bunch of SQL files as stages in your pipeline (example: https://github.com/ploomber/projects/blob/master/templates/s...), this reduces the boilerplate a lot since you only have to write SQL, if you wanted to do the same thing with DVC, you'd have to manage the connections and create bash scripts to submit the queries.

The other important difference is that AFAIK, DVC can only run your pipelines locally, and Ploomber can export your pipelines to run in other environments like (Kubernetes, Airflow, AWS, SLURM, Kubeflow), this allows you to run experiments locally but easily move to a distributed environment when you need to train models at a larger scale or want to deploy an ML pipeline.

esotericpedant · on Feb 4, 2022

DVC also runs pipelines in the cloud so you can run experiments remotely at different scales and share results with others.

edublancas · on Feb 4, 2022

Interesting, can you share a link to the docs? I'd like to learn more about their approach.

cardosof · on Feb 4, 2022

Congrats on the launch! Do you guys by any chance know Deepnote? They're in YC as well, also in the tools for data scientists space. I lead a small team of DSs in a big corp and we'd happily pay for a single tool that would be Deepnote+Ploomber in terms of features (collaboration + deployment)

edublancas · on Feb 4, 2022

Thanks! Yes, we know Deepnote! We want to focus on pipelines and deployment so we currently integrate with Jupyter distributions (as long as they keep the standard format); we have users that run Ploomber on JupyterHub, Domino, SageMaker and others. I don't think any of our users runs on Deepnote but it should work as well. We are thinking of being the "backend": Users can develop their notebooks from whatever distribution they use; Ploomber will provide the tools to help them build modular pipelines and we'll help them deploy those pipelines. I'd love to learn more about your use case, please ping me at eduardo@ploomber.io

Equiet · on Feb 4, 2022

Hey, the founder of Deepnote here. Happy to chat about this.

jiriro · on Feb 3, 2022

The audio in the landing page video is hard to understand. Is this only my broken speakers?

Also the video cannot be made fullscreen on my phone. Is this by design?

idomi · on Feb 3, 2022

Thanks for letting us know! So the audio is ok on laptops and mobiles. On the video you're right, it's a bug we need to fix since it's within an iframe you can't expand it into full screen mode.

hoerzu · on Feb 3, 2022

Really helpful keeping notebooks tidy :)

edublancas · on Feb 3, 2022

Yes! We want to help people keep enjoying Jupyter and produce tidy pipelines!

ricklamers · on Feb 4, 2022

Congrats on the launch! It’s great to see validation of the usefulness of notebooks in data workflows even when moving beyond the proof of concept/exploration stage into production type workloads and deployments. Once deployed, iteration is often still necessary or desirable and that’s where having notebooks available for continued iteration is a big advantage.

For those who’d like to compare and contrast different solutions that support the use of notebooks in the (batch) deployment context you can also check out Orchest (https://github.com/orchest/orchest). I’d say a meaningful point of difference between Ploomber and Orchest is that we are more container oriented as we’ve found that gives robust units to deploy in production with isolated and well defined dependencies. In addition we have a more GUI first approach which might be more familiar to those who come from RStudio, JupyterLab, Spyder, MATLAB, etc.

Disclaimer: I’m one of the Orchest creators.

sails · on Feb 6, 2022

Nothing in the orchest docs looked to me like a deployment setup, whereas I came across a few ploomber cloud deployment related docs eg: https://soopervisor.readthedocs.io/en/latest/tutorials/aws-b...

Please point me in the right direction?

ricklamers · on Feb 7, 2022

Glad you asked.

Simply deploy the Orchest OSS to a VM of choice (https://docs.orchest.io/en/latest/getting_started/installati...) or use https://cloud.orchest.io and then create scheduled Jobs.

https://docs.orchest.io/en/latest/fundamentals/jobs.html

If you’re looking for something more than that I’d be happy to open up a conversation about your use case at rick@orchest.io