Hacker News new | past | comments | ask | show | jobs | submit login
Launch HN: Ploomber (YC W22) – Quickly deploy data pipelines from Jupyter/VSCode (github.com/ploomber)
126 points by edublancas on Feb 3, 2022 | hide | past | favorite | 23 comments
Hi HN, we’re Eduardo & Ido, the founders of Ploomber (https://ploomber.io). We’re building an open-source framework (https://github.com/ploomber/ploomber) that helps data scientists quickly deploy the code they develop in interactive environments (Jupyter/VScode/PyCharm), eliminating the need for time-consuming manual porting to production platforms.

Jupyter and other interactive environments are the go-to tools for most data scientists. However, many production data pipeline platforms (e.g. Airflow, Kubernetes) drag them into non-interactive development paradigms. Hence, when moving to production, the data scientist’s code needs to move from the interactive environment to a more traditional software environment (e.g. declaring workflows as Python classes). This process creates friction since the code needs to cross this gap every time the data scientist deploys their work. Data scientists often pair with software engineers to work on the conversion, but this is time-consuming and costly. It’s also frustrating because it’s just busy work.

We encountered this problem while working in the data space. Eduardo was a data scientist at Fidelity for a few years. He deployed ML models and always found it annoying and wasteful to port the code from his notebooks into a production framework like Airflow or Kubernetes. Ido worked as a consultant at AWS and constantly found that data science projects would allocate about 30% of their time to convert a notebook prototype into a production pipeline.

Interactive environments have historically been used for prototyping and are considered unsuitable for production; this is reasonable because, in our experience, most of the code developed interactively exists in a single file with little to no structure (e.g., a gigantic notebook). However, we believe it’s possible to bring software engineering best practices and apply them to the interactive development world so data scientists can produce maintainable projects to streamline deployment.

Ploomber allows data scientists to quickly develop their code in modular pipelines rather than a giant single file. When developed this way, their code is suitable for deployment to production platforms; we currently support exporting to Kubernetes, AWS Batch, Airflow, Kubeflow, and SLURM with no code changes. Our integration with Jupyter/VSCode/PyCharm allows them to iteratively build these modular pipelines without moving away from the interactive environment. In addition, modularizing the work enables them to create more maintainable and testable projects. Our goal is ease of use, with minimal disturbance to the data scientist’s existing workflows.

Users can install Ploomber with pip, open Jupyter/VSCode/PyCharm, and start building in minutes. We’ve made a significant effort to create a simple tool so people can get started quickly and learn the advanced features when they need them. Ploomber is available at https://github.com/ploomber/ploomber under the Apache 2.0 license. In addition, we are working on a cloud version to help enterprises operationalize models. We’re still working on the pricing details, but if you’d like us to let you know when we open the private beta, you can sign up here: https://ploomber.io/cloud. However, the core of our offering is the open-source framework, and it will remain free.

We’re thrilled to share Ploomber with you! If you’re a data scientist who has experienced these endless cycles of porting your code for deployment, an ML engineer who helps data scientists deploy their work, or you have any feedback, please share your thoughts! We love chatting about this domain since exchanging ideas always sheds light on aspects we haven’t considered before! You may also reach out to me at eduardo@ploomber.io.




Congrats on the launch. I'm a MLOps consultant that helps enterprises with productionizing their models on cloud platforms. Previously, also a startup founder who iterated in the same space and can probably exchange notes.

The problem is definitely a time-consuming and costly one and I'm intrigued to play around with Ploomber. How does Ploomber handle collaboration/code sharing across data scientists?


We allow people to write pipeline tasks in .py files but open them as notebooks in Jupyter. So they keep the same workflow they're used to but under the hood, they're writing .py files - so they can do code reviews (jupytext handles the .py to .ipynb conversion). Also, when executing the pipeline, ploomber generates an output report for each script, so teams can use this to review any outputs generated by the code.

Finally, since the pipeline is modularized, it's easier to split the work. Some people may work in data cleaning, others in feature engineering, and they can all orchestrate the pipeline with "ploomber build".

You can read more about our approach in this guest blog post we published a few months ago on the Jupyter blog: https://blog.jupyter.org/ploomber-maintainable-and-collabora...

We'd love to hear from your experience! Please send me an email to eduardo@ploomber.io


We took an approach of keeping the notebook interface/IDE, and behind the scenes Ploomber converts it into .py files so you can collaborate with teammates through Git. The users can still open those as a notebook and interact with the files regularly.


Hey congrats on the launch! This is definitely a useful concept.

I haven't dug deep, but is code reviews possible? A big point of the whole data-as-code movement is to enable easier review of the data generation process, make abstractions and versioning. Being able to generate pipelines from Jupyter notebooks sounds exciting in theory, but I'd imagine code reviewing the generated pipelines can be a pain.


We allow users to open .py files as notebooks in Jupyter, so you can get the best of both worlds: interactivity with Jupyter and nice code reviews. jupytext does the heavy lifting for us (it's a great package!) and we add some extra things to improve the experience

More on the docs: https://docs.ploomber.io/en/latest/user-guide/jupyter.html


I think this is a good idea. Decoupling seems like an interesting approach. When I worked in this space as an engineer, bridging the notebook - production-ization divide was annoying. I'd be interested to see if this solves it.


Thanks for your feedback! Do you have any stories to share? I'd love to hear about your experience with the notebooks-production gap


Yeah a bunch, I worked at Amazon but I'm sure its similar everywhere else. Basically, the scientists were way more familiar with notebooks, and they'd code their models there, but when we needed to deploy it, we needed a proper python package that we could store in git, build, test, run in a container, integrate with data engineering tools, and deploy on some internal tools and AWS Sagemaker later. So we'd usually end up converting it to a Python package once it was ready, which worked OK, but you could tell the scientists were more comfortable in notebooks.

Funnily, there were a bunch of internal MLOps type frameworks there (at least 4) that tried to let the scientists deploy to production w/o engineers, but they all failed or semi-failed. I've heard Netflix made it work and I follow MLFlow so I'd be curious what sticks here.

I don't work in the space anymore but it was an interesting space, definitely could use more standardized tooling.


That totally resonates with me! I spent 6 years working as a data scientist and notebooks just make it a lot simpler to explore and interact with the data, so I totally understand my data science peers for sticking with notebooks.

Having said that, the challenge now is to hit a sweetspot between keeping the Jupyter interactive experience, and providing some features to help data scientists develop modular work. That's where most frameworks fail, so we want to keep our eyes open, and get feedback from both scientists and engineers to develop something that works for everyone.


I've had a great experience using DVC for both data versioning and pipelines before. Can you tell us why Ploomber is a better solution than DVC?


I'm not a DVC user so I'll speak for what I've seen in the documentation and the couple examples I ran a while ago. DVC's core is data versioning and the pipeline features are an extension to it. The main difference is that DVC's pipeline feature is agnostic: you define the command, inputs and outputs, and DVC executes pipeline. On the other hand, Ploomber has a deeper integration between your code and the pipeline. For example, our SQL integration allows you to tell Ploomber how to connect to a database and then list a bunch of SQL files as stages in your pipeline (example: https://github.com/ploomber/projects/blob/master/templates/s...), this reduces the boilerplate a lot since you only have to write SQL, if you wanted to do the same thing with DVC, you'd have to manage the connections and create bash scripts to submit the queries.

The other important difference is that AFAIK, DVC can only run your pipelines locally, and Ploomber can export your pipelines to run in other environments like (Kubernetes, Airflow, AWS, SLURM, Kubeflow), this allows you to run experiments locally but easily move to a distributed environment when you need to train models at a larger scale or want to deploy an ML pipeline.


DVC also runs pipelines in the cloud so you can run experiments remotely at different scales and share results with others.


Interesting, can you share a link to the docs? I'd like to learn more about their approach.


Congrats on the launch! Do you guys by any chance know Deepnote? They're in YC as well, also in the tools for data scientists space. I lead a small team of DSs in a big corp and we'd happily pay for a single tool that would be Deepnote+Ploomber in terms of features (collaboration + deployment)


Thanks! Yes, we know Deepnote! We want to focus on pipelines and deployment so we currently integrate with Jupyter distributions (as long as they keep the standard format); we have users that run Ploomber on JupyterHub, Domino, SageMaker and others. I don't think any of our users runs on Deepnote but it should work as well. We are thinking of being the "backend": Users can develop their notebooks from whatever distribution they use; Ploomber will provide the tools to help them build modular pipelines and we'll help them deploy those pipelines. I'd love to learn more about your use case, please ping me at eduardo@ploomber.io


Hey, the founder of Deepnote here. Happy to chat about this.


The audio in the landing page video is hard to understand. Is this only my broken speakers?

Also the video cannot be made fullscreen on my phone. Is this by design?


Thanks for letting us know! So the audio is ok on laptops and mobiles. On the video you're right, it's a bug we need to fix since it's within an iframe you can't expand it into full screen mode.


Really helpful keeping notebooks tidy :)


Yes! We want to help people keep enjoying Jupyter and produce tidy pipelines!


Congrats on the launch! It’s great to see validation of the usefulness of notebooks in data workflows even when moving beyond the proof of concept/exploration stage into production type workloads and deployments. Once deployed, iteration is often still necessary or desirable and that’s where having notebooks available for continued iteration is a big advantage.

For those who’d like to compare and contrast different solutions that support the use of notebooks in the (batch) deployment context you can also check out Orchest (https://github.com/orchest/orchest). I’d say a meaningful point of difference between Ploomber and Orchest is that we are more container oriented as we’ve found that gives robust units to deploy in production with isolated and well defined dependencies. In addition we have a more GUI first approach which might be more familiar to those who come from RStudio, JupyterLab, Spyder, MATLAB, etc.

Disclaimer: I’m one of the Orchest creators.


Nothing in the orchest docs looked to me like a deployment setup, whereas I came across a few ploomber cloud deployment related docs eg: https://soopervisor.readthedocs.io/en/latest/tutorials/aws-b...

Please point me in the right direction?


Glad you asked.

Simply deploy the Orchest OSS to a VM of choice (https://docs.orchest.io/en/latest/getting_started/installati...) or use https://cloud.orchest.io and then create scheduled Jobs.

https://docs.orchest.io/en/latest/fundamentals/jobs.html

If you’re looking for something more than that I’d be happy to open up a conversation about your use case at rick@orchest.io




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: