Hey HN! Alan and Albert here, cofounders of Serra. Serra is end-to-end dbt—we make building reliable, scalable ELT/ETL easy by replacing brittle SQL scripts with object-oriented Python. It’s open core:
https://github.com/Serra-Technologies/serra, and our docs are here:
https://docs.serra.io/documentation/.
I stumbled into this idea as a data engineer for Disney+’s subscriptions team. We were “firefighters for data,” ready to debug huge pipelines that always crashed and burned. The worst part of my job at Disney+ was the graveyard on-call rotations, where pagers from 12am to 5am were guaranteed, and you'd have to dig through thousands of lines of someone else’s SQL.
SQL is long-winded—1000 lines of SQL can often be summarized with 10 key transforms. We take this SQL and summarize those transforms with reusable, testable, scalable Spark objects.
Serra is written in PySpark and modularizes every component of ETL through Spark objects. Similar to dbt, we apply software engineering best practices to data, but we aim to do it not just with transformations, but with data connectors as well.
We accomplish this with a configuration YAML file—the idea is if we have a pipeline with said 1000 line SQL script that is using third-party connectors, we can summarize all of this into a 12-block config file that gives easy high-level overhead and debugging capabilities—10 blocks for the transforms and 2 for the in-house connectors. Then, we can add tests and custom alerts to each of these objects/blocks so that we know where exactly the pipeline breaks and why.
We are open-source to make it easy to customize Serra to whatever flavor you like with custom transformers/connectors. The connectors we support OOB are Snowflake, AWS, BigQuery, and Databricks and are adding more based on feedback. The transforms we support include mapping, pivoting, joining, truncating, imputing, and more.
We’re doing our best to make Serra as easy to use as possible. If you have docker installed, you can run this docker command to instantly get setup with a Serra environment to create modular pipelines.
We wrap up our functionality with a command line tool that lets you: - create your ETL pipelines, test them locally with a subset of your data, and deploy them to the cloud (currently we only support Databricks, but will soon support others and plan to host our own clusters too). It also has an experimental “translate” feature which is still a bit finicky, but the idea is to take your existing SQL script and get suggestions on how you can chunk up and modularize your job with our config. It’s still just a super early suggestion feature that is definitely not fleshed out, but we think it’s a cool approach.
Here’s a quick demo going through retooling a long-winded SQL script to an easily maintainable, scalable ETL job: https://www.loom.com/share/acc633c0ec03455e9e8837f5c3db3165?.... (docker command: docker run --mount type=bind,source="$(pwd)",target=/app -it serraio/serra /bin/bash)
We don’t see or store any of your data—we’re a transit layer that helps you write ETL jobs that you can send to your warehouse of choice with your actual data. Right now we are helping customers retool their messy data pipelines and plan to monetize by hosting Serra on the cloud, charging if you run the job on our own clusters, and per API call on our translate feature (once it’s mature).
We’re super excited to launch this to Hacker News. We’d love to hear what you think. Thanks in advance!
Interesting project in a space that I am pretty certain is going to change a lot in the coming years. Here is a bit of random feedback and questions.
* Some of your messaging related to python vs yaml is a bit confusing, which results in me not being immediately clear on the value prop. After digging through docs and code I now understand that the yaml is a declarative pipeline calling the underlying python code that can include user defined transformations. Nifty! As someone who has led data platform teams, I understand that this would be a big win for any data platform team to better support data eng/scientists. But you don't tell me any of that. I would look at trying to give more context to what this is and adding more of these use cases and values in your marketing (even if they are pretty nascent at this stage)
* From the loom, the play you are doing is clear and makes a lot of sense to build a cloud service to easily run these jobs... but that makes me wonder if your licensing choice is maybe a bit too restrictive? IMHO, the most important thing to do when building dev tools is to be very deliberate in your end-to-end user -> customer journey and designing your open source and commercial strategies to nicely dovetail. For a product like this, I would think the faster and bigger I can build a community, the better, and that may mean "giving away" a lot of the initial core innovation, but with a clear plan on the innovation I can drive through integrated services, which would imply as open as a license as possible. As is, I think you might find it much harder to get people to take it serious, as, unlike other source available companies (Elastic, Cockroach, etc) you aren't yet proven to be worth the effort to get this approved vs a full open source alternative
* On a similar note, what is in the repo right now seems to be a relatively thin wrapper around spark. That isn't a criticism. Many technologies and communities have started based on a "remix" of a lower level tool that offers simplified UX/DX or big workflow improvements. What sets those apart though, imho, is to drastically lower the barrier to entry to using the underlying technology and to be seen as leaders and experts in the space you operate. I am guessing you probably have lots of features planned, but I would also give a soft suggestion to look as much into thinking of learnability as a feature (via features, interactive docs, etc) as I would almost anything else, as that is really where a lot of the value of a higher level interface like this comes in
* My past experience with really large and complex ETL jobs that essentially required dropping into spark to represent them has me wonder how much actual complexity can be represented by the transformers? I would be curious to know what your most complex pipeline is? It doesn't seem there is an API limitation why these pipelines couldn't get quite a bit larger and represent many sql statements, other than big long spark pipelines getting kind of ugly, and in some cases, could even remove the need for quite a few airflow jobs. I am curious to know if and how you see Serra addressing those sorts of problems like those types of ETL jobs.
Once again, congrats on launching! Happy to give more context/thoughts in a thread or reach out to me via in profile