Data Build Tool makes it easier to get data analysts involved, which helps bring more analytics code into version control. This is a really really good thing given that most data analysts are not software engineers and rightfully prioritize putting their analytical skills to use in the most efficient way possible, which tends to result in analysis getting lost in a data swamp.
I made a little animated slide deck (https://dbt.picturatechnica.com/) a while back about dbt for data analyst type folks that want an intro. You can slowly step through each slide to get a feel of what dbt is going to build in your datawarehouse.
I use dbt to build small dremio data warehouse. dbt is using software engineering ideas to build data, it is quite awesome. I can use github to version control it, can did some unit test (data quality check mainly).
Another thing is its macro systems. Although Macros are not a good programming language. But it makes integrate new DBMS much easier, because most 3rd party plugins are mainly written in Macros. So, Macros is good too if it is provided by other people (not need to be written by myself :) )
The issue with dbt is now folks end up needing to spin up the entire repository to run a bunch of sql queries. The issue is this is a tool that’s targeted at teams that don’t have rigorous engineering backgrounds so it’s bound to be used in suboptimal ways. If the dbt team can do things about it, it might be even better.
Before dbt, I didn't even really think about structuring how I'd be running my queries. My experience with it has been - it's made it clear to me what has been sub-optimally laid out. Organize my fact tables by subject, adopt a common schema before applying business logic, utilize re-usable views instead of copy-pasting CTEs, etc.
And I like its tight git integration. The alternative, quite often, is a folder of scripts on an analyst's desktop.
My team have been working to solve this problem (and more). We recently released our CLI tool, palm, as OSS. Along with a plugin, palm-dbt, which we designed specifically for working with dbt projects in an optimal way.
Palm requires running code in containers, so teams can ensure dependencies and versions are correct. Then the plugin provides sensible defaults, with solutions for common problems (like idempotent dev/ci runs) and an abstraction in front of the most common commands you need to run in the container.
It's been a huge boon for our analysts productivity, improved the reliability and quality of our models, and decreased onboarding times for new people joining our data team.
I'm the CTO of a 15 person startup and dbt (along with Mode, Stitch and BigQuery) has been a superpower allowing me to moonlight as a CDO (or some shitty pastiche of one) and push forward our data usage and competence in ways I just didn't think possible. Modern data stack is a marvel in what it unlocks.
I can't argue that dbt isn't great -- it is. It is, however
, unfortunate that Python is still the dominate lingua franca here. Between all the Python data tools and JSON being shuffled around ala singer.io, I just can't help but think there is a huge opportunity awaiting the engineers who build a company around an open source set of products using rigorous languages and schemas.
Does it matter that dbt is written in Python? dbt models are still SQL at heart. Sure there's the addition of Jinja to enable references for data lineage and configuration, but it's all compiled to SQL with fine-grained control over what's produced.
Forgive me if I come across as combative, but I don't understand generic appeals to things like a language being rigorous. Rigorous to what end? What problem is it solving where that is required? If you know something specific in this domain where that level of rigor is needed, why not share what it is?
There are a lot of problems in the analytics space (and a lot of opportunity for new practices, tools, and businesses), but I would argue that at the end of the day the primary issue is whether or not data producers choose to model data such that it is legible outside of the system that produced it much more than it is about any particular language or methodology.
For typical datasets (~95% of companies have medium data on the order of gigabyte), you are 100% correct that the data modeling / formatting is the biggest challenge.
Having well modeled data that matches the business domain is a massive (2-10x) productivity boost for most business analysis.
Well part of the benefit is rapid development; it's mind-boggling how quickly someone can stand up a dbt project and begin to iterate on transforms. Using Python/SQL/JSON (at small/medium) scales keeps the data stack consistent and lowers the barrier to entry. No reason to prematurely optimize when your bottleneck is the modeling and not the actual data volume.
A vast majority of companies are working with < 1TB of data that sits neatly in a single cloud database. Python and tools like dbt are fantastic for a huge class of problems without compromising workflow velocity, and pushing transformations into SQL removes most Python-bound performance constraints.
Changing singer.io to require Thrift or protobuf schemas isn't going to add the value you think it is. How data is shuffled between systems is considerably less important and time consuming than figuring out how to put that data to work.
Singer and dbt are different - dbt orchestrates stuff (evaluates macros and runs SQLs) while Singer actually has the data flow through it. So rewriting Singer in something fast (I guess Java since even the most obscure database has JDBC driver for it) would definitely help.
It would only help once you start shuttling around terabytes.
These tools were not built for multi-tb workloads, and pointing out that as a deficiency when it was clearly not a design goal is a misleading argument.
Our own approach is to keep singer.io et al and JSON (flexible! streaming capable!), but deeply invest in JSON Schema to a) statically infer and map into other kinds of schema on your behalf -- TypeScript, Elastic Search, and SQL DDL so far -- and b) optimize validation so it's fast enough to be in the high-scale critical path. If you validate every document before it's read or written by your hacky transform, you've gone a long way towards limiting the bug blast radius.
It uses Java for its connectors, and looks great but has issues importing a massive dataset into S3 as there is a chunk limit of 10k, and each chunk size is 5mb :).
I am a user, and what sets them apart even beyond the product is the amazing community they have built that celebrates the unsexy aspects of data science. Just watch some videos from their annual conference, people are so excited to talk about rationalizing business logic
One of the better ways I've seen dbt described is very similar to an MVC framework for data (I didn't see that mentioned yet, if it has been sorry I missed it).
Where a rough mapping would be:
model >> source
controller >> ephemeral model
view >> materialized (view, table, incremental) model
Like many MVCs it provides abstractions and helpers for common tasks such as performing incremental updates, snapshotting time series, common manipulations like creating date spines. Like many MVCs it supports a robust ecosystem of plugins that make it easy to re-use stable transforms for common datasets. Where Django passes you the request object instead of hand-parsing http responses, dbt allows you to loop through a set of derived schemas and dry out your SQL code.
You would generally use dbt as a transform framework for the same reason you'd use Ruby on Rails or Django etc as a web framework - because it provides you with a ton of otherwise repetitive non-differentiating code pre-baked and ready to go. You could keep a folder of sql files you arbitrarily run, and you could roll your own web framework from scratch. Personally I wouldn't do those things.
I recently applied for a job in a data warehousing team that was using this. The way they were using it replace their transformation stage from legacy tools like SSIS,DataStage and Informatica, It was certainly impressive and looked great to work with. I can see dbt becoming a lot more common moving forwards.
Is there dbt competition? I know of https://dataform.co that looked pretty similar, but that got bought by Google so I don't know if it would be a good investment now.
dbt and ELT in general are such a game-changer for allowing rapid iteration on business logic by data analysts; the feedback loop feels much more like "normal" software engineering compared to legacy systems.
Having used it for a little over a year now, I can say that it's strengths may lie in getting junior developer's code more free of bugs and dependency issues. It's just additional overhead when you're trying to do anything more complex, like a Kimball type II slowly changing dimension - then it's just a blocker. Unfortunately, as it becomes a defacto build environment, it's limitations start getting applied to everyone.
Yeah, dbt snapshots do a row hash and update if anything in the row changes. We had a source table that had a bunch of daily changing timestamps, e.g. "load date", that we needed to ignore, and focus on a business key. Dbt was an utter torment to try get this going. Ended up building a more elegant framework without it.
dbt is the quickest way to watch your cloud costs skyrocket.
I really do hope people don’t think this is a way to not have dedicated data engineers on staff and instead “empower” analysts to do things for themselves.
Cloud costs increasing could be a sign of more data being utilized productively. In any way, a mixture of data engineers and analysts is a healthy way to scale a dbt project to increase speed-of-delivery of analytics requests vs cloud costs. We at dbt Labs encourage the "analytics engineering" mindset to bring software engineering best practices into the dbt analytics workflow (git version control, code reviews), and so cloud cost considerations should be incorporated into mature dbt development practices.
I’m still a little confused on what dbt does after reading the article. Is it like Trino that generates materialized views as output, with built-in version control and other features?
In your case, dbt would be the tool to manage your Trino view programmatically:
* You define the sql queries that select the input data as models and the dbt scripts (also sql) to combine/manipulate the models.
* On running, dbt will generate the Trino SQL queries to join/transform these models into the final result (also a model).
* Depending on your configuration, any models can be stored as a Trino view or it can be materialized as a data table that's updated every time you re-run the dbt scripts.
I'd list the DAG and testing as core features as well, but yeah, basically. In a very transparent and non-magical way (I despise magical data solutions). If that summary doesn't resonate with you, it's probably not the tool for you. No need to force it.
I'm not familiar with Trino at all, but that sounds like a specific database. dbt is not a database, it is tooling for working with databases.
We’ve recently introduced dbt, fivetran and snowflake and it has been amazing on many levels. Speed of development along with CI/CD has been fantastic. Clarity of thought about organizing our work has helped with the more junior and platform specific people.
Before I got into it I was sceptical because I was thinking It’s just SQL and SQL is pretty easy and straightforward. It breaks bad habits in DW development.
I’m not sure I still get why but it has been one of the few things Ive seen in a long time in this area that has been a big jump in capabilities. There is so much more confidence on my team. We are moving much faster. We can just get on with delivering without having to undo and think of everything is precious or dangerous to change.
It took some hands on work and getting over the initial mind bend to get there but I wouldn’t go back to what we had. I would likely only make my next move if they had a similar setup in place or were open to change.
Exactly what we've been doing for 2+ years at my job.
It's amazing.
Snowflake is amazing, but watch out for search optimization costs (it's great for append only), left joins taking FOREVER (avoid left joins as much as possible for large datasets).
It has to join each part of the previous join to the next join, and if you have a lot of joins this can get out of hand.
We have a lot of joins in our final fct orders from our intermediate table, and looks like this:
from foo
left join bar on bar.common_id = foo.common_id
left join baz on baz.common_id = foo.common_id
left join qux on qux.common_id = foo.common_id
left join waldo on waldo.common_id = foo.common_id
So waldo joins to qux, which joins to qux... I call it a "staircase join", as that's what it looks like in the SF profiler.
The beautiful (and scary thing) about cloud data warehouses is that you can scale them significantly, which means bad SQL mostly becomes a cost issue as opposed to a speed issue.
Also, dbt makes it easier to persist tables instead of views, which has a massive performance improvement.
Edit: if you didn't mean it dismissively and want to clarify that, I'd be happy to remove this comment. We unfortunately see quite a few comments along the lines of "so basically, this is just super-simple $thing and therefore this is dumb" and I interpreted yours through that lens.
I actually really like DBT and advocated for it. It is however quite a simple tool, just some wrappers and a workflow around SQL transformations. Maybe it’s simplicity is it’s quality, but the OP you replied to isn’t as far off the mark as similar comments.
In theory, you could write dbt in make. Actually, Whenever I introduce to others about workflow concepts (DAG), I tell them it is similar to how make works.
I have been trying to migrate our legacy ETLs to DBT, which is mostly strings peppered around in python code.
I like the macros, templates, the visualisation tool ..... and would like to make our code as DBT compatible as possible.
However, the DB we currently use (Vertica) does not have official support. There are dependency problems installing all of DBT. All I would like to get for my first milestone is to use refs and macros and better organise the SQLs. It is good enough if it generates standard SQL, that I can run outside DBT.
My wish:
I wish I could install select DBT packages just for what I need (templates, macros, refs, dependency) and still make sure I can gradually achieve DBT compatibility. At the moment it looks like all or nothing.
They are essentially SQL transformation at the reporting stage. Python is just the glue code, a lot of attention has gone into the connectors but no so much into the complexity within the SQL.
The problem I would like to address is complex SQL written as strings.
Some parts of these repeat over multiple reports, column transformations, look ups, joins with dimensions .....