Hacker News new | past | comments | ask | show | jobs | submit login
What Is Dbt and Why Are Companies Using It? (seattledataguy.substack.com)
93 points by mooreds on Dec 2, 2021 | hide | past | favorite | 70 comments



Data Build Tool makes it easier to get data analysts involved, which helps bring more analytics code into version control. This is a really really good thing given that most data analysts are not software engineers and rightfully prioritize putting their analytical skills to use in the most efficient way possible, which tends to result in analysis getting lost in a data swamp.


I made a little animated slide deck (https://dbt.picturatechnica.com/) a while back about dbt for data analyst type folks that want an intro. You can slowly step through each slide to get a feel of what dbt is going to build in your datawarehouse.

P.s. I currently work for dbt Labs.


I use dbt to build small dremio data warehouse. dbt is using software engineering ideas to build data, it is quite awesome. I can use github to version control it, can did some unit test (data quality check mainly).

Another thing is its macro systems. Although Macros are not a good programming language. But it makes integrate new DBMS much easier, because most 3rd party plugins are mainly written in Macros. So, Macros is good too if it is provided by other people (not need to be written by myself :) )


Not to steal the OPs thunder, but I have also created a few resources on DBT if they help.

A demo of integrating DBT and Snowflake https://timeflow.systems/connecting-dbt-to-snowflake

How DBT DevOps enables data teams: https://timeflow.academy/blog/how-dbt-devops-enables-data-te...

I am also writing a tutorial on DBT, though there are a few rough edges still: https://timeflow.academy/dbt-introduction


>A demo of integrating DBT and Snowflake https://timeflow.systems/connecting-dbt-to-snowflake

This link returns a 404


The issue with dbt is now folks end up needing to spin up the entire repository to run a bunch of sql queries. The issue is this is a tool that’s targeted at teams that don’t have rigorous engineering backgrounds so it’s bound to be used in suboptimal ways. If the dbt team can do things about it, it might be even better.


Before dbt, I didn't even really think about structuring how I'd be running my queries. My experience with it has been - it's made it clear to me what has been sub-optimally laid out. Organize my fact tables by subject, adopt a common schema before applying business logic, utilize re-usable views instead of copy-pasting CTEs, etc.

And I like its tight git integration. The alternative, quite often, is a folder of scripts on an analyst's desktop.


Yeah "forcing version control" seems like a feature (pertaining to data governance), not a bug.


Shameless, but very relevant plug:

My team have been working to solve this problem (and more). We recently released our CLI tool, palm, as OSS. Along with a plugin, palm-dbt, which we designed specifically for working with dbt projects in an optimal way.

Palm requires running code in containers, so teams can ensure dependencies and versions are correct. Then the plugin provides sensible defaults, with solutions for common problems (like idempotent dev/ci runs) and an abstraction in front of the most common commands you need to run in the container.

It's been a huge boon for our analysts productivity, improved the reliability and quality of our models, and decreased onboarding times for new people joining our data team.

Links: https://github.com/palmetto/palm-cli https://github.com/palmetto/palm-dbt


I'm the CTO of a 15 person startup and dbt (along with Mode, Stitch and BigQuery) has been a superpower allowing me to moonlight as a CDO (or some shitty pastiche of one) and push forward our data usage and competence in ways I just didn't think possible. Modern data stack is a marvel in what it unlocks.


I can't argue that dbt isn't great -- it is. It is, however , unfortunate that Python is still the dominate lingua franca here. Between all the Python data tools and JSON being shuffled around ala singer.io, I just can't help but think there is a huge opportunity awaiting the engineers who build a company around an open source set of products using rigorous languages and schemas.

It's a goddamn madhouse.


Does it matter that dbt is written in Python? dbt models are still SQL at heart. Sure there's the addition of Jinja to enable references for data lineage and configuration, but it's all compiled to SQL with fine-grained control over what's produced.

Forgive me if I come across as combative, but I don't understand generic appeals to things like a language being rigorous. Rigorous to what end? What problem is it solving where that is required? If you know something specific in this domain where that level of rigor is needed, why not share what it is?

There are a lot of problems in the analytics space (and a lot of opportunity for new practices, tools, and businesses), but I would argue that at the end of the day the primary issue is whether or not data producers choose to model data such that it is legible outside of the system that produced it much more than it is about any particular language or methodology.


For typical datasets (~95% of companies have medium data on the order of gigabyte), you are 100% correct that the data modeling / formatting is the biggest challenge.

Having well modeled data that matches the business domain is a massive (2-10x) productivity boost for most business analysis.


Well part of the benefit is rapid development; it's mind-boggling how quickly someone can stand up a dbt project and begin to iterate on transforms. Using Python/SQL/JSON (at small/medium) scales keeps the data stack consistent and lowers the barrier to entry. No reason to prematurely optimize when your bottleneck is the modeling and not the actual data volume.


> using rigorous languages and schemas

And what value does it add?

A vast majority of companies are working with < 1TB of data that sits neatly in a single cloud database. Python and tools like dbt are fantastic for a huge class of problems without compromising workflow velocity, and pushing transformations into SQL removes most Python-bound performance constraints.

Changing singer.io to require Thrift or protobuf schemas isn't going to add the value you think it is. How data is shuffled between systems is considerably less important and time consuming than figuring out how to put that data to work.


Singer and dbt are different - dbt orchestrates stuff (evaluates macros and runs SQLs) while Singer actually has the data flow through it. So rewriting Singer in something fast (I guess Java since even the most obscure database has JDBC driver for it) would definitely help.

It would only help once you start shuttling around terabytes.


These tools were not built for multi-tb workloads, and pointing out that as a deficiency when it was clearly not a design goal is a misleading argument.


There are only languages people hate and ones no one use


and golang


Our own approach is to keep singer.io et al and JSON (flexible! streaming capable!), but deeply invest in JSON Schema to a) statically infer and map into other kinds of schema on your behalf -- TypeScript, Elastic Search, and SQL DDL so far -- and b) optimize validation so it's fast enough to be in the high-scale critical path. If you validate every document before it's read or written by your hacky transform, you've gone a long way towards limiting the bug blast radius.


I think this is what meltano is trying to help with?

https://www.meltano.com/


Not really - meltano uses singer (ie extracts/loads data in JSON form) and dbt (for transformation, in the ELT pattern).

It’s a good tool (I use it), but the concerns GP is raising are very much its weaknesses.


In the other hand, it's convenient to use the same language as Jupyter Notebooks.


Can you give a before/after of what the desired state would look like?


Might be worth looking into Airbyte!


Airbyte looks great, and the UI is fantastic.

It uses Java for its connectors, and looks great but has issues importing a massive dataset into S3 as there is a chunk limit of 10k, and each chunk size is 5mb :).


We've used https://github.com/estuary/connectors/pkgs/container/source-... to load data sets in the many terabytes. Caveat that, while it's implemented to Airbyte's spec, we've only used it with Flow.


That’s the current focus of the team. Consolidating those connectors :)


Something like Databricks?


I am a user, and what sets them apart even beyond the product is the amazing community they have built that celebrates the unsexy aspects of data science. Just watch some videos from their annual conference, people are so excited to talk about rationalizing business logic


Evidently, they have a free conference next week[0].

[0] https://coalesce.getdbt.com


One of the better ways I've seen dbt described is very similar to an MVC framework for data (I didn't see that mentioned yet, if it has been sorry I missed it). Where a rough mapping would be:

model >> source

controller >> ephemeral model

view >> materialized (view, table, incremental) model

Like many MVCs it provides abstractions and helpers for common tasks such as performing incremental updates, snapshotting time series, common manipulations like creating date spines. Like many MVCs it supports a robust ecosystem of plugins that make it easy to re-use stable transforms for common datasets. Where Django passes you the request object instead of hand-parsing http responses, dbt allows you to loop through a set of derived schemas and dry out your SQL code.

You would generally use dbt as a transform framework for the same reason you'd use Ruby on Rails or Django etc as a web framework - because it provides you with a ton of otherwise repetitive non-differentiating code pre-baked and ready to go. You could keep a folder of sql files you arbitrarily run, and you could roll your own web framework from scratch. Personally I wouldn't do those things.


I recently applied for a job in a data warehousing team that was using this. The way they were using it replace their transformation stage from legacy tools like SSIS,DataStage and Informatica, It was certainly impressive and looked great to work with. I can see dbt becoming a lot more common moving forwards.


Is there dbt competition? I know of https://dataform.co that looked pretty similar, but that got bought by Google so I don't know if it would be a good investment now.

Any other?


dbt and ELT in general are such a game-changer for allowing rapid iteration on business logic by data analysts; the feedback loop feels much more like "normal" software engineering compared to legacy systems.


How long before Amazon incorporates this as one of their services?


I’m sure they’re working on it. Databricks, however, also has Delta Live Tables coming out soon I believe.


Having used it for a little over a year now, I can say that it's strengths may lie in getting junior developer's code more free of bugs and dependency issues. It's just additional overhead when you're trying to do anything more complex, like a Kimball type II slowly changing dimension - then it's just a blocker. Unfortunately, as it becomes a defacto build environment, it's limitations start getting applied to everyone.


The snapshots feature should handle the slowly changing dimensions. But are you saying it's not flexible enough for certain edge cases?


Yeah, dbt snapshots do a row hash and update if anything in the row changes. We had a source table that had a bunch of daily changing timestamps, e.g. "load date", that we needed to ignore, and focus on a business key. Dbt was an utter torment to try get this going. Ended up building a more elegant framework without it.


So apparently the check_cols parameter didn't help either?

I'm trying to learn about the critical pain points of dbt, and this case seems interesting.


> 15,000 members of a Slack community.

That's an impressive number of highly paid people. Nice work on Dbt's people.


dbt is the quickest way to watch your cloud costs skyrocket.

I really do hope people don’t think this is a way to not have dedicated data engineers on staff and instead “empower” analysts to do things for themselves.


[dbt Labs employee here]

Cloud costs increasing could be a sign of more data being utilized productively. In any way, a mixture of data engineers and analysts is a healthy way to scale a dbt project to increase speed-of-delivery of analytics requests vs cloud costs. We at dbt Labs encourage the "analytics engineering" mindset to bring software engineering best practices into the dbt analytics workflow (git version control, code reviews), and so cloud cost considerations should be incorporated into mature dbt development practices.


Could you expand on this point?


I’m still a little confused on what dbt does after reading the article. Is it like Trino that generates materialized views as output, with built-in version control and other features?


In your case, dbt would be the tool to manage your Trino view programmatically:

* You define the sql queries that select the input data as models and the dbt scripts (also sql) to combine/manipulate the models.

* On running, dbt will generate the Trino SQL queries to join/transform these models into the final result (also a model).

* Depending on your configuration, any models can be stored as a Trino view or it can be materialized as a data table that's updated every time you re-run the dbt scripts.


I'd list the DAG and testing as core features as well, but yeah, basically. In a very transparent and non-magical way (I despise magical data solutions). If that summary doesn't resonate with you, it's probably not the tool for you. No need to force it.

I'm not familiar with Trino at all, but that sounds like a specific database. dbt is not a database, it is tooling for working with databases.


DBT is really nice, but not a game changer since it's basically a SQL generator and only as good as the underlying SQL engine of your data warehouse.


We’ve recently introduced dbt, fivetran and snowflake and it has been amazing on many levels. Speed of development along with CI/CD has been fantastic. Clarity of thought about organizing our work has helped with the more junior and platform specific people.

Before I got into it I was sceptical because I was thinking It’s just SQL and SQL is pretty easy and straightforward. It breaks bad habits in DW development.

I’m not sure I still get why but it has been one of the few things Ive seen in a long time in this area that has been a big jump in capabilities. There is so much more confidence on my team. We are moving much faster. We can just get on with delivering without having to undo and think of everything is precious or dangerous to change.

It took some hands on work and getting over the initial mind bend to get there but I wouldn’t go back to what we had. I would likely only make my next move if they had a similar setup in place or were open to change.


Exactly what we've been doing for 2+ years at my job.

It's amazing.

Snowflake is amazing, but watch out for search optimization costs (it's great for append only), left joins taking FOREVER (avoid left joins as much as possible for large datasets).


What makes left joins perform poorly in Snowflake?


It has to join each part of the previous join to the next join, and if you have a lot of joins this can get out of hand.

We have a lot of joins in our final fct orders from our intermediate table, and looks like this:

from foo left join bar on bar.common_id = foo.common_id left join baz on baz.common_id = foo.common_id left join qux on qux.common_id = foo.common_id left join waldo on waldo.common_id = foo.common_id

So waldo joins to qux, which joins to qux... I call it a "staircase join", as that's what it looks like in the SF profiler.


The beautiful (and scary thing) about cloud data warehouses is that you can scale them significantly, which means bad SQL mostly becomes a cost issue as opposed to a speed issue.

Also, dbt makes it easier to persist tables instead of views, which has a massive performance improvement.


The dbt community will frequently run Snowflake, and its SQL engine is very good.

https://medium.com/@jthandy/how-compatible-are-redshift-and-...


So, it's like a makefile with some SQL to do ELT?


"Please don't post shallow dismissals, especially of other people's work. A good critical comment teaches us something."

https://news.ycombinator.com/newsguidelines.html

Edit: if you didn't mean it dismissively and want to clarify that, I'd be happy to remove this comment. We unfortunately see quite a few comments along the lines of "so basically, this is just super-simple $thing and therefore this is dumb" and I interpreted yours through that lens.


Is it a dismissal or is it someone asking for affirmation on their understanding of the product? I think it can be read both ways.


If it is for for affirmation. The answer is Yes. I even write a similar tool using Scons. However, dbt is far better.


I wavered about that for a little while and then settled on the dismissive interpretation.

If I misread the comment and elchief wants to clarify, I'd be happy to apologize and correct the mistake.


I actually really like DBT and advocated for it. It is however quite a simple tool, just some wrappers and a workflow around SQL transformations. Maybe it’s simplicity is it’s quality, but the OP you replied to isn’t as far off the mark as similar comments.


nope, it has lots of other fantastic features such as a dag, macros, etc.


make is the OG DAG. And has macros


In theory, you could write dbt in make. Actually, Whenever I introduce to others about workflow concepts (DAG), I tell them it is similar to how make works.


I have been trying to migrate our legacy ETLs to DBT, which is mostly strings peppered around in python code. I like the macros, templates, the visualisation tool ..... and would like to make our code as DBT compatible as possible.

However, the DB we currently use (Vertica) does not have official support. There are dependency problems installing all of DBT. All I would like to get for my first milestone is to use refs and macros and better organise the SQLs. It is good enough if it generates standard SQL, that I can run outside DBT.

My wish:

I wish I could install select DBT packages just for what I need (templates, macros, refs, dependency) and still make sure I can gradually achieve DBT compatibility. At the moment it looks like all or nothing.


If your transformations are already in Python scripts wouldn’t you be better off with Airflow or Luigi? dbt seems to be aimed at SQL users primarily.


They are essentially SQL transformation at the reporting stage. Python is just the glue code, a lot of attention has gone into the connectors but no so much into the complexity within the SQL.

The problem I would like to address is complex SQL written as strings. Some parts of these repeat over multiple reports, column transformations, look ups, joins with dimensions .....


Unfortunate typo: It’s about to be a decade since the Harvard article touting data science as the sexist job in the 21st century.


> It’s about to be a decade since the Harvard article touting data science as the sexist job in the 21st century.

<facepalm>


psychedelic in the tryptamine family




Join us for AI Startup School this June 16-17 in San Francisco!

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: