Yes, it does diff the schema too. The main use cases we've seen: 1) You made a c...

ironchef · on Aug 6, 2020

Most folks I know are doing this (1 and 2) by testing against a replica (or in the case of snowflake just copying the DB or schema) ... then running data tests locally and downstream (great expectations, DBT tests, or some airflow driven tests).

Is the value prop “you don’t need all they grunt work” as opposed to above direction?

hichkaker · on Aug 7, 2020

You raised a great point.

Data testing methods can perhaps be broken down to two main categories:

1. "Unit testing" – validating assumptions about the data that you define explicitly and upfront (e.g. "x <= value < Y", "COUNT(*) = COUNT(DISTINCT X)" etc.) – what dbt and great_expectations helps you do. This is a great approach for testing data against your business expectations. However, it has several problems: (1) You need to define all tests upfront and maintain them going forward. This can be daunting if your table has 50-100+ columns and you likely have 50+ important tables. (2) This testing approach is only as good as the effort you put to define the tests, back to #1. (3) the more tests you have, the more test failures you'll be encountering, as the data is highly dynamic, and the value of such test suites diminishes with alert fatigue.

2. Diff – identifies differences between datasets (e.g. prod vs. dev or source DB vs. destination DB). Specifically for code regression testing, a diff tool shows how the data has changed without requiring manual work from the user. A good diff tool also scales well: it doesn't matter how wide/long the table is – it'll highlight all differences. The downside of this approach is the lack of business context: e.g. is the difference in 0.6% of rows in column X acceptable or not? So it requires triaging.

Ideally, you have both at your disposal: unit tests to check your most important assumptions about the data and use diff to detect anomalies and regressions during code changes.

sails · on Aug 10, 2020

I think doing a deeper analysis into why this is a good tool in addition to dbt would be useful for me to understand. Locally Optimistic [] has a slack channel and do vendor demos, with a _very_ competent data analytics/engineering membership. I think you'd do well to join and do a demo!

[] https://locallyoptimistic.com/community/