The main use cases we've seen: 1) You made a change to some code that transforms data (SQL/Python/Spark) and want to make sure the changes in the data output are as expected.
2) Same as (1) but there is also some code review process. In addition to checking someone's source code diff, you can see the data diff.
3) You copy datasets between databases, e.g. PostgreSQL to Redshift and want to validate the correctness of the copy (either ad-hoc or on a regular basis).
Most folks I know are doing this (1 and 2) by testing against a replica (or in the case of snowflake just copying the DB or schema) ... then running data tests locally and downstream (great expectations, DBT tests, or some airflow driven tests).
Is the value prop “you don’t need all they grunt work” as opposed to above direction?
Data testing methods can perhaps be broken down to two main categories:
1. "Unit testing" – validating assumptions about the data that you define explicitly and upfront (e.g. "x <= value < Y", "COUNT(*) = COUNT(DISTINCT X)" etc.) – what dbt and great_expectations helps you do. This is a great approach for testing data against your business expectations. However, it has several problems: (1) You need to define all tests upfront and maintain them going forward. This can be daunting if your table has 50-100+ columns and you likely have 50+ important tables. (2) This testing approach is only as good as the effort you put to define the tests, back to #1. (3) the more tests you have, the more test failures you'll be encountering, as the data is highly dynamic, and the value of such test suites diminishes with alert fatigue.
2. Diff – identifies differences between datasets (e.g. prod vs. dev or source DB vs. destination DB). Specifically for code regression testing, a diff tool shows how the data has changed without requiring manual work from the user. A good diff tool also scales well: it doesn't matter how wide/long the table is – it'll highlight all differences. The downside of this approach is the lack of business context: e.g. is the difference in 0.6% of rows in column X acceptable or not? So it requires triaging.
Ideally, you have both at your disposal: unit tests to check your most important assumptions about the data and use diff to detect anomalies and regressions during code changes.
I think doing a deeper analysis into why this is a good tool in addition to dbt would be useful for me to understand. Locally Optimistic [] has a slack channel and do vendor demos, with a _very_ competent data analytics/engineering membership. I think you'd do well to join and do a demo!
The main use cases we've seen: 1) You made a change to some code that transforms data (SQL/Python/Spark) and want to make sure the changes in the data output are as expected. 2) Same as (1) but there is also some code review process. In addition to checking someone's source code diff, you can see the data diff. 3) You copy datasets between databases, e.g. PostgreSQL to Redshift and want to validate the correctness of the copy (either ad-hoc or on a regular basis).
We have a signup-free sandbox where you can see the exact views we provide, including schema diff: https://app.datafold.com/hackernews