Launch HN: Artie (YC S23) – Real time data replication to data warehouses

slotrans · on July 24, 2023

(None of the below is meant to diminish the work done by the author/poster.)

> This means that when companies aggregate production data into their data warehouse, the underlying data is always stale.

This is intentional and desirable.

The classic piece on this is this one by Dan McKinley https://mcfunley.com/whom-the-gods-would-destroy-they-first-...

Something McKinley doesn't address is that it's quite advantageous if the values in your data warehouse don't change intra-day because this lets business users reach consensus. Whereas if Bob runs a report and gets $X, and Alice runs the same report 5 minutes later and gets $Y, that creates confusion (much more than you would expect). I recall a particular system I built that refreshed every 6 hours (limited by upstream), that eventually Marketing asked me to dial back to every 24 hours because they couldn't stand things changing in the middle of the day.

Now of course I see you're targeting more real-time use cases like fraud detection. That's great! But why you would run a fraud detection process out of your data warehouse, which likely doesn't even have a production-grade uptime SLA? Run it out of your production database, that's what it's for!

DenisM · on July 25, 2023

Snowflake and Big Query support as-of queries. Redshift supports snapshots. One could easily get a report that's consistent over the course of time.

tang8330 · on July 24, 2023

Thanks for your feedback!

> Something McKinley doesn't address is that it's quite advantageous if the values in your data warehouse don't change intra-day because this lets business users reach consensus. Whereas if Bob runs a report and gets $X, and Alice runs the same report 5 minutes later and gets $Y, that creates confusion (much more than you would expect). I recall a particular system I built that refreshed every 6 hours (limited by upstream), that eventually Marketing asked me to dial back to every 24 hours because they couldn't stand things changing in the middle of the day.

If they want to see a consistent view of the report, you could bound this.

1/ SELECT * FROM FOO WHERE DATE_TRUNC('day', updated_at) < DATE_TRUNC('day', DATEADD(day, -1, CURRENT_DATE()));

If your dataset doesn't contain kv, you can turn on include `artie_updated_at` which will provide an additional column with the updated_at field to support incremental ingestion.

2/ If you had stateful data, you could also explore creating a Snowflake task and leveraging the time travel f(x) to create a "snapshot" if your workload depended on it.

3/ Also, if you _did_ want this to be more lagged, you can actually increase the flushIntervalSeconds [1] to 6h, 24h, whichever time interval you fancy. You as the customer should have maximum flexibility when it comes to when to flush to DWH.

4/ You can also choose to refresh the analytical report on Looker / Mode to be daily. [2]

> Now of course I see you're targeting more real-time use cases like fraud detection. That's great! But why you would run a fraud detection process out of your data warehouse, which likely doesn't even have a production-grade uptime SLA? Run it out of your production database, that's what it's for!

You can certainly do this in production db (that was our original hypothesis as well!), however, after talking to more companies...it has become more obvious to us that folks that are running fraud algos actually want to join this across various data sets. Further, by using a DWH - it provides a nice visualization layer on top.

Of course, you could go with something even more bespoke by utilizing real-time DBs such as Materialize / Rockset / RisingWave. Just comes with trade offs such as increase in architectural complexity.

There are also plenty of additional use cases this can unlock given that DWH is a platform, any post-DWH application can benefit from less lag, such as reverse ETLs.

[1] https://docs.artie.so/running-transfer/options

[2] https://mode.com/help/articles/report-scheduling-and-sharing...

mbesto · on July 24, 2023

I think you missed the parent's point - your USP is real-time replication. So everything you're proposing makes it not real time. Your USP is now worthless (in that context) and you're competitors are numerous.

tang8330 · on July 24, 2023

Hm, perhaps I wasn't being clear, apologies for that.

What I am proposing above is ways to provide a view to teams that do not want real-time data while keeping your underlying dataset in real-time.

mbesto · on July 25, 2023

Huh? The parent's point was your underlying dataset is always in real-time. There's no issue querying a data warehouse when all you're doing is looking for a simple transactional report.

shrimpx · on July 25, 2023

I think their point is they have a real-time warehouse that can also be used in “stale snapshot” mode.

debarshri · on July 24, 2023

I agree with the parent point. I also don't think DWH is the primary usecase for your platform.

I have seen architectures where databases are siloed within departments and data has to be replicated across department physical databases in the same network or different, mostly in banks, insurances and old school industries. In this scenario, a daily batch would run that would replicate and populate the tables and kick start business processes. A platform like this would make sense. Another usecase, i can think of is reverse ETL, but there are many tools custom made for that.

As for fraud analysis, there are many vendor tools that does exactly that, asking people to visualize and implement a full blown usecase is hard.

I might be naive I don't see the USP between artie and Airbyte, hevodata, fivetran, stitch etc. and others from a distance.

cgio · on July 24, 2023

I’ve been in this space and can appreciate your design decisions. They are meaningful, but a couple of comments on the differentiating factors. DML is tackled in traditional pipelines very similarly, I.e. execution of the log at the target, whether that is materialised or as a view. DDL, it is a nice to have from a technical achievement perspective, but changes to a data schema are not live. They have to be deployed, approved etc. so the flexibility of accommodating them on the pipeline is a benefit from an ease of deployment perspective but could also be alarming for data officers who may feel like they are losing a control lever by not getting the pipeline to block when a schema changes outside normal deployment processes. Finally, the biggest issue with CDC always ends up being the seed loads, recoveries and the incremental snapshot strategies.

tang8330 · on July 24, 2023

Thanks for the comment!

Your comment regarding DDL is interesting.

Today, this is what happens:

1/ Column doesn't exist in the destination, we'll create it based on our typing inference from the data type (important: not the data value).

2/ Certain tools will handle automatic column data type conversion if a change like this was detected at the source. We do not do this. We will simply hard fail and cause head-of-line blocking reasons being: this is anti-pattern and should be rare, in which case - it's okay to cause an err and require manual intervention for this breaking change.

3/ If the column has been dropped from the source, you as the end user can decide whether this column should be also dropped in the destination, or not. The default is not to drop it.

^ We hear more customers explicitly don't want columns to be dropped because it could cause downstream errors, such as other views / tables not compiling due to referencing a non-existent column.

We haven't heard much from folks that don't even want columns to be added. If there is a need, we can definitely add that as a config option to provide maximum configurability.

> Finally, the biggest issue with CDC always ends up being the seed loads, recoveries and the incremental snapshot strategies.

Yep totally. On the recovery bit, this is exactly why we are leveraging Kafka. If there are any particular issues, we simply don't commit the offset and cause head-of-line blocking.

On the incremental snapshot and recoveries bit, we primarily leverage Debezium's DDD-3 high watermark strategy [1] for MySQL and MongoDB. Postgres has a different issue in that replication slots can grow really fast, esp on AWS! [2]. We ended up writing our own custom snapshotter for Postgres that is Debezium compatible to onboard customers that have a massive dataset and cannot afford to have a read lock on their WAL.

[1] https://github.com/debezium/debezium-design-documents/blob/m... [2] https://www.morling.dev/blog/insatiable-postgres-replication...

cgio · on July 25, 2023

The custom snapshotter sounds interesting, potentially a good selling point. On recovery end, in my designs I have also found it useful to have synthetic events so in breakage of CDC I can stitch logs together and not just start from scratch and lose history. I can see you are in the depths of it, more than I’ve been for a while. Wish you the best.

jgraettinger1 · on July 25, 2023

> Postgres has a different issue in that replication slots can grow really fast, esp on AWS! [2]. We ended up writing our own custom snapshotter for Postgres that is Debezium compatible to onboard customers that have a massive dataset and cannot afford to have a read lock on their WAL.

Does Debezium's DDD-3 watermark (DBLog) implementation for Postgres not process the WAL quickly enough? We don't use it ourselves either, but architecturally it appears it would reasonably bound how long the WAL can remain un-read?

Agreed that many production DBs people care about have pretty severe limitations here! Managed Supabase is another good example.

tang8330 · on July 25, 2023

On a single unbounded (CPU + mem) Debezium running on a VM extracting Postgres, I was able to clock in about 7-10m/hr. You could increase the # of tasks, but then it'll hinder your DB perf. Also, this is on your primary DB.

We found it far more efficient and less risky to do CDC streaming and snapshotting w/o read lock in parallel to two different topics. Once snapshot is done and drained, we then move to drain the CDC topic.

LeonB · on July 24, 2023

I found in integration/warehousing that if a source system “suddenly” has a new column — it’s best if we can automatically just bring it in. We would tend to do this in a way that it can’t break the warehouse, doesn’t affect people downstream of us. We can then choose to make it available to others, or not — but the moment that data is available we start hoovering it in.

cgio · on July 25, 2023

That’s the logical thing in the context of this domain. In a broader data domain there are other considerations, not always logical, so if that ends up configurable it’s a feature that broadens application scope.

danpalmer · on July 24, 2023

Hey, brief feedback, but "5,000,000,000+" rows processed might be a red flag for some. Many individual customers might do this per day, so to say that's all the company has done so far might put them off.

You've said that initial imports are free, but anecdotally, initial imports ended up being a somewhat regular occurrence as we found issues required a re-import, table rewrites, breaking schema changes, and so on. Do you pay full cost for subsequent full-imports?

Lastly, pricing. I feel like you really need a dollar figure on this even just to start the discussion. "Call us for pricing" is fine for enterprise plans, but in a ~100 person startup I would have just passed immediately on to others that provide pricing details. I'm not going to spend time trialling something without knowing if the pricing is in the right ballpark as it's a waste of my time if it isn't. My perception is that this would be substantially more expensive than batch replication competitors such as Stitch, but I don't know.

tang8330 · on July 24, 2023

> Hey, brief feedback, but "5,000,000,000+" rows processed might be a red flag for some. Many individual customers might do this per day, so to say that's all the company has done so far might put them off.

Appreciate the feedback! Totally agree and it's something we are working towards :). We learned a ton from going from nothing -> billions, certain functions, backfill strategies, etc just doesn't work at that scale. I'm sure we'll have another equally exciting learning curve when scaling from B/month -> B/day!

> You've said that initial imports are free, but anecdotally, initial imports ended up being a somewhat regular occurrence as we found issues required a re-import, table rewrites, breaking schema changes, and so on. Do you pay full cost for subsequent full-imports?

Thanks for pointing this out. We plan to make all backfills free of charge.

> Lastly, pricing. I feel like you really need a dollar figure on this even just to start the discussion. "Call us for pricing" is fine for enterprise plans, but in a ~100 person startup I would have just passed immediately on to others that provide pricing details. I'm not going to spend time trialling something without knowing if the pricing is in the right ballpark as it's a waste of my time if it isn't. My perception is that this would be substantially more expensive than batch replication competitors such as Stitch, but I don't know.

Makes sense. A lot of folks think that streaming is more expensive than batched, what we've found with our initial customers is the opposite. Streaming is able to distribute the load to DWH and as a result, customers on Snowflake can use a much smaller vDWH when working with Artie. On the "call us for pricing" front, we are planning to provide a calculator / graph to estimate costs before talking to a human in the future. However, we are learning that a lot of companies have bespoke 1yr/ 2yr contracts with their data provider and are trying to figure out our own pricing and packaging at the moment.

GordonS · on July 24, 2023

I just wanted to second the GP's point on pricing - like them, "call me pricing" is a hard no from me. IMO it's fine for the highest tier (Enterprise or whatever), but lower tiers should have fixed and public pricing.

EDEdDNEdDYFaN · on July 24, 2023

How many of these tools are out there? This is like the 4th CDC to Data warehouse tool I've seen this year and I'm struggling to understand why they all exist in parallel.

I just found estuary with a colleague yesterday which didn't work and previously have seen at least two other tools promising the same CDC to Data Warehouse in-a-box pipelines. What makes this tool different?

jgraettinger1 · on July 24, 2023

Hi, I'm Estuary's CTO (https://estuary.dev). Mind speaking a bit more about what didn't work?

We put quite a bit of effort into our CDC connectors, as it's a core competency. We have numerous customers using them at scale successfully, but they can be a bit nuanced to get configured. We're constantly trying to make our onboarding experience more intuitive and seamless... it's a hard problem.

tang8330 · on July 24, 2023

For us, what we are striving to do differently is:

1/ DFS vs BFS. We are planning on rolling out connectors slowly vs. building tens / hundreds / thousands of connectors to try to attract a broader set of audiences. Those that have tried to replicate data between OLTP and OLAP know how painful it is and we really want to solve this pain point before we move on to new sources. In addition, we're planning on providing more value than just replicating data from source to destination. We're planning on integrating our telemetry library [1] with Datadog such that customers can: * Centralize all metrics. See Artie metrics without coming to our dashboard, instead it's integrated with your existing tools. * Help provide cookie cutter monitors for anomaly detection * We also want to provide better table quality checks

2/ We do not want to change user behavior. We are using pretty standard tools to solve this problem such as Kafka, Pub/Sub and Red Hat Debezium. If you are already using one of these tools, we can just integrate vs. standing up a whole suite of services just for the data pipeline. * If you have CDC events being emitted for event-driven architecture already, we'll skip deploying Debezium and just deploy our consumer * If you have Kafka enabled and want to also consume CDC events, we'll just deploy Debezium to publish to your Kafka vs. ours.

3/ Ease of use. This goes without saying, but plenty of tools out there have broken links in their documentation, extremely confusing UI, etc. We really look up to Fivetran in this regard and we try to make the onboarding process for new connectors as simple as possible.

Do you think there are anything missing with the other CDC or data replication tools out in the market? Let me know and happy to see how we can help!

[1] https://docs.artie.so/telemetry/overview

joelhaasnoot · on July 24, 2023

Many of them really just suck (depending on your workload)

PeterZaitsev · on July 24, 2023

Hi Robin, This looks like very cool project

Note thought Elastic Licence 2.0 is not Open Source License so if this is your license choice I'd avoid calling the code you have on gitHub OSS

redwood · on July 25, 2023

You mean AWS can't deliver this as a service so it's not open source? That's an interpretation and not gospel

scapecast · on July 25, 2023

In 2016, I wrote a pretty detailed answer for "Spark vs. Redshift" question. This was in the very early days of what today I guess is called "the modern data stack"

The core of the answer was that cloud warehouses are not suitable for real-time use cases, because the batch processing and transformations take too long. If you want real-time, you need to pay up - hence Databricks / Spark. I did call out the fraud use case in that answer.

There were 1st generation ETL tools like Alooma that tried to go into the direction of streaming, and they pushed the limits.

Back then (we had built cloud warehouse monitoring tool), the closest to real-time I've ever seen any company get was IronSource. The time between an event and until that event was available in a dashboard was five minutes (they were using Redshift).

I'll stick my neck out and say that certain industries will be all over Artie, whereas others will shrug their shoulders.

The industries that I think will be all over Artie:

- FinTech - Insurance - AdTech - Gaming - Publishing - Logistics / Delivery

These are industries where a couple of minutes of difference in data recency can make a difference of millions of dollars. And you'll probably cost less than existing streaming solutions, which is obviously nice. But I think the real advantage will be simplicity.

I know you can't support all destinations at once, and need to go with where demand is. But I would expect that the Materialize and Clickhouse crowds are good target users for you.

Good luck, this is an exciting product!

tang8330 · on July 25, 2023

Thank you so much for the support!

justinsoong · on July 25, 2023

We've previously used Stitch data https://www.stitchdata.com/ and found it super easy to stream cdc to redshift and snowflake

Will give artie a go

BenderV · on July 24, 2023

Congrats on your project. Reliable CDC is hard to set up.

I worked a bit on the space, focusing on api instead of DB, which is in the end a quite different problem. My goal was to leverage LLM to auto-build connector on the fly.

I started by building a lightweight ETL, that I open-sourced (https://github.com/BenderV/universal-data).

I left the space because I realize that I didn't want to work on this problem, even though I believe in the "AI" approach, and think simplifying data transfer (or distributing compute) is one the key factor to scale data usage.

RyanHamilton · on July 24, 2023

Good to see more entrants in the real time analytics space. I've been waiting years for most database to embrace streaming. I'm betting on these real time data platforms needing a faster dashboard than tableau... That's what I'm working on https://www.timestored.com/pulse/. You may find it useful to show off demos. We are partnering with some startups like yourself. Email me if interested. And good luck!

cpursley · on July 25, 2023

This looks really good - and easy to configure! And love that it's open source.

For those using Elixir and who want to work with Postgres CDC at a lower level, I've put together a little CDC library: https://github.com/cpursley/walex

bpicolo · on July 24, 2023

> the partition key is the primary key(s) to ensure no out of order writes.

Can you get out of order writes if the pk for a row changes?

tang8330 · on July 24, 2023

Great question! Say this happened:

* CREATE TABLE user_test (id int primary key, name text);

* INSERT INTO user_test (id, name) VALUES (1, 'foo');

* UPDATE user_test set id = 2 where id = 1;

When the UPDATE call is invoked, there will be 2 events emitted to Kafka. 1/ A DELETE event for id = 1

2/ A CREATE event for id = 2

Here's the detailed events that get emitted: https://gist.github.com/Tang8330/7a9450f95fbae486a4393abdd49...

bpicolo · on July 24, 2023

Interesting! I'm both surprised and also not surprised it's modeled that way.

tang8330 · on July 24, 2023

Eventual consistency FTW!

yevpats · on July 24, 2023

Congrats!! Check out also CloudQuery (https://github.com/cloudquery/cloudquery) an open source ELT that support DB->DB without Kafka or Debezium (Founder here) :)

tang8330 · on July 24, 2023

Thank you!!

halfcat · on July 24, 2023

Is the value here primarily in low latency updates? i.e. the fraud detection scenario

My main pain points with ETLs are with systems that don’t provide good CDC metrics to begin with, which are not Postgres/etc, which are usually a bolt-on ODBC driver or rate-limited APIs.

tang8330 · on July 24, 2023

Main value props are low latency OLTP data in your DWH and lower DWH ingestion costs.

davej · on July 24, 2023

Very cool, are you guys focused on databases as sources or do you plan to add API-based sources too (e.g.. Stripe Customers)? Currently using Airbyte but something more real-time would be beneficial to us.

tang8330 · on July 24, 2023

For now, we're super focused on databases as sources. We really want to do this well before we move on to other data sources such as APIs.

jgraettinger1 · on July 24, 2023

estuary.dev may be a fit for you (am CTO).

(Competitive product and I feel weird about replying in Artie's thread, but tang8330 has said they're not serving this segment)

cpursley · on July 25, 2023

Take a look at Nango.dev

keithchoy · on July 25, 2023

This will be game changer. I love this so much - it will increase the speed and reliability of my data! Sent over an email to book a meeting.

Aaronstotle · on July 24, 2023

The blocker for testing this out and using it at my place of employment is lack of binary data type in MySQL, any plans to add this?

tang8330 · on July 24, 2023

Definitely. Do you expect the resulting data to also be in binary / bytes format in your DWH?

I ask because there's a workaround by setting `binary.handling.mode` to a STRING type [1].

Transfer will then automatically pick this up and write this as a B64 string to the DWH.

[1] https://debezium.io/documentation/reference/stable/connector...

Aaronstotle · on July 24, 2023

Thanks for the reply! We'll be keeping an eye on it

k2so · on July 25, 2023

Hi, great project! I was wondering if you are planning on extending support for dynamodb

tang8330 · on July 25, 2023

Hi! Yes, our next 2 connectors are going to be S3 and DynamoDB.

GitHub Issue: https://github.com/artie-labs/transfer/issues/157

If you'd like to be another design partner for us on this, do reach out. I'm at robin@artie.so.

CyberDildonics · on July 25, 2023

Can this also be used with a data lake or the data bricks I used to build a data mart?

tang8330 · on July 25, 2023

Definitely. For staging temporary tables to merge for our Redshfit destination, we're uploading it to S3.

We will be creating a S3 destination with TSV, Avro, Parquet format support verily soon.

joelhaasnoot · on July 24, 2023

Do you support PostGIS? This seems to be an issue with a lot of solutions I've seen...

tang8330 · on July 24, 2023

Not at the moment, but no reason(s) why we shouldn't. Debezium already supports this [1] and it'd be a minor refactor on our end to support. If that's something you'd want, LMK!

[1] https://debezium.io/documentation/reference/stable/connector...

btown · on July 24, 2023

Seconding this! We're in the short-term rental space, and while properties to don't change locations frequently (though mistakes get corrected, and also https://arkup.com/ exists...), the polygon definitions and hierarchical relationships of regions-of-interest for search and discovery are often being updated, and being able to have real-time geometry intersections rather than rerunning scripts periodically to rebuild our analyses would be incredibly powerful. And certainly for use cases like trucking fleet management etc., real-time analytics on location data across the entire gamut is vital.

tang8330 · on July 24, 2023

That's fascinating! Thanks for providing more color, support for geometric shapes coming! https://github.com/artie-labs/transfer/issues/155

mitchbregs · on July 25, 2023

How are you performing initial sync?

tang8330 · on July 25, 2023

For Postgres, we have our own custom snapshotter that is capable of doing parallel snapshots against your read replica and not incur WAL growth. More details here: https://news.ycombinator.com/item?id=36855338

For MySQL and MongoDB, we rely on Debezium to perform the initial snapshots.