Hacker News new | past | comments | ask | show | jobs | submit login
Debezium to olake.io – PhysicsWallah switch for CDC
3 points by pkhodiyar 9 days ago | hide | past | favorite | 1 comment
We recently hosted a small online meetup at OLake where a Data Engineer at PhysicsWallah, walked through why his team dropped Debezium and moved to OLake’s “MongoDB → Iceberg” pipeline.

Video (29 min): https://www.youtube.com/watch?v=qqtE_BrjVkM

If you are someone who prefer text, here’s the quick TLDR;

Why Debezium became a drag for them: 1. Long full loads on multi-million-row MongoDB collections, and any failure meant restarting from scratch 2. Kafka and Connect infrastructure felt heavy when the end goal was “Parquet/Iceberg on S3” 3. Handling heterogeneous arrays required custom SMTs 4. Continuous streaming only; they still had to glue together ad-hoc batch pulls for some workflows 5. Ongoing schema drift demanded extra code to keep Iceberg tables aligned

What changed with OLake? -> Writes directly from MongoDB (and friends) into Apache Iceberg, no message broker in between

-> Two modes: full load for the initial dump, then CDC for ongoing changes — exposed by a single flag in the job config -> Automatic schema evolution: new MongoDB fields appear as nullable columns; complex sub-docs land as JSON strings you can parse later

-> Resumable, chunked full loads: a pod crash resumes instead of restarting

-> Runs as either a Kubernetes CronJob or an Airflow task; config is one YAML/JSON file.

Their stack in one line: MongoDB → OLake writer → Iceberg on S3 → Spark jobs → Trino / occasional Redshift, all orchestrated by Airflow and/or K8s.

Posting here because many of us still bolt Kafka onto CDC just to land files. If you only need Iceberg tables, a simpler path might exist now. Curious to hear others’ experiences with broker-less CDC tools.

(Disclaimer: I work on OLake and hosted the meetup, but the talk is purely technical.)

Check out github repo - https://github.com/datazip-inc/olake






This looks great!! Do you guys support schema evolution as well, because Synching MongoDB data comes with that challenge?



Consider applying for YC's Summer 2025 batch! Applications are open till May 13

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: