Debezium to olake.io – PhysicsWallah switch for CDC

We recently hosted a small online meetup at OLake where a Data Engineer at PhysicsWallah, walked through why his team dropped Debezium and moved to OLake’s “MongoDB → Iceberg” pipeline.

Video (29 min): https://www.youtube.com/watch?v=qqtE_BrjVkM

If you are someone who prefer text, here’s the quick TLDR;

Why Debezium became a drag for them: 1. Long full loads on multi-million-row MongoDB collections, and any failure meant restarting from scratch 2. Kafka and Connect infrastructure felt heavy when the end goal was “Parquet/Iceberg on S3” 3. Handling heterogeneous arrays required custom SMTs 4. Continuous streaming only; they still had to glue together ad-hoc batch pulls for some workflows 5. Ongoing schema drift demanded extra code to keep Iceberg tables aligned

What changed with OLake? -> Writes directly from MongoDB (and friends) into Apache Iceberg, no message broker in between

-> Two modes: full load for the initial dump, then CDC for ongoing changes — exposed by a single flag in the job config -> Automatic schema evolution: new MongoDB fields appear as nullable columns; complex sub-docs land as JSON strings you can parse later

-> Resumable, chunked full loads: a pod crash resumes instead of restarting

-> Runs as either a Kubernetes CronJob or an Airflow task; config is one YAML/JSON file.

Their stack in one line: MongoDB → OLake writer → Iceberg on S3 → Spark jobs → Trino / occasional Redshift, all orchestrated by Airflow and/or K8s.

Posting here because many of us still bolt Kafka onto CDC just to land files. If you only need Iceberg tables, a simpler path might exist now. Curious to hear others’ experiences with broker-less CDC tools.

(Disclaimer: I work on OLake and hosted the meetup, but the talk is purely technical.)

Check out github repo - https://github.com/datazip-inc/olake