We recently hosted a small online meetup at OLake where a Data Engineer at PhysicsWallah, walked through why his team dropped Debezium and moved to OLake’s “MongoDB → Iceberg” pipeline.
Video (29 min): https://www.youtube.com/watch?v=qqtE_BrjVkM
If you are someone who prefer text, here’s the quick TLDR;
Why Debezium became a drag for them:
1. Long full loads on multi-million-row MongoDB collections, and any failure meant restarting from scratch
2. Kafka and Connect infrastructure felt heavy when the end goal was “Parquet/Iceberg on S3”
3. Handling heterogeneous arrays required custom SMTs
4. Continuous streaming only; they still had to glue together ad-hoc batch pulls for some workflows
5. Ongoing schema drift demanded extra code to keep Iceberg tables aligned
What changed with OLake?
-> Writes directly from MongoDB (and friends) into Apache Iceberg, no message broker in between
-> Two modes: full load for the initial dump, then CDC for ongoing changes — exposed by a single flag in the job config
-> Automatic schema evolution: new MongoDB fields appear as nullable columns; complex sub-docs land as JSON strings you can parse later
-> Resumable, chunked full loads: a pod crash resumes instead of restarting
-> Runs as either a Kubernetes CronJob or an Airflow task; config is one YAML/JSON file.
Their stack in one line: MongoDB → OLake writer → Iceberg on S3 → Spark jobs → Trino / occasional Redshift, all orchestrated by Airflow and/or K8s.
Posting here because many of us still bolt Kafka onto CDC just to land files. If you only need Iceberg tables, a simpler path might exist now. Curious to hear others’ experiences with broker-less CDC tools.
(Disclaimer: I work on OLake and hosted the meetup, but the talk is purely technical.)
Check out github repo - https://github.com/datazip-inc/olake
reply