Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

This is a bit overblown.

Is Iceberg "easy" to set up? No.

Can you get set up in a week? Yes.

If you really need a datalake, spending a week setting it up is not so bad. We have a guide[0] here that will get you started in under an hour.

For smaller (e.g. under 10tb) data where you don't need real-time, DuckDB is becoming a really solid option. Here's on setup[1] we've played around with using Arrow Flight.

If you don't want to mess with any of this, we[2] spin it all up for you.

0 - https://www.definite.app/blog/cloud-iceberg-duckdb-aws

1 - https://www.definite.app/blog/duck-takes-flight

2 - https://www.definite.app/



I think Iceberg can work in real time but the current implementations make it impossible.

I have a vision for a way to make it work. I made another comment here. Your blog posts were helpful, I digged a bit in the Duck Takes Flight code in python and rust.


heads up the logo on your site needs to be 2x'd in pixel density it comes across as blurry on hidpi displays. or convert it to an svg/vector.


fixed!


If you're already in AWS, why wouldn't you use AWS Glue Catalog + AWS SDK for pandas + Athena?

You can setup a data lake, save data and start doing queries in like 10 minutes with this setup.


These days you can 'just' create an S3 tables bucket. https://docs.aws.amazon.com/AmazonS3/latest/userguide/s3-tab...


Athena is really expensive though and you will often run into a hard limit on the size of your query.


Like most things serverless Athena is cheap as long as you don't use it.

My company has 100s of data pipelines that are executed infrequently.

For this use case Athena is ridiculously cheap and easy to use vs most other solutions.


I never found Athena expensive. Compared to employment cost it will be miniscule.

And some times, if your query is CPU extensive but the queried data size is not huge you can get a ridiculous value for money, like many CPU-days in 10 minutes for just $5 if your query covers 1TB after partitioning.

Query size limits are also configurable.

Obviously it depends on what data you are working on, but not having to set up and pay for a computational cluster is a huge cost saving.


Agreed.

A lot of people worry would worry about "vendor lock-in" here, but it's certainly convenient.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: