Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

Yes and I also believe:

Experienced Spark / Data Engineering teams would not assume S3 is readily useable as a filesystem.

This [1] seems like a good guide on how to configure spark for working with Cloud object stores, while recognizing the limitations and pitfalls.

[1]: https://spark.apache.org/docs/latest/cloud-integration.html

---

Amazon EMR offers a managed way to run hadoop or spark clusters and it implements an "EMR FS" [2] system to interface with S3 as storage.

[2]: https://docs.aws.amazon.com/emr/latest/ReleaseGuide/emr-fs.h...

AWS Glue is another option which is "serverless" ETL. Source and Destination can be S3 data lakes read through a data catalog (hive or glue data catalog). During processing AWs Glue can optionally use S3 [3,4,5] for shuffle partition.

[3]: https://aws.amazon.com/blogs/big-data/introducing-amazon-s3-...

[4]: https://docs.aws.amazon.com/glue/latest/dg/monitor-spark-shu...

[5]: https://aws.amazon.com/blogs/big-data/introducing-the-cloud-...



I think we're talking about two different things. I was addressing a section in the article about running databases backed by s3. It's less about s3 needing to act as a filesystem, and more about all of the rdbms features that come along with the various types of DB transactions. It's a solved problem with the libraries I mentioned. Not something I'd ever recommend to build on your own. Been there done that when those solutions were still nascent. Wasn't worth the effort vs just using an rdbms.

The problem that emrfs is trying to solve doesn't cover the rdbms scenarios like row-level updates and deletes.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: