Crazy that the project took almost 4 years end-to-end, and it's still ongoing. I...

thedood · 2024-07-30T02:16:06 1722305766

Hi mannyv - one of the devs that worked on the migration here. It has been a pretty long project - approached with caution due to the criticality of keeping our BI datasets healthy - but the preliminary results produced year-over-year kept looking promising enough to keep after it. =)

Also, we mostly have Parquet data cataloged in S3 today, but delimited text is indeed ubiquitous and surprisingly sticky, so we continue to maintain some very large datasets natively in this format. However, while the table's data producer may prefer to write delimited text, they are almost always converted to Parquet during the compaction process to produce a read-optimized table variant downstream.

gregw2 · 2024-07-31T21:52:23 1722462743

Are you all shifting over to storing as iceberg-enriched parquet yet and letting it (within, say Athena) manage compaction or thinking about it, or is it not worth it since this new Ray+Parquet thing is working for you?

thedood · 2024-07-31T22:09:00 1722463740

As alluded to in the blog post, Ray+Parquet+Iceberg is the next frontier we'd like to make our compactor and similar procedures available on in open source so that the community can start bringing similar benefits for their Iceberg workloads. Stay tuned. =)

jerrygenser · 2024-07-30T02:09:50 1722305390

They reference parquet files, not sure if it's only CSV or CSV even figures in that heavily other than the first iteration before migrating to spark

whoevercares · 2024-07-31T22:24:34 1722464674

Is it really AWS? I don’t recall any service called BDT

p0rkbelly · 2024-08-01T02:07:58 1722478078

The second paragraph discusses that BDT is an internal team at Amazon Retail. They used AWS and Ray to do this.

wenc · 2024-07-31T19:11:43 1722453103

Most people are moving away from CSV for big datasets, except in exceptional cases involving linear reads (append only ETL). CSV has one big upside which is human readability. But so many downsides: poor random access, no typing, no compression, complex parser needing to handle exceptions.

100pctremote · 2024-07-31T19:57:58 1722455878

Most people don't directly query or otherwise operate on raw CSV, though. Large source datasets in CSV format still reign in many enterprises, but these are typically read into a dataframe, manipulated and stored as Parquet and the like, then operated upon by DuckDB, Polars, etc., or modeled (E.g. DBT) and pushed to an OLAP target.

wenc · 2024-07-31T20:55:30 1722459330

There are folks who still directly query CSV formats in a data lake using a query engine like Athena or Spark or Redshift Spectrum — which ends up being much slower and consuming more resources than is necessary due to full table scans.

CSV is only good for append only.

But so is Parquet and if you can write Parquet from the get go, you save on storage as well has have a directly queryable column store from the start.

CSV still exists because of legacy data generating processes and dearth of Parquet familiarity among many software engineers. CSV is simple to generate and easy to troubleshoot without specialized tools (compared to Parquet which requires tools like Visidata). But you pay for it elsewhere.

fragmede · 2024-08-01T00:47:35 1722473255

how about using Sqlite database files as an interchange format?

wenc · 2024-08-01T02:42:12 1722480132

I haven't thought about sqlite as a data interchange format, but I was looking at deploying sqlite as a data lake format some time ago, and found it wanting.

1. Dynamically typed (with type affinity) [1]. This causes problems with there are multiple data generating processes. The new sqlite has a STRICT table type that enforces types but only for the few basic types that it has.

2. Doesn't have a date/time type [1]. This is problematic because you can store dates as TEXT, REAL or INTEGER (it's up to the developer) and if you have sqlite files from > 1 source, date fields could be any of those types, and you have to convert between them.

3. Isn't columnar, so complex analytics at scale is not performant.

I guess one can use sqlite as a data interchange format, but it's not ideal.

One area sqlite does excel in is as a application file format [2] and that's where it is mostly used [3].

[1] https://www.sqlite.org/datatype3.html

[2] https://www.sqlite.org/appfileformat.html

[3] https://en.wikipedia.org/wiki/SQLite#Notable_uses

cmollis · 2024-07-31T21:38:50 1722461930

exactly.. parquet is good for append only.. stream mods to parquet in new partitions.. compact, repeat.

qwerp · 2024-08-01T02:17:28 1722478648

our plans are measured in centuries