> The intuition is that for datasets commonly and frequently joined on a known key, e.g., user events with user metadata on a user ID, we can write them in bucket files with records bucketed and sorted by that key. By knowing which files contain a subset of keys and in what order, shuffle becomes a matter of merge-sorting values from matching bucket files, completely eliminating costly disk and network I/O of moving key–value pairs around.
I'm actually surprised that this should be regarded as "novel" in data science.
It reminds me of something in Eric Raymonds "The Art of Unix Programming" (I don't have time to find the link right now) where it discussed an approach from the earlier days of Linux filesystems where you had a limit on the number of iNodes that could exist in a single directory and corresponding performance. The work around was to create a subdirectory structure to store files based on the filename. But then you tended to get many files starting with the same characters all in the same directories. What turned out to be a better way to distribute the files evenly in the directory structure was to take the first and _last_ character of the file name and use those to create the subdirectories. This way you were more likely to spread the files evenly across the structure.
>It reminds me of something in Eric Raymonds "The Art of Unix Programming" (I don't have time to find the link right now) where it discussed an approach from the earlier days of Linux filesystems where you had a limit on the number of iNodes that could exist in a single directory and corresponding performance. The work around was to create a subdirectory structure to store files based on the filename. But then you tended to get many files starting with the same characters all in the same directories. What turned out to be a better way to distribute the files evenly in the directory structure was to take the first and _last_ character of the file name and use those to create the subdirectories. This way you were more likely to spread the files evenly across the structure.
Interesting. I have been pondering over filesystem performance and inode limits in servers/home-servers since a long time. This seems useful infomration
I think the bit I was remembering was the Terminfo Case Study on page 149 of the Art of Unix Programming - https://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.62... - read it a long time ago though and re-reading now, it's not exactly what I was remembering but that's memory for you...
I read it and I am like ... we already do this. This is common and obvious. Maybe I am missing something.
Before worker nodes had as much memory as they have now, almost everything needed to use small buffers and spill to disk. BDB (Berkeley DB) was an extremely common tool for doing out of core data operations. Because the ETL tools I was writing needed to run on machines with 512MB of ram, it required out of core algorithms. We easily had jobs processing 10-20GB with only 512M of ram.
I am sure I am missing something, reading the paper now.
> The intuition is that for datasets commonly and frequently joined on a known key, e.g., user events with user metadata on a user ID, we can write them in bucket files with records bucketed and sorted by that key.
Index organized tables in Oracle, clustered tables in Mssql. "Intuition" in modern big data world :)
> The intuition is that for datasets commonly and frequently joined on a known key, e.g., user events with user metadata on a user ID, we can write them in bucket files with records bucketed and sorted by that key. By knowing which files contain a subset of keys and in what order, shuffle becomes a matter of merge-sorting values from matching bucket files, completely eliminating costly disk and network I/O of moving key–value pairs around.
I'm actually surprised that this should be regarded as "novel" in data science.
It reminds me of something in Eric Raymonds "The Art of Unix Programming" (I don't have time to find the link right now) where it discussed an approach from the earlier days of Linux filesystems where you had a limit on the number of iNodes that could exist in a single directory and corresponding performance. The work around was to create a subdirectory structure to store files based on the filename. But then you tended to get many files starting with the same characters all in the same directories. What turned out to be a better way to distribute the files evenly in the directory structure was to take the first and _last_ character of the file name and use those to create the subdirectories. This way you were more likely to spread the files evenly across the structure.