I came here expecting to read about the tech in the article or how others do big...

harryf · on Feb 12, 2021

Agreed to I'll bite...

> The intuition is that for datasets commonly and frequently joined on a known key, e.g., user events with user metadata on a user ID, we can write them in bucket files with records bucketed and sorted by that key. By knowing which files contain a subset of keys and in what order, shuffle becomes a matter of merge-sorting values from matching bucket files, completely eliminating costly disk and network I/O of moving key–value pairs around.

I'm actually surprised that this should be regarded as "novel" in data science.

It reminds me of something in Eric Raymonds "The Art of Unix Programming" (I don't have time to find the link right now) where it discussed an approach from the earlier days of Linux filesystems where you had a limit on the number of iNodes that could exist in a single directory and corresponding performance. The work around was to create a subdirectory structure to store files based on the filename. But then you tended to get many files starting with the same characters all in the same directories. What turned out to be a better way to distribute the files evenly in the directory structure was to take the first and _last_ character of the file name and use those to create the subdirectories. This way you were more likely to spread the files evenly across the structure.

mandis · on Feb 12, 2021

>It reminds me of something in Eric Raymonds "The Art of Unix Programming" (I don't have time to find the link right now) where it discussed an approach from the earlier days of Linux filesystems where you had a limit on the number of iNodes that could exist in a single directory and corresponding performance. The work around was to create a subdirectory structure to store files based on the filename. But then you tended to get many files starting with the same characters all in the same directories. What turned out to be a better way to distribute the files evenly in the directory structure was to take the first and _last_ character of the file name and use those to create the subdirectories. This way you were more likely to spread the files evenly across the structure.

Interesting. I have been pondering over filesystem performance and inode limits in servers/home-servers since a long time. This seems useful infomration

harryf · on Feb 12, 2021

I think the bit I was remembering was the Terminfo Case Study on page 149 of the Art of Unix Programming - https://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.62... - read it a long time ago though and re-reading now, it's not exactly what I was remembering but that's memory for you...

nerpderp82 · on Feb 12, 2021

I read it and I am like ... we already do this. This is common and obvious. Maybe I am missing something.

Before worker nodes had as much memory as they have now, almost everything needed to use small buffers and spill to disk. BDB (Berkeley DB) was an extremely common tool for doing out of core data operations. Because the ETL tools I was writing needed to run on machines with 512MB of ram, it required out of core algorithms. We easily had jobs processing 10-20GB with only 512M of ram.

I am sure I am missing something, reading the paper now.

http://kth.diva-portal.org/smash/get/diva2:1334587/FULLTEXT0...

trhway · on Feb 12, 2021

> The intuition is that for datasets commonly and frequently joined on a known key, e.g., user events with user metadata on a user ID, we can write them in bucket files with records bucketed and sorted by that key.

Index organized tables in Oracle, clustered tables in Mssql. "Intuition" in modern big data world :)

jpitz · on Feb 12, 2021

It shouldn't be novel. I dive into this topic when I interview data engineers.

johncena33 · on Feb 12, 2021

This has become a huge problem on HN lately. Lots of discussions are nothing but complaining. Now the technical discussions are starting to get infested with off-topic whining. The mods don't do anything about off-topic rants. If you point it out you'll get downvoted [1][2][3].

[1] https://news.ycombinator.com/item?id=25839399 [2] https://news.ycombinator.com/item?id=25064636 [3] https://news.ycombinator.com/item?id=24699908

pedroaraujo · on Feb 12, 2021

Absolutely, there has been a dramatic shift in the type of people who visit HN in the past years.

I used to think that Reddit was bad in this regard but to be honest it mostly affects the big subreddits, the niche and small ones still have a high quality community. HN became pretty much like the biggest subs on Reddit.

chishaku · on Feb 12, 2021

Back in my day...

This comment thread is in its own category of low quality discussion.

Negativity bias prevents you from seeing that 95% of the homepage right now is technical/nerdy with a lot of high quality corresponding discussion.

When political/social issues hit the homepage, they often slide off quickly if the corresponding discussion is of low quality (has many downvoted comments).

HN is certainly not perfect but just focusing on the parts you don't like prevents you from seeing the bigger picture.

adventured · on Feb 12, 2021

> The mods don't do anything about off-topic rants.

Mod. There is a question of how much one moderator can do against the tide. HN really needs a couple of full time paid moderators, with their salaries covered by the zillion dollar YC bank account.

jlouis · on Feb 12, 2021

The article isn't easy to read unless you have some knowledge up front about the used technologies and what the problem is.

This definitely drives people to comment on other things.

My gut feeling screams they made a problem themselves in the first place which they then "solved". Similar to a "solution running around looking for a problem" type of deal.

chishaku · on Feb 12, 2021

[flagged]

matsemann · on Feb 12, 2021

I agree that my post is just additional noise. But it's noise not disturbing a good signal.

Let me just point out that I think discussing sides of what's linked can be interesting and relevant. In this instance I think discussing privacy around the amount of data Spotify stores is a relevant subdiscussion worth exploring. But complaints about their UI doesn't feel very relevant. Etc.

Now that's only my opinion. What made me write my comment was that it was literally no one discussing the concepts of the article in the first 15 or so comments. Which I found a bit disappointing as I thought the tech is interesting.

chishaku · on Feb 12, 2021

> But it's noise not disturbing a good signal.

The signal will always be weak early on.

Look at the comments now and it's clear that the upvote/downvote mechanisms are working sufficiently to address your concern.

This should not be surprising. It's typically faster to not read the article and spew superficial off-topic comments than to write on-topic, substantive, and technical comments.

matsemann · on Feb 12, 2021

Good point, although I felt it was more than normal in this case. Having my meta comment on top is also no good, hopefully it can be demoted somehow. Will try to flag this thread.