Hacker News new | past | comments | ask | show | jobs | submit login
Spotify Optimized the Largest Dataflow Job Ever for Wrapped 2020 (atspotify.com)
196 points by SirOibaf on Feb 12, 2021 | hide | past | favorite | 125 comments



I came here expecting to read about the tech in the article or how others do big data processing stuff. Instead I get off topic Spotify rants.. Did you read the article or just see Spotify in the headline and decided your gripes therefore are relevant?


Agreed to I'll bite...

> The intuition is that for datasets commonly and frequently joined on a known key, e.g., user events with user metadata on a user ID, we can write them in bucket files with records bucketed and sorted by that key. By knowing which files contain a subset of keys and in what order, shuffle becomes a matter of merge-sorting values from matching bucket files, completely eliminating costly disk and network I/O of moving key–value pairs around.

I'm actually surprised that this should be regarded as "novel" in data science.

It reminds me of something in Eric Raymonds "The Art of Unix Programming" (I don't have time to find the link right now) where it discussed an approach from the earlier days of Linux filesystems where you had a limit on the number of iNodes that could exist in a single directory and corresponding performance. The work around was to create a subdirectory structure to store files based on the filename. But then you tended to get many files starting with the same characters all in the same directories. What turned out to be a better way to distribute the files evenly in the directory structure was to take the first and _last_ character of the file name and use those to create the subdirectories. This way you were more likely to spread the files evenly across the structure.


>It reminds me of something in Eric Raymonds "The Art of Unix Programming" (I don't have time to find the link right now) where it discussed an approach from the earlier days of Linux filesystems where you had a limit on the number of iNodes that could exist in a single directory and corresponding performance. The work around was to create a subdirectory structure to store files based on the filename. But then you tended to get many files starting with the same characters all in the same directories. What turned out to be a better way to distribute the files evenly in the directory structure was to take the first and _last_ character of the file name and use those to create the subdirectories. This way you were more likely to spread the files evenly across the structure.

Interesting. I have been pondering over filesystem performance and inode limits in servers/home-servers since a long time. This seems useful infomration


I think the bit I was remembering was the Terminfo Case Study on page 149 of the Art of Unix Programming - https://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.62... - read it a long time ago though and re-reading now, it's not exactly what I was remembering but that's memory for you...


I read it and I am like ... we already do this. This is common and obvious. Maybe I am missing something.

Before worker nodes had as much memory as they have now, almost everything needed to use small buffers and spill to disk. BDB (Berkeley DB) was an extremely common tool for doing out of core data operations. Because the ETL tools I was writing needed to run on machines with 512MB of ram, it required out of core algorithms. We easily had jobs processing 10-20GB with only 512M of ram.

I am sure I am missing something, reading the paper now.

http://kth.diva-portal.org/smash/get/diva2:1334587/FULLTEXT0...


> The intuition is that for datasets commonly and frequently joined on a known key, e.g., user events with user metadata on a user ID, we can write them in bucket files with records bucketed and sorted by that key.

Index organized tables in Oracle, clustered tables in Mssql. "Intuition" in modern big data world :)


It shouldn't be novel. I dive into this topic when I interview data engineers.


This has become a huge problem on HN lately. Lots of discussions are nothing but complaining. Now the technical discussions are starting to get infested with off-topic whining. The mods don't do anything about off-topic rants. If you point it out you'll get downvoted [1][2][3].

[1] https://news.ycombinator.com/item?id=25839399 [2] https://news.ycombinator.com/item?id=25064636 [3] https://news.ycombinator.com/item?id=24699908


Absolutely, there has been a dramatic shift in the type of people who visit HN in the past years.

I used to think that Reddit was bad in this regard but to be honest it mostly affects the big subreddits, the niche and small ones still have a high quality community. HN became pretty much like the biggest subs on Reddit.


Back in my day...

This comment thread is in its own category of low quality discussion.

Negativity bias prevents you from seeing that 95% of the homepage right now is technical/nerdy with a lot of high quality corresponding discussion.

When political/social issues hit the homepage, they often slide off quickly if the corresponding discussion is of low quality (has many downvoted comments).

HN is certainly not perfect but just focusing on the parts you don't like prevents you from seeing the bigger picture.


> The mods don't do anything about off-topic rants.

Mod. There is a question of how much one moderator can do against the tide. HN really needs a couple of full time paid moderators, with their salaries covered by the zillion dollar YC bank account.


The article isn't easy to read unless you have some knowledge up front about the used technologies and what the problem is.

This definitely drives people to comment on other things.

My gut feeling screams they made a problem themselves in the first place which they then "solved". Similar to a "solution running around looking for a problem" type of deal.


[flagged]


I agree that my post is just additional noise. But it's noise not disturbing a good signal.

Let me just point out that I think discussing sides of what's linked can be interesting and relevant. In this instance I think discussing privacy around the amount of data Spotify stores is a relevant subdiscussion worth exploring. But complaints about their UI doesn't feel very relevant. Etc.

Now that's only my opinion. What made me write my comment was that it was literally no one discussing the concepts of the article in the first 15 or so comments. Which I found a bit disappointing as I thought the tech is interesting.


> But it's noise not disturbing a good signal.

The signal will always be weak early on.

Look at the comments now and it's clear that the upvote/downvote mechanisms are working sufficiently to address your concern.

This should not be surprising. It's typically faster to not read the article and spew superficial off-topic comments than to write on-topic, substantive, and technical comments.


Good point, although I felt it was more than normal in this case. Having my meta comment on top is also no good, hopefully it can be demoted somehow. Will try to flag this thread.


I'm having a hard time understanding this article. It seems to be a bit too low level on the specifics of Beam for general consumption.

From what i undestand, Spark has the same feature built in. If the planner knows that the source data is partitioned and/or sorted appropriately, it can skip shuffling/sorting it, instead having each executor directly requesting the one file it needs.

It's a nice optimization, but it's not game changing. You often end up having to shuffle anyway, as you are joining on a different key, or for performance reason you need more executors than the set amount of partitions, or the shuffle needed to write the data doesn't justify the savings on the readers.

Maybe it's better with their additional optimizations? Spark does not do those, mostly.


Hive also has had this optimization for as long as I can remember. As others have noted it's not particularly new or novel, it's just not part of the Beam SDK.


50% cost reduction though


I wonder if they could publish dollar cost of that job before and after the optimization, as provided by GCP billing. I know it could be a bit unfair (some costs may be static, regardless of job size, etc.) but it would improve decision making for others if discussions of public cloud usage optimizations also include the cost.


They do refer to savings in percentages.. I feel like giving away actual dollar costs would potentially break contractual agreements (because I doubt Spotify pays public list price), and potentially give away competitive information about how much data they have on their customers etc..

I agree with you that it would be interesting to know, I just don't think it's realistic for them to release that information.


Regarding the title, how do we know this is THE largest dataflow job? The article body itself only makes mention that this is THEIR largest dataflow job. This post doesn't make any quantifiable claims either one could use to support, this is all I found:

> "We estimate around a 50% decrease in Dataflow costs this year compared to previous years’ Bigtable-based approach. Additionally, we avoided scaling the Bigtable cluster up two to three times its normal capacity (up to around 1,500 nodes at peak"

The official Spotify Engineering Tweet similarly only makes mention that this is Spotify's largest dataflow job ever: https://twitter.com/SpotifyEng/status/1359887825047613442.

I'm fairly sure a similar accidental unsourced exaggeration was made last year.

Maybe the title should be Spotify Optimized Their Largest Dataflow Job Ever For Wrapped 2020?


Hi not sure if I am just completely off here but I am wondering how this relates or compares to processing things with Kafka and Kafka Streams?

If I am reading things correctly with Kafka the workflow equivalent to what's written in the article would be to have your producer produce via hash-based-round-robin (the default partitioning algorithm) based on the key you are interested in into some topic and then your consumer would just read it and your data would already be sorted for the given keys (because within a partition Kafka has sorting guarantees) and also be co-partitioned correctly if you need to read some other topic in with the same number of partitions and the same logical keys produced via the same algorithm. No?


This is the most basic pattern for distributed joins - you hash on the join key in both tables and shuffle data based on hash ranges. In some systems like Redshift you can designate the key for distribution so that "related" records are already co-located on a single shard.

> our data would already be sorted for the given keys (because within a partition Kafka has sorting guarantees)

It's been a while since I used Kafka but I don't remember "sorting guarantees". Consumers see events "in order" based on when they were produced, because each partition is a queue.


Yes I guess my point is when using Kafka in combination with Kafka Streams and you produce things partitioned in a way that you need them for consumption then you do not need to do any shuffling in the instance where you want to join because data is already partitioned correctly.


You seem to know what you're talking about. Any recommendations on learning resources for this type of flow? Or really understanding which platform works for in each situation?

I'm learning proper data flow in real time as I look to transition ETL of product data into Postgres to a more applicable system.

Finding the right learning resources is difficult! Cheers.



Wrapped works so well because it panders to you. Everyone likes to be acknowledged for listening to the weird indie band they discovered earlier this year.

I enjoy it anyways, and Spotify is still a great service for now - I wonder if it'll meet the same fate as Netflix at some point, with publishing houses going for their own streaming services instead.


I wonder if it'll meet the same fate as Netflix at some point, with publishing houses going for their own streaming services instead.

I don't think so because the way that people consume music and the way they consume films and television are very different. With a film you might block out a few hours to watch that specific provider. With music you're more likely to want to interleave content from several providers at the same time (eg a playlist). Unless all the providers are available on the same platform it wouldn't work well.


A considerable amount of music is distributed by a small subset of providers. So in that regard it's not that different to the TV / film situation and you could theoretically still interleave different artists.

There's also a common use case where people will just play a specific artist for an hour. Or even an album.

Frankly, I hope services like Spotify don't disappear. It's a great loss to consumers just how fragmented video streaming services have become. I'd hate to see the same happen to music as well.


I don't think that behavior is all to common. I see people listening to a very wide range of artists and almost never reach for stuff outside of it around here.


I use music streaming services to play specific albums (as I tend to play older artists who created albums designed to be played as a whole entity rather than singles)

My wife uses them to shuffle singles by specific artists (she’s more into pop music).

I’d wager if my wife and I both coincidentally follow the same pattern despite doing so for different reasons, that it’s then likely a more common pattern than first assumed. Please also bare in mind that I’m not suggesting our use cases are how the majority of people consume music, but I’d be surprised if it was small enough to be a rounding error.


Speaking of interleaving, I wish Spotify understood how to interact with Concept Albums. Changing to the middle of The Wall for one song is jarring and I usually skip it.

And then I worry skipping it is going to train the algorithm that I don't like Pink Floyd.


What does this mean? I don't use spotify. Are you saying when you go to listen to an album it will put a random song into the queue?


I'll be listening to a theme or genre of music and it adds songs that don't really work outside the context of their album.


Spotify allows both: listening to random songs (either based on a particular genre, seeded from an existing playlist, or generated from the user’s listening habits), or choosing specific songs or albums to listen to.

They probably meant listening in random mode, and Spotify randomly choosing a track that doesn’t make much sense outside the context of its album.


Only if you listen on shuffle(like every music app I guess) or in their AI lists. Would be nice if the AI ones would respect that for sure, was wondering why that's not a thing since like 2016.


I never understood why for albums like this they even bother splitting up the tracks.


If anything the top-5 format doesn't capture the long tail you might be listening to.


[deleted]


Do you have anything more terse. A 30 minute, ad filled, video isn't exactly a light read.


This article fails to make a clean problem statement for the general audience. It jumps right into jargon and names from some framework/library. It reads like it was an internal report from a programmer to their team, and someone decided to make it public with no changes.


And they’re still less useful than the data last.fm makes available to you.


Wrapped is a marketing product, not a data product.

Spotify itself is not a data product.

The first rule of avoiding disappointment is managing your own expectations to align with reality.

That said, you can export more granular data Spotify has on you from your settings page[0] if you want to do your own deeper analysis on your trends and usage, etc..

0: https://www.spotify.com/account/privacy/


Useful how? What actionable insights do you get from Last.fm? It's vanity metrics optimized for sharing on social networks and there's nothing wrong with it.


Well, for one it gives me the yearly summary once the year is actually over.

It's not particularly actionable to know you listen to more music on a Saturday than a Sunday, but it is mildly interesting for those who are curious about such things.


For those that would like to dig more into what SMB is - here's the link to the paper from the article: http://kth.diva-portal.org/smash/get/diva2:1334587/FULLTEXT0...


I'd be curious how this compares in load to Google's internal applications. I'm also curious what the capacity of Google's infrastructure goes to Google vs. GCE - has combined GCE usage even passed the compute needs of Google internally yet?


The only even remotely concrete information in this post is their input was 1PB, and they typically have 500 bigtable tablet servers. In 2008, Google said they processed 20PB per day through mapreduce jobs. For the last ten years the only thing they've said about the size of their public web index is that it is over 100PB.


This technique of using distributed storage for large joins instead of shuffling between compute nodes also helps make your job robust to spot instance kills. Until disaggregated shuffle services are widely adopted, it can be really handy.


Largest data flow job ever? I’m sure Google would beg to differ. At Quantcast we process 50PB every day, and that’s nothing compared to real scale like Google.

And merge joins from sorted data? Joins have been done that way since the punched card days on mainframes (and by any scaled data system)


It's largest Dataflow[1] job ever, with a capital D, not "data flow".

[1]: https://cloud.google.com/dataflow


Surely you read the article before posting, right?

From literally the first sentence:

> from our largest Dataflow job


Surely you have read the guidlines before posting, right?

> Please don't comment on whether someone read an article. "Did you even read the article? It mentions that" can be shortened to "The article mentions that."

https://news.ycombinator.com/newsguidelines.html

The hackernews title and the article title say "the". Critizing this clickbait is more than warranted.


Fair point, I could have worded it better.

But I think it's a reach to call this "clickbait".

Article titles are shortened all the time, and you can't expect them to have all the context in the title.

However, one should reasonably expect people participating in a discussion about an article (particularly when posting criticism) to have actually read it.

Complaining about something that is provably false in the first sentence of the article is the bigger sin here, is it not?


Misleading headline, the first sentence is "how Spotify optimized and sped up elements from *our largest Dataflow job*". Surely it's not the largest ever run, even on Dataflow.


Yet, I, as a user still cannot see my playcounts via app or API.


That might be explained as guarding their information, yet I still have trouble believing that they are unable to distinguish two artists with the same name from each other. Each week, my Release Radar playlist is basically invaded by 0 listener rappers who "feat" defunct 70s rock bands who do not manage their artist pages... Response from Spotify: please report them individually, from the desktop app.


That's some interesting product hacking in a way.

I've run into artist naming collisions enough to know it's a thing but I've never seen anyone (ab)use it intentionally..

Neat, but still annoying I'm sure. =)


this. They don't even do an exact string match of the artist name, the reason why I can't fathom. I could make a new artist account named "drake" (lowercase d) today and probably show up in a million Release Radar playlists by next week.


This is one of the reasons why I still keep using lastfm. Stats in Spotify are next to useless and even then it would not track my local listening.


You can, to some extent, using the play history API. https://output.jsbin.com/ribat


this is a massive gripe of mine. I remember being able to toggle so many stats in itunes back in the day which gave me such great insight into my listening habits and other cool tidbits. Spotify, for all its ease of listening has removed some of the magic there


Perhaps a bit off-topic. But a lot of users (myself included) reported wildly inaccurate data in the spotify wrapped this year with seemingly no explanation, no shared accounts, no re-used passwords, no weird listening history, etc.

I wonder if some of the data in the "We worked with the maintainer of these data sets to convert a year’s worth of data to SMB format." step got corrupted or just wrongly converted/lost.

I'm not sure how else explain that I have to google artists in my top 10 because I never heard of them.


listenbrainz has gotten pretty good in the last while for keeping track of your music stats. Its got a nice weekly stats page etc. Its also not ad laden like last.fm


Glad that you like it! I'm part of the (small) team working on ListenBrainz. If you have any feedback or feature requests, I'd be happy to hear it.

My email is me [at] param.codes.


[flagged]


Spotify is no longer a startup.


Good point. That’s changes the substance of the post.


Whatever theyre doing with data means nothing when their client apps are absolute dogshit.


Could be done with a bash script...


And paper and pencil

What’s your point?


Anyone wanna know how much Spotify wanna know about you?

https://twitter.com/steipete/status/1025024813889478656


I wonder whether they use the bluetooth device logging purely to develop their social graph. A person streaming music to their own bluetooth headphone using a friend's computer and Spotify account can be detected this way. I can't really come up with another purpose for it.

Other than that, I'm not surprised by what they log. Virtually every company stores search queries, oauth grants, play history, ad interactions etc. Doesn't make it right of course.


I'm already thinking that every service hoovers as much up as they can. Nice to see proof that it is actually happening!


Woo. Thats incredible. Seriously!


What's scary about this?

They're complying with GDPR. Isn't that a good thing?

The "scary" thing in that tweet is that they store the manufacturer of their bluetooth headphones?


That's not what one could call complying. He had to follow them like a dog for a long while just to get his rightful data.


That was from 2018. Almost 3 years ago.


Sorry if it's off-topic, but if anyone's interested, I'm launching volt.fm next week.

It connects to your Spotify account and generates a nice public page with your stats (top artists, top tracks), playlists and etc.

You can reserve your username now: https://volt.fm


Hi, how is this different/better than last.fm?


It helps you promote your playlists and discover new ones.


Why would I want to promote my playlists?


My off-topic rant: I'd really wish Spotify would focus on improving the core player experience. It has barely seen any improvements in years.

* Not overwrite/delete my listening history everytime I switch devices

* Allow tabs, or some way to resume what I've been listening to in different contexts

* Option to open only one instance, instead of having multiple instances that mess with each other

* Playing local files crashes/not working on Linux

* Change playback speed, not just for podcasts

* Jump back/forward, not just for podcasts

* Have some visibility when the song was last played / play count

* Liked songs not always appearing in search results

* Sorting search results not working

* Add basic functionality to the dbus interface (e.g. seeking)

* Ability to report songs (e.g. wrong titles/badly split tracks/etc.)


I have been begging them for years to add a 'resume playlist' feature but they won't do it.


You are absolutely right. Even more, people are using it for free using mods like these https://bestforandroid.com/apk/spotify-premium-mod-apk/


Yup, well said.


Almost all of these strike me as only benefiting a very small sliver of users, like well under 1%. An infinitesimal portion of Spotify listeners even know what dbus is. How much engineer time is it worth to improve something like that?


Spotify is a company with over 7 billion USD dollar of revenue in 2020. Linux is about 2% of desktop users, or about 1% of users in total.

Nevertheless, they decided not to spend a dime to work on the Linux client ("Spotify for Linux is a labor of love") [1]. The Linux app is an electron app, so there's relatively little effort required to maintain it.

Instead, they paid 100 million USD for Joe Rogan, and added video features that nobody wanted and asked for.

Take a look at /r/JoeRogan [2], and see just the extent of how much reputation damage it caused.

Take a look at /r/Spotify [3], most of the complaints are about basic functionality not working.

MPRIS [4] is a dbus API standard for controlling music players. End-users don't know or care about APIs. End-users care about having music widgets, but you can't do that without it. End-users care about keyboard accessiblity when they need it, but you can't do it without those APIs. And Linux users specifically care about programmability.

Limiting the API means that not only do I have to deal with the buggy app, it's prevents anyone else from building something better.

It's been 10 years since the Linux client came out. One engineer could fix these issues in a couple of months, but it's just not their priority.

The only saving grace for Spotify is that all their competitors are worse.

1. https://www.spotify.com/us/download/linux/

2. https://www.reddit.com/r/JoeRogan/

3. https://www.reddit.com/r/Spotify/

4. https://wiki.archlinux.org/index.php/MPRIS


Great! Now will they stop scanning my entire hard drive with their desktop app? Also stop opening sockets directly to advertiser IP address. And stop paying off data thieves instead of disclosing to their users that their passwords were leaked. And to stop being sellouts too!


Do you have links to further readings about this?


One thing I can never understand about Spotify is that despite it's insane budget, huge amount of employees/talent, they still can't create better personalized playlists than either Pandora OR last.fm.

To this day when I want a recommended playlist based on my taste/history, I always use last.fm because it's just plain better. Why? The "Discover" etc playlists on Spotify are just crap.


Really? I find my 'Discover Weekly' playlist to be astounding. I have to confess for a long time I thought it was a human curated playlist from someone who just happened to have exactly the sames tastes as me.


I have to second this. I can't list the number of artists I have discovered over the last few years using Discover Weekly alone. It's really incredible.


It's been the opposite for me. It regularly surfaces songs I've listened to in the past from artists I listen to frequently. As a music discovery algo Spotify has been nothing but poor in my opinion.


Remember the Netflix Challenge fiasco? It's pretty clear that this problem is either ridiculously hard, or it is fundamentally "unsolvable" (because, perhaps, it depends on highly variable mood).


Spotify is a company that feels like they want to be a "big tech company" when in reality they do not need to. All they need to do is provide a great service with as much music as possible.


That's true – it's kind of like how Google keep pretending to be a big tech company when all they really need to do it provide a great service with as many search results as possible.


I for one love discover weekly and hope they will continue working on improving it. I consider it a core feature :)


Agree 100%. I always find great songs there. A cool feature to discover new music. But I guess we are not that sophisticated. Predicting which songs I'm gonna like is probably noit that difficult.


Sure, that is a great feature but that doesn't really need that much data besides listen data that they need to save anyway (I am assuming).


All they need to do is provide a great service with as much music as possible.

I wonder if adding 'as much music as possible' would drive any growth without the AI music discovery stuff. There has to be a diminishing return to adding new songs - if Spotify adds an artist that only a few thousand people have heard of then that's only going attract a few thousand new customers at most. No one else is going to listen to that artist unless Spotify recommends the songs to people who might like them.


At some point they need to try to differentiate themselves from Amazon or Apple, who could easily out-spend them on licensing music catalogues.

I think Spotify's "moat" is largely the analysis they've done on all the listening data they have, and their ability to provide that to the music industry as a product.

They share rudimentary summaries with customers via stuff like Wrapped but I have to imagine they have much more detailed and robust data products for the industry...

If they're just a giant Dropbox for MP3s with a music player sitting on top, they don't really stand a chance...


I'm not sure I get the point in relation to the article. They have a huge amount of data, so they need to handle it in a "big tech" way & it's not like they're doing this stuff just because they can.

The Spotify end of year wrap is amazing marketing for them that people absolutely love and share widely.


Wouldn’t surprise me if they shuffle more data than Twitter. Is it because it’s not a Silicon Valley darling that it gets all the snide remarks?


They clearly value the free+recommendations+ads stack to drive their business more than adding basic features.


They don’t add music. Artists do. And they get paid fuck all by Spotify.


When this report came out it was the straw that broke the camel's back for me in terms of my data privacy. Most people seem to have found Wrapped 2020 entertaining but I found it creepy.

I miss being able to do something simple like listen to music or watch a movie without all my actions being recorded and saved. So I'm back to buying physical media and DRM free downloads.

I'm convinced that it is now important to hold on to older appliances that work without internet access or data collection this plus right to repair gives me hope for the future.


Seems strange that "Wrapped 2020" was the straw for you. Ever since launch, the top reasons for starting to use Spotify has been availability (across devices) and the provided radio/discover playlists that automatically finds music based on your previous history, this has always been a core proposition of Spotify, and not something that happened now.

Then it's not always good, right or even close sometimes, but it's not like it's a hidden feature.


I joined spotify in 2011 and the core offer was simply access to all music across your devices. However, that's not really relevant.

2020 was the year I really started to question why I was taking such care with my data in some ways but not others. The Wrapped 2020 was a bright reminder to me that if I want to take my privacy seriously I need to look at everything in my life that collects data. Simple as that.

I don't think spotify is wrong or evil to collect the data. I actually think it is a great product. I have just decided that I want to leave as little data around about my daily activities as possible.

As it happens I don't use the Discovery features very much and it turns out there are still some enjoyable FM stations where I live. When I want background music I turn on the radio. When I want something more specific I play an album or playlist from my collection.


Discover Weekly (the first algorithmic playlist) launched in 2015, Spotify itself launched in 2008. It did have radio from the start, but that was (at least sold as) a fairly generic genre-and-decade-based affair, rather than something personalized.


I do know the history of Spotify, became a user first in 2009 (back when it was invite only) as my family got a subscription together with our broadband connection with Bredbandsbolaget in Sweden. The radio was always a feature they advertised in order to find similar music to the one you like, although you're correct that Discover Weekly et al wasn't available until later. I guess you could argue that because the radio was initially just for artist pages then playlists, it was kind of manual and not automatic like todays playlists that they generate.

Although it is strange to leave Spotify in 2020 citing concerns about Spotify using your data, as it has been going on from at least 2015, possibly earlier.


Spotify acquired the vast majority of their current users since 2015.


What’s your concern, that your listening history will be linked to your personal identity and somehow used against you?

If that’s the case, why not just sign up using a single-use email address, pay via gift cards purchased with cash, and if you’re really concerned, use a VPN?


Because going through all that just to listen to music is bonkers


but also far less effort and money than buying physical copies of music


Oh I totally agree, for me this is a matter of an absolutely backwards industry. Very often I WANT to give my money but can't find where to buy a DRM-free track, preferably in flac. And registering for 8 different stores is not a valid solution.


This is what relentless back-and-forth made someone realise how much Spotify loves him.

https://twitter.com/steipete/status/1025024813889478656


Ironically, I've been trying to get that data and cannot. I got a 400KiB zip file with like 10 JSONs inside, only relevant ones being listening history (ts and ms played per song). Support suggested there might be more data, they even re-did the export, but nothing.


Why create a throwaway account just to post this?

I don’t think it’s a surprise to anyone that Spotify collects telemetry data on its users.

The main reason myself (and many others) use Spotify is because they use this data to recommend new music that I will like.


Creating throwaways to post is a feature of HN. I didn't abuse it. I don't have a HN account.

It's not meant as slander. It's just to show others 'here is something that's true that you might not have thought of or have thought of but haven't thought of it in all it's 250 MB glory. Decide for yourself'.

Folks who wanna try something different might like 'Radio Paradise'. It's human curated music. Have wide platforms support [1] but also regular stream URLs [2]

[1] https://radioparadise.com/listen/options [2] https://radioparadise.com/listen/stream-links


Because your music taste is essentially a unique fingerprint that could be used to track you in the future even across accounts or platforms, and the only way to escape would be to literally give up on all your favorite songs and listen to something totally unrelated.


So what? Your voice is a unique fingerprint. Your writing style is a unique fingerprint. Your gait is a unique fingerprint. Who you are friends with is a unique fingerprint.

Why should I be worried about a company knowing what music I listen to?


I haven't used Spotify in over a year so I don't know about this. What in Wrapped 2020 was so different from previous years that made it creepy for you?


> I'm convinced that it is now important to hold on to older appliances that work without internet access or data collection

Or run modern, up to date FOSS equivalents on machines you control.

I’ve migrated more and more services like that and I’m slowly but surely building my own “cloud”.


I understand why people don't constantly think about their digital footprint, but it's not that different from when they order the same thing at a stall for a month, and the clerk begins to ask "Do you want your usual?".


But the digital footprint is infinitely copyable, permanent etc




Consider applying for YC's Summer 2025 batch! Applications are open till May 13

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: