Hacker News new | past | comments | ask | show | jobs | submit login
Personal and social information of 1.2B people discovered in data leak (dataviper.io)
1439 points by bencollier49 on Nov 22, 2019 | hide | past | favorite | 419 comments



I was at an Elasticsearch meetup yesterday where we had a good laugh about several similar scandals in Germany recently involving completely unprotected Elasticsearch running on a public IP address without a firewall (e.g. https://www.golem.de/news/elasticsearch-datenleak-bei-conrad..., in German). This beats any of that.

Out of the box it does not even bind to a public internet address. Somebody configured this to 'fix' that and then went on to make sure the thing was reachable from the public internet on a non standard port that on most OSes would require you to disable the firewall or open a port. The ES manual section for network settings is pretty clear about this with a nice warning at the top: "Never expose an unprotected node to the public internet."

Giving read access is one thing. I bet this thing also happily processes curl -X DELETE "http:<ip>:9200/*" (deletes all indices). Does it count as a data breach when somebody of the general public cleans up your mess like that?

In any case, Elasticsearch is a bit of a victim of its own success here and may need to act to protect users against their own stupidity since clearly masses of people who arguably should not be taking technical decisions now find it easy enough to fire up an Elasticsearch server and put some data in it (given the amount of companies that seem to be getting caught with their pants down).

It's indeed really easy to setup. But setting it up properly still requires RTFMing, dismissing the warning above, and having some clue about what ip addresses and ports are and why having a database with full read write access on a public ip & port is a spectacularly bad idea.


I've been using ES off and on since before 1.0 came out. It has always baffled me that ES doesn't require a username and password by default.

ES is a database that has to exist on a network to be usable. Heck, it expects that you have multiple nodes, and will complain if you don't. So one of the first things you do is expose it to the network so you can use it.

Yes, it takes some serious incompetence to not realize you need to secure your network, but why in the world would you not add basic authentication into ES from the start? I'd never design a tool like a database without including authentication.

I am serious about my question. Could anyone clue me in?


It has to exist on a private network behind a firewall with ports open to application servers and other es nodes only. Running things on a public ip address is a choice that should not be taken lightly. Clustering over the public internet is not a thing with Elasticsearch (or similar products).

If you are running mysql or postgres on a public ip address it would be equally stupid and irresponsible regardless of the useless default password that many people never change unless you also set up TLS properly (which would require knowing what you are doing with e.g. certificates). The security in those products is simply not designed for being exposed on a public ip address over a non TLS connection. Pretending otherwise would be a mistake. Having basic authentication in Elasticsearch would be the pointless equivalent. Base64 (i.e. basic authentication over http) encoded plaintext passwords is not a form of security worth bothering with. Which is why they never did this. It would be a false sense of security.

At some point you just have to call out people for being utter morons. The blame is on them, 100%. The only deficiency here is with their poor decision making. Going "meh http, public IP, no password, what could possibly go wrong?! lets just upload the entirety of linkedin to that." That level of incompetence, negligence, and indifference is inexcusable. I bet, MS/Linkedin is considering legal action against individuals and companies involved. IMHO they'd be well within their rights to sue these people into bankruptcy.


Software should be secure by default. Don't blame the user.

mySQL in comparison wont even let you install without setting a root password. And it only listen on localhost/unix-socket by default. Then you need to explicitly add another user if you want to allow it to login from a non local ip. I don't think it's even possible - to both set a blank root password and allow it to login from a public IP.

So you really think the solution is to blame some low level worker, and sue him/her? The blame should always be on the people in charge, usually the CEO, who set the bar for engineering practices, proper training, etc, or the lack of.


While I don't think blaming labor is constructive or ethical, it seems like most tools pose danger to users in proportion to utility. For example, cars can squish people, electricity can fry people, and power tools can remove limbs.

Typically, people start out using knives and bicycles as children, learn through experience that crashing and getting cut hurt, and carry those lessons forward when they start using tablesaws and cars later in life. How does this apply to elasticsearch? I have no idea.


We could teach our children that software is very dangerous, especially databases. Or we could make software secure by default. But we also need to teach the user how to use the software properly. Learning by getting hurt is effective, but then we also need to have playgrounds.


That MySQL stuff is all quite recent... up until 5.7 (?, one of the most recent releases, anyway) there's no root password by default and running `mysql_secure_installation` is a common (but not mandatory) step to, well, secure the installation and set a root password. I think MariaDB still works this way? Not sure.

I'm not aware of "bind to localhost" being the default, either. The skip-networking setting to only allow local socket connections is definitely not the default, and I'm pretty sure the default is still to bind to all interfaces.


I installed mySQL a couple of months ago on a Ubuntu server, and got asked to set a root password. I've also installed mySQL many times on Windows. Secure install is the default. And it doesn't annoy me a bit. I like my software to be secure by default.


This is ridiculous.

Software should be built in the best method of delivering maximum value to its users. A trade-off for usability can be made for certain cases like ease-of-use for new software. Redis was part of this a while ago http://antirez.com/news/96.

Engineers should know their tools before using them. It's a huge part of our jobs. You could introduce a ton of other vulnerabilities in software: XSS, SQL injections, insecure cryptography. Security is part of our job and matters we must know.

You don't blame a plane for a pilot mistake that was meant to be part of his training. Engineers in every other sector are responsible for their mistakes, we should be too.

Also, you don't sue the worker, you sue the company.


"Software should be built in the best method of delivering maximum value to its users."

Yes, and defaulting to insecure, thus repeatedly causing huge data breaches, is the exact opposite of delivering maximum value to users. It's delivering maximum liability.


I would argue that the single command to begin using the application and the ease of on boarding / querying data was a huge factor in expanding its usage. Elastic optimized for initial spin-up and getting things running fast. It works really well! Until you load it full of data on a public IP, that is.


That single command to spin up the application can easily generate and show a copyable random secret required to use it, so that you can use easily but there's no option to use it that insecurely.


Onions. You need layers and defense in depth. Because even the best humans make mistakes and it is inhuman to assume perfectionism. Never rely on just one engineering feature.


> You don't blame a plane for a pilot mistake that was meant to be part of his training

Did you miss that Boeing is right now risking bankruptcy for doing exactly this?


Honestly a lot of the problem is: people aren’t studying systems engineering OR security. Look at all the “learn to code in 21 days” BS and all the code academies.

There’s so much emphasis on abstracting away the systems with cloud-this and elastic-that and developers don’t know much about general systems engineering.

My recommendation to software developers: take the Network+ and Security+ exams at the bare minimum.

Honestly as much as people complain about process getting in the way of things, there should be checks and balances at any business that deals with personal information. Finance institutions are heavily regulated—these fkers should be held accountable.


> "Engineers"

Maybe the hint is right there in your comment. Nearly all the people deploying these nodes aren't engineers in the slightest despite having someone given them such a title.


It's not always engineers that use them.

Sometimes software managers have the sudden need to show statistics and other things.

Yeah, that was fun...


If security is so important, why should we accept database developers who don't understand that?


Because... they dance the devops dance with their devop hats on! Security problems can be swiftly danced around until they actually surface, and can then be handled in the next round of "continuous delivery". It's also smart to postpone solving most issues until after they occur, so sales can continue bragging about "continuous improvement".


So, after some thought, here's why I don't consider it pointless to have basic auth built in.

It would keep ES from being completely open. If you wanted to get in, you'd have to comprise some part of the network that would let you read the username and password.

The way it is now, anyone can do a scan for port 9200 and get full access right away.

It is also important to have a username and password, even on secured networks. My test instance is on an internal network, and protected by both network and host firewalls, but I still make sure to secure it beyond that.

Basic auth would not provide a false sense of security. It is simply a very basic part of overall security. Not having it is a mistake.


> At some point you just have to call out people for being utter morons. The blame is on them, 100%. [...]

Your attitude is a symptom of a broader issue that plagues this industry: Indifference to risk*probability. If you don't ship software with "secure defaults" (depending on the threat/attack model), you essentially are handing out loaded shotguns, then blaming the "dumb" user when they inevitably point it at their foot and click the trigger. Easy solution: Don't hand out the gun loaded -- make the user do specific actions that enable the usage. Yeah, it creates some friction to first time deployment, but that's a secondary concern to having your freaking DB leaking all over the place.


But ES doesn't hand over a loaded gun . Someone went out of their way to load the gun up.


Bullshit.

If firing up a piece of software creates an unauthenticated, unprotected (non-TLS) endpoint to read-write data, that's a loaded gun. That is PRECISELY the default behavior of ES.

ES has jacked around for years by making TLS and other standard security features premium. To that, I say this: Screw ES and their bullshit business model. Their business model is a leading cause to dumbasses dumping extremely sensitive PII data into a DB that is unprotected - those same folks aren't going to go the extra mile to secure the DB, either by licensing or 3rd party bolt-ons.

Thus, why it must be shipped secure by default. Anything less is a professional felony, in my eyes. Also, screw ES again, in-case I wasn't clear.


Is it a secondary concern, though? As a startup, uptake is as vital as oxygen


Tort law is going to catch up to software soon enough and people will be held accountable for negligently creating or deploying software that they should have known would cause harm.

The fact that someone else down the chain should have known better is not a perfect defense. If that misuse was foreseeable and you didn’t do enough to prevent or discourage it, then you can still be held liable.


If startups prioritize their growth over the good of society, isn't the logical conclusion that startups are a threat to society?


They're not a startup.


maybe. but there's always this....

http://www.team.net/mjb/hawg.html


There's something called defense in depth.

Even with ES deployed in an environment with proper network firewall rules...etc, I'd still want some sort of authentication/RBAC


"Defense in depth" sounds, to me, like a phrase to justify multiple layers of imperfect security.

A single layer of cloth might not hold water, adding more layers of cloth may hold water for longer, but it's probably more cost effective to start with the right material.


> "Defense in depth" sounds, to me, like a phrase to justify multiple layers of imperfect security.

That’s absolutely correct! But you seem to be missing the fact that _all_ layers of security are always imperfect.


This is a fallacy of distributed systems. Never trust the network. Best case you get packets destined for somewhere else, worst case you your network segmented wasn't actually segmented.


i agree with GP here. ES is to blame here. not long ago apache airflow had a similar vulnerability discovered about not having sensible authentication defaults. the reasoning on their mailing list was eerily similar to those defending ES here. same arguments (iirc)

history is our greatest teacher. i think ES will end up doing what that team did: they agreed to provide sensible & secure defaults.


Security in depth. If I compromise one part of your network, I shouldn't compromise it all.


PostgreSQL does the following things by default to prevent this:

    1. Only listen to localhost and unix sockets
    2. Not generate any default passwords
So the only way to connect to a default configured fresh installation of PostgreSQL is via UNIX sockets as the postgres unix user. Where PostgreSQL is lacking is that it is a bit more work than it should be to use SSL.


> It has to exist on a private network behind a firewall with ports open to application servers and other es nodes only.

Have you ever heard of the end-to-end principle, IPv6, or number 4 of the eight fallacies? http://nighthacks.com/jag/res/Fallacies.html


> It has to exist on a private network behind a firewall with ports open to application servers and other es nodes only. Running things on a public ip address is a choice that should not be taken lightly. Clustering over the public internet is not a thing with Elasticsearch (or similar products).

I've met at least one cloud provider in the past (small Dutch thing) that provides _only_ public IP addresses. They do have customers, though one less now. Clustering over the public Internet is a thing. It shouldn't, but I could say the same thing about this website and yet here we are.


Heroku does the same in non-enterprise tiers. Their databases are accessible by the public internet with no option to limit it to your own dynos.


Well, lets agree it's a sad thing. Very sad.


Oh sure, but sad things happen. And they can be even messier: I had a Jenkins instance "made" public because a sysadmin new to a hosting provider forgot to remove the public IP that gets automatically assigned to new things. We were lucky, being fairly sure nothing found it before I realised, but it was a strong lesson learned:

Any network may become public by accident unless you go to great lengths to make sure it doesn't. Configurations change and mistakes are made even by seasoned people. People bring devices. Unless there's an air gap, people's devices may be hacked and let stuff through. Put authentication and anti-CSRF on _all_ your stuff, always.


> Clustering over the public internet is not a thing with Elasticsearch

It is, sort of, https://www.elastic.co/guide/en/elasticsearch/reference/curr...

But it's not a feature you'd be using without a really good reason IMO.


That does give me some food for thought. Not sure I agree a username and password is pointless though.


>Having basic authentication in Elasticsearch would be the pointless equivalent.

Instead of that they could implement a PAKE. That would provide security with no certificates.


Honestly, I as a user don't give a shit what a good engineer should so. All I see is that my personal data gets leaked left and right by elasticsearch and not mysql or postgres. But its fanbois just keep shifting blame instead of reflecting about reality and going "hey yeah maybe we should try do do something about it on our end". So fuck ES.


I agree. Every anti-moronic default adds friction. I love that I can play with ES quickly via simple URL without any auth.


That's how we got PHP, Javascript, Visual Basic, MySQL (before version 5), Mongo.

You'd think that at some point we'd understand that there's way more morons out there than sensible people.


It can still bind to localhost or a local socket without auth.


> It has always baffled me that ES doesn't require a username and password by default.

because auth was a part of their paid service (and by paid i mean 'very goddamned expensive') until like half a year ago when they made it free because of freshly emerged amazons opendistro free auth plugin


They offer security as a paid feature.


Actually it comes for free now with the standard ES distribution. https://www.elastic.co/blog/security-for-elasticsearch-is-no...


>Security for Elasticsearch is now free

What a horrific title. Even simply typing that should have been a blinking neon sign to them that they had their priorities in the wrong order.


That's incorrect.

The usual way of using this service is to have backend network configured that connects your services that is not available from outside (ie you have to traverse through services to reach it).

The so called "security" is just a paid feature for companies that want to use ElasticSearch but want to use it in "legacy" way because, presumably, they don't have people to design it correctly.


That's still really insecure, because it means that as soon as someone manages to gain any access to that network or any of the services on that network has a security issue your database is wide open.

That means that if someone manages to get access to the. I'd say public internet with proper (encrypted) password auth is more secure than that.


If attacker has access to app server it is already game over. App server typically already has access to all of the data.

The pods are akin to localhost networking where there is only one externally available application with multiple networked components.


That's true, but there are usually multiple ways to compromise protected networks. You still need to protect the database against attacks that don't go through the app server.


If an attacker gets a hold of your app server, they will be able to get the connection details for that DB, including the username/password.

Having a password adds a small layer of protection to databases that the affected app wasn't meant to connect to.

It adds some protection in that case, but the user should use best judgement if it's worth doing.


If you set up elasticsearch on a cloud service like AWS, by default your firewall will prevent the outside world from interacting with it, and no authentication is really necessary. If you do use authentication, you probably wouldn't want username+password, you would probably want it to hook into your AWS role manager thing. So to me, username+password seems useful, but it isn't going to be one of the top two most common authentication schemes, so it seems reasonable that it should not be the default.

MongoDB also by default does not have username+password authentication turned on.

I think defaulting to username+password is a relic of the pre-cloud era, and nowadays is not optimal.


I don't see why, though. It's much safer to start with a secure setup and then have the user disable the security explicitly (hopefully knowing what they're doing). Yes, username/password auth is not that common, but isn't it better than having no auth at all?


Ok, let's say username/password is mandatory and enabled by default. I see to options.

Option one, they generate an unique password for every installation – non trivial to do, because at which point do you do it? It can't be before a cluster is formed, as you'll have a split brain generating a bunch of credentials. If you do it afterwards, then there is a period of time when you cluster is not yet protected. Worse yet, unprotected and handshaking authentication. So you don't do that.

You could make the user input the credentials. What is to prevent them from creating weak credentials? And worse, they have to do that for every node (or at least the masters). Not a good experience and lost credentials will probably be the subject of a good many support calls.

So most products don't do that. What they do is default passwords. Which is arguably no security at all and doesn't protect anything. It may make it just a tiny bit easier to do the right thing afterwards (by changing to better credentials). Still, there's a period of time while the cluster is unprotected (default credentials are as good as no credentials).

Authentication does little to protect against the sort of people who are exposing databases to the public. If it is easily disabled, then they will be doing just that. Because they are already doing that by forcing databases to bind to publicly accessible interfaces.


I'd say option two is the only one viable. You deny access to the service until credentials are set by the user. You print huge warning labels while the credentials are set by the user to remind them of the possible consequences of setting weak credentials.

Yes, lost credentials will be subject of many support calls. Then, it boils down to your priorities. If you care about minimizing support calls, then sure, leave everything open to everyone. It will surely result in fewer access problems.

On the other hand, if your motivation is actually preventing your end-users from doing stupid things, it makes sense to just do the most conservative thing as default. Let the user change to the more liberal option, but not before informing them of all dangers that might befall them in that case.

I refuse to believe in this narrative of the end-user just being a stupid automaton who does not have any agency, and that any default imposed upon them will just result in them overriding the default with their terrible practices and ideas. I think there is a possibility of education and risk reduction.


I'd argue that the "pre-cloud" era is still going strong. And that is a good thing. My workplace has it's own data center. There are some downsides, but I prefer it.

So username+password really is needed. And should be included by default.

Also, I'd expect the same of something like MongoDB. That it doesn't have that by default is just baffling.


Password auth over HTTP is horrible. Short of binding a public IP address to your instance, basic auth without HTTPS setup is probably the worst thing you can do.


It's a marketing ploy by ES.

They aggregated the data and published it so that the viral breach would spread their name around because all publicity is good publicity.

Just riffing of course.


This addresses entirely the wrong question. By looking at it as a technical problem you're completely missing the broader ethical problem. Why was anyone allowed by law to amass this amount of data? And why did PDS not take the security and privacy concerns of 1.2 billion people seriously enough to ensure the data was handled correctly? They obviously thought it was valuable enough to amass a huge database. Do they sell this to just anyone? If not, who can buy access to this data? How much does it cost, and what steps are involved in doing so?

This makes me want to talk to a lawyer.


> Out of the box it does not even bind to a public internet address.

Bind to all interfaces used to be the default in 1.x - it changed pretty much because people were footgunning themselves.

Coupled with lack of security in the base/free distribution, that made for a dangerous pitfall. At least now security is finally part of the free offering, but the OSS version still comes with no access control at all.


You typically use these in pods which share networking but are not available from outside.

It doesn't matter then if you bind it to 0.0.0.0.


At the time it was common to deploy on bare hosts. Deploying ES into a network namespace isn't even the most common use case today.


That still puts you a single firewall mistake away from disaster. It also places a lot of trust into the applications and hosts that can access ES on a network level: They get full access with no control at all.

To add on that: No security also means no TLS, neither in the cluster communication, no TLS speaking to the client etc.


I've come across several such ES instances that are 100% exposed to the world without even trying, and ES is by no means the first tool to have this problem. People are never going to stop doing this. Making it annoyingly difficult within ES just weakens them such that some other "wow it's so easy" search product will be better positioned to eat their lunch.


ES, Mongo, Redis used to be some of the easiest targets for production data (security vuln wise). Deployed by SWE's usually, with products that were early versions, and didn't have access control by default.


ES's practice of making its security a proprietary paid for product is the cause for these kinds of things. It's a shitty practice, and this is one of the reasons I'm glad AWS forked it.


Other databases learned that not requiring a user/password upon install is completely irresponsible. ES and other dbs need to catch up ASAP, it's ridiculous.

Documentation is not security. If you need to "RTFM" to not be in an ownable state it's ES's fault.


Trusting software you install to be secure is ridiculous and completely irresponsible, especially if you did not pay for someone else to take the blame.

The only thing you can do to secure your software is to restrict its communication channels. Once you've secured the communication channels, the software auth is decorative at best.


That doesn't absolve ES of providing basic security defaults.


Wasn't this exact same thing a huge scandal just a few years ago for Mongo on Shodan?

I can't believe anyone shipping a datastore could let it happen after that. Doesn't postgresql still limit the default listen_address to local connections only? Seems like the best approach. On a distribute store consistency operations between nodes should go on a different channel than queries and should be allowed on a node by node basis at worst. At least at that point, it requires someone who should know better to make it open to the world. Even just listening for local connections passwordless auth should never be a default.


Yes, and similar issues still exist with public MongoDB instances even though the defaults are secure.


This assumes it was incompetence and not done intentionally.

My understanding is neither company is owning this data set and there is an assumption that it is a third company that has either legally or illegally obtained the data and is using it for their own services.

Another option is that the data was exfiltrated by a loose group of people who wanted this to be freely available on a random ip. Know the ip, get sick access to a trove of PII. No logins, no accounts, no trace.

Welcome to the early 90s internet.


> It's indeed really easy to setup. But setting it up properly still requires RTFMing, dismissing the warning above

I would bet that in a lot of cases, people that configure their servers like in the OP just don’t read the official docs at all.

Stack Overflow, Quora, etc. are great places to get answers, because of the huge amount of questions that have already been asked and answered there.

But when people rely solely on SO, Quora, blog posts and other secondary, tertiary, ..., nth-ary sources of information, Bad Stuff will result, because of all the information that is left unsaid on Q&A sites and in blog posts. (Which is fine on its own – the problem is when the reader is ignorant about the unsaid knowledge.)

> and having some clue about what ip addresses and ports are and why having a database with full read write access on a public ip & port is a spectacularly bad idea.

Again, not necessarily, for the same reason as above.

But even if they did, it is a sad fact that a lot of people dismiss concerns over security with the kinds of “counter-arguments” that I am sure we are all too familiar with. :(

Thankfully though, we are beginning to see a shift in legislation being oriented towards protecting the privacy of people whose data is stored by companies.

Ideally, the fines should cause businesses to go bankrupt if they severely mishandle data about people. Realistically that is not what happens. For the most part they will get but a slap on the wrist. But it’s a start.

Companies that can’t handle data securely, have no business handling data at all.


My favourite was Bitomat.pl's loss of 17k bitcoins in 2011 because they restarted their EC2 instance.

I understand that the "ephemeral" nature of EC2 was in the documentation, but ESL speakers may have glossed over the significance of a word they didn't fully comprehend.

https://siliconangle.com/2011/08/01/third-largest-bitcoin-ex...


Not to say this is what people are doing, but I don't think it requires much knowledge to run under Docker, and it's pretty easy to expose it to the public internet that way.


Incompetence and indifference will be the ruin of us all.

This is just another symptom of the Principal-agent problem writ large.


It's a tragedy that all of this data was available to anyone in a public database instead of.... checks notes... available to anyone who was willing to sign up for a free account that allowed them 1,000 queries.

It seems like PDL's core business model is irresponsible regarding their stewardship of the data they've harvested.


If your in Europe or California, I suggest sending both companies an erasure request: https://yourdigitalrights.org/?company=peopledatalabs.com https://yourdigitalrights.org/?company=oxydata.io

Disclaimer: I'm one of the creators of yourdigitalrights.org.


Can I use this on behalf my @company users HIBP has just emailed me about?


This is great. Thanks


Would it be better if this was a paid service? If the issue access to the data, then maybe we should ask if this data should be collected in the first place.


> If the issue access to the data, then maybe we should ask if this data should be collected in the first place.

Outlawing the collection of data would be hard and is unlikely to work, but the fact that companies like AT&T are allowed to sell your data, as they did with OP's (where else would that unused phone number come from), is an angle new legislation can use.

The EU now already has a piece of legislation aimed at stifling these practices. The US and other economies just need to follow suit.


I'm more thinking that not all data is equal. We really treat it like it is, at least from the public perspective (it clearly isn't from the perspective of those gathering data, but there's a clear disparity in how these groups view things). Some data is actually necessary to give up to have a well functioning internet (what browser you're using) and some data is not (canvas fingerprinting). There's a tough question here because the people making the decision of what data to be used is not us. It is the websites we visit. I would argue that there is no consent being given here and all is assumed to be "common consent" (which I'm using as a lack for better terms. Things like that if you walk out in public people can see you. But conversely, someone can't run up to you and measure your height with a tape measure). There has to be some balance here. What that is, I don't know. But really the only people that can figure that out are us computer nerds who at least kinda understand these things. We have to be having these discussions, or else it becomes "fuck silicon valley" (a conversation that is becoming national). So if we don't think about these things, then we clearly live in a bubble and bubbles burst. If we do think about these things, maybe we don't live in a bubble.


I was recently told how private detectives from a national agency would actually go door-to-door (over a minimal area) under the pretext of AT&T store / sales employees. They’d try to convince their target (and some incidental neighbors as cover) to switch their bundled services to AT&T.

The private agents were armed with the latest available discounts (which you could find for yourself if you tried). But their skills made them particularly more successful than a typical front-line sales employee.

The catch? It wasn’t a scam, and they really were trying to get their targets to switch. It seems that AT&T was more willing to sell consumer data than the general public is aware of. Converting their targets to AT&T granted their agency access to additional data which they then to passed onto their clients. And the target gets a discount, too. Win-Win-Win? :)


It seems like that is starting to happen with California's new data privacy law. I'm starting to get a lot of privacy policy update emails like I did when GDPR took effect.


That is OPs point.


I found a vulnerability in linkedIn a few years back that allowed anyone to access a private profile (because client side validation was enough for them I guess..?)

They didn't take my report seriously (still not completely patched) and I feel like that told me all I needed to know about their security practices.


I reported an issue to the LinkedIn competitor https://about.me two years ago where signing in with my Google credentials gives me access to some the account of some random other person with a similar name to me. I think that during registration, I attempted to register about.me/johnradio (except it's not "johnradio"), but he was already using it, and then the bug occurred that gave me this access.

I randomly check every 6 months or so and yep, still not fixed.


My gmail is my first initial followed by my last name. There are other people on this planet with same first initial and last name, some of whom seem to think that must be their email too, because I keep on getting emails where they used it to sign up for things.


I had a lady send me a zip file that contained a VPN client, certificate and a word document with usernames and passwords to the VPN and a number of industrial control systems at the factory she was a manager of.

She sent it religiously, every 90 days.


Every few months I get scans of X-rays from random clients' teeth from some dentist in South America. I've tried so many times to respond and/or unsubscribe but never hear anything back.


Do you have any clue who she thought you were?


Oh yes, she was emailing a copy of her stuff to “herself”.


Seriously?

How the hell could she think that your email address was hers? I mean, wouldn't she notice that she never got the messages?


Totally serious. There are about a dozen people who regularly do this. One guy has missed 4-5 job interviews.


So is it typos? Like one letter off?

I can imagine someone mistyping an address, and then reusing the "to" link.


I faced the same problem (though my name is not at all very common). Banks, mobile companies never did anything even after I repeatedly told them on phone and Twitter (and have kept a record of it).

One day after I had received a person's bank, mobile statement and many other bills for few months I decided to call him (his number was easily visible in many emails) and inform him of his mistake. He turned out to be lawyer and he said he will "decide" what to do about it. And the next thing I know is he sent a carefully drafted email (as a legal notice) that I should hand over my email address to him without further delay and all that.

I didn't do that. I talked to a lawyer friend and he just told me to reply with a "G F Y" card. I didn't do that either. But that pushed me to finally move my emails to my personal domain as it was/is a Gmail account and if someone complained Google would have just terminated my account and I don't know anyone who works at Google.


That lawyer sounds like a douchebag. I super agree with your point too: I'm also slowly moving all my emails to my personal domain and it feels liberating.


I get several on a weekly basis. It's amazing how many services do not verify emails and just trust their users to own the email they claim to own.


It’s a common “growth hack” to postpone email verification.


Even more baffling are the ones who use it to fill out job applications.


I get bank statements, job offers, party invitations, and lately a bunch of lets say very questionable email verifications from euro 'dating' sites- I've identified the guy in the UK but its too much (and getting embarrassing now) to keep forwarding his stuff to him.

Downside of getting in early on popular email services.


I went through several rounds of conversation with somebody's wedding planner over email.


> but its too much (and getting embarrassing now) to keep forwarding his stuff to him

What amazes me is when I get misaddressed email, and I reply to say its misaddressed (and I'm not talking about automated services, I'm talking about obviously manually sent stuff), and my reply just gets ignored and the misaddressed email just keeps on coming.


Somebody keeps phoning me and leaving messages. They don't answer their own phone (or messages clearly). I even have a sarky voicemail now, you'd think they'd notice. Nope!

Lady, whoever you think is going to be at that funeral isn't getting that message.

I've no idea if they'll get disconnected now as I've blocked their number. Hope so maybe they'll notice then.


That's the most surreal, when you try to fix it and the behavior never changes.


My gmail is two initials and last name, so theoretically less susceptible to such errors. Yet I get misaddressed mail all the time—and a surprising amount of it is job applications!


Trust me, I used my full first name, it's not enough to stop these people. One is a UK doctor, one is a US teacher, and I think there are one or two more. Been sent a few baby pictures from their relatives too.


This happened to me and I keep getting the guy's notifications on instagram and all. So annoying!



I actually had a similar thing happen with facebook, though we didnt share names.


For a while, our Comcast billing account accessed some other person’s account. Comcast didn’t take it seriously, and just told us to create a new account and not use the old one. (!!!)

We had full access. I could have signed this person up for the most expensive package, or even canceled their service.


Let's be realistic here. Everyone knows it's not possible to cancel Comcast service.


I managed to cancel my dad's after he died. They STILL tried to upsell me! One of my favorite phrases ever uttered: "He's dead, you asshole, he doesn't need more channels!" And that actually did it. Felt sorry for the salesperson, who didn't have much of a choice in the matter...


Surely by making it difficult to cancel they’re really just making it easier for people to get discounts. If I were a Comcast customer I’d be calling up to cancel every few months.


He's dead, he doesn't need discounts.


Obviously. Which is why I used a plural—I was referring to Comcast’s overall customer base.


Nice one. However, I cancelled in person a couple years ago (because I had equipment to return).

The first thing I said at the counter was "I know it's really hard to cancel Comcast, and I'm not going to accept anything but a cancel."

The girl at the counter smiled and said "We know ..." and immediately cancelled my account.


"Ah yes, cancelling requires a call because of security. A feature for the user!"


To be fair, internets would have been equally outraged if there wasn't such requirement, because sure as hell somebody would have found an exploit and cancelled a bunch of account, just for funzies


That sounds like white hat hacking from all I've heard of Comcast...

Maybe that's how we drive their customer count and revenue down and put them out of business.


I signed up for a disposable Gmail account using my real name at one point, and accepted the randomly suggested address it offered. Gmail loaded with someone else's obviously in use mailbox

IIRC I logged out again and back in, same thing, my credentials worked. Went back to it a few days later and the password no longer worked


Hash collisions most likely.


Have heard this so many times about Gmail...

How have they not resolved this?


I think it's like EC2 instance IDs. When they first came up with it, they never thought there would be literally billions of unique email addresses/EC2 instances eventually.


I can only imagine about.me mass-creating profiles for names found on other web pages, and opening a way for someone to "claim" those profile with a matching Google account sign-in.

About.me's business model was quite unsettling to me and they have made little to no effort to protect the user data from scrapers.


I had a similar experience. In 2014 I reported an issue where you could take over someone's account by adding an email you control to it and having them complete the flow by sending them a link (which, unless they looked very carefully, looked exactly like the regular log-in flow at the time - especially if they used a public email service and you registered a similar-looking account).

I tried it on a friend and it worked, but LinkedIn's response was basically "meh".

My life has only gotten better since I deleted LinkedIn a few years ago. I know I'm in a privileged position to be able to do that, but I strongly recommend everyone here consider whether what they gain from their account is worth the crap and spam they have to put up with.


LI is terrible if you actually try to use it, but it's harmless enough if you just use it as a profile hosting service, where people are likely to look. I just auto-archive their emails and only visit the site a couple of times per year.


While not good, what's the connection to this story?

The article says some LinkedIn data was scraped, but I don't see anywhere that it specifically says a LinkedIn security flaw was used in the scraping. Although it is vague about what data was scraped and how, so it doesn't preclude that either.

In other words, are you saying a LinkedIn vulnerability was exploited here, or suggesting that it probably was, or are you just mentioning LinkedIn because it's tangentially related?


I signed up for an API key to see what they have on me, and the data it returned looks awfully close to what I have on linked in.


A few years of heads up is sufficient to disclose publicly. Full disclosure helps keep companies honest about security.


I deleted my linkedin a few years back when they had some bug where I would randomly get page views as some other person, with all their connections and account details and whatnot. It would only last a few minutes then switch me back to my account, but they aggressively ignored my attempts to reach out to them about this bug so I just gave up.


[flagged]


Could you please stop posting unsubstantive comments to Hacker News? We're trying for a bit better than internet default here.


No it is not.


The number in the HN headline was changed from 1.2 billion to 1 billion (despite the original source's headline saying 1.2). It is kind of amazing that leaking the personal data of 200 million people is now just a rounding error that can be dropped from headlines.


Imho, it's more impressive that it's basically a non-story outside of it security news.


The general public just shrugs upon hearing such news. They still think there is nothing dangerous if their data gets leaked.


I think the solution here is laws which require anonymity, and that includes in banking (where it will never happen).

That is because a couple days ago, I got a text message from tmobile (which seemed genuine) basically saying that my account was one of a larger subset of prepaid phone accounts which had been compromised and that my personal information had been potentially taken by "hackers".

To which I got a good chuckle, because tmobile is one of the few phone companies that will let you create completely anonymous prepaid accounts using cash and without filling out any information. AKA you buy a sim card for $$$ and that is it. So, basically the only information they lost of mine as far as I can tell, is the phone number and type of phone I'm using (which they gather from their network). If they got the "meta" data about usage/location/etc that would have been different but it didn't sound like the hacker got that far.

Had this been a post-paid account they would have my name/address/SSN/etc.


Do you think it’s reasonable to believe your name / address / SSN / DOB / etc is already out there?

I’m of the opinion it’s too late for prevention and we need, instead, mitigation.


Exactly. The very reason for existence of the two companies, pdl and oxy, is to tie n pieces of data with m pieces of data.

So depending on how the "anonymous" phone number was used, it's plausible that the number can be connected with other PII.

In fact I wonder if there is any such thing as non-PII, given the existence of such companies.


Companies need to stop treating knowledge of this information as proof that you are who you say you are. I would have no problem publicly posting my name, social security number, birthday, mother's maiden name, etc., if not for the fact that someone can actually use this information to open a bank account or take out a loan in my name. It's ridiculous that this is all it takes in most cases.


> Companies need to stop treating knowledge of this information as proof that you are who you say you are.

If we assume that isn't happening in the very immediate future due to the latency of introducing new legislation...

Do we have any other options to protect ourselves?

I've personally worked myself in to a bad credit rating. I have a home loan and a credit card, but any new credit applications auto-reject. Not the ideal scenario though!


> Analysis of the “Oxy” database revealed an almost complete scrape of LinkedIn data, including recruiter information.

"Oxy" most likely stands for Oxylabs[1], a data mining service by Tesonet[2], which is a parent company of NordVPN.

It is probably safe to assume, that LinkedIn was scraped using a residential proxy network, since Oxylabs offers "32M+ 100% anonymous proxies from all around the globe with zero IP blocking".

[1] https://oxylabs.io/

[2] https://litigation.maxval-ip.com/Litigation/DetailView?CaseI...


The article says it is "Company 2: OxyData.Io (OXY)"* (http://oxydata.io)


OxyData and OxyLabs seem to be sister companies[1]: the former sells data as a product, the latter sells scraping as a service.

[1] https://vpnscam.com/wp-content/uploads/2018/08/2018-08-24-09...


Tesonet is true cancer. I am amazed how unethical (and successful) they are.

Knowing how quickly it's expanding, do the employees are just as unethical or they do not connect the dots (company got too big)?

I hate fb, et al as any other person here, but most of people know that "if it's free - you are the product". Though with NordVPN users are paying money and are getting stabbed in the back.


> do the employees are just as unethical

Most people's ethics are easily bought. Does working for a company that operates with questionable integrity outweigh providing a stable income for your family?

Remember Facebook is still a very highly desirable company to work at.


> NordVPN users are paying money and are getting stabbed in the back.

could you please expand on this claim?


From the comment they replied to: https://vpnscam.com/


"My name is Ripoff Reporter." For all that their schtick is about how they're "educating" the public about how shady VPN services are this could be anyone, including a front for a VPN service that isn't mentioned on the site.


How is that possible? LinkedIn blocked mining the data this way several years ago.

Is it still possible if you pay LinkedIn enough? Or is this old data?


It is strictly impossible to "block mining data" on the public web. Double that if the miner has free access to a pool of residential IPs.

[source: experience]


A large number residential proxies and fake LinkedIn accounts would look the same to LinkedIn as normal browsing.


There's information on the leak that wouldn't be widely available without accessing LinkedIn data using their APIs. Phone numbers and emails, for example.


The article mentions it is a blend of data from http://oxydata.io/ and https://www.peopledatalabs.com/

Both are aggregators that get data from many sources, correlate them, and sell it. The phone numbers and emails could have come from anywhere.

See this screenshot from PeopleDataLabs: https://d1ennknj6q36vm.cloudfront.net/images/cblead.png


I'm a nordvpn user. Practices like this scares me though. I guess it's time to switch to a new vpn?



Ah... but that is very inconvenient :( I guess comfort comes at a cost.

Is there at least a less shady provider if I would like to compromise myself but a bit less than nordvpn? How far do we go in assuming all are bad?


Mullvad seems trustworthy (I used to share an office with one of their IT infrastructure staff), but it is impossible to say for sure.


You could set up your own VPN on a server you run.


Yes. This. And is free to setup on big cloud services. Like free 24/7 with whatever amount of data. Guides are online.


All the way. It isn’t as if all VPN providers are part of a shadowy cabal to steal your data from an otherwise valuable service; the very premise of commercial VPNs is flawed. Any VPN service is inherently harmful.


Out of curiosity how do you guys think they managed to scrape LinkedIn on such a large scale?

I've been wanting to do some social graph experimentation on it (small scale - say 1000 people near me) but concluded I probably couldn't scrape enough via raw scraping without freaking out their anti-scraping. (And API is a non-starter since that basically says everything is verboten).


I've crawled a popular social network on a large scale, currently doing the same for dating services as a hobby. God, wish I'd still got paid for webscraping.

Here are some tricks which may or may not work today:

- Have an app where user logs in through said website, then scrape their friends using this user's token. That way you get exponential leverage on the number of API calls you can make, with just a handful of users.

- Call their API through ipv6, because they may not yet have a proper, ipv6 subnet-based rate limiter.

- Scrape the mobile website. Even Facebook still has a non-js mobile version. This single WAP/mobile website defeats every anti-scraping measure they may have.

- From a purely practical perspective, start with a baremetal transaction-isolation-less database like Cassandra/ScyllaDB. Don't rely on googling "postgres vs mongodb" or "sql vs nosql", those articles will all end in "YMMV". What you really need is massive IOPS, and a multi-node ring-based index with ScyllaDB will achieve that easily. Or just use MongoDB on one machine if you're not in hurry.

- Don't be too kind on the big websites. They can afford to keep all their data in hot pages, and as a one man you will never exhaust them.


> - Call their API through ipv6, because they may not yet have a proper, ipv6 subnet-based rate limiter.

Nice tip!!

> -- From a purely practical perspective, start with a baremetal transaction-isolation-less database like Cassandra/ScyllaDB. Don't rely on googling "postgres vs mongodb" or "sql vs nosql", those articles will all end in "YMMV". What you really need is massive IOPS, and a multi-node ring-based index with ScyllaDB will achieve that easily. Or just use MongoDB on one machine if you're not in hurry.

Somewhat ironically Elasticsearch would probably work really well for this too (just make sure your elasticsearch isn't open to the world on the internet!).


>Somewhat ironically Elasticsearch would probably work really well for this too (just make sure your elasticsearch isn't open to the world on the internet!).

Sure it will work, but I personally don't like Elasticsearch for anything high-intensity because of its HTTP REST API and the overhead it carries. Take a look at Cassandra's [1] "CQL binary protocol", it simple and always on point.

[1] https://github.com/apache/cassandra/blob/trunk/doc/native_pr...


You forgot the part about exposing your finished database to unprotected elasticsearch http endpoint ;)

In all seriousness does anyone know why you can even host an elasticsearch database as http and without credentials? Seems to be the default. What is the use case for this?


Tbh I'm still selling that data.

For a while I've had reoccurring nightmares that my DB had been stolen and published together with an article on how stupid and incompetent I am.


If I've understood you right, you break the TOS on other websites to collect users personal info, and then you have nightmares about people taking that data from you? Doesn't that raise ethical concerns in your eyes?


>You forgot the part about exposing your finished database to unprotected elasticsearch http endpoint ;)

I'll cut straight to the chase and post it on hn. This intermediate step of waiting for someone to discover it takes too long


The use case is in a local datacenter, with a NAT-ed IP not exposed to the main web


A firewalled IP would be much more appropriate, and NAT is not a firewall or a security mechanism.


Same thing, more-or-less. And NAT is effectively a firewall for inbound traffic, even if a lot of people say it isn't.


> Have an app where user logs in through said website, then scrape their friends using this user's token.

That's some extremely shady thing to do.


Welcome to the internet!


> Don't be too kind on the big websites.

I usually recommend latency-based dynamic load control for that. Once the website starts to reply 500-1000ms longer than the average one-thread latency, it is time to take a bit of it back. It is also a co-operative strategy between fellow scrapers, even if they don't know about the other ones pushing larger load on the servers.


1000ms is a massive slowdown when revenue-noticeable impacts are far, far smaller. I don't know the legality, but hitting a site hard enough to cause 1000ms slowdowns seems like it's approaching DOS legality issues.


Don't you consider this unethical -- if not against the site itself, than against the other users of the site whose data you're scraping?


Wow these are some hot tips!

YMMV, and cloud providers would hate you for this, but you can automate the IP rotation with a cloud providers that bills you by the hour. It's easier than ever nowadays to spin an instance in Frankfurt, use it for an hour, and then another in Singapore for the second hour.

Pretending to be Googlebot also helps.


>- Call their API through ipv6, because they may not yet have a proper, ipv6 subnet-based rate limiter.

Clever. VMs with IPV6 are cheap as a bonus :)

Same for non-js mobile. Thanks for the tips


- Call their API through ipv6, because they may not yet have a proper, ipv6 subnet-based rate limiter.

How would someone do that using node.js? Asking for a friend.


So far, the answers have contained non-technical answers like "Distributed Scraping." Well, yes, obviously.

A more useful answer is: I did this once, many years ago. Back then it was a matter of hooking up PhantomJS and making sure your user string was set correctly. Since PhantomJS was – I think – essentially the same as what headless chrome is today, the server can't determine that you're running a headless browser.

Now, it's not so easy nowadays to do that. There are mechanisms to detect whether the client is in headless mode. But most websites don't implement advanced detection and countermeasures. And in the ideal case, you can't really detect that someone is doing automated scraping. Imagine a VM that's literally running chrome, and the script is set up to interact with the VM using nothing but mouse movements and keyboard presses. You could even throw in some AI to the mix: record some real mouse movements and keyboard presses over time, then hook up some AI to your script such that it generates movements and keyboard presses that are impossible to distinguish from real human inputs. Such a system would be almost impossible to differentiate vs your real users.

The other piece of the puzzle is user accounts. You often have to have "aged" user accounts. For example, if you tried to scrape LinkedIn using your own account, it wouldn't matter if you were using 500 IPs. They would probably notice.

It's hard to counter a determined scraper.


I wrote a chrome headless framework that types using semi-realistic key presses (timing, mistakes, corrections) and does semi-realistic scrolling / swiping and clicking / tapping.

It's not very hard to get something that would be too hard for almost every website beside Google and Facebook to bother with. If it's a 1 on a 0-9 scale in difficulty, most websites just don't have the resources to detect it

It took me like ~3 hours to write it, but I guarantee it would take months for someone to detect it, and even then, they'd have a lot of false positives and negatives.


I think there's also a lot of bot-detection-as-a-service around here that can be used by sites smaller than Google and Facebook, like WhiteOps or IAS anti-fraud.


These are highly questionable under GDPR, many of them rely on tracking users wherever they go (e.g. Recaptcha is known for this).


> These are highly questionable under GDPR

How many fines has GDPR resulted in?


Not many yet, general consensus is to first warn and get companies to implement better compliance - only those who really openly shit on GDPR get the fines.


then release it!

Headless chrome cat and mouse game is a lot of fun. We need more players.


LinkedIn doesn't protection doesn't seem to be that sophisticated at the moment. Someone I know maintains ~weekly up-to-date profiles of a few million users via a headless scraper that uses ~10 different premium accounts and a very low number of different IPs.


That is a violation of ToS (using registerd accounts for scrape) and could carry potential legal implications.


So is leaking PII? ToS isn't a legal contract: it's not signed by anyone and it's changed every other week without consent of users. ToS is just a formal excuse why someone's account may be suspended.


As long as you are able to source more than one provider, this can work well enough. If you're dependent on a single data source, e.g., because that source is the only possible source of said data, you'll get nuked from orbit by legal rather than technical means.

I had a business that was generating more money than my full-time job for a while. We helped and greatly simplified matters for several thousand independent proprietors while having a positive effect on the load of the data source, since we were able to batch/coalesce requests, make better use of caches, and take notification responsibilities on ourselves.

Once in a while someone would get worried and grumpy at the data source and there were a couple of cat-and-mouse games, but we easily outwitted their scraping detection each time. When they got tired of losing the technical game, they sent out the lawyers, which was far more effective. We were acquiring facts about dates and times from the place that issued/decided those dates and times, so there wasn't really any reliable alternative data source, and we had to shut down.

The glimmer of hope on the horizon is LinkedIn v. HiQ, which seems poised to potentially finally overturn 4 decades of anti-scraping case law, but not holding my breath too hard there.


The US courts decided that scraping is legal, even if against EULA:

> In a long-awaited decision in hiQ Labs, Inc. v. LinkedIn Corp., the Ninth Circuit Court of Appeals ruled that automated scraping of publicly accessible data likely does not violate the Computer Fraud and Abuse Act (CFAA). This is an important clarification of the CFAA’s scope, which should provide some relief to the wide variety of researchers, journalists, and companies who have had reason to fear cease and desist letters threatening liability simply for accessing publicly available information in a way that publishers object to. It’s a major win for research and innovation, which will hopefully pave the way for courts and Congress to further curb abuse of the CFAA.

https://www.eff.org/deeplinks/2019/09/victory-ruling-hiq-v-l...


That is a blatant misrepresentation of that decision. That decision was upholding a lower court's preliminary injunction that prevents LinkedIn from blocking hiQ while the main case between the two is litigated. It is not a final decision and it doesn't purport to say that scraping is legal (it even points out other laws besides the CFAA that might be used to prohibit scraping.)


LinkedIn Sales Navigator is a paid tool which allows you to search their whole database. Then depending on how much you pay you can get all their personal details (Email address, phone number, even their address sometimes.) https://business.linkedin.com/sales-solutions/sales-navigato...


I've always been a little confused how this works. If I got all that info for free, it's a "data leak", but if I pay to get the same detailed personal information it's...

In either case my personal data is given away without my consent, but there's this implication that it's only an issue when someone doesn't pay for it.


You're right, my take on this is that a company scraped a bunch of publicly available information, that people left open (consciously or not.) That's why only a subset have phone numbers. The profile URLs, emails, most people don't even try to protect those.

Normally the company sells this data, but now they've given it away. It's not good this data got out because the curation has some value to spammers or whoever. But using the word "leak" here undermines the severity of a real leak where passwords and social security numbers are exposed. Data that was never meant by anyone to be open.

Everyone likely has (technically) provided consent for every piece of information here being shared with partners. Buried in fine print that it wasn't really expected they'd read, of course. It's the cost of being online, and that sucks, but it seems only a leak of what had already been given out.


> In either case my personal data is given away without my consent

You gave that consent when you put your info in Linkedin in the first place, according to their ToS.


I think everyone is confused. Everyone just wants their slice of the pie (aka $$$).


If you get drivers info by hacking a DMV database, it's prison. If you got the same details by paying a few millions for FOIA requests, you're a good citizen and a model tax payer.


Unless you're the state of Florida, and you make millions by selling the DMV database to private buyers... [0]

[0] https://www.abcactionnews.com/news/local-news/i-team-investi...


Jokes aside, can you really file FOIA requests to get personal driver details from DMV? I thought FOIA would only apply for stuff that is meant to be public, but isn't due to difficulties of hosting, putting it up, etc.

Mind you, I didn't research the topic of what can or cannot be requested with FOIA, so I might be totally wrong.


LinkedIn gives away email id and phone number (even if you had given just for 2FA) to all your contacts. I checked PDL, it has all the information from LinkedIn except for phone number, which I promptly removed once I identified the 2FA issue (now TOTP is available).


'Mobile Proxies' like https://oxylabs.io/mobile-proxies (no affiliation) allow you to use large pools of mobile or domestic IPs to scrape. It's expensive, but not prohibitively so. Once you've got a mobile IP you become incredible hard to throttle, since you're behind a mobile NAT gateway.


You probably have to be highly distributed. At least that’s what I did when I tried to scrape a large site some years ago. I had around 100 machines in different countries and gave each of them random pages to scrape.


Distributed bot and scraper networks. Thousands of IPs geographically dispersed throughout the world. There is only so much you can do with rate limiting.


They asked about LinkedIn, where the content is gated behind a login. If it was a rate limiting problem, that would be trivial.

Needing to be logged in as the same user defeats the purpose of proxying to hide your physical origin.

Registering thousands of different users to use in a distributed way is hard now that they require a text message verification for new accounts.


Public LinkedIn profiles (which is many of them) are open to scrapers and they lost a court case about it.

https://www.eff.org/deeplinks/2019/09/victory-ruling-hiq-v-l...


I go to LinkedIn without being logged in and nearly always get a login gate instead of the profile.

They were ordered to unblock hiQ specifically, they were not ordered to open up content to scrapers generally.

They can still throttle high volume traffic and put up captchas. I think the only specific thing the court ordered was for them to unblock hiQ IP ranges.


Proxies can also work well for cheaper than buying distributed compute.


Scraping LinkedIn is so common you can usually hire people with years of experience in it. It is not as complicated as you might think. There are at minimum hundreds of companies that sell LinkedIn data they have scraped.


You use a proxy botnet and route your scraping requests through that. Use something like hola proxy or crawlera for example.


I scraped 10 million records from linkedin a few years ago from a single ip by using their search function. I got a list of the top 1000 first names and top 1000 last names and wrote a script to query all combinations and scrape the results.

This may or may not still work.


It looks like the purpose was data enrichment, so maybe it was pieced together over time from multiple sources. My linkedin from PDL only had 1 bit of wrong info. I wasn't able to find anything on my personal email addresses which is good.


once worked on a project that tried to do just that, but at the time the LinkedIn api was already limited to seeing the authenticated users connections connections, which was too limited for what we wanted to do, can only imagine it got worse. It's also the reason recruiters really want to connect to you on LinkedIn because even if you are not interested, your connections might be.


A very large distributed network of machines.


Hey - not related to your comment (apologies) but wanted to get in touch . You left a note on a previous post of mine about wanting to simplify FTP. I'd love to work on this project and wanted to see if you'd be willing to connect so I can understand the problem better. Feel free to email me at kunal@mightydash.com, and thanks in advance!


People data labs's data is pretty accurate. Here is mine: https://api.peopledatalabs.com/v4/person?api_key=9c6a1382204...

You can try it for yourself by changing the email. All of the information is public, so I don't mind. They are basically doing data integration.


Haha, when I was a kid and scared to use my real name for things, for some reason I used my email... which had my real name in it, to open a Github account with a fake name

So the api knows me as the famous architect, Art Vandelay


Reminds me of when I used to get free magazine subscriptions (and the subsequent junk mail/robocalls) addressed to Santos L. Halper.


There is a way to get every developer’s email on github thanks to git commits adding it :))


In your github account you can add a new email address that doesn't even exist or have a valid TLD, like "name@mail.fake". Don't use it as your primary email and it won't require confirmation. You can now set your git user.email to this fake address and any commits you make will be attributed to your account without exposing your actual email address.


You can use yourgithubusername@users.noreply.github.com instead of adding a fake email, and your commits will still show up on your contribution graph and be linked to your username.


That must have been a long time ago, Boorish Bears.


Wow.. I checked with an email address I use for disposable purposes. The only thing they had on it was a blank LinkedIn profile -- meaning that LinkedIn cancer has trawled some pretty questionable sites, harvesting email addresses as placeholders for their accounts. WTF.


Ah, looks like everyone's using that API key, I got 2 queries for my addresses and got a "rate limit exceeded" message.

Strangely it only says I work in real estate (no I don't) when I looked up the email address I use for LinkedIn...


You, and others can use my api key, just signed up.

e75ac28b25480e60071b24d819d4692a0b315c037046b9ff6ec9dfb1e99a895c


Status 429, Rate limit error.


yours gone too now. very curious about this API lol


Try changing v4 to v3 in the URL.


Yup, that worked for me.

Indeed they do have a profile on me - a bare minimum, scaped from GitHub. That makes sense, since that's about the only social platform I use, aside from HN.

EDIT: My GMail address has the most amount of information gathered, which makes sense. It's gathered Facebook, LinkedIn, Pinterest, GitHub..

It lists my skills as: firefighting and emergency planning/management/services. I suppose, with a stretch of imagination..


Here's mine eaca37c25ca1a9c5d85efb8cbaf1742b4fbfeee0054d713961176ab9500c2f2b


It returned a 404 for my personal email account, so that appears to be sufficiently protected.

More surprisingly it had data such as my name, title and work email address which was connected to old work email account (Okta managed - GSuite) that I never associated with external services, and absolutely never used on a social networking site like LinkedIn.


That API key is now public, too! Rate limited.


Yeah no kidding. Though if you wait until it flips to a new minute and refresh, that helps. Though it takes all of a minute to register a free key, so probably no big deal.


Your api key is now permanently in public. After few days, people will still be able to use this for their own usage.


a few days? its already hit its limit :)


I'm actually a bit surprised at how little data they have on me. They've associated my main email with an old junk email, they've got my first and last name, and know that I'm male, but there's little more.


Nothing for most of my accounts, except one which somehow was falsely attributed to someone else. Odd given I do have a LinkedIn profile; Their scraping must be far from perfect.


Wait, so is this mostly just Linkedin data in JSON form?


My personal email seems to be based on Github and Gravatar, while my job search and work emails got linked together and appear to be based on LinkedIn.


This seems exceptionally unethical


Displaying public information publicly, or sharing your API key?


It would be really surprised if this were compliant with the GDPR. I live in the US but I tried email accounts of relatives in Europe and they had data in there.


It looks like it's a US-based company without enough of a European presence to fall under their jurisdiction.


https://gdpr.eu/companies-outside-of-europe/ it looks like it would? I'm no expert though.


Right, they can say it applies... but if a company does no business in Europe, how can a judgement be enforced?


> The whole point of the GDPR is to protect data belonging to EU citizens and residents. The law, therefore, applies to organizations that handle such data whether they are EU-based organizations or not, known as "extra-territorial effect."

They can say this all they want, but if you have no presence in the EU, and your jurisdiction does not have any agreement to apply GDPR regulations to you, then this is at most a strongly worded request.

Barring explicit agreements to the contrary (treaties, extradition agreements, etc), by definition a country's laws are only enforceable there.

If PDL has no business in Europe, no plans to expand there, and there's no treaty or other agreement making the provisions enforceable against them, the EU can say whatever it wants but PDL has no legal obligation to do anything about it.


One obvious answer in that case would be to establish who is buying the data from them and treat any PDL data as potentially tainted. If you find a downstream customer who does have a presence, then investigate accordingly. You might not be able to fine PDL directly, but you could certainly make the offending data risky or unprofitable...


Sure, but how do you propose doing that? Send another strongly worded letter to PDL demanding their customer list?


Usually you'd either track known errors in the dataset (implying that the companies had either bought it from PDL or copied the leak), or you'd ask the banks (who do have a presence) which accounts were paying them and who owned the accounts. If Bitcoin's involved at all, you assume there's something fishy going on and investigate accordingly.

(Assuming anyone were bothered enough to actually do this, of course.)


I’m also not an expert, but my understanding is that it applies but would be hard for the EU to take action against them


A law isn't a law if you can't enforce it, so "applies" has kind of a strange meaning in this context then, doesn't it?


A law always has a jurisdiction. EU laws generally don't apply to the US, even if the EU wants them to. There are exceptions, of course.


Theoretically, if it were egregious enough, the EU could say to the owners or management of the company that if they went to the EU they would be arrested. That’s enough of a threat that it might convince them.


Legal jurisdiction is a separate matter than the specific text of laws. The "this applies to non-European companies" things just means that if you fall under the jurisdiction of European courts, you can't absolve yourself of responsibility of complying with this law simply by being a foreign-registered company.

On the other hand, if you never fall under European jurisdiction in the first place, you're free to ignore them, just as you can ignore Thai laws against insulting their king. One very important thing to note is that setting foot in European soil will expose you to their jurisdiction, so you've significantly limited your freedom of movement, but if GDPR compliance is a bigger deal than that then "just never go to Europe" can be a viable strategy.


Oh yes, I'm going to try and see if they have data on me and send a number of GDPR requests if they do. For others from the EU, it's very easy to do using: https://www.mydatadoneright.eu/request


So... if the owner is known, it will be quite costly ;-)


It's no secret who is behind that website [1].

Good luck to the EU on enforcing their law against an American company, though.

[1] https://angel.co/company/peopledatalabs/people


I don't know how accurate the coordinates of your address in India are, but it's 5 minutes away from me. Small world, huh?


I'm glad they don't have jack shit on me besides my email, is there a list of their data source(s) ?


It should be illegal for any company to store my private information like this. The 'anonymous' sharing of my information is easily de-anonymized. Sites asking for your phone number for "security purposes" are a joke.

You just have to accept that absolutely everything you've done online is public information. If it isn't now, it is being stored and future tools / databases will make what is either difficult to access or difficult to interpret very easy to use in the future.


Using phone number as an example of private information is pretty hilarious. Remember when the phone company used to literally print your name and phone number in a book and send it to everyone in your town? Man, their security was terrible!

But it works perfectly fine as a two-factor auth mechanism to prove that whoever setup the account is the same person trying to log into it at some later time.


Birthday is commonly used to verify people despite the practice of broadcasting it to people on Facebook.


What private information? If you give a random website your email address or phone number, it's not private anymore and you're the one who released the secret. Unless they promised to keep it private in a legally binding way, in which case, your wish is already true.


Such a cavalier attitude to the storage of EC citizens' personal data is illegal under Article 32 of the GDPR: https://www.gdpr.org/regulation/article-32.html

Citizens of the US really need similar protections.


Firefox monitor can tell you if your information was leaked in data breaches. I don't think they have this data set though.

https://monitor.firefox.com/


Highly recommended. You can put in multiple email addresses, so you can help monitor your non-technical family members’ info as well.


At this point practically everything about me's available either for free or a few dollars. The only interesting thing left is whether a given password has been compromised. The answer to everything else is "yes, it's been leaked". Been that way for most of a decade at this point, guessing it's the same for most other folks with any modern digital or banking presence whatsoever.


I'm sure there are search engines for it too but I noticed that credit karma can tell you which of your passwords have been associated with your email addresses in data breaches.

Credit Karma is free but the CEO appears to be transparent in how they make money (recommending financial products to you based on what they see in your credit profile).


I am a very suspicious and wary internet user, hardly sign up for any services, but been using Credit Karma for my taxes and light financial monitoring for the last 3 years. Tax Filing was totally free and I got the tax refunds I was expecting. No issues with them whatsoever. I have never gotten any email or other spam as a result of using their service. I am a happy customer, though technically speaking I have never actually given them any money directly.


I agree. At first I got a few emails over a long period of time recommending financial products (credit cards, savings accounts, etc) but I unsubscribed and haven't seen any of those since. The only emails I get now are when something changes on my credit profile (new account, closed account, etc).


This looks like a wrapper around Have I Been Pwned.


That's precisely what it is. From[0]:

> Through our partnership with Troy Hunt’s “Have I Been Pwned,” your email address will be scanned against a database that serves as a library of data breaches. We’ll let you know if your email address and/or personal info was involved in a publicly known past data breach.

https://blog.mozilla.org/blog/2018/09/25/introducing-firefox...


It works with that service. They are pretty transparent about that in their documentation.


Mozilla are now reporting on this data set. Top marks for getting it online so quickly.


Does this cover more leaks than haveibeenpwned.com?


Maybe in the future it will, but it uses Have I Been Pwned. From the FAQ[0]:

How does Firefox Monitor know I was involved in these breaches?

Firefox Monitor gets its data breach information from a publicly searchable source, Have I Been Pwned. If you don’t want your email address to show up in this database, visit the opt-out page.

[0] https://support.mozilla.org/en-US/kb/firefox-monitor-faq#w_h...



IIRC, they use that.


1Password has a similar feature too.


chrome://flags/#password-leak-detection - same thing


can't I search for my phone number?


> 400 million+ phone numbers. 200 million+ US-based valid cell phone numbers.

Sounds like a nightmare in the making for those cell phone users and their carriers when those begin to get SIM jacked.


I think cold calling could be an even bigger nuisance. My DNS provider published my phone number by mistake on a whois when I registered a domain, I spotted it immediately and it was corrected within hours. Over a year later I still receive cold calls from India to sell me web services at least once or twice a week.

Imagine if you can match everyone’s position with a mobile phone, a dream for tele marketers, tailors, scammers, etc...


Is that all you need to SIM jack a phone? The phone number?


Yes and no. You need a phone number, but you still need to carry out a variation of an attack that replaces the SIM associated with that phone number. Sometimes this is carrier-specific. Sometimes it's trivial, sometimes it requires a menial amount of work, and in extreme cases you might have to access an actual network. Most of the time there is nothing stopping the attack if they have your personal information.


Yet another Elasticsearch server wide open. This is going to make the flurry of open mongodb servers look trivial.


I wouldn't be surprised if the starting point for this vulnerability wasn't ES, but Docker. Docker by default modifies iptables and if you hack together a system that uses both software running directly on the host and in containers, it's going to expose the forwarded containers to the Internet - which you might not be expecting, since a bind to localhost would be enough to expose a service. It's always a good idea to have a separate firewall running outside of the your system - this is the one Docker can't fool.


I had this issue last week. I was pulling my hair out as to how my brand new Linode got hacked even though I had setup ufw within minutes.

It is downright ridiculous that this was ever approved as a default behavior.


They mentioned this is google cloud, which blocks almost all incoming ports by default. they had to have chose to expose this through the project firewall, and not put in a source filter.


No. It's not dockers' fault you did not read the manual and expose the ports wrong: you can bind the port to specific ips for export and tjat address should be 127.0.0.1


I see where you're coming from, but I disagree. I believe that good software and abstractions should take little training to use - everything unintuitive is a design failure and should be fixed. "Reasonably secure" should be the implicit default, not something you need to explicitly added. E.g., it's better to force authentication and force the administrator to add an account than let everyone in by default. Or it's better to bind to 127.0.0.1 than to 0.0.0.0 by default, like most web servers built into frameworks I saw do.

Unfortunately, instead of good intuition, Docker is built on caveats, be it networking, storage, caching, image sharing, container/image distinction, authentication, deployment or building a cluster. Every subsystem I experimented with "works", but fails in weird ways in some situations. In my opinion, that means that Docker is a good idea, but has terrible UX/functionality/error handling. I kind of think the same way of Git.


Great point. Depressing how such large profile projects can have such insane defaults.


I believe Elasticsearch doesn't allow restricting access by requiring login unless you pay for the enterprise version, which is just straight up stupid.


The basic license is free now, so you can get basic authentication.

But I still wonder why that isn't part of the open source version, and why it isn't turned on by default....


They're everywhere. Just ask Shodan.


I remember there was some brewhaha a while back about how Shodan was able to discover services on IPv6 since the address space was so sparse. Apparently they were running enough of their own NTP servers to reliably map out lots of devices on IPv6.


Not being able to map ipv6 space is a myth. There's plenty of workarounds.


Such as? Not too familiar with the subject, would love to know how.


It's pretty old news these days, guessing it's mostly what's leaked out of private to public sector stuff, is probably just the beginning really.

Would suggest starting at arxiv. This is not a hidden field for the active and/or keen researcher.


> Shodan is a search engine that lets the user find specific types of computers connected to the internet using a variety of filters. Some have also described it as a search engine of service banners, which are metadata that the server sends back to the client.

Interesting.


Since I learned about Shodan, I'm convinced that the (subjectively) increase in reported data breaches is just due to an increasing amount of people looking through Shodan results, and doesn't have anything to do with any trends in security.

Security standards at any company have always been low, but now it's easy even for a layman to find leaked data.


Not sure about Google Cloud, but Elasticsearch on AWS doesn't support x-pack security. You can only secure your instance via IP restriction, otherwise you have to sign your requests, which is not always supported on Elasticsearch DSL libraries that are commonly used.


It's not hard to secure ES. If people need help, please just ask, happy to help.


This is why I lie about my birthdate by a couple of days on anything where it's not something like a medical record or where I am required to tell the truth for whatever reason. I also never provide my social security number unless it is required by law.


One of my coworkers generates a fake middle name for every service they sign up with. According to him, this serves as a unique identifier allowing them to determine when a service is selling their data to a third party (or data is being leaked).


Fastmail has subdomain addressing, so if your email is jondoe@example.com, you can use hn@jondoe.example.com to sign up for HN.

That way you'll know for sure who leaks your data, and nobody's going to strip it away like some services would strip away plus addressing (as in, johndoe+hn@example.com).


I have excellent results with a subdomain. Even though PDL probably has a lot of data on me, they have (not yet?) been able to glue it to my primary mail address. That one only has my name, gender, github, country and name of my employer. They can't seem to map the remainder to anything else.


From what I could see the data returned on me was all derived from publicly available sources (eg: my "public" LinkedIn page, my public github page etc). Perhaps others have more but this looks more like an aggregator of public information than a breach of non-public information.

Having said that, I find these companies unspeakably evil - their intent is to make money by harming people (eroding their privacy by making otherwise private personal information easier to get, obviously a gold mine for identity thieves etc).


In retrospect, it would have been interesting to have a bunch of accounts each containing a unique "map trap", at all of the larger services. Then years later, when the aggregator/broker guys get hacked/sold/leaked, you'd have some picture of the genealogy involved.


The problem is that you often can’t find access to the actual “password” used in the breach. Does anyone know where I can see if it was an actual password or just some made up thing?


I was suggesting something different. Specifically open an account on every service as a tracking canary, say with the same email to help them tie them all together. But on each one, vary something slightly like phone. Then years later, when looking at a leaked aggregator entry, all the phones on the record should tell you all the places they bought/stole data from.


There was no password on the original ES instance it was open to the web.


I meant my password.


There’s a torrent going around


Do you know where I can find a torrent for this leak?


I don't think it's wise to add a magnet link here.

But as I recall looking for Breach Compilation may help finding the requisite gist on GitHub.


This is all scrapped public social media data. No credentials or govt information. It's very easy to download or buy this data legally.


It appears to also contain information possibly acquired from other companies. For example, the author notes that he had attributed to him a phone number he had assigned to him by AT&T that he never used or shared.


Guarantee the fine print in the AT&T contract authorized them to share information with a third party.


Back in the day (maybe it is still this way) your landline was default "listed" and you had to pay a monthly fee to be an unlisted number. So AT&T most likely listed his number in some kind of phone book / directory.


Yeah, I'm wondering if this is all scrapped public data or a breach of some kind. Are land line phone numbers published in a directory (like a phone book) in USA?


Yeah. Email addresses? Phone numbers? All of that is practically public at this point anyway. This article is crying wolf. Someday there is going to be a massive credential compromise.


are you sure? how did you come to that conclusion. thanks for the info though, very glad to hear it.


Yes, there are dozens of these data enrichment companies. They scrape public sites and use browser extensions, SaaS tools, inbox addons, etc. They mix it together into profiles, and pretty much have the same dataset by now.

Clearbit is one of them and even a YC company.


Yep! And as someone who has worked with these data sets and worked on the scraping tools on services like LinkedIn, a lot of the data is outdated, incorrect, or mixing together different entities with the same name into one person or splitting the same person into separate entities incorrectly.


Look at the personal record that was in the article. It looks like aggregated public information. And look at what the companies referenced in the DB do.

It's possible there's someone selling them so not-quite-public info, too, but it's probably more like phone numbers and less like private messaged on Facebook or Linkedin.

The title reads like data from 1.2B profiles was leaked by Facebook and Linkedin, but this looks like scraping public profiles from them.


I mean, this is literally just a leak of data that People Data Labs is selling to anyone who signs up to their service. The 'leak' is just bypassing their payment requirements, so by definition all the data leaked is available for purchase.


Genuinely hope somebody goes to prison for this, but not gonna hold my breath.


This data is accessible at small scales just by registering for a free api key at People Data Labs and making a GET request, and if you want more robust access you could just pay PDL for it.


Sorry, I should have been clearer, I'm talking about whoever is responsible for leaving it completely open to the public internet.


I mean it is INTENTIONALLY exposed to the public... the only mistake is they are giving it away instead of charging for it. If you don't like it when they give out all the information for free, it doesn't make it better if they charge money.


The only person harmed directly seems to be PDL since they may find it harder to charge subscriptions for bulk access to the data.

I am not sure why this stuff being online in bulk is so much worse than being online behind a paywall that someone should actually go to jail for it.


Depending on the countries the data is hosted in and the attacker lives in, it's unclear any law has been broken that would land a person in jail.

If PDL had a flaw in their implementation that allowed someone to scrape them (or they didn't and someone did the hard work of creating 1.2 million fake accounts to register for 1,000 free API calls), it might be an uphill battle to prove even "unauthorized access."


Where can i download the data?


Linkedin the last social media membership I have. I’ve been mulling over whether to delete my account because I’m not sure how it will look to prospective employers.



Thank you for writing this. Much like the fear you expressed, I'm going to delete my account as soon as I lock in my next job.


Good thing I just updated my LinkedIn profile. Wouldn’t want hackers to think I have gaps in my resume.


I've gotten some strange spam phone calls this last week, including like 3 from Egypt. Wonder if this is why.


Probably unrelated. These security researchers found this open database, it doesn't necessarily mean someone else found it.


I guess it's time to start leaking billions of records of junk data to pollute the waters.


There's estimated to be 4.4 billion internet users in 2019, so this is over 25% of people on the internet.


Maybe I am missing something here, but I do not really see the scandal here with the "leak" and I rather think the term is missleading in this context.

What happened?

As far as I understand, there are companies who search the web for public data of people like me, without my consent.

Then they sell that data. Also without my consent.

So that data was avaiable anyway, allmost for free. If this data would contain sensitive information, then I see this buisness practice as a scandal.

But the mere fact that all this data which was gathered without consent is now avaiable for free because of possible db missconfiguration .. is not a scandal to me.

And a leak is usually when a company loses sensitive data of its customers, who expected that data to remain confident, like emails. Not what happened here. Feels more like PR.


I don't know about other people, but I have zero personal info with LinkedIn and Facebook.

They only info they have about me is info I don't mind being public. If I want something to be private I don't tell it to them. It's as simple as that.

Google on the other hand, knows lots of private things.


Through shadow profiles, third-party submissions, cross-site cookie tracking, and integration of offline data records, this almost certainly is absolutely false.

Unless you've directly pursued all legal (or otherwise) mechanisms to ascertain this directly, the best you can say is that you're unaware of any information that's been acquired, and that you didn't knowingly or intentionally contribute any yourself.

The article here describes precisely this practice, in its fourth paragraph and following, in the section titled "Data Enrichment":

For a very low price, data enrichment companies allow you to take a single piece of information on a person (such as a name or email address), and expand (or enrich) that user profile to include hundreds of additional new data points of information. As seen with the Exactis data breach, collected information on a single person can include information such as household sizes, finances and income, political and religious preferences, and even a person’s preferred social activities.

Please let's put this canard to rest.


Facebook has a lot of personal information about you even if you have never had a Facebook account. For example: your GPS location data, approximate age, gender, ethnicity....

Welcome to the future komrade. Sadly, it's not a matter of just "not giving them" your location data. Your devices supply it.


And your friends too. I dutifully kept a new number out of FB until a friend messaged me with, is this your number right? Xxx-xxx-xxx. They can also tag you and auto tag you through face recognition.


Cyber alarmists would call a telephone directory: 'A verified threat incident'. Yet these are the same companies selling OSINT data. These alarmist groups need to put down the buzzwords, step off from their white horse and take a look at the hypocrite in the mirror.

If you use social networks, you don't have a reasonable expectation of privacy. You've published your data publicly. If you want to keep this information private, then don't publish it on the Internet.

From: http://www.dmlp.org/legal-guide/publication-private-facts

>2. Private Fact: The fact or facts disclosed must be private, and not generally known.


> In order to test whether or not the data belonged to PDL, we created a free account on their website which provides users with 1,000 free people lookups per month.

Well that's very generous of them. Now I know what I'm gonna do next.


Is there a way for an individual to check if he's in the dataset ? I am curious about what kind of data they'd have aggregated around me.


When this type of leak happens, where does this data actually appear? On the dark web? Who has access to this and how does one get it?


Seems like the ball is with Google at the moment, the exposed data is on their GCP servers. So, they can figure out next steps.


Imagine the equivalent in another industry:

“Hello, Bank of America? There’s an ATM machine of yours that’s spitting out cocaine.

Yes, I understand that it’s probably not your cocaine and that’s not your business, but don’t you think you should maybe shut it down?”


But would you call VendingMachinesCo because there is a vending machine outside the local supermarket, operated by said supermarket, that spits out cocaine? Pretty sure that whatever you put in there is the machine owner's responsibility, not the manufacturer. GCP does not put content in their VPSes themselves the way that a bank operates an ATM.

I think it's more like the responsibility of an ISP to poke their noses in what they transfer, since it might be illegal content (similar to whether Google should poke their noses into people's VPSes). I'm not sure if we should want to require them to do that.


Get a court order. No infrastructure company on its own should be making value judgements about what it hosts.


Why are there people running anything publicly accessible?

If you are running on the cloud, there is no need for any VMs to have any public IPs at all. Exception for your Bastion host, and even that should be restricted to known networks.

All incoming traffic needs a layer of indirection. On cloud providers that's usually their load balancers.


I wonder whether FB/Linkedin can manipulate the timing of negative news like this, for strategic reasons...


Facebook/LinkedIn are not implicated in the breach at all; it was some random third-party data enrichment service. The Facebook/LinkedIn in the title refers to the fact that people's FB/LI accounts were one of the fields in the database. So were their Github, and basically any other public-facing account that these scrapers can gather.


ES is the new Mongo. If you make software this easy to use, then people with little or no experience are going to use it. Just have secure defaults, like authentication, how many times do we have to learn this lesson...


Isn't it creepy that People Data Labs, "a data aggregator and enrichment company", collected data on 1.2 billion people?

Isn't it exactly what GDPR came to prevent? Are there no Europeans among this group?


Welp, time to change all my passwords, maiden names, and friendships.


The IP address in question does not seem to be working at this time. Clearly whoever runs the server has shut off access. I wonder if someone managed to save a data dump somewhere?


Unless we go after every customer who used the services of PDL, nothing is going to change. We will see a $3 fine per individual after 1 or 2 years of talking about this.


When will people start going to prison for stuff like this?


For what? For scraping public data?


When there's a law against it and there's evidence of breaking that law.


"including close to 260 million in the US."

So basically, _everyone_ in the USA minus those not online. And I bet this will go unreported by the mainstream media.


Is there any way to see what data they had on me?


"1B" is a surprisingly bad abbreviation here, considering its resemblance to a much ... less impressive number.


Ugh. To whomever is currently wasting their time and effort on differential privacy, take a good long look.


Why? Interested in why you think differential privacy would make any difference... The fault here seems to be an open es server.


That is precisely my point. Differential privacy would NOT make any difference, and I was pointing the many folks who are working on it to the much simpler issues that are in fact being encountered in the field. This past IEEE S&P had quite a few theoretical privacy talks.


Is this public facing information that's been crawled, collected, and categorized?


Is it illegal to download/scrape data from a wide open database like this one?


I just had a look at “my” data on this and it is almost hilariously wrong.


Where can we look up our data?


To check if you're affected use haveibeenpwned.com


does anyone know how we can search the data to find info about our (more than likely) entries in this database? or did they simply find it but not release the info?


Not the aggregate data set, but one of the two data sources (People Data Labs) offers free access for under 1,000 searches per month.


Does this mean we don't need to do a census any more?


It would be a shame if someone corrupted these ES indexes.


>According to their website, the PDL application can be used to search: Over 1.5 Billion unique people, including close to 260 million in the US. Over 1 billion personal email addresses. Work email for 70%+ decision makers in the US, UK, and Canada. Over 420 million Linkedin urls Over 1 billion facebook urls and ids. 400 million+ phone numbers. 200 million+ US-based valid cell phone numbers.

Too bad there aren't any laws regulating this sort of private data aggregation and sale. Well, besides GDPR (which apparently isn't enforced) and CCPA (which won't be enforced either.)


Let me make sure I understand: If I take gigabytes of “enriched” personal information and make it available to the public for free, then I’m an irresponsible, idiotic, incompetent buffoon. But if I put a paywall in front of it and sell that same data for a fair price, then I’m a business genius?

Seems to me that if the data is legally acquired and can be legally distributed, doing so at a cost of zero does not constitute a data leak. It may be bad business, but since when is that a crime?


that IP:9200 address is down, any mirrors?


where cam i download the leaked data?


Data Enrichment Companies. Marketing speak for highly vulnerable privacy eradication service.

Vote #1 for some sort of global GDPR where these businesses are no longer profitable.


Can I do a GDPR request for the data about myself? How?


It's weird because for oxydata you have to contact their sales team... but peopledatalabs has an opt out form.

https://www.peopledatalabs.com/opt-out-form

People Data Labs privacy policy: 3. ACCESS TO AND CONTROL OVER INFORMATION A person may do any of the following at any time by contacting People Data Labs at support@peopledatalabs.com. People Data Labs will reply to a person’s request within five business days.

A. Access any information we have on them, if any.

B. Change, correct, or delete any information we have them, if any.

C. Express any concerns about People Data Labs using their information.

People Data Labs' team will act swiftly upon a person’s email request to change, correct, provide, delete, or explain anything a person query.

People Data Labs understands if a person would like to opt out of People Data Labs' database. Opting out will stop all data sharing and enriching of all PII in People Data Labs servers for that person. Click here, if you would like to opt-out, or choose to have all data about you removed from People Data Labs' database.

For https://www.oxydata.io/: Review and changes to your information Contact us at sales@oxydata.io to find out what information we have collected about you, and to request any changes to or deletion of it.


I want someone to start an opt-out service, where I send them $20, and they send a book of names by registered mail for opt-outs every month.

An online opt-out system is too easy for them. I want each one to get a phone-book sized list of opt-outs every month.

And the same for data requests. Someone that curates the data collectors, and sends them requests every month.

Do you know which country’s “do not call” list I want to be on? All of them! Get my number on the AU list, the UK list, the DE list...

Let’s crash the system.


Would be a great idea and I bet it could be successful, but maybe at a lower price point and using lots of automation of opt-out forms. As far as opting out of many credit reporting agencies, check out https://www.consumer.ftc.gov/articles/0262-stopping-unsolici... https://www.optoutprescreen.com/?rf=t

Also it seems like theres a service like this called Delete Me, but it also seems like theyre a manual opt-out shop. Would be cool if you could find a way to not have humans doing it. Bet they're just having people on amazon mechanical turk fill these out or something like that. https://joindeleteme.com/how-we-work/


Easier to just send them your own template on paper instead of using theirs.

It should be like a doctor’s prescription in a lot of places: as long as it’s on paper and has the right elements, it’s valid.


Well then thats the trick. A legal research team that develops the form for as many sites as you could find, and then a mechanism to send that form filled with each users data to those sites.


Like, what more do they need than disambiguating identity info and a declaration that I'm opting out? E.g. name and DOB?

My only fear is that you're now sending this all to them, but in 2019, we can safely say your name+DOB+address isn't a secret. Or national identity number if that's a thing in your jurisdiction.

It's the metadata around it we want wiped out.


> I want someone to start an opt-out service, where I send them $20, and they send a book of names by registered mail for opt-outs every month

This exists but it's not cheap: https://www.abine.com/deleteme/


I'm not rich but $129/year isn't bad. I'd hesitate mostly because I assume such services are scams.


It's a legit service. I use them and they did ensure that my data was removed from the services they specified. Obviously I'm just some person on the internet so my statement has no intrinsic credibility, but I believe they were also validated in a nyt article awhile back.


“DeleteMe experts find and remove your personal information.”

Blargh, let the data broker figure out if I’m in their DB or not.

Trying to determine that myself seems risky. Better to send the request to every broker in existence.


Actually working on that project right now - www.thekanary.com. Super early stage but have a big list of brokers and opt out links that I'm automating. Would love early feedback.


Thanks a bunch for compiling those links/emails. I've unsubscribed myself and alerted my family.


They list 2 companies as owners of the data in the article. I guess there would be a good place. I'd love to do that but I'm not on the eu.

But the article says that's possible the actual leak comes from a customer or former customer of these companies and the actual ownership is so far a mistery.


>Can I do a GDPR request for the data about myself? How?

And send it where? It's unclear who owns this server


Google are jointly liable for this service, so if you can't find a contact point, then you can email google with the service IP. They will more than happily point you on to the customer to avoid being taken to court.


Seems like OxyData and PDL directly have more up-to-date records anyway.

Could the server owner (Google) have to fulfill the request? Probably not, but interesting to think about.


Start with Google, they will need to figure out and know who the actual owner is and who paid for the resources to host it.


People Data Labs?


From the article it seems that you can just create a free account and query your own name.

> In order to test whether or not the data belonged to PDL, we created a free account on their website which provides users with 1,000 free people lookups


I wonder how high the GDPR fine will be.


Is Elastic going to be punished under GDPR especially given that it's a Dutch company?


Really interesting legal question - "Seems like the ball is with Google at the moment, the exposed data is on their GCP servers. So, they can figure out next steps." is a comment above. How will the chain of insecure infrastructure + the data scrapers + the people responsible for configuration react?


That is a terrifying thought with terrible chilling effect should somebody official would even voice this thought in any way.


Was this an AI-generated sentence?


There's a video at https://www.youtube.com/watch?v=VNLEEogFo18 where People Data Labs' chief executive speaks at an insurance conference this year about their business.

They describe the data as being sourced from a 'data co-op' of over 1k companies which share data. It wasn't clear whether that means that those companies are collaborating and pooling data, or whether it's a roundabout/wordy way of saying that they scrape public personal information from thousands of sites.

They also claim that they're GDPR and CCPA compliant; I'm no expert but I do find one or two references that seem to suggest that scraping EU citizens' personal data without consent hasn't been GDPR-compliant for some time.

It does also raise another question: even if PDL themselves aren't GDPR-compliant, would any resulting fines against them reclaim a significant portion of the utility captured from the distribution of that data? As per comments on this thread, PDL API keys seem to be free to create.

Hypothetically speaking it could be within the interests of a group of businesses to provide a small amount of funding towards operation(s) that harvest and redistribute personal data: if the revenue base is low, the operation(s) can eventually fail (once legal proceedings catch up with them) and the group as a whole incurs little cost.

The speaker also takes a question from the audience regarding potential use-cases for this kind of personal data, and answers that knowing about an individual's life events (such as marriage) can be an opportunity to sell products to them, as can differentiating pricing if they'd just started smoking cigarettes.

Although I'm no expert, my understanding of insurance has been that risk is spread across a large pool of customers, allowing them each to pay similar premiums despite potentially slightly different backgrounds, with the understanding that they mutually benefit by paying into a shared fund so that the (random, potentially high-cost) risk of loss to each member is greatly softened.

We're seeing a situation here where more precise, per-individual data is being collected across large populations and could potentially be used for price differentiation.

If the insurance industry doesn't defend itself, this could lead to premiums which are essentially calculations based on 'pre-existing data' -- information which the consumer may not have consented to sharing, and which an insurance company might not be able to collect from application forms.

We don't seem to be particularly good, collectively, at escaping from cycles which seem to introduce or further wealth disparity at the moment and I worry that this kind of tech-driven attempt to optimize revenue efficiency of the insurance industry would only lead to further inequality.




Consider applying for YC's Spring batch! Applications are open till Feb 11.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: