Hacker News new | past | comments | ask | show | jobs | submit login

Well let's look at how this actually played out.

  - Defendant was in fact sending CP through his gmail.
  - gmail correctly detects and flags it based on hash value
  - Google sends message to NCMEC based on hash value
  - NCMEC sends it to police based on hash value
Now police are facing the obvious question, is this actually CP? They open the image, determine it is, then get a warrant to search his gmail account, and (later) another warrant to search his home.

The court here is saying they should have got a warrant to even look at the image in the first place. But warrants only issue on probable cause. What's the PC here? The hash value. What's the probability of hash collisions? Non-zero but very low.

The practical upshot of this is that all reports from NCMEC will now go through an extra step of the police submitting a copy of the report with the hash value and some boilerplate document saying 'based on my law enforcement experience, hash values are pretty reliable indicators of fundamental similarity', and the warrant application will then be rubber stamped by a judge.

An analogous situation would be where I send a sealed envelope with some documents to the police, writing on the outside 'I believe the contents of this envelope are proof that John Doe committed [specific crime]', and the police have to get a warrant to open the envelope. It's arguably more legally consistent, but in practice it just creates an extra stage of legal bureaucracy/delay with no appreciable impact on the eventual outcome.

Recall that the standard for issuance of a warrant is 'probable cause', not 'mathematically proven cause'. Hash collisions are a possibility, but a sufficiently unlikely one that it doesn't matter. Probable cause means 'a fair probability' based on independent evidence of some kind - testimony, observation, forensic results or so. Even a shitty hash function that's only 90% reliable is going to meet that threshold. In the 10% of cases where the opened file turns out to be a random image with no pornographic content it's a 'no harm no foul' situation.

For reference, a primer on hash collision probabilities: https://preshing.com/20110504/hash-collision-probabilities/

and a more detailed examination of common perceptual hashing algorithms (skip to table 3 for the collision probabilities): https://ceur-ws.org/Vol-2904/81.pdf

I think what a lot of people are implicitly arguing here is that the detection system needs to be perfect before anyone can do anything. Nobody wants the job of examining images to check if they're CP or not, so we've outsourced it to machines that do so with good-but-not-perfect accuracy and then pass the hot potato around until someone has to pollute their visual cortex with it.

Obviously we don't want to arrest or convict people based on computer output alone, but how good does it have to be (in % or odds terms) in order to begin an investigation - not of the alleged criminal, but of the evidence itself? Should companies like Google have to submit an estimate of the probability of hash collisions using their algorithm and based on the number of image hashes that exist on their servers at any given moment? Should they be required to submit source code used to derive that? What about the microcode of the silicon substrate on which the calculation is performed?

All other things being equal, what improvement will result here from adding another layer of administrative processing, whose outcome is predetermined?




> Recall that the standard for issuance of a warrant is 'probable cause', not 'mathematically proven cause'. Hash collisions are a possibility, but a sufficiently unlikely one that it doesn't matter. Probable cause means 'a fair probability' based on independent evidence of some kind - testimony, observation, forensic results or so. Even a shitty hash function that's only 90% reliable is going to meet that threshold. In the 10% of cases where the opened file turns out to be a random image with no pornographic content it's a 'no harm no foul' situation.

But do we actually know that? Do we know what the thresholds of "similarity" are in use by google and others, and how many false positives they trigger? Billions of photos are processed daily by googles services (google photo, chat programs, gmail, drive, etc.), and very few people actually send such stuff via gmail, so what if the reality is, that 99.9% of the matches are actually false positives? What about intentional matches, like someone intentionally creating some random SFW meme image, that (when hashed) matches with some illegal image hash, and that photo is then sent around intentionally.. should police really be checking all those emails, photos, etc., without warrants?


Well, that's why I'm asking what threshold of certainty people want to apply. The hypotheticals you cite are certainly possible, but are they likely?

what if the reality is, that 99.9% of the matches are actually false positives

Don't you think that if Google were deluging the cops with false positive reports that turned out to be perfectly innocuous 999 times out of 1000, that police would call them up and say 'why are you wasting our time with this?' Or that defense lawyers wouldn't be raising hell if there were large numbers of clients being investigated over nothing? And how would running it through a judge first improve that process?

What about intentional matches, like someone intentionally creating some random SFW meme image [...]

OK, but what is the probability of that happening? And if such images are being mailed in bulk, what would be the purpose other than to provide cover for CSAM traders? The tactic would only be viable for as long as it takes a platform operator to change up their hashing algorithm. And again, how would the extra legal step of consulting a judge alleviate this?

should police really be checking all those emails, photos, etc., without warrants?

But that's not happening. As I pointed out, police examined the submitted image evidence to determine of it was CP (it was). Then they got a warrant to search the gmail account, and following that another warrant to search his home. The didn't investigate the criminal first, the investigated an image file submitted to them to determine whether it was evidence of a crime.

And yet again, how would bouncing this off a judge improve the process? The judge will just look at the report submitted to the police and a standard police letter saying 'reports of this kind are reliable in our experience' and then tell the police yes, go ahead and look.


> Don't you think that if Google were deluging the cops with false positive reports that turned out to be perfectly innocuous 999 times out of 1000, that police would call them up and say 'why are you wasting our time with this?' Or that defense lawyers wouldn't be raising hell if there were large numbers of clients being investigated over nothing? And how would running it through a judge first improve that process?

Yes, sure.. they send them a batch of photos, thousands even, and someone from the police skims the photos... a fishing expedition would be the right term for that.

> OK, but what is the probability of that happening? And if such images are being mailed in bulk, what would be the purpose other than to provide cover for CSAM traders? The tactic would only be viable for as long as it takes a platform operator to change up their hashing algorithm. And again, how would the extra legal step of consulting a judge alleviate this?

You never visited 4chan?

> But that's not happening. As I pointed out, police examined the submitted image evidence to determine of it was CP (it was). Then they got a warrant to search the gmail account, and following that another warrant to search his home. The didn't investigate the criminal first, the investigated an image file submitted to them to determine whether it was evidence of a crime.

They first entered your home illegally and found a joint on the table, and then got a warrant for the rest of the house. As pointed out in the article and in the title... they should need a warrant for the first image too.

> And yet again, how would bouncing this off a judge improve the process? The judge will just look at the report submitted to the police and a standard police letter saying 'reports of this kind are reliable in our experience' and then tell the police yes, go ahead and look.

Sure, if it brings enough results. But if they issue 200 warrants and get zero results, things will have to change, both for police and for google. This is like saying "that guy has long hair, he's probably a hippy and has drugs, let's get a search warrant for his house". Currently we don't know the numbers, and most people (you excluded) believe that police shouldn't search private data of people just because some algorithm thinks so, without a warrant.


The idea that police are spending time just scanning photos of trains, flowers, kittens and so on in hopes of finding an occasional violation seems ridiculous to me. If nothing else, you would expect NCMEC to wonder why only 0.1% of their reports are ever followed up on.

a fishing expedition would be the right term for that

No it wouldn't. A fishing expedition is where you get a warrant against someone without any solid evidence and then dig around hoping to find something incriminating.

You never visited 4chan?

I have been a regular there since 2009. What point are you attempting to make?

They first entered your home illegally and found a joint on the table, and then got a warrant for the rest of the house. As pointed out in the article and in the title... they should need a warrant for the first image too.

This analogy is flat wrong. I already explained the difference.

most people (you excluded) believe that police shouldn't search private data of people just because some algorithm thinks so, without a warrant.

That is not what I believe. I think they should get a warrant to search any private data. In this case they're looking at a single image to determine whether it's illegal, as a reasonably reliable statistical test suggests it to be.

You're not explaining what difference it makes if a judge issues a warrant on the exact same criteria.


As many others have said, Google isn’t using a cryptographic hash here. It’s using perceptual hashing, which isn’t collision-safe at all.


Did you read the whole thing?

and a more detailed examination of common perceptual hashing algorithms (skip to table 3 for the collision probabilities): https://ceur-ws.org/Vol-2904/81.pdf

And there was a whole lot of explanation of how probable cause works and how it's different from programmers' aspirations to perfection.


The table only proves the point. The lowest probability in the table is 1 in 100_000. Most others are 1 in 100.

28 billion photos are uploaded every week to Google Photos[1]. That’s at least 280k false positives per week.

Should we really be performing 30 search warrants on innocent people per second?

[1] https://blog.google/products/photos/storage-changes/


Do you have any evidence that this is happening? You don't think someone would have noticed by now if it were?

And as I pointed out, we're not talking about a search warrant on a person, we're talking about whether it's necessary to get a search warrant to look at a picture to determine if it's an illegal image.




Consider applying for YC's Spring batch! Applications are open till Feb 11.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: