A prefix of URL hash more or less gives away the site you are visiting, given a large enough database. And in China, an IP address can be tracked down to a person, given that most sim cards can be identified due to regulation.
>A prefix of URL hash more or less gives away the site you are visiting, given a large enough database.
That database would be ridiculously huge. If they truly are using the same method for Tencent as they are for Google, the prefix has would give you several thousand different domains potentially. According to Google's docs, the full URL is hashed, not just the domain. (https://developers.google.com/safe-browsing/v4/urls-hashing)
The top return for a google search "how many pages does google have in it's index" says Google has "30 trillion web pages" (that's a 2013 article, but let's roll with that number for now).
The Google Safe Browsing docs says to send the first 32 bits of a 256 bit SHA256 hash. So there's 2^32 = 4e9 possible prefixes (And 2^224 = 2e67possible hashes for each prefix.)
30 trillion = 3e13 webpages. If you assume they're evenly distributed across all the hash prefixes (a reasonable assumption for a cryptographically strong hash function) there's about 1e4 or 10,000 urls matching each prefix. (And that's a lower bound, using 2013 vintage estimates of the number of known urls...)
I _think_ that means visiting 13-14 unique urls from a site in Tencent's lists would be enough to guarantee they could tell which site and pages you'd just visited? (since 2^14 > 1e4)
1: I'm pretty sure browsers keep a local list of known malicious hashes, and only request a list if the URL matches it. So those 13-14 URLs would all have to happen to have a prefix on the safe browsing list (presumably quite unlikely). I guess tencent could just advertise every hash as being on the list, but I feel like that would've been discovered if that was the case.
2: Even with 13-14 hashes I don't think that _guarantees_ a match, it's just the average you'd expect to need in order to find a match.
3: This becomes significantly harder when a user browses multiple websites at a time (which most people do, even if its just because they click a link from one website to go to another).
2) Right. I was imagining some sort of binary search through the hash space - which might be way off base. It might actually take all 10,000 pages to guarantee it? That seems wronger in my head than binary search?
And those 10000 URLs per hash are not equally likely. Chances are one or two of those URLs get visited and the other 9999 are so obscure they can almost certainly be ruled out. Sending a 32-but hash isn't that far off sending the URL.