Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

A prefix of URL hash more or less gives away the site you are visiting, given a large enough database. And in China, an IP address can be tracked down to a person, given that most sim cards can be identified due to regulation.


>A prefix of URL hash more or less gives away the site you are visiting, given a large enough database.

That database would be ridiculously huge. If they truly are using the same method for Tencent as they are for Google, the prefix has would give you several thousand different domains potentially. According to Google's docs, the full URL is hashed, not just the domain. (https://developers.google.com/safe-browsing/v4/urls-hashing)


I just did some ballpark math...

The top return for a google search "how many pages does google have in it's index" says Google has "30 trillion web pages" (that's a 2013 article, but let's roll with that number for now).

The Google Safe Browsing docs says to send the first 32 bits of a 256 bit SHA256 hash. So there's 2^32 = 4e9 possible prefixes (And 2^224 = 2e67possible hashes for each prefix.)

30 trillion = 3e13 webpages. If you assume they're evenly distributed across all the hash prefixes (a reasonable assumption for a cryptographically strong hash function) there's about 1e4 or 10,000 urls matching each prefix. (And that's a lower bound, using 2013 vintage estimates of the number of known urls...)

I _think_ that means visiting 13-14 unique urls from a site in Tencent's lists would be enough to guarantee they could tell which site and pages you'd just visited? (since 2^14 > 1e4)


I think there's a few flaws with your process:

1: I'm pretty sure browsers keep a local list of known malicious hashes, and only request a list if the URL matches it. So those 13-14 URLs would all have to happen to have a prefix on the safe browsing list (presumably quite unlikely). I guess tencent could just advertise every hash as being on the list, but I feel like that would've been discovered if that was the case.

2: Even with 13-14 hashes I don't think that _guarantees_ a match, it's just the average you'd expect to need in order to find a match.

3: This becomes significantly harder when a user browses multiple websites at a time (which most people do, even if its just because they click a link from one website to go to another).


1)N Yeah - this assumes the safe browsing list is under control of the attacker. They could ensure that https://free-hk.org/ and https://free-hk.org/about and https://free-hk.org/blog and https://free-hk.org/news and https://free-hk.org/donate and https://free-hk.org/next-protest are all "on the list".

2) Right. I was imagining some sort of binary search through the hash space - which might be way off base. It might actually take all 10,000 pages to guarantee it? That seems wronger in my head than binary search?

3) I don't think that matters so much, the order or sequences don't matter, only a cluster of visits within a chosen timeframe. I don't mind if you browse to https://bank.tld/ in between https://free-hk.org/news and https://free-hk.org/next-protest...


And those 10000 URLs per hash are not equally likely. Chances are one or two of those URLs get visited and the other 9999 are so obscure they can almost certainly be ruled out. Sending a 32-but hash isn't that far off sending the URL.


Not if you restrict the dataset to the undesirable domain names, it would be orders of magnitude smaller.


It’s a prefix of (URL hash), not a (prefix of URL) hash.




Consider applying for YC's Winter 2026 batch! Applications are open till Nov 10

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: