A prefix of URL hash more or less gives away the site you are visiting, given a ...

dpkonofa · on Oct 15, 2019

>A prefix of URL hash more or less gives away the site you are visiting, given a large enough database.

That database would be ridiculously huge. If they truly are using the same method for Tencent as they are for Google, the prefix has would give you several thousand different domains potentially. According to Google's docs, the full URL is hashed, not just the domain. (https://developers.google.com/safe-browsing/v4/urls-hashing)

bigiain · on Oct 15, 2019

I just did some ballpark math...

The top return for a google search "how many pages does google have in it's index" says Google has "30 trillion web pages" (that's a 2013 article, but let's roll with that number for now).

The Google Safe Browsing docs says to send the first 32 bits of a 256 bit SHA256 hash. So there's 2^32 = 4e9 possible prefixes (And 2^224 = 2e67possible hashes for each prefix.)

30 trillion = 3e13 webpages. If you assume they're evenly distributed across all the hash prefixes (a reasonable assumption for a cryptographically strong hash function) there's about 1e4 or 10,000 urls matching each prefix. (And that's a lower bound, using 2013 vintage estimates of the number of known urls...)

I _think_ that means visiting 13-14 unique urls from a site in Tencent's lists would be enough to guarantee they could tell which site and pages you'd just visited? (since 2^14 > 1e4)

thenewnewguy · on Oct 15, 2019

I think there's a few flaws with your process:

1: I'm pretty sure browsers keep a local list of known malicious hashes, and only request a list if the URL matches it. So those 13-14 URLs would all have to happen to have a prefix on the safe browsing list (presumably quite unlikely). I guess tencent could just advertise every hash as being on the list, but I feel like that would've been discovered if that was the case.

2: Even with 13-14 hashes I don't think that _guarantees_ a match, it's just the average you'd expect to need in order to find a match.

3: This becomes significantly harder when a user browses multiple websites at a time (which most people do, even if its just because they click a link from one website to go to another).

bigiain · on Oct 15, 2019

1)N Yeah - this assumes the safe browsing list is under control of the attacker. They could ensure that https://free-hk.org/ and https://free-hk.org/about and https://free-hk.org/blog and https://free-hk.org/news and https://free-hk.org/donate and https://free-hk.org/next-protest are all "on the list".

2) Right. I was imagining some sort of binary search through the hash space - which might be way off base. It might actually take all 10,000 pages to guarantee it? That seems wronger in my head than binary search?

3) I don't think that matters so much, the order or sequences don't matter, only a cluster of visits within a chosen timeframe. I don't mind if you browse to https://bank.tld/ in between https://free-hk.org/news and https://free-hk.org/next-protest...

femto · on Oct 15, 2019

And those 10000 URLs per hash are not equally likely. Chances are one or two of those URLs get visited and the other 9999 are so obscure they can almost certainly be ruled out. Sending a 32-but hash isn't that far off sending the URL.

Scirra_Tom · on Oct 15, 2019

Not if you restrict the dataset to the undesirable domain names, it would be orders of magnitude smaller.

tzs · on Oct 15, 2019

It’s a prefix of (URL hash), not a (prefix of URL) hash.