Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

An error condition. I decided to do away with it and take a small hit on the error by assuming the chances of the trimmed set being equal to the threshold are very small and that the error condition is effectively doing nothing.

I also changed the logic from == to >= to trigger unfailingly, and pass in the "window"/threshold to allow my code to work without internal awareness of the length of the iterable:

    from random import random

    def estimate_uniques(iterable, window_size=100):
        p = 1
        seen = set()

        for i in iterable:
            if i not in seen:
                seen.add(i)
            if random() > p:
                seen.remove(i)
            if len(seen) >= window_size:
                seen = {s for s in seen if random() < 0.5}
                p /= 2
        return int(len(seen) / p)
I also didn't like the possible "set thrashing" when an item is removed and re-added for high values of p, so I inverted the logic. This should work fine for any iterable.


Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: