An error condition. I decided to do away with it and take a small hit on the error by assuming the chances of the trimmed set being equal to the threshold are very small and that the error condition is effectively doing nothing.
I also changed the logic from == to >= to trigger unfailingly, and pass in the "window"/threshold to allow my code to work without internal awareness of the length of the iterable:
from random import random
def estimate_uniques(iterable, window_size=100):
p = 1
seen = set()
for i in iterable:
if i not in seen:
seen.add(i)
if random() > p:
seen.remove(i)
if len(seen) >= window_size:
seen = {s for s in seen if random() < 0.5}
p /= 2
return int(len(seen) / p)
I also didn't like the possible "set thrashing" when an item is removed and re-added for high values of p, so I inverted the logic. This should work fine for any iterable.
I also changed the logic from == to >= to trigger unfailingly, and pass in the "window"/threshold to allow my code to work without internal awareness of the length of the iterable:
I also didn't like the possible "set thrashing" when an item is removed and re-added for high values of p, so I inverted the logic. This should work fine for any iterable.