Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

> You're ignoring real-world aspects of the problem.

No, I'm not. The most important real world aspect of this problem is that I can pull the complex part, the binary search, off the standard library shelf. All I need to implement is a random access iterator over the lines of an open file and I'm set.

> For example, you're assuming that the data points are evenly distributed across time

No, I'm not. I just don't care how they're distributed across times. 70GB is between 2^36 and 2^37 bytes. It's probably around 2^28 or 2^29 lines. Binary search over the characters would take at most 37 probes; over the lines it would take at most 29 probes. Since each probe is so cheap (one disk seek and one disk read, essentially) there's absolutely no reason to optimize any further, especially when I can use an off-the-shelf component like C++'s std::lower_bound to do my binary search for me.

> Reddit will most likely have regular busy and quiet periods, which will impact the density of the log at certain time

If you think that impacts the runtime of binary search at all, you've got a serious misconception about how binary search works.

> Thus, one can tune the algorithm to work better on Reddit's specific visitor pattern, if you can learn it.

Why bother? We're talking about a savings of milliseconds on an interactive process. An ordinary human like myself would never even notice the difference. The network lag from SSH alone would be greater than the difference in runtime.



This is absolutely correct. The worst case in any binary search is exactly floor(lg n + 1) where the element doesn't exist. There is no reason to optimize any further than this.


To be clear, interpolation search can cut down the runtime by another log factor (`O(log log n)` instead of `O(log n)`) but it's just not worth the effort, especially in this case, precisely because the lines aren't uniformly distributed.


Well clearly they're more likely to hire someone who does go to the effort!

> To be clear, interpolation search can cut down the runtime by another log factor (`O(log log n)` instead of `O(log n)`) but it's just not worth the effort, especially in this case, precisely because the lines aren't uniformly distributed.

It doesn't matter what the distribution is as long as you can learn it. Uniform helps when you are cutting down the search space in regular chunks, sure. The Reddit data wont be uniform - so don't cut the search space uniformly.

As I said, something faster could be conceived if one studies the average number of visitors over a 24 hour period (I don't know where this data could be gathered from).

If you can make your first seek very accurate, then it may be the case that simple linear search from there will get you there faster than Binary Search.

You can still report average and worst-case complexities under the assumption you have taken about how the log data is distributed.

Sorry to keep going on about this, but I just wanted to make my point clear! Good discussion.


OK, you are simply proposing the most obvious solution to the problem, which is entirely correct - I must have misunderstood your reasoning.

I assumed Reddit would want something a bit more special that the hundreds of Binary Search implementations they are going to receive. There are always faster algorithms that can be designed for specific problems, even if realistically they only save (in this case) 100ms.

It's a competition - you're supposed to stand out.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: