Hacker News new | past | comments | ask | show | jobs | submit login

This article is essentially just complaining that DDG and Google don't have special parsing for reddit pages ("How come it doesn't know that thread didn't get many upvotes?", "How come it thinks some change to the site's layout was an update to the page?")

Maybe if you want to search reddit, the best search engine is the search bar on reddit.com.




But those pretend complaints aren't his complaints. His complaint is "why does this archived reddit page from six years ago without any updates come up on search results for 'things within the past month'?"

Which is... reasonable.


It is reasonable. It is also likely that whatever meta information reddit is sending back (in headers or tags) is probably not dated correctly for the time of the origin post.

Google COULD offer more time machine features and perform diffing on pages. But a reddit "page" will always have content changes, as everything is generated from a database and kept fresh on the page. The ONLY metric therefore Google could use would be whatever meta tag or header tag that reddit provides.


It is also likely that whatever meta information reddit is sending back (in headers or tags) is probably not dated correctly for the time of the origin post.

That doesn't explain why Google lists the old search results as being from this month, while Duck correctly lists them as being from years past.


Google does cache results, and could via comparison with cache notice changes, and claim, the page was updated sometime between.

I've always wondered, how search engines get a hold of timestamps. Locally with a cached sample, like I explained above, parsing a page's content or some metadata? It's not like the HTTP protocol sends me a "file created/last modified date" along with the payload, does it?


> It is reasonable. It is also likely that whatever meta information reddit is sending back (in headers or tags) is probably not dated correctly for the time of the origin post.

That could explain the first screenshot, but definitely not the second, where google has it tagged as years old.


That's DDG, not google.


Seems like an easy solution to this problem would use two functions.

One function that takes the output of the page, and renders it so only what's user visible, actually gets indexed. So no headers, no JSON data, no nothing, unless it's actually in the final outcome of the page when rendered. This would require jsdom or some other DOM implementation. Hardly hard for Google (Chrome) to achieve this, and been done multiple times.

Second function is a function that does the same call twice, passing the page to function one each time, then compare them two. If you make two calls right next to each other, and some data is different, you discard that from your search index. Instead you only index data that appears in both calls.

Now you don't have the issue of "dynamic content" anymore...


Typically dynamic content doesn't change from second to second, it changes after 5 minutes or an hour or 1 day, actually it is extremely site specific too.

But I do like your idea.

To go a bit further on your idea - you could apply machine learning to analyse the changes. So for example, ML could determine what is probably the "content area" of the page simply by having built out a NN for each website that self-expires the training data at about 1 month (to account for redesigns over time).

The major problem will still be "ads" in the middle of the content, especially odd scroll designs ads that have a different "picture" at each scroll position, as well as video ads that are likely to be different at each screen shot.

Another form of ad being the "linked" words like when you see random words of the paragraphs becoming links that go to shitty websites that define the word but show a bunch of other ads.

I suppose Google could simply install UBlock in it's training data collector harness to help with that stuff. >()


Yes, admittedly I do expect a search engine to be able to parse one of the biggest websites in the world, just how it used to for roughly a decade.

Obviously, it doesn't need to consider the upvotes directly but maybe the text inside the page. Or the date.


Does Reddit use semantics like <time> [or a microformat like <span class="dtstart"><span class="value">...] to allow proper parsing? If not then they should share at least half, probably most, of the blame IMO.


The search bar on reddit.com only searches through post titles and not comments. Using Google/DDG/<any other search engine> seaches through all content posted on Reddit and not just post titles. Until Reddit implements proper search, people will keep using search engines to search for content on it.


appending site:reddit.com to your search is very useful.


It doesn't have to be special parsing. Upvotes are just a heuristic for contents. Good contents generate upvotes, not the other way around. And site layout shouldn't fool the algorithms, or is that something that only Reddit does?


everyone knows the best way to search reddit is via google. The search engine bar in reddit is for optics only.


I prefer https://redditsearch.io/ but Google works too


Thanks, never heard of it. Already returning better results for me.


But does anyone know why search on Reddit is broken? Perhaps intentionally? I don't want to get tin foil hatty but perhaps more not readily apparent false positives = more user clicks = more revenue via ad serving?


I often wonder why some fairly large companies that rely heavily on their own website don't seem to put more than a sole web developer worth of resources into them. Reddit fits into that category for me (Reddit has 400 employees).

Initially I had the impression that search was hard to implement. However, spending a work week figuring it out with ElasticSearch, Solr and Sphinx changed my mind. Getting the solution to work with the scale of the website would take more work, but all the know-how is there, and they could put a whole team to the task for a month.


I wouldn't say it's a trivial ask, but yeah, if you have 400 employees at least assign some resources to get it right. Unless it's intentionally broken. Facebook's prioritization but also randomization of the feed is a feature not a bug.


Because search is difficult to get right. So most sites just implement a basic feature and then assume users will use Google.


Simple: because Google has a ton more data on what content is relevant on Reddit than Reddit itself does.


Given how relevant forums and discussion spaces are. One would think there would be some standardized structure for it so you can search for comments or posts with criteria all over the internet.


There are 'standards' (and an XKCD comic) for that. See schema.org for example.

Google does use it, last time I used it there was even tooling for it in the 'check how my site is to crawl' console, whatever it's called.


https://schema.org/upvoteCount

Yet I haven't seen even one instance of this anywhere. :(


BRB, need to put `upvoteCount="10000000"` on all my blog articles. :)


Google is a TRILLION dollar company. Indexing Reddit properly would take what? 2-3 engineers? Cmon.


Would take 2-3 engineers how long -- I can't really see it being more than a couple of week project. But, why would they? Seems they'd only do it if there's a ROI, is there?

If Google gives you what you want straight away then you leave; sure, you come back, but they want to be bad enough to keep you on there and good enough to be better than other searches. Their reach and resources cures the latter.


Google adding dedicated optimizations for popular sites seems like a bad trend.


They do it all the time.




Join us for AI Startup School this June 16-17 in San Francisco!

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: