This article is essentially just complaining that DDG and Google don't have spec...

prostheticvamp · on Jan 19, 2020

But those pretend complaints aren't his complaints. His complaint is "why does this archived reddit page from six years ago without any updates come up on search results for 'things within the past month'?"

Which is... reasonable.

rhacker · on Jan 19, 2020

It is reasonable. It is also likely that whatever meta information reddit is sending back (in headers or tags) is probably not dated correctly for the time of the origin post.

Google COULD offer more time machine features and perform diffing on pages. But a reddit "page" will always have content changes, as everything is generated from a database and kept fresh on the page. The ONLY metric therefore Google could use would be whatever meta tag or header tag that reddit provides.

reaperducer · on Jan 19, 2020

It is also likely that whatever meta information reddit is sending back (in headers or tags) is probably not dated correctly for the time of the origin post.

That doesn't explain why Google lists the old search results as being from this month, while Duck correctly lists them as being from years past.

klingonopera · on Jan 19, 2020

Google does cache results, and could via comparison with cache notice changes, and claim, the page was updated sometime between.

I've always wondered, how search engines get a hold of timestamps. Locally with a cached sample, like I explained above, parsing a page's content or some metadata? It's not like the HTTP protocol sends me a "file created/last modified date" along with the payload, does it?

Izkata · on Jan 19, 2020

> It is reasonable. It is also likely that whatever meta information reddit is sending back (in headers or tags) is probably not dated correctly for the time of the origin post.

That could explain the first screenshot, but definitely not the second, where google has it tagged as years old.

ma2rten · on Jan 20, 2020

That's DDG, not google.

capableweb · on Jan 19, 2020

Seems like an easy solution to this problem would use two functions.

One function that takes the output of the page, and renders it so only what's user visible, actually gets indexed. So no headers, no JSON data, no nothing, unless it's actually in the final outcome of the page when rendered. This would require jsdom or some other DOM implementation. Hardly hard for Google (Chrome) to achieve this, and been done multiple times.

Second function is a function that does the same call twice, passing the page to function one each time, then compare them two. If you make two calls right next to each other, and some data is different, you discard that from your search index. Instead you only index data that appears in both calls.

Now you don't have the issue of "dynamic content" anymore...

rhacker · on Jan 19, 2020

Typically dynamic content doesn't change from second to second, it changes after 5 minutes or an hour or 1 day, actually it is extremely site specific too.

But I do like your idea.

To go a bit further on your idea - you could apply machine learning to analyse the changes. So for example, ML could determine what is probably the "content area" of the page simply by having built out a NN for each website that self-expires the training data at about 1 month (to account for redesigns over time).

The major problem will still be "ads" in the middle of the content, especially odd scroll designs ads that have a different "picture" at each scroll position, as well as video ads that are likely to be different at each screen shot.

Another form of ad being the "linked" words like when you see random words of the paragraphs becoming links that go to shitty websites that define the word but show a bunch of other ads.

I suppose Google could simply install UBlock in it's training data collector harness to help with that stuff. >()

Tenoke · on Jan 19, 2020

Yes, admittedly I do expect a search engine to be able to parse one of the biggest websites in the world, just how it used to for roughly a decade.

Obviously, it doesn't need to consider the upvotes directly but maybe the text inside the page. Or the date.

pbhjpbhj · on Jan 19, 2020

Does Reddit use semantics like <time> [or a microformat like <span class="dtstart"><span class="value">...] to allow proper parsing? If not then they should share at least half, probably most, of the blame IMO.

woadwarrior01 · on Jan 19, 2020

The search bar on reddit.com only searches through post titles and not comments. Using Google/DDG/<any other search engine> seaches through all content posted on Reddit and not just post titles. Until Reddit implements proper search, people will keep using search engines to search for content on it.

asdff · on Jan 20, 2020

appending site:reddit.com to your search is very useful.

narag · on Jan 19, 2020

It doesn't have to be special parsing. Upvotes are just a heuristic for contents. Good contents generate upvotes, not the other way around. And site layout shouldn't fool the algorithms, or is that something that only Reddit does?

aldoushuxley001 · on Jan 19, 2020

everyone knows the best way to search reddit is via google. The search engine bar in reddit is for optics only.

wongarsu · on Jan 19, 2020

I prefer https://redditsearch.io/ but Google works too

yusef555 · on Jan 19, 2020

Thanks, never heard of it. Already returning better results for me.

yusef555 · on Jan 19, 2020

But does anyone know why search on Reddit is broken? Perhaps intentionally? I don't want to get tin foil hatty but perhaps more not readily apparent false positives = more user clicks = more revenue via ad serving?

tsian2 · on Jan 19, 2020

I often wonder why some fairly large companies that rely heavily on their own website don't seem to put more than a sole web developer worth of resources into them. Reddit fits into that category for me (Reddit has 400 employees).

Initially I had the impression that search was hard to implement. However, spending a work week figuring it out with ElasticSearch, Solr and Sphinx changed my mind. Getting the solution to work with the scale of the website would take more work, but all the know-how is there, and they could put a whole team to the task for a month.

yusef555 · on Jan 19, 2020

I wouldn't say it's a trivial ask, but yeah, if you have 400 employees at least assign some resources to get it right. Unless it's intentionally broken. Facebook's prioritization but also randomization of the feed is a feature not a bug.

antonvs · on Jan 19, 2020

Because search is difficult to get right. So most sites just implement a basic feature and then assume users will use Google.

lovecg · on Jan 20, 2020

Simple: because Google has a ton more data on what content is relevant on Reddit than Reddit itself does.

thrwaway69 · on Jan 19, 2020

Given how relevant forums and discussion spaces are. One would think there would be some standardized structure for it so you can search for comments or posts with criteria all over the internet.

OJFord · on Jan 19, 2020

There are 'standards' (and an XKCD comic) for that. See schema.org for example.

Google does use it, last time I used it there was even tooling for it in the 'check how my site is to crawl' console, whatever it's called.

thrwaway69 · on Jan 19, 2020

https://schema.org/upvoteCount

Yet I haven't seen even one instance of this anywhere. :(

majewsky · on Jan 19, 2020

BRB, need to put `upvoteCount="10000000"` on all my blog articles. :)

sfblah · on Jan 19, 2020

Google is a TRILLION dollar company. Indexing Reddit properly would take what? 2-3 engineers? Cmon.

pbhjpbhj · on Jan 19, 2020

Would take 2-3 engineers how long -- I can't really see it being more than a couple of week project. But, why would they? Seems they'd only do it if there's a ROI, is there?

If Google gives you what you want straight away then you leave; sure, you come back, but they want to be bad enough to keep you on there and good enough to be better than other searches. Their reach and resources cures the latter.

detaro · on Jan 19, 2020

Google adding dedicated optimizations for popular sites seems like a bad trend.

sfblah · on Jan 20, 2020

They do it all the time.