Hacker News new | past | comments | ask | show | jobs | submit login

I have some questions about information retrieval and SLOs:

* Is there a metric of search quality which is appropriate here -- specifically, "when I search for [site:tbray.org rock roll], and receive a set of results, that set includes Tim's article"? What do we call this metric? The metric would be lower when the result set is empty (no relevant results returned) and higher when the result set contains the desired article (a relevant result was returned).

* How would you assess the quality of this particular search against a metric?

* How would you measure the overall quality of "all searches in the past hour, including the [site:tbray.org rock roll] search"? How would this one failure to find a page contribute to an overall success rate?

* Is there any possible automation that would notice whether Tim's article has started to be missing from indexes and say "hey, this represents a loss of a kind of quality"?

* Suppose the index were to (say) discard all pages created before 1999 but simultaneously improve the relevance of all queries that find more recent results. If (say) 99.99% of queries have users happy getting only post-1999 links and (say) only 0.01% are unhappy because they specifically wanted a pre-1999 result, but things get way way better for the 99.99%, was that a bad change? would any metrics show a problem?

I don't see super satisfying answers to this at e.g. https://www.quora.com/How-does-Google-measure-the-quality-of... or https://www.quora.com/How-can-search-quality-be-measured . If I'm reading right, it sounds like part of the state of the art for search quality recently involved human raters manually running sample queries… That seems kinda crazy / totally unlikely to catch certain obscure issues. But then again:

* What is the service level objective for search quality? If search is getting way better for 99.99% of users because of various optimizations, is it a problem if a particular 0.01% of queries such as Tim's old review query, which he expected to find one specific page, instead find no results at all?

And then I guess I wonder:

* According to whatever metric correctly captures Tim's review being missing as a problem, what is the current search quality of Google web searches and how has it been changing over time?




This won’t answer all of questions but the measures you’re looking for are called ‘recall’ and ‘precision‘:

- recall: number of relevant documents retrieved / number of relevant documents

- precision: number of relevant documents in result set / number of documents in result set


Yeah you know, it's funny, the last time I worked on question-answering code, we were trying really hard to find algorithms that could improve a particular metric (F-score, a synthetic agglomeration of precision and recall) ... I don't remember hearing very many conversations at all about whether we were measuring the right thing.

Given a query like [site:tbray.org "rock n roll animal"], and knowing that the 1 relevant document we actually want is the review at https://www.tbray.org/ongoing/When/200x/2006/03/13/Rock-n-Ro... , I think we can say that

* if Google search returns 4 results for the query, not including the review: precision is 0/4, recall is 0/1 (so p=0, r=0)

* if Google search returns 5 results for that query, including the review: precision is 1/5, recall is 1/1 (so p=0.2, r=1)

But while I _kind of_ understand how we can use these measures to assess the outcome of a single query, I'm really not sure I understand what meaningful ways are available to aggregate those metrics. Suppose we're going to get 1M queries in the next hour. Do we prefer an algorithm which has the highest mean F-score per query? highest median F-score per question? or which has the highest 1st percentile F-score per question (99% of queries get the best possible outcomes?)

If there is published literature on how search quality is measured I'd love to see it. Would be especially interesting to see real-time data -- e.g. what is the impact of 1 data shard outage on overall user-experienced quality according to some metric?


"Modern Information Retrieval" by Baeza-Yates / Ribeireo Neto a few years ago used to be a good standard work.

I'm not sure though how well it's kept up in terms of aspects like real-time search and graph search, both of which are fairly recent developments.




Consider applying for YC's Summer 2025 batch! Applications are open till May 13

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: