> - reputation/karma that isn't apparent or chased The more I think about those ...

velosol · on May 28, 2021

That reminds me of the slashdot moderation system - when a user gets modpoints they get to spend them on posts as they browse and indicate why they spent that modpoint that way (e.g. 'Troll' or 'Flamebait').

dredmorbius · on May 28, 2021

Having used many broken moderation systems and designed a few (also broken) myself, a few observations.

- Popularity itself is a very poor metric for quality. It's mostly a metric for ... popularity. Which is to say: broad appeal, simplicity, emotive appeal (or engagement), and brevity. This does however correspond reasonably well to sales and advertising metrics.

- The most critical question the designer of a moderation / rating system needs to ponder is what is the goal? See https://old.reddit.com/r/dredmorbius/comments/28jfk4/content...

- My own goal tends toward maximised overall quality, with a high favouring of truth value and relevance.

- There's some value to a multi-point rating scale. This is called a "Likert Scale", typically an odd-number of points (3, 5, 7, ...), most commonly encountered as a star-scale system. Amazon and Uber are the most familiar of these today, and highlight failure modes. If users' ratings are rebalanced based on their own average rating, at least some of the issues go away (e.g, a very positive rater giving away 5/5 will have those ratings discounted, a conservative rater offering 3/5 on average would see those uprated). The adjusted average becomes the rebalanced rating.

- Note that a capped cumulative score is not the same as an averaged Likert score. Slashdot's moderating system is an example of the former. It ... kind of works but mostly doesn't. Highly-ranked content tends to be good, but much content deserving higher ratings is utterly ignored.

- Taking number of interactions and applying a logarithmic function tends to give a renormalised popularity score. That is, on a log-log basis, you'll tend to see a linear scaling from "1 person liked this" to "10 billion people liked this" (roughly the range of any current global-scale ratings system). See also: Power Distribution, Zipf Function.

- Unbiased and uncorrupted expertise should rate more strongly. In averaging the inputs of 300 passengers + 2 pilots for an airplane's flight controls, my preferred weighting is roughly 3300*0 and 11. Truth or competence are not popularity games.

- Sometimes a distinct "experts" vs. "everyone" scoring is useful. I've recently seen an argument that film reviews accomplish this, with the expert reviewers' scores setting expectations for "what kind of film is this" and the popular rating for "how well did this film meet established expectations"? There are very good bad films, and very bad good films, as well as very bad bad films.

- "The wisdom of crowds" starts failing rapidly where the crowd is motivated, gamed, bought, or otherwise influenced. Such behaviour must* be severely addressed if overall trust in a ratings system is to remain.

- Areas of excellence ("funny", "informative", "interesting", etc.) are somewhat useful but very often the cost of acquiring that information is excessively high. Indirect measures of attributes may be more useful, and there's some research in this area (Microsoft conducted studies on classification of Usenet threads based on their "shape", in the 2000s. Simply based on the structure of reply chains, there were useful classifications: "dead post", "troll", "flameware", "simple question everybody can answer", "hard question many can guess at but one expert knows the answer", etc.

- Actual engagement with content, even just for a voting or other action is a small fraction of total views. Encouraging more rating behaviour often backfires. Make do with the data that occurs naturally, incentivised contribution skews results.

- Sortition in ratings may be useful. It greatly increases the costs of gaming.

- As is sortition of the presented content. Where it's not certain what is (or isn't) highly-ranked content, presenting different selections to different reader cohorts can help minimise popularity bias effects.

- Admitting that any achieved ratings score is at best a rough guess of the ground truth is tremendously useful. Fuzzing ratings based on the likely error can help balance out low-information states in trying to assess ratings.