What am I missing here? Unless all the devs were really bad at maths (unlikely if they're game devs) then this seems like a really easy bug to find all things considered?
I thought maybe it was going to be some weird DB glitch or something far upstream from the algorithm selecting which player to attack, but it was literally the logic of the very algorithm you would first look at if you were aware of such an issue.
This was a fun read though. Finding and fixing bugs like this is some of the most satisfying work we do imo, and no one outside of tech understands =)
I would speculate the main hurdle was probably believing the players in the first place. Humans are notoriously bad at not-noticing-patterns in properly random data. And statistical bugs like this require more effort and careful attention to detail to reproduce than deterministic bugs.
Another hurdle is likely that game developer culture strongly favors integration testing over unit testing. Games are optimized for fun, not correctness, and you can't unit test fun. This specific roulette selection function would have been straightforward to unit test, and a unit test would have caught the distortion. But now imagine people keep varying how important distance is to the calculation in order to make it "feel right". Updating those unit tests is suddenly a noticeable slowdown on how quickly you can iterate on game feel.
Yeah, WoW had many many problems with people not understanding probabilities that they added explicit code to track all drop rates and compare them to intended - and actually found a bug or two that way.
But mostly it was to explain to people that a 1 in 100 chance doesn’t mean you’ll get it even after 200 goes.
Systematic debugging flaws probably; and lack of tooling to easily isolate.
Systematic flaws: a cross between groupthink, early flawed assumptions, deference to team leads, a 'I just look for 1hr, if I can't find move on' (which leads to not looking), or just plain simple "reading" instead of searching.
Lack of tooling: many game engines are infamous for lack of control over tooling. I havent used many, but I understand it would be quite an effort to run meaningful parameterised or structured fuzz testing on most systems. This makes it hard to artificially confirm suggestions. That said, there is practically no excuse for them not to just add a bunch of counters to the game - even on their internal testers it would very quickly become clear there was a bias.
Most of my 'should have caught it earlier bugs' are of the 'deference to lead' variety. I looked, didn't see immediately, handed off with some notes, and then the follow up debugger(s) took notes or thoughts as gospel. This is really hard to fight - I write something along the lines of "my hunch is there is a problem in code x because it handles y and is poorly structured/tested. I checked z and found i, j - queries as follows" and then find the debuggers effevtively refuse to look anywhere past x. This is particularly true for a group of debuggers, who play chinese whispers with groupthink and invent reasons it must be x.
Yes I came to post exactly this, and found your comment, so will reply instead. The bug doesn't seem to be obscure. It is there in the right place. Someone thinking about checking "why some players are attacked more often?" would probably choose this as the first place to double check, since it is directly related to selecting the player for attack.
Maybe the most occult part of this is figuring out that the unique IDs assigned to the players play a role.
> The bug doesn't seem to be obscure. It is there in the right place.
Well, there is a larger bug -- the entire algorithm, functioning properly, still won't behave the way this letter says that it should. It's not clear what it's designed to do, but it's very obvious that it doesn't do what the description says it does.
All characters would have been affected. If they had picked any 3 characters, and put them in a room with monsters repeatedly, they would have observed that the same character was attacked most times.
I thought maybe it was going to be some weird DB glitch or something far upstream from the algorithm selecting which player to attack, but it was literally the logic of the very algorithm you would first look at if you were aware of such an issue.
This was a fun read though. Finding and fixing bugs like this is some of the most satisfying work we do imo, and no one outside of tech understands =)