Today we’re going to look at a piece of video game history: a bug introduced six years ago to the game Dungeon Crawl: Stone Soup (abbreviated DC:SS or just Crawl, as distinct from the game named simply Crawl). It persisted for two weeks before it was removed, and it has a lot to teach us about just how easily domain rationality can make you vulnerable to certain kinds of ignorance. Very much including myself, here - I was an active Crawl player at the time and fell prey just as much as everyone else.
Before explaining the bug, some background on Crawl is needed for those who haven’t played. (You also could just play it - it’s free to play and available both in browser or as a local download.) Crawl is a roguelike, a genre of game that’s undergone some linguistic expansion in the last decade, so we’ll focus only on the two attributes that are core to our story. The first is that the levels are procedurally generated every time - you can’t memorize the layout of the dungeon, because each run takes place in an entirely new version of the dungeon. The second is that consequence is permanent - you can’t ever reload a save to a previous turn, and if your character dies, your file is deleted. Combining these factors means that there’s no way to guarantee yourself a victory in Crawl. You can’t endlessly grind to improve your level, nor can you look up exactly what steps to take or memorize a route. And if you die, all of your progress is gone - the only difference run to run is what you learn.
So beating Crawl requires some degree of self-reflection. After all, if you kill off a character with eight hours invested in them (something I’ve done more than my fair share of!), those hours were completely wasted unless you manage to learn something from what happened. It naturally forces you to take responsibility and update your internal model, and it does so in a rigorous, no-nonsense way that’s hard to come by in everyday life. You can make all sorts of incorrect predictions in real life and simply forget about your misses and remember your hits. But if you predict that you can beat that ancient lich and you can’t, then you are Dead With No Do-overs, and It Sure Looks Like You Don’t Know What You’re Talking About With Ancient Liches, and Maybe You Should Work On That.
Of course, I could also just conclude that this particular lich was simply some sort of super-lich, or that his crystal spear was an unavoidably extra-crispy spear, or some other unique-to-this-run factor means that nothing that happened was my fault. This lets me assuage my ego at the extent of invalidating the chance to learn from the run. I can decide that it was the game's fault, really, and go into the next run with exactly the same mental model as before. Winning reliably at Dungeon Crawl: Stone Soup requires you to suppress this impulse. The reward you get for suffering short-term damage to your pride is a long-term improvement in outcomes that you know is directly tied to the lessons you learned. Always assuming that the system is consistent and fair, and that changes in outcomes are tied to your actions, is the path to victory. But once you’ve gotten good at making this assumption, a new question arises: what happens when the system really is inconsistent?
It is March 6th, 2015, and a bunch of nerds are about to fail to notice something.
The bug itself is one of those hilariously simple mistakes that computers are great at enabling. A small refactor in melee damage code ended up causing the outgoing damage to be added to itself. This isn’t a bug that’s highly contingent or complicated: all player melee damage was simply doubled every attack. However, the actual damage numbers are not exposed in a game of Crawl. (Charmingly, the game uses successive numbers of exclamation points to give a vague sense of how much damage you did. You dice the ogre like an onion!!!!) And the nature of Crawl combat is already swingy, since your armor essentially rolls to reduce damage, meaning that it can have very inconsistent value attack to attack. Still, doubling player damage is huge, powering you up on a level that outweighs most of the tactical concerns of the entire game. Surely a signal this loud can pierce through the noise, even if we can’t immediately look at the numbers?
Well...no. My source here is the Something Awful Dungeon Crawl discussion thread(s) (the incident spans the .15 and .16 threads), chosen because it’s a single linear record of discussion at the time, instead of trying to rewind Reddit or the official forum to six years ago to follow the various strands of discussion. And as the days tick on from March 6th, conversation proceeds mostly as normal with the occasional praise of the blessings of luck. Floodkiller writes, celebrating a win:
I need to stop playing until (the .16 release tournament) starts or I’ll waste all my good luck!
LogicNinja with a comment that’s heartbreaking in retrospect:
AW YISS WON THAT KoBe [Kobold Berserker]
MY FIRST STREAK
I’M REAL GOOD AT CRAWL NOW
The victory posts come in a bit more frequently than usual, but it’s a high energy time for the thread, and who can be too surprised if a lot of players successfully improved themselves with all of the helpful advice they got? In fact, only a single player,
Brannock, cottoned on before the bug was formally announced:
I don't even fucking know what to think. Suddenly I've become a legendary shining avatar of Cheibriados. Before this day last week my best streak was two. Then I knock off three and then six in a row. It's starting to make me feel a little [tinfoil hat]-ish. Maybe something really did go wrong in an unrelated commit and Crawl is unintentionally easier now???
(It’s worth noting that Cheibriados is a god that grants you substantial physical power in exchange for making you slower, and so someone streaking through Cheibriados-worshipping characters would benefit more from the bug than most players - and have more cause to notice it - since the numbers being doubled would be higher.)
Internet Kraken has a ready answer for Brannock:
You've been playing basically the same kind of character for over 30 games now. The game isn't easier, you just know exactly what to do in every situation with these combos. Still, good work.
This comment is a crucial one for understanding what went wrong, and we’ll return to it shortly - but let’s finish the story first. For a bit longer the thread continued on, with nods to luck or self-improvement but no more guesses that it might be a bug. Here’s Razzled:
Guys. GUYS. I am so fucking happy. I don't know what changed but after 3 years of playing in these tournaments I finally got not 1 but 2 wins for the first and second time ever!! ...I know these are just easy baby class/race combos and god but dang feels good to finally win
The commit causing the bug was reverted on March 21st, after two weeks of havoc. It fell to the appropriately named Can Of Worms to inform the thread:
So, it turns out a bug was accidentally introduced in one of the commits that's caused all melee damage dealt by players to be doubled (approximately.) Fun!
The responses were mostly about the clues that we missed in retrospect. Parthenocarpy:
Well that explains me going from a 2% winrate to 17.95% in the span of two weeks
Hawkperson:
Ahahah, I thought there were an awful lot of victory posts lately, but I just chalked it up to clustering or something. This is obviously why my gargoyle monk is still alive...
Fhqwhgads:
Well that explains how a shit player like me was able to streak with a bunch of melee dudes, but as soon as I tried something other than melee I got my ass handed to me repeatedly (like normal). Kinda sad about it now.
“Kinda sad about it now” was my reaction, too. The point of playing Crawl is getting good at Crawl, and successes are only valuable as the context is continuous. We play to get unbiased feedback on how capable we are at learning; instead, for two weeks, that feedback was disconnected from all of our prior experience and made close to meaningless. (This is not meant as a criticism of the programmer who introduced the bug. Computers are warm rocks we tricked into doing math and it’s a miracle they do anything.)
In exchange for the feedback of our performance in the game being less relevant, we got a different kind of feedback: how receptive we are at noticing large-scale changes. We did really, really poorly. And I think a big reason why is that being the kind of person who’s good at Crawl makes you especially susceptible to these sorts of errors.
What kind of person is that exactly?
Researcher Sarah Constantin wrote an article called “Do Rational People Exist”, which speculates about the “cognitive decoupling elite”:
Stanovich talks about “cognitive decoupling”, the ability to block out context and experiential knowledge and just follow formal rules, as a main component of both performance on intelligence tests and performance on the cognitive bias tests that correlate with intelligence. Cognitive decoupling is the opposite of holistic thinking. It’s the ability to separate, to view things in the abstract, to play devil’s advocate.
Cognitive flexibility, for which the “actively open-minded thinking scale” is a good proxy measure, is the ability to question your own beliefs. It predicts performance on a forecasting task, because the open-minded people sought more information. [21] Less open-minded individuals are more biased towards their own first opinions and do less searching for information.[22] Actively open-minded thinking increases with age (in middle schoolers) and correlates with cognitive ability.[23]
Under this model, people with high IQs, and especially people with training in probability, economics, and maybe explicit rationality, will be better at the cognitive bias skills that have to do with cognitive decoupling, but won’t be better at the others.
Speculatively, we might imagine that there is a “cognitive decoupling elite” of smart people who are good at probabilistic reasoning and score high on the cognitive reflection test and the IQ-correlated cognitive bias tests.
To which I would add: the cognitive decoupling elite are also people who can reliably beat Dungeon Crawl: Stone Soup.
But here’s Constaintin again:
I’d expect [the cognitive decoupling elite] not to be much better than average at avoiding the cognitive biases uncorrelated with intelligence. The cognitive decoupling elite would be just as prone to dogmatism and anchoring as anybody else.
To which I would add: the cognitive decoupling elite are also people who can play a game very intently for two weeks without noticing that they’re doing double damage. In fact, I’ll go further: seeing the objective power of cognitive decoupling in systems that reward it can foster dogmatism and anchoring.
I promised I’d come back to this quote from Internet Kraken:
You've been playing basically the same kind of character for over 30 games now. The game isn't easier, you just know exactly what to do in every situation with these combos. Still, good work.
This statement is mostly true. Even if the double damage bug hadn’t been introduced, Brannock probably would have become better at winning with Cheibriados characters. Plenty of people still lost during the double damage times, because even with a massive handicap, Crawl is an awfully hard game. (At one point a few days after .16 was released, the winrate was 2.89%. This when every melee attack is dealing double damage!) Brannock still had to learn enough to let the double damage carry him to victory, and Internet Kraken was correct to note that experience is a hugely important factor in achieving victories.
But buried in Internet Kraken’s analysis is an unconscious proposition that we’ll call the “Systemic Stability Principle”:
If the change in the systems outputs can be explained by a change to the systems inputs, then the system itself didn’t change, only the inputs.
The mistake Internet Kraken (and the rest of us, implicitly) made was assuming that because experience with Cheibriados characters could explain Brannock’s sudden improvement in winrate, then it must be the whole explanation; when, in fact, it was experience plus the double damage bug. The Systemic Stability Principle is clearly false. But why did we make this unspoken assumption?
Answer: because believing in the Systemic Stability Principle makes you good at Dungeon Crawl: Stone Soup, and many other things besides.
In fact, it’s almost a prerequisite to improving in highly formal domains! If you die of more damage than you were expecting and tell yourself “that one must have been double damage”, then you can’t learn anything. All of that stuff I said earlier about rigor and self-improvement starts with you holding the system constant enough that you can evaluate your changes over time. In a well-designed roguelike, it’s hardly an exaggeration to say that the more you can internalize the Systemic Stability Principle, the better you’ll be.
This is just another face of cognitive decoupling; the superpower comes from what you’re blocking out. Instead of designing experiments to turn every single little detail into data, you take the system at its word on what elements should be discretized as data, and focus on performing interesting higher-level logic with those pieces. This makes it hard to separate the gains from cognitive decoupling from the vulnerability. How do you set a threshold for following up on anecdotal observation that’s high enough to reap the efficiency rewards of a good system, but low enough that you don’t miss correspondence breaks when the system genuinely breaks down?
There’s probably not a single comprehensive answer, but there are certainly some tricks that can help. The first is the easiest - remember that the Systemic Stability Principle is completely false, and just happens to be a false belief that it’s often useful to hold. A system being stable in the past doesn’t mean it’s stable now, and evidence that the Systemic Stability Principle works is not the same thing as evidence that it’s right.
It’s also important to avoid refuting observations by appealing to system definitions. If someone says “This feels like it’s doing more damage than it says”, resist the impulse to say “Nope, it does exactly this value; it’s written right here.” Instead, try to design an experiment that would prove whether the value written in the system is correct. If designing the experiment is very hard, that should be interpreted as a risk factor - if no one can check that the system is doing what it ought to, then maybe it really is wrong! Play yes-and with systems skeptics, letting them invest their time into correspondence work if they think something is wrong. Or be the system skeptic yourself, if a certain observation sits wrong with you.
And more than anything, we need more humility from the cognitive decoupling elite. We’re hard at work turning the world into metrics and dashboards and systems, and obviously those of us who are good at systems are happy to have things be more personally legible. But before we get too excited about turning the world into a video game, let’s remember how stupid we all looked when we tried treating a video game like a video game.
(I’m going on vacation, so barring a urgent burst of inspiration, Desystemize will be on break until sometime early September. Don’t forget to subscribe if you’d like to know when it’s back!)
So, how did the bug get discovered?
This was an interesting post. Your suggested solutions, though, imo sound too close to "have systems in place that detect when damage is doubled", which has lots of potential failure modes (e.g. it might detect when damage numbers double, but not when damage is applied twice).
One much more general thing you can do in any software where inputs aren't highly time-sensitive (i.e. not real-time strategy or something) is to have an auto-replay functionality: While playing through a level, record your inputs and the resulting outputs (level finished, boss defeated, etc). Then, after doing any significant code changes, let the game play through your recordings, and check whether your inputs still result in the expected outputs.
For instance, you can find stream recordings on Youtube of Jonathan Blow developing an isometric puzzle game, and after any major code changes, he'd let the game play through all level recordings (at 8x speed or higher) and check whether they still behaved as expected, both in terms of "is this level still getting solved" and in terms of fuzzier things like "does anything look off, e.g. in terms animations".