16 Comments
User's avatar
Doctrix Periwinkle's avatar

Came here from Freddie de Boer's subscriber writing post, and I am so glad I did. What a superlative article! This has given me a lot to think about.

My background is in infectious disease research, and I now teach things about infectious diseases and immunology. Something that my students always find troubling is the amount of randomness and death that a working immune system entails. How the system works is: I make a bunch of B-cells and T-cells, and they randomly put together their B-cell and T-cell receptors to recognize pathogens from a library of parts. Then, for good measure, they just mutate the DNA for those parts a ton of times, just to throw in some more randomness. Then, they are exposed to self-proteins, and if they respond to self, they die. Oh, also if they don't respond to signals from other cells, they die. Or if they respond to aggressively to signals from other cells, yep, dead too. So practically all the B- and T- cells ever made in your body just die almost as soon as they're made.

But that's terrible! my students say. Why isn't there a more *efficient* system, without so much waste? Why all these random mutations and recombinations? Wouldn't it be better to design the perfect antibody?

Ah, but our legion ancestors evolved this system because what was needed to survive wasn't efficient design, it was having the right tool on hand for some future infectious environment. What might that environment be? No one can predict, so let's just make all the tools we can and destroy the ones that don't work.

This is something I was thinking about in association with this excellent article.

Expand full comment
Frank Lantz's avatar

This was brilliant. I totally agree that there is something important in the KataGo adversarial policy paper that most people haven't really contended with. Knowing where you are in the metagame stack is a hard problem, and by definition, you can't solve all hard problems equally well. It almost feels like there is a law here somewhere, something like "Every system must always trade off between doing something effective and making itself invulnerable to exploitation."

Certainly no guarantee of safety, but perhaps a good direction for safety-minded people (like Zvi) to be exploring, rather than only weighing it's pro- or anti- worryabout content.

Because I can't leave well enough alone, let me take one small stab at defending car meta in Geoguessr.

I think your distaste for car meta, on aesthetic grounds, is completely natural, understandable, and relatable. So much of the beauty of Geoguessr is in seeing the world, the actual world, not as tourist postcards, but in its simple reality - a random street corner in Gdansk, a boring pharmacy in Dundee, a grimy gas station in Dakar - mundane, but also sublime. Who wouldn't resent the intrusion of car & camera metas into this beauty? Shouldn't the game be about grokking the deep patterns of this beauty? Instead of memorizing the arbitrary artifacts of these framing protocols?

And yet.

As an insufferable game snob, I can't help but notice that this reaction, while perfectly sensible, resembles a common refrain that I hear as a reaction to many competitive games...

Scrabble - the fun part is the anagramming, memorizing the dictionary is a grueling chore

Fighting Games - the fun part is reading and responding to your opponent, the precise technical input requirements is unnecessary friction

Chess - the fun part is the improvisational problem-solving of the middle game, studying the opening book is a tedious bore

Go - the fun part is intuiting the deep strategic flow, reading out a ladder, one step at a time, is a drag

I think all deep competitive games force us to confront this same unpleasant fact - that you can never escape the brute fact of the ordinary, the repetitive, the arbitrary details of whatever framing device we use to demarcate the edges of the game, and the rote practice and memorization required to master them. The fact that the things we find most beautiful - our intuition, our imagination, our creative epiphanies - are inextricably linked to these boring, ordinary things, maybe even made up of them. But confronting this fact doesn't make the beauty go away, it just makes it more complex, more poignant. The world, the actual world, with its rolling hills and craggy mountains, its trees and bollards and traffic signs, and its cars and cameras, is mundane, but also sublime.

Expand full comment
collin's avatar

Thanks much for this detailed and wonderful comment!

"It almost feels like there is a law here somewhere, something like "Every system must always trade off between doing something effective and making itself invulnerable to exploitation."

I think it's even a bit more extreme than that. Notice that if a zebra wants to avoid dying due to a cheetah, the cheetah isn't eating the grass really fast so the zebra starves to death. The cheetah is testing the zebra on axes unrelated to its grass eating capacity - how well do you observe, how fast do you run? Handling the cheetah will certainly lose at least some grass-eating optimization by adding another constraint, but it's not the case that the zebra has easy access to a continuum of tradeoffs. It has to positively create capability at observing and running; it'll need to eat less grass to do that but that doesn't mean it gets those capabilities simply by choosing to eat less grass.

And the value of the tradeoffs is partially stored in the predators, not the prey. If a zebra decides "I will only eat half as much grass to spend more time in super high readiness", then is it eating the grass during the day or at night? Well, that depends on the relevant distance between its nocturnal capabilities and its predators; perhaps night feeding is good in some ecosystems and dangerous in others. So it's a lot harder than just choosing to trade effectiveness for resilience; you need a sense of who is trying to exploit you and how, which either takes time or takes you perfectly imagining the entire ecosystem in advance. That imagining, is a lot harder than just eating grass, and advances in grass eating don't indicate capacity improvements in ecosystem imagining.

As for your other point - I completely agree that the value of competitive games comes in taking them as they are, accepting their ontologies at face value. I agree that a "no car meta" tournament would be unenforceable in practice and aesthetically even more ugly than using it. It does have a strange kind of beauty that thousands of invested, diligent students of Geoguessr could see a supply bag for water that goes on your car and say "I know that bag - keeps you hydrated in the Mongolian Steepe, that one does."

But all of this beauty is dependent on engaging the game on its own terms. Which is the point of games and a good thing to do! But if you want to use Geoguessr to learn about the *world*, that's when an aesthetic of the real is important and optimizing on the game as-is eventually means step away from the world, not towards. My point here is that playing a little Geoguessr will teach you a little bit about the world, but trying as hard as you can eventually has negative returns while the dataset is held constant. But I certainly wouldn't want every game subject to an aesthetic of the real! Games should have the aesthetic of games.

Expand full comment
Frank Lantz's avatar

well said!

Expand full comment
Dylan Black's avatar

A friend of mine summarized the essence of the problem you so eloquently point out as the difference between “kind” and “wicked” learning environments. Our environment is wicked, the LLM’s is kind (fast feedback, mistakes not deadly, well defined objective functions).

Related to this, I recently wrote a (short) article on a similar theme you might enjoy:

https://maximumeffort.substack.com/p/destroying-the-world-is-a-difficult

I also have an (indirect) question for you: given that LLMs are dominantly trained on free internet content, I wonder how much the p(doom) arguments are poisoning the training set against well-aligned AIs? An AI that has no concept of an evil-world destroying AI is surely less likely to come up with that as a plan, right? I wonder if the best thing we can do is inject “adversarial” (in the sense given in your article) nice happy AI stories, like the Minds from Iain Banks’ Culture series.

Expand full comment
1123581321's avatar

“ I wonder if that fake invoice story inspired a lot of copycat criminals or not. It’s not something you hear about because companies obscure information about these adversarial dynamics.”

I asked an accountant friend. He looked at me like I was a baby. This happens all the time he said, every AP dept is on a lookout for switched banking info.

Expand full comment
Anthony Bailey's avatar

Thanks for responding to Zvi (https://thezvi.substack.com/p/ai-104-american-state-capacity-on/comment/94896861) - I hope you'll continue the dialog.

Expand full comment
collin's avatar

Well, probably not again in that thread - I think an author defending themselves in a comment thread more than once usually just irritates people without changing any minds and I didn't see any of the curiosity signals that might override that. But yes, it was clarifying to see where the dismissive attitude is coming from to help tailor my critique in future writing, which will hopefully reach him however this one did.

Mostly it seems like the meaningful difference is how much you believe in the power of simulation. I think having done all the work to parcel out Representation and Uncertainty, Fractal Ratchet, Thought Strewn All Around Us I'm probably in a good position to go further on the bear case for simulation.

And I guess I need to go into more detail on Prop 20 because I think it's a type error to suppose you're "giving up" efficiency for robustness as though they're both numerical quantities; the point is you don't know what to give up until you've learned what the threats are, so you can become more arbitrary unilaterally but you can't become more robust unilaterally. Time to crack open Antifragile again :)

Expand full comment
Kshitij Parikh's avatar

For points 1 and 2, Francois Chollet differentiated intelligence and skills: https://arxiv.org/abs/1911.01547

Expand full comment
Vermora's avatar

You're completely right about needing to take things back to the real.

When thinking about how to protect an AI against adversarial inputs, I ask myself, "If someone could control every one of my sensory inputs -- everything I saw, heard and touched, like the matrix -- could they make me believe anything?" Of course they could. There's simply no defense against having all your sensory inputs controlled by an adversary.

But that's the constraint chatbots are operating under. These have only one sense, text input, and the user is in control of that. Unless the chatbot is also running some other sensory input that cannot be controlled by the user, then there is simply no defense.

Expand full comment
Jeremy Côté's avatar

Thank you Collin for this awesome essay. As someone outside (but curious of) the AI space, I loved your examples in this essay.

Expand full comment
Aaron Weiss's avatar

Let's say I really cared about KataGo being mostly immune to being preyed on.

I spin up a factory for training predator models

Have the predator models play KataGo a few times, all in parralel

Have KataGo train on those tricks then self play a few times

Set a cron job to go through these steps daily

How long till KataGo gets too good for the factory?

Expand full comment
collin's avatar

I think in practice this all comes down to how you operationalize "factory for training predator models".

Think of our own immune system. You train on lots of potential pathogens. How long until your immune system gets too good for diseases? Well, I mean, what about the new ones? What does it mean to be immunized to all new diseases before they happen? If your "factory" is doing something like "enumerate every possible disease" and you get through them all, that would do it - but that approach isn't compatible with states of any real complexity. And are you sure there will never be a Red Queen loop where your fix for disease N makes you re-vulnerable any previous disease?

So the question is how well you can approximate "every disease" with your factory and how badly it hurts the model to accommodate them. And my contention here is that making the good disease factory is just a fundamentally different thing than optimizing the immune system. That "optimize under these conditions" and "generate all the conditions there could be" diverge farther and farther apart as you try to get less abstract and more real because there are just many many more ways to be than there's time to write down.

Expand full comment
Aaron Weiss's avatar

Your abstractions are leaking

The immune system evolves to defeat invaders which evolve to beat immune systems, so these are the same as predatory go models right? Nope

Go is symmetrical

Immune systems are asymmetrical

Immune systems are functioning under a handicap, they need to kill the invaders without killing the body.

That is, there are costs associated with getting good at killing all pathogens, so just optimizing being good at killing pathogens isn't going to optimize the whole system.

KataGo just needs to play better, there is no "killing the body" here - this makes optimizing far more straightforward, at least to a certain degree (perhaps you get some local maxima which pure self-play would've eventually transcended with the same amount of resources at some limit, though I don't think so)

Perhaps you could argue that the cases are similar in regards to attention/short-term memory/production ratios of cell specialists, but again the actual model can simply improve over time

Expand full comment
collin's avatar

It's true that even if perfect inoculation worked in symmetric games that it wouldn't necessarily mean it does in asymmetric games. But my point of including the KataGo example is that I think the distinction between symmetrical games and asymmetrical games is much less straightforward than it appears. Because in practice there's simply too many states to deal with them exhaustively, you need to compress them into some sort of higher order ontology. And if two sides have different goals they can end up with different compressions and one side can exercise representational privilege over the other even if they're looking at the same stream of public information. (See https://tis.so/wine-in-front-of-me for a bit more on this.)

And when you think of it this way you see that even in symmetric games the one who is trying to win against all comers has a huge disadvantage to the predator. The victim needs to see the board the way every predator might; each predator only needs to see it two ways (how does the victim see the world, and how ought I see it to take advantage?) In practice we solve this by waiting on responding to a predator until they've proven they can also stay alive. But if you're trying to speedrun through anti-inductive dynamics with simulated self-play you don't get this privilege.

Expand full comment
Aaron Weiss's avatar

Higher ontologies can be entirely psychological, but they can also be more and more accurate representations of reality.

In strategy games, real-time games and combat, up to a certain point you can beat more skilled opponents by trying weird things that are out of they expectations. At a certain elo, weird tricks are just mistakes which get punished.

Your point about brittleness is well made in regards to AI, the question is what happens when you optimize for resilience as well as self-play.

The defending champion need only see their own weaknesses on top of self play, hence the factory creating models specifically designed to defeat the champion.

I suspect you very rapidly run out of new tricks at low levels of play, and end up needing to use very strong models to predate on other very strong models.

How many predator models do you think it would take before they stop succeeding?

Expand full comment