Conor Griffin recently sat down for a discussion with Been Kim and Neel Nanda, two leading lights of the AI explainability field. We touch on the history and goals of explainability research and the usefulness of different approaches, including mechanistic interpretability and monitoring models’ Chain-of-Thought. We then explore how humans, including leading chess players, can learn new knowledge from AI systems. We end with Been and Neel’s top AI explainability policy ideas and dream experiments.
Please see an edited summary below. Thanks to Nick Swanson for his support on this. Enjoy!
The origin story
Conor: Let’s start at the beginning. How do you both start working on AI & explainability and what motivates you to work on it today?
Neel: I originally got into this field out of a concern for AGI safety. It seems we are on track to produce human-level AI within my lifetime. My core motivation is that if we can truly understand these systems, we are more likely to achieve better outcomes. Several years later, here I am.
Been: In 2012, I was starting my PHD when AlexNet achieved a huge performance boost on ImageNet. It was a massive deal. But when I asked people how it worked, nobody could answer. That felt wrong to me. At the time, people actively discouraged me from working on interpretability, saying no one cared and it wouldn’t lead to a good research job. But I was fascinated by the intellectual question: how do we understand these systems? There have been moments when I’ve considered working on other topics, but I always come back to this question. I can’t escape it. Of course, as Neel says, its importance is now becoming clearer than ever, given how prevalent machine learning is becoming in our society.
Conor: AI explainability research looks quite different today to 40 years ago, or even 5 years ago. Been, how would you paint this history? What notable eras, or epochs, would you call out?
Been: In the 1980s, folks were really excited about expert or rules-based systems. Despite what it may sound like today, people built some very capable systems that could, for example, outperform doctors at tasks like predicting patients’ complications in an ICU. One of my thesis committee members, Randy Davis, designed a system called TEIRESIAS that had interpretability built in. It could explain its reasoning to a human expert and, because the machine surfaced the exact rules it used, the human could then modify the knowledge base. So it was a two-way interaction.
After the AI winter, the neural network era began. One branch of explainability research focused on developing models that were inherently interpretable, like Bayesian models or decision trees. Another branch focused on post-hoc interpretation methods, like saliency maps, that are put on top of an existing neural network; you don’t touch the model, but you try to distill some intuition out of it. My own work, in 2017, on Testing Concept Activation Vectors (TCAV) fits here. At the time, people were using saliency maps to try to explain a model’s prediction by highlighting which individual pixels in an input image were most important to the model’s output. TCAV used real human concepts, like explaining that a picture was a dog by pointing to its ears and snout.
Now, we are in the post-LLM era and people are pursuing various approaches to explainability research. We have mechanistic interpretability, which Neel will discuss. We have methods for training data attribution, which tries to trace an LLM’s output back to its source data. Some people think Chain-of-Thought reasoning has solved interpretability - I disagree. And my own work focuses on teaching humans how to extract new knowledge from these models so we can keep up and evolve with them.
The goals of AI explainability research
Conor: We’re going to come back to the merits of Chain-of-Thought reasoning traces for explainability. As well as to the idea of extracting new knowledge from AI systems. But taking a step back, Neel, if you had to list the main reasons why people work on explainability today, what would you call out?
Neel: There are many goals. I can try to cluster them. First, we have the AGI Safety community that wants to ensure that we can safely control future, more powerful models that could potentially deceive us. To do that we need to be able to look inside them. Then there is the goal of avoiding harms that are already possible with current AI systems, such as issues of fairness and bias, for example in hiring or lending decisions. There are also fields like finance and medicine, where practitioners are often uninterested in AI systems that can’t explain their decisions. Finally, there is the scientific motivation. There are so many rich mysteries of intelligence and computation that may be answerable by studying AIs.
Been: Those are great. I would add one more: we need to keep up and evolve with these models. Imagine you run a factory and hire an amazing employee who eventually runs all the critical operations. One day, she quits or makes an unreasonable demand. You have no choice but to comply because you are no longer in control. You failed to maintain transferability and oversight. I make the same argument for LLMs. Even if a model is having a positive impact, we must know how it does what it does to ensure that humanity benefits and stays in control.
Neel: I strongly second that. It’s the extreme example of a problem we face today: empowering users. To customise and control AI in an intuitive way, non-experts need to understand why an AI model is doing what it’s doing. Been’s point is that even the experts not knowing could be catastrophic in the long term. It’s all a continuum.
The merits of mechanistic interpretability
Conor: Let’s go a bit deeper on the AGI safety perspective and how mechanistic interpretability could support that. I feel like a growing number of us in AI policy and governance circles will have seen mechanistic interpretability work. But it’s not always, itself, that interpretable to non-experts. From an AGI safety perspective, why is mechanistic interpretability useful, and what are the main approaches to it?
Neel: First, let me explain what mechanistic interpretability is. I find it helpful to contrast it to the standard ML paradigm, which I call ‘non-mechanistic non-interpretability’. Normal machine learning trains a neural network on inputs and outputs, and nudges the model to be better at producing the outputs. The inside of the model is just lists of numbers. We can see those numbers but we don’t know what they mean.
Mechanistic interpretability tries to engage with those numbers and a model’s ‘internals’ to help us understand how it works. Think of it like biology: You can find intermediate states like hormones. Figuring out what these mean is really hard, but can be incredibly useful. A lot of the core concerns in AGI safety boil down to: we will have systems capable of outsmarting us. But it’s much harder to deceive someone if they can see your thoughts, not just your words.
As for the approaches, I find it useful to divide the field into three buckets, even if people may disagree about where their own work fits:
Applied Interpretability: The low-risk, low-reward approach of just picking a task and trying to solve it better by using a model’s internals. For example, simple techniques like “probes” are very effective for detecting misuse. (A probe is a simple AI model that is trained to find the ‘signature’ of a concept, like sentiment, within a separate AI model’s internals).
Model Biology: This is the middle ground, where you try to understand the high-level properties of how a model is “thinking”. For example, there was a lovely bit of work on auditing models for hidden objectives from Samuel Marks and colleagues. They trained a model to have a hidden objective, where it would exhibit whatever behaviours it believed its training reward model would like, even if they were unhelpful to humans. They then used a bunch of mechanistic interpretability techniques to try to understand what that goal was. And several of the techniques were successful.
Basic Science: This is the extreme end, where people try to reverse-engineer a model into its source code or figure out the meaning of individual neurons, like some of the older ‘Circuits’ work. Historically this was the dominant approach, but now there’s more interest in the other camps.
I think all three groups contain potentially promising, potentially useless, approaches. It’s ‘research’ after all. I used to be very much in the ‘basic science’ camp, but I became a bit disillusioned. Now I largely view myself as being in the ‘model biology’ camp. But I also think ‘applied interpretability’ is important. About a third of my team is directly working on how we can use interpretability in production, with Gemini, to help make things safer.
Conor: In mechanistic interpretability, there has been a lot of focus on ‘sparse autoencoders’. Can you explain what SAEs are and how useful you think they are?
Neel: A sparse autoencoder tries to create a brain-scanning device for an LLM. It takes the confusing mess of internal signals - the model’s “brain waves” - and tries to identify meaningful concepts. Imagine putting a brain scanning device up to somebody’s head. If you know what you’re doing, you might say: “this particular squiggle means that Neel is talking, while this particular squiggle means Neel is feeling happy.” Anthropic famously developed an SAE to find a concept for the ‘Golden Gate Bridge’. They then manually amplified this so that Claude became obsessed by it - it started adding ‘by the Golden Gate Bridge’ to a spaghetti recipe.
My team pivoted to focus on SAEs in early 2024, but we found a problem. It’s easy to find cool-looking examples. But the real question is whether they make systems safer? We tested SAEs on the task of classifying whether a user had harmful intent. It turns out that the simple, decades-old linear probe technique, from my ‘applied interpretability’ bucket, worked dramatically better. So I would say that SAEs are a useful tool for getting a general sense of what’s happening in a model, especially if you have no idea what you’re looking for. But they are not a game-changer for everything. If you have a specific goal, like detecting harmfulness, you’re better off using simpler techniques tailored to that specific use case.
Conor: Overall, how optimistic should we be about mechanistic interpretability? We see very different opinions on this, from the optimism of Dario Amodei to the pessimism of Dan Hendrycks.
Neel: In a sentence: ‘Big if true, but not yet production-ready, so watch this space’. More concretely, we’re starting to see real evidence of interpretability being the best tool for the job on certain tasks. Linear probes are both effective and cheap enough that they can be run on a model in production, which is pretty awesome. But these are not perfectly reliable techniques and there are cases where really boring baselines, like just asking the model its thoughts, are actually more effective.
The main reason to invest a lot in interpretability is its potential to be incredibly useful for future risks that existing techniques will not be able to solve. But even there, it’s one potentially promising approach. It is one tool in our toolkit. Combining multiple techniques will be much better than over-relying on one. For example, there’s this really important field of evaluating models for dangerous capabilities, but we also have evidence that models can tell when they’re being evaluated. There’s some preliminary interpretability work trying to find the “I’m being tested right now“ concept inside the model and just deleting it. This is not yet production-ready, but if we could find it - which I think is an attractive research problem - then we would have so much more faith in the entire field of dangerous capability evaluations.
Debating the usefulness of Chain-of-Thought
Conor: Let’s return to reasoning models and their Chain-of-Thought. How useful might these models’ ‘reasoning traces’ be, as an explainability technique?
Neel: I’m pretty positive on this. Maybe I can give the optimistic case and then Been can tear it apart. Let’s first unpack what’s going on on a technical level. Language models take in some text and produce the next token, or word. You can use them to produce many words by repeatedly running them on the word they just produced, and everything that came before.
The main insight with reasoning models was to use new techniques to run them for a really long time, and give them access to a ‘scratchpad’. It’s sometimes called Chain-of-Thought, but I think that’s a misleading term. It’s more like what happens in the model in a single step are the “thoughts,” and what the model puts down on the scratchpad is the equivalent of what I might write down when I’m trying to solve a problem.
With this framing, it becomes clearer from a safety perspective. If I want to form a really careful plan, it’s useful for me to write down my thoughts. This means that if you’re reading what I’m writing, it is harder for me to get away with plotting against you. In AI terms, the first models capable of doing dangerous things without us reading their scratchpad, will likely not be capable of doing those things if we read their scratchpad.
Therefore, having ‘monitors’ (separate AI models) that look at a model’s scratchpad is useful and important. But it’s not a panacea. If tasks are easy, I can just do them in my head. And maybe sufficiently capable models can also do very dangerous things ‘in their head’. So we would still need to find more fundamental solutions. But I think Chain-of-Thought monitoring is pretty great and lots of people hate on it way too much.
Been: I agree that Chain-of-Thought and looking at a models’ scratchpad could be useful for some problems. However, here’s my argument against just doing this and stopping there. I strongly believe that machines think and work in a very different way to humans, and so our language is a ridiculously limited tool to express how machines work. Imagine a machine ‘born’ in a little room and given tons of documents - the entire history of everything humanity has written. It has no experience, no sensing, no embodiment, no idea of life and death, survival, or human feelings. It’s hard to convince ourselves it will reason in a way that’s similar to us.
Human language evolves quickly because it’s useful to our lives. We have ideas of family, life, and death. Machines are a weird animal, and their thinking is completely different because they were brought up differently. Now they have to squeeze their complex thinking into the tiny little words that humans brought to them. It’s difficult to assume these machines can even express what they think using human language. And if we can’t understand the risks, we can’t be prepared for them, and that could be dangerous. For that reason, Chain-of-Thought is a fundamentally limited tool.
Neel: I just wanted to add another brief caveat. While I think Chain-of-Thought monitoring is great for frontier safety issues, like models acting in sophisticated ways, I think it is much less useful for more traditional problems. For example, if I wanted to be racially biased against someone, I would not need a scratchpad, so there’s no guarantee I’d write it down. If a problem is really helpful to break down and write notes for, then Chain-of-Thought monitoring can be extremely effective.
Learning from how AI systems learn
Conor: Been, your point about human language not necessarily being the best way to understand AI systems is a nice segue to your other work that explores how humans can learn from AI systems. Can you start by walking us through the work you did with AlphaZero and the chess grandmasters?
Been: As I touched on in the history of explainability research above, and Neel touched on with his own examples, we repeatedly see excitement about certain explainability methods and then swings towards skepticism. The motivation for this work started when some of us discovered that some popular ‘post-hoc’ explainability methods didn’t work very well.
We proved mathematically that methods like SHAP and Integrated Gradients may tell you that a certain ‘feature’ is important to a model’s output, like your ‘salary’ is important to a model’s hiring prediction. But our work showed that whether the feature was actually important was 50/50. So it was essentially random. This made the community question what we had done with interpretability research for the last decade. This also impacted my thinking about my work; how can we avoid being fooled by delusions of progress? So I just really wanted to show, as quantitatively as I could, that we can use explainability methods to communicate something new from a machine to humans.
That gave birth to the chess work, where we taught superhuman chess concepts to four grandmasters, who generously volunteered their time. Along with other colleagues, I worked with Lisa Schut on this research, who herself was a professional chess player in a previous life. We used AlphaZero, the world’s best chess-playing machine, to extract “superhuman” chess strategies that were very different from what human grandmasters had done in the past. Then, we designed a teaching session for the four grandmasters. We quizzed them on board positions and asked them what moves they would make. Then we showed them the moves AlphaZero would have made, to teach them the new concepts. Finally, we tested them on different board positions that invoked the same superhuman chess strategy to see if their own strategy had changed.
Chess was the perfect playground, because we knew who the experts were. And we could roll out an entire game to clearly verify which strategy was better. The results were remarkable. All four grandmasters changed the way they play the game. One of our participants, Gukesh Dommaraju, who was really open-minded about it, became the youngest World Chess Champion last year. Chess coaches sometimes spend a whole year trying to teach a grandmaster one new thing, because the players already know so much. The fact that we could steer their thinking about what the best move is showed me that this approach worked. This was a proof-of-concept that we can use AI to teach something to human experts who are the frontier of human knowledge.
Conor: Yes, typically the discussion is always about whether AI can reach the level of human experts, and designing an evaluation to assess this. But this is a nice reminder that experts can also learn from AI systems. Your recent work on ‘neologisms’ takes this idea a step further?
Been: The neologism work asks the natural next question: how can we generalise this teaching process for anyone, not just chess masters? We wanted to use language, the most natural interface. The idea is simple: if the current set of words are not enough, let’s create new words (”neologisms”) to describe the complex concepts that AI models are learning.
Think about the word “skibidi”, which GenAlpha uses to mean something like “good, bad, and boring” all at once. Humans create new words like this all the time because they are useful. GenAlpha learns this word in the right context and so they know how to use it. But it’s complex. So there are articles online to explain it to everybody else.
The idea is to do the same thing with machines. In our recent paper, we create neologisms to bridge the communication gap between AIs and humans. As I touched on above, we start from the premise that machines conceptualize the world differently from us. So we create new words to give us more precise handles for concepts that the AI model is using. First, we developed a method to teach the model new words. We showed that this works in practice: a ‘length’ neologism gave us precise control over the model’s response length whereas using normal human language previously failed to achieve that. We then reversed the process to learn machine concepts that are new to humans. We created a neologism, we call them tilde (~) tokens, like ~goodM, that represents what the AI model considers a ‘good’ response to be. Here, we learned something new about its internal values - the model’s notion of ‘good’ is effusive, detailed, and often avoids directly challenging a user’s premise.
Once you have a new word, you can have a conversation with a model about it. “Hey model, what is ~goodM? What are some examples of it? Ok, give me a pop quiz about ~goodM to ensure I understand it right”. You can imagine extending this to better understand other complex concepts an AI model has learned.
Agentic Interpretability: A New Paradigm
Conor: One thing that strikes me is that AI is almost always framed as a ‘negative’ or a ‘risk’ to explainability. But millions of people now use LLMs and explaining things is one of the main use cases. This is logical, because LLMs have lots of the characteristics that people say they want in an explanation - you can double-check things, ask an LLM to make the language in the explanation simpler, or more visual etc. So, it feels that we are entering a paradoxical world where we may increasingly rely on AI to explain things, often better than humans can, while AI’s internals remain relatively unexplainable, at least to the public at large. This distinction is somewhat related to a recent paper that you were both involved in that introduces the idea of “agentic interpretability” and distinguishes it from more traditional “inspective interpretability”?
Been: Yes, the crux of the distinction is the role of the subject. You can think of more traditional ‘inspective’ interpretability as taking place in a lab-style environment. You, the researcher, have your gloves on and you are opening up an AI model and doing all the analysis. The model just sits there. However, in ‘agentic’ interpretability, the model you are trying to understand is an active participant in the loop. You can ask it questions, probe it, and it is incentivised to help you understand how it works.
Think of a teacher-student relationship; a teacher teaches better when they know what the student doesn’t know. Agentic interpretability proposes that because machines can now talk, we can have them build a mental model of us - of what we know and what we don’t know. And it can then use that mental model to help us understand them.
There’s a trade-off. Inspective interpretability aims for a complete and exhaustive understanding, which is critical for high-stakes situations. Agentic interpretability strikes a balance between interactivity and completeness. You may not understand everything about the model, but you, the user, can steer the conversation to learn the aspects that are most useful to you.
Neel: I think it’s also useful to think of the two approaches as answering very different questions. Inspective interpretability asks: “How did the AI do this?” Agentic interpretability is a bit more like a standard machine learning task: you’re trying to build a system that is good at explaining complex things, which is difficult but not as fundamentally different to standard machine learning as inspective interpretability is.
Policy Implications and Dream Experiments
Conor: Let’s wrap up. If you had one policy recommendation for governments on explainability, and one “dream experiment” that you could run, without any resource restrictions, what would they be?
Been:
Policy: I’m co-authoring an ACM Tech Brief on this, that will highlight what is possible now and in the future. The summary is: do simple things now. For high-stakes decisions, standards requiring simple, statistically solid validation methods like ablation tests, which can check, for example, if a certain feature was important in a hiring decision, could be worth consideration. We can do that today.
Dream Experiment: Interpretability is a means to an end - the end is making AI work for humanity. My dream scenario would be to have an infinite feedback loop with experts at the frontier of human knowledge - mathematicians, scientists, and others. We need their insights on what they want from AI in their workflow to build tools that truly help them.
Neel:
Policy: I was recently part of a position paper on Chain-of-Thought monitoring. We had people from lots of the AI labs, as well as many other institutions. There was a broad consensus that Chain-of-Thought monitoring is valuable for safety - I was actually surprised we ended up there, but we did. But it’s also a fragile opportunity. There’s just a bunch of things that developers could do, accidentally or deliberately, that might have the side effect of breaking this. While I wouldn’t ban developers from ever breaking it, I think measuring “monitorability” and providing an adequate replacement if degraded is worth further research.
Dream Experiment: AGI safety research is hard because we’re trying to fix future problems in AI systems that don’t exist yet. But we do have good techniques for giving models properties that they didn’t learn by themselves during training. So, if you wanted to, you could train a model to only speak French when it is speaking to a man. We could get much more creative with this. My dream study would be to create an enormous, open-source library of models with different kinds of carefully engineered safety problems (e.g. models that fake alignment), that we then deploy our leading interpretability techniques against. We could run a global competition where people compete to identify and, ideally, fix these problems. I think this is technically possible today and would be an immense public good.
Thought-provoking article—thanks for unpacking the practical and philosophical dimensions of AI explainability. I recently explored related territory in my Medium piece “Reading Between the Lines: What Your Brain Reveals About the Future of AI,” where I argue that genuine explainability demands more than just statistical transparency; it requires AI to tap into the human processes of meaning-making. When we ground explanations in the mechanisms by which our brains create understanding—semiosis, interpretation, and cultural context—AI transitions from being merely ‘transparent’ to truly relatable and trustworthy. The semiotic web vision pushes this further, embedding meaning and empathy into AI’s very architecture. Looking forward to seeing how the conversation around meaning-centric AI evolves from here!
I thoroughly enjoyed reading this interview. Hoping for more of such insights every month.