Mechanistic interpretability is all about peering inside a neural network’s mind and asking, ‘How does this thing actually think?’ The usual way is to throw billions of words at the model and then ask another AI to explain what’s going on. That’s slow, expensive, and often gives you answers you can’t really trust. Transcoders try to break down what the model is doing into simple, sparse features, and they try to tell apart what comes from the data you feed in, and what comes from the model’s own wiring.

Now, researchers at the Fraunhofer Heinrich Hertz Institute have come up with two new tools: WeightLens and CircuitLens. Instead of just looking at what lights up inside the model when you give it data, these methods dig into the model’s actual wiring and circuits to figure out what’s really going on.

WeightLens is like reading the model’s diary instead of watching its reactions. It looks straight at the model’s learned weights—no need for big datasets or extra explainer AIs. With this, it can actually describe a good chunk of what the model is doing: about a third of features in Gemma-2-2b, more than half in GPT-2 Small, and a quarter in Llama-3.2-1B. For instance, it can spot features that pick out words like ‘the,’ ‘this,’ or ‘that,’ or ones that help the model write phrases like ‘the basis of’ or ‘based on.’

CircuitLens takes a different approach. Instead of just grouping things by what they seem to mean, it looks at the actual circuits inside the model—asking, ‘Which parts of the model are causing this to happen?’ This lets you see patterns that you’d miss if you only looked at the surface. For example, a feature that looked random at first turned out to always light up when the model talked about something using words like ‘the’ or ‘this.’

The best part? This circuit-based method needs just 24 million tokens, instead of the usual 3.6 billion. That’s 150 times less data. And it still matches—or even beats—the old methods when it comes to making sense of what the model is doing.

So why does this matter? If you’re building tools to check what big language models are really doing, or you just want to dig deep into how they work, these new methods give you a big advantage. You can look at what the model has learned without needing to gather huge datasets or run extra explainer AIs. All you need is the model itself.

And if you care about features that depend on context, you get that same 150x boost in data efficiency, plus you get to see how different parts of the model work together—something the other methods just can’t show you. Both tools are open source, so you can try them out yourself.

Of course, there’s a catch. You’ll need to be comfortable with transcoders, have direct access to the model, and be ready to dive into the world of mechanistic interpretability. This isn’t a quick fix for everyday debugging. But if you’re working on AI safety or alignment, the circuit-based approach helps untangle features that respond to lots of different things by showing which circuits are really behind each behavior.

Read the paper on arXiv