Imagine you’re trying to understand what’s going on inside an AI’s mind. Sparse Autoencoders, or SAEs, are a tool that lets us break down the AI’s thoughts into features we can actually make sense of. Developers use these features to guide what the model does—whether that’s keeping things safe, moderating content, or making sure the AI behaves the way we want. By turning certain features up or down, we can nudge the AI’s responses in the right direction.
But here’s the twist: researchers in Hong Kong put 90 different SAEs to the test across three language models, and found that picking features just because they’re easy to interpret doesn’t actually help much when you want to steer the model. Instead, they came up with something called Delta Token Confidence. This method looks at how much turning up a feature actually changes what the AI is likely to say next.
The result? This new approach boosted steering performance by over 50% compared to the old best method. They also found that an architecture called BatchTopK—which simply picks the most active features—gave the most reliable improvements, no matter how big the model was.
Maybe the most surprising part: once you pick features that actually help steer the model, how easy they are to interpret doesn’t matter anymore. In other words, just because you can explain a feature doesn’t mean it’s useful for controlling the AI.
So why does this matter? If you’re building systems that need to keep AI on track—whether for safety, content moderation, or just making sure it behaves—you should pick features using Delta Token Confidence, not just because they’re easy to explain. This method actually checks how much the AI’s next move changes when you turn a feature up.
And if you’re wondering about which architecture to use, BatchTopK is your best bet for steady improvements, no matter the size of your model. The old idea that more interpretable features make for better steering just doesn’t hold up when you look at the data.
The takeaway? If you want reliable control over your AI, use feature selection methods built for steering, not just for interpretability. It makes a big difference in practice.