System feedback in an interpretability game

October 26, 2025 · by CriaFaar#AI#Interpretability#Machine Learning

Share on Bluesky Share on Twitter Share on Facebook

Follow CriaFaarFollowing CriaFaarUnfollow CriaFaar

Share this post:

Share on Bluesky Share on Twitter Share on Facebook

Follow CriaFaarFollowing CriaFaarUnfollow CriaFaar

I shared the concept/build of my game yesterday to a few reddits, and got a very heartwarning and fascinating comment I wanted to dive deep into.

The very-appropriately-titled user "InterestingSystems" shared this:

This is a real mind-bender, congrats on one of the most unique concepts for a game I've ever come across.

I feel like the game needs to give me some feedback about WHY a particular sentence sent me to a particular planet (triggered a particular neuron). Otherwise, with 3072 planets/neurons, it's just going to be blind random luck where I go. I feel like I need a tool or something that at least lets me think - ah, that's probably why those two sentences took me to the same place, but that one didn't. Or maybe it just needs to be less planets/neurons, 3072 is an insane amount.

Firstly, I want to thank them for the kind words, it does a lot to keep me motivated and builds confidence this is worth pursuing! ❤️

I wanted to dive into their feedback because it is bang on too, and also touches right on, I guess, the "big problem" of a non-expert like me trying to do this and maybe even further, on the problem of anyone trying to do this inside a gamification framework because...

I feel like the game needs to give me some feedback about WHY a particular sentence sent me to a particular planet (triggered a particular neuron).

That's interpretability! :D This is the central question of the entire field. They want interpretability, but it's not on the menu quite yet. We're serving like, appetizers of it or something. If this game could give that kind of feedback it would be a significant thing - OpenAI and Anthropic would be playing it.

Welcome to Two: Vectorspace.

That's the "catch" of building an interpretablity game: you will probably never get the kind of clear feedback mechanics most games have, because you as dev are not in total control of the world's environment, and we as humans do not understand the "black box" AI that is driving whatever game mechanics we've relinquished to it. The challenge, I think, is building a system around this whole mess that can still make -some- kind of sense of it. And that's where an expert behind this would fare so much better than me. I share this stuff in detail in the hopes an expert will see the idea and run with it.

To say "we don't have answers" captures my point, but also oversimplifies a little perhaps. We do have some answers, and with further work, the game could definitely do better at giving the kind of feedback you're talking about, perhaps by using these past studies, these specific areas where we do have a statisically grounded interpretation of what's happening. Less "here be dragons" and more "these neurons all co-activate for US phone numbers" type thing. The benefit of a model like GPT-2 here, relative to newer ones, is that this is a well-trodden path, so many such examples exist. Interpretability is strong with this model.

This approach could create a game with systems/tutorials/"campaigns" where it's not just blind luck, as you say. It would help build intuition. I've been poking around GPT-2 for around a year now so I am operating on a level of intuition 99% of players will not have - it doesn't feel like total blind luck to me, but it does sometimes feel bewildering and random, if that makes sense. That's high-dimensional topology for you - it's a supremely weird I-can't-believe-it's-euclidian space that is wildly counter-intuitive to conceptualize. As of right now, my game is committing more than a few sins in this regard, too, so that's another area where I can definitely tutorialize and improve player feedback. For example I stack neurons sequentially according to planet type: lava planets all cluster around neurons 0-512. But in reality, Neuron 1 and Neuron 3 are not "close" to each other. They don't exist in physical space and they are not "neighbours" in any meaningful sense whatsoever. A neuron index is just an arbitrary address in a list. There is no inherent spatial or functional relationship between neuron N and N+1. For now, I've deliberately abstracted from that complexity, and many others. Building them back in could hopefully build intuition.

And yet if you type faux-Elizabethan m'lday language at GPT-2, you'll notice you're landing inside what looks some kind of cluster:

L5-N1028: "Would'st thou seek something delicious"

L5-N1066: "Wouldst thou cross the threshold, knowing not what fate awaiteth thee"

L5-N1074: "STOP TAKING ME THERE I WANNA GO SOMEWHERE NEW WHALE AHOY MATEY" (Note: This is me getting annoyed I keep landing on N-1888 aka Planet Mellybean).

L5-N1206: "Leavest thou from thine own language and partake in the old ways of speaking unto the voids"

But it's entirely coincidental, most likely, and the existence of the N1074 prompt shows how this is never clean and precise, but messy and confusing. My frustrated, pirate-themed outburst still landed nearby, lol. Perhaps the model latched onto the slightly archaic 'matey' or the dramatic tone, who knows! That mystery is part of the discovery. Nobody has probed this specific interpretability question ever before, most likely. We're doing real-time discovery/intpretability. The problem is, as you note, we're doing it with zero safety net - and I can fix that in time, for sure. But yeah, if I could find ways to cultivate that deeper appreciation of how this all works, I believe people will appreciate what's going on and ideally, build up their own mastery over time and engagement.

Yet those metrics, those papers, those findings about GPT-2's "co-activating" neurons and other things would require me to increase the amount of backend experimentation (move from PeakN measurements to cosine similarity and other things as well - more computationally expensive). It's a difficult question: what to focus on and why.

Support this post

Did you like this post? Tell us

Next steps and post-jam reflections
49 days ago
Visualization of Latent Space
56 days ago
The reason for unreachable neurons, plus thoughts on scalability/citizen science
57 days ago
Gamifying AI Interpretability (for good)
57 days ago

itch.io

System feedback in an interpretability game

Support this post

More posts

Leave a comment