Play project
TraCR-Supported Mechanistic Interpretability's itch.io pageResults
Criteria | Rank | Score* | Raw Score |
ML Safety | #1 | 3.750 | 3.750 |
Novelty | #1 | 4.500 | 4.500 |
Generality | #5 | 3.000 | 3.000 |
Mechanistic interpretability | #5 | 4.250 | 4.250 |
Reproducibility | #5 | 4.000 | 4.000 |
Ranked from 4 ratings. Score is adjusted from raw score by the median number of ratings per game in the jam.
Judge feedback
Judge feedback is anonymous.
- Interesting work - comparing the compiled model to a learned model is a natural thing to try, but I'm not aware of anyone actually trying it before! The actual analysis is pretty limited and doesn't really engage with understanding what's going on inside the learned model beyond inspecting neuron activations and attention patterns (which ARE the correct things to look at, at least!). I think this work would have been significantly cooler with some attempt to reverse engineer the learned circuit, or at least explore its behaviour on multiple inputs (though this is probably harder than can be achieved in a weekend hackathon!) Hypothesis 1 seems very obviously true (so I'm glad your results bear it out!) Hypothesis 2 seems very likely to be true. Hypothesis 3 is unclear and interesting to me, I don't feel like your work gives strong evidence either way, since I still don't know how things are implement before the final attention layer > We find that the only attention heads that are relevant seem to be in layer four, while the first three layers’ attention appear redundant. I don't think this statement is supported? Unless I'm missing something (eg, you investigating what happens if you delete those attention layers), all you show is that they aren't OBVIOUSLY doing anything relevant. Did you vary the length of the input list when training? Reversing a fixed length list is very, very easy, it just needs to match up the positional embeddings, so this is an important detail re what to learn from your project Some broader feedback - I think interpreting the learned model is significantly harder because it's so big! I think it could be much smaller while still able to do the task (1L might work, 2L attn-only is also worth a shot). I was pleasantly surprised by how sparse the learned MLP layer was, and it's totally worth trying to interpret the neurons, especially the one that are sharply on for some neurons, and off for others!
What are the full names of your participants?
Bart Bussmann, John Litborn, Esben Kran, Elliot Davies
What is your team name?
o TrAcCeR o
What is you and your team's career stage?
Early career
Does anyone from your team want to work towards publishing this work later?
Maybe
Where are you participating from?
Copenhagen
Leave a comment
Log in with itch.io to leave a comment.
Comments
That's gold! I (Itay) would love to know if you are planning to do more work on that, and if you are looking for collaboration especially on "The Road to Automation".