The Mechanistic Interpretability Hackathon

Hosted by Esben Kran, Neel Nanda, Apart Research, Zaki, fbarez · #alignmentjam

Entries

Ratings

Overview Submissions Results Screenshots Submission feed

A jam submission

TraCR-Supported Mechanistic InterpretabilityView project page

✍️ Compiling RASP code to Transformers and comparing them to trained networks

Submitted by Esben Kran (@esbenkc), ElliotJDavies, h6 — 1 hour, 30 minutes before the deadline

Add to collection

Play project

TraCR-Supported Mechanistic Interpretability's itch.io page

Results

Criteria	Rank	Score*	Raw Score
ML Safety	#1	3.750	3.750
Novelty	#1	4.500	4.500
Generality	#5	3.000	3.000
Mechanistic interpretability	#5	4.250	4.250
Reproducibility	#5	4.000	4.000

Ranked from 4 ratings. Score is adjusted from raw score by the median number of ratings per game in the jam.

Judge feedback

Judge feedback is anonymous.

Interesting work - comparing the compiled model to a learned model is a natural thing to try, but I'm not aware of anyone actually trying it before! The actual analysis is pretty limited and doesn't really engage with understanding what's going on inside the learned model beyond inspecting neuron activations and attention patterns (which ARE the correct things to look at, at least!). I think this work would have been significantly cooler with some attempt to reverse engineer the learned circuit, or at least explore its behaviour on multiple inputs (though this is probably harder than can be achieved in a weekend hackathon!) Hypothesis 1 seems very obviously true (so I'm glad your results bear it out!) Hypothesis 2 seems very likely to be true. Hypothesis 3 is unclear and interesting to me, I don't feel like your work gives strong evidence either way, since I still don't know how things are implement before the final attention layer > We find that the only attention heads that are relevant seem to be in layer four, while the first three layers’ attention appear redundant. I don't think this statement is supported? Unless I'm missing something (eg, you investigating what happens if you delete those attention layers), all you show is that they aren't OBVIOUSLY doing anything relevant. Did you vary the length of the input list when training? Reversing a fixed length list is very, very easy, it just needs to match up the positional embeddings, so this is an important detail re what to learn from your project Some broader feedback - I think interpreting the learned model is significantly harder because it's so big! I think it could be much smaller while still able to do the task (1L might work, 2L attn-only is also worth a shot). I was pleasantly surprised by how sparse the learned MLP layer was, and it's totally worth trying to interpret the neurons, especially the one that are sharply on for some neurons, and off for others!

What are the full names of your participants?
Bart Bussmann, John Litborn, Esben Kran, Elliot Davies

What is your team name?
o TrAcCeR o

What is you and your team's career stage?
Early career

Does anyone from your team want to work towards publishing this work later?

Maybe

Where are you participating from?

Copenhagen

Comments

israel xai teamSubmitted2 years ago(+2)

That's gold! I (Itay) would love to know if you are planning to do more work on that, and if you are looking for collaboration especially on "The Road to Automation".

Submitted

Soft Prompts are a Convex Set

Like Reply

itch.io