The Mechanistic Interpretability Hackathon

Hosted by Esben Kran, Neel Nanda, Apart Research, Zaki, fbarez · #alignmentjam

Entries

Ratings

Overview Submissions Results Screenshots Submission feed

A jam submission

Automated Identification of Potential Feature NeuronsView project page

Submitted by lomichelle42 — 8 hours, 34 minutes before the deadline

Add to collection

Play project

Automated Identification of Potential Feature Neurons's itch.io page

Results

Criteria	Rank	Score*	Raw Score
Judge's choice	#3	n/a	n/a
Reproducibility	#3	4.250	4.250
Mechanistic interpretability	#5	4.250	4.250
Generality	#5	3.000	3.000
Novelty	#8	3.000	3.000
ML Safety	#12	2.750	2.750

Ranked from 4 ratings. Score is adjusted from raw score by the median number of ratings per game in the jam.

Judge feedback

Judge feedback is anonymous and shown in a random order.

This is a wonderful project and plays right into mechanistic interpretability! This novel 3-step method is great for making neurons more interpretable and it enables quite a bit of deeper analysis. I recommend also reading Alex Foote's winning submission for the last interpretability hackathon which echoes some of your comments at the end: https://alexfoote.itch.io/investigating-neuron-behaviour-via-dataset-example-pruning-and-local-search. Great work!
Cool project! I'm excited to see Neuroscope being used like this (and I'm sorry you had to scrape the data - I need to get round to making the dataset available!) I liked the creativity and diversity of your methods, and like the spirit of trying to automate things! Using GPT-3 and FastText are cool ideas. My main criticisms are that I think these descriptions tend to not be specific enough and miss nuance, eg neuron 134 in layer 6 of solu-8l-pile is actually a neuron that activates on the 1 in Page: 1 in a specific document format in the pile, and seems way more specific than the description given! https://neuroscope.io/solu-8l-pile/6/134.html I also think that tokenization is a massive pain, that breaks up the semantic meaning of words into semi-arbitrary tokens, and I don't see how your method engages with that properly - it seems like it mostly doesn't involve the surrounding context of the word? I really liked the idea of substituting in synonym tokens for the current token, I'd love to see that done for the 5 tokens before the current token, and to try to figure out if we can find "similar tokens" in a principled way, when the token is not just a word/clear conceptual unit. But yeah, overall, nice work!

What are the full names of your participants?
Michelle Wai Man Lo

Does anyone from your team want to work towards publishing this work later?

Yes

Where are you participating from?

Online

Comments

No one has posted a comment yet

itch.io