The Mechanistic Interpretability Hackathon

Hosted by Esben Kran, Neel Nanda, Apart Research, Zaki, fbarez · #alignmentjam

Entries

Ratings

Overview Submissions Results Screenshots Submission feed

A jam submission

Interactive LayerscopeView project page

We expand Neel Nanda's Interactive Neuroscope to view an entire layer.

Submitted by chris-lons, victorlf4 — 17 minutes, 27 seconds before the deadline

Add to collection

Play project

Interactive Layerscope's itch.io page

Results

Criteria	Rank	Score*	Raw Score
Generality	#1	4.333	4.333
Mechanistic interpretability	#3	4.333	4.333
Novelty	#7	3.333	3.333
ML Safety	#8	3.000	3.000
Reproducibility	#9	3.667	3.667

Ranked from 3 ratings. Score is adjusted from raw score by the median number of ratings per game in the jam.

Judge feedback

Judge feedback is anonymous and shown in a random order.

Interesting project! Sadly I looked at this after the tool timed out. This would have been significantly improved by including the Colab! I'd encourage you to make a public version of it I broadly buy that this tool is a cool thing to exist, and that looking at all neurons in a layer is a useful view, though obviously this would be a cooler tool with clear and compelling use cases! One question I have is about the speed and usability of the tool - my experience is that Plotly is pretty slow with massive plots? On the other hand, being able to hover and see the top neuron indices is pretty useful. I agree with the take that neurons that strongly activate sparsely are more interesting than neurons that activate everywhere. Another thing worth tracking is that some neurons activate much more than others in general, and maybe this would be improved by normalising by that? My main suggestions for improvement would be a public version, and just exploring a bunch more and seeing if you find anything interesting! One idea would be to pick some phenomena that you want to investigate and staring at neurons for that. -Neel
This is a great example of creating a live application that can be utilized for mechanistic interpretability work and it's nice to see its use case in the report as well.

What are the full names of your participants?
Víctor Levoso Alejandro González Chris Lonsberry

What is your team name?
We can read the matrix (actually we can't)

What is you and your team's career stage?
Multiple

Does anyone from your team want to work towards publishing this work later?

Maybe

Where are you participating from?

Online

Comments

victorlf4Developer1 year ago (2 edits) (+2)

https://colab.research.google.com/drive/1F5SoDy1JPvZe5lf6dnGHSkB4C6vPIFcW?usp=sh...

Duno why we didn't end up linking the colab.

Here it is anyway in case anyone is interested.

I might fix some things and make a pull request to add it to transformer lens later if Neel thinks that's a good idea.

Maybe modifying the original neuroscope so it has a selector to change between showing layers and showing an specific Neurons would be better idea though .

Also about speed, it's not that bad.

It still takes a moment to load in colab wich can be anoying but it was still usable. (And for whatever reason it was near instant when Alex ran the notebook locally)

That said we were trying whith gpt small maybe it's worse whith bigger models.

Like Reply

itch.io