The Mechanistic Interpretability Hackathon

Hosted by Esben Kran, Neel Nanda, Apart Research, Zaki, fbarez · #alignmentjam

Entries

Ratings

Overview Submissions Results Screenshots Submission feed

A jam submission

In search of linguistic concepts: investigating BERT's context vectorsView project page

Investigating whether BERT's context vectors correspond to human-interpretable linguistic concepts

Submitted by roksanagow — 37 minutes, 6 seconds before the deadline

Add to collection

Play research

In search of linguistic concepts: investigating BERT's context vectors's itch.io page

Results

Criteria	Rank	Score*	Raw Score
ML Safety	#5	3.333	3.333
Generality	#5	3.000	3.000
Novelty	#11	2.667	2.667
Mechanistic interpretability	#11	3.333	3.333
Reproducibility	#14	2.000	2.000

Ranked from 3 ratings. Score is adjusted from raw score by the median number of ratings per game in the jam.

Judge feedback

Judge feedback is anonymous and shown in a random order.

There's actually been a fair amount of prior work on this kind of thing! Two relevant papers: https://arxiv.org/abs/1906.02715 https://arxiv.org/abs/1905.05950 More generally there's a whole subfield called BERTology on these kinds of questions: https://arxiv.org/abs/2002.12327 I think your motivation section is mostly false - as far I know, there's been very little interp work on vision transformers, and attention patterns are, if anything, easier to interpret for language models than image. There's been a fair of interp work on classic image models like ConvNets and ResNets. But generally we don't interpret image models by "averaging over" inputs, other techniques like feature visualization are used: https://distill.pub/2017/feature-visualization/ The actual method used here was fairly legit, and is analogous to what's known in the literature as probing, here's a review: https://arxiv.org/pdf/2102.12452.pdf It's generally easier to classify eg "anger vs not anger" than a 7 variable categorical problem like this, though you need to eg have the same number of anger and non anger data points (or scale the loss for the anger ones to get comparable gradients) I'm pretty surprised that a two layer BERT model could do such good fake news classification! Honestly this makes me suspect that the dataset is badly made or too easy. It wasn't clear to me where the two layer BERT model came from, was it part of BERTVis? I'm impressed that you managed to fine-tune a language model in a weekend hackathon! That's a fair amount of effort. - Neel
This work is nicely done as a traditional machine learning task. However, using BERT visualization on the fine-tuned models may not be very useful. It would be beneficial to include more interpretability methods to support the conclusions and investigate fine-tuning, as this area is still under-studied.

What are the full names of your participants?
Roksana Goworek, Paul Martin, Jonathan Frennert

What is your team name?
teamEd

What is you and your team's career stage?
UG students

Does anyone from your team want to work towards publishing this work later?

Yes

Where are you participating from?

Edinburgh

Comments

israel xai teamSubmitted2 years ago(+1)

Loved the research question!! try have a look on TCAV and our results from the previous hackathon (where we looked for concepts in connect-four RL agent).

Submitted

Soft Prompts are a Convex Set

Like Reply

roksanagowDeveloperSubmitted2 years ago(+1)

Thank you very much! I will check those out! Btw, would you like to connect for potential future collaborations? Here is my LinkedIn if you do: https://www.linkedin.com/in/roksana-goworek-0b6072154