Play project
Automated Identification of Potential Feature Neurons's itch.io pageResults
Criteria | Rank | Score* | Raw Score |
Judge's choice | #3 | n/a | n/a |
Reproducibility | #3 | 4.250 | 4.250 |
Mechanistic interpretability | #5 | 4.250 | 4.250 |
Generality | #5 | 3.000 | 3.000 |
Novelty | #8 | 3.000 | 3.000 |
ML Safety | #12 | 2.750 | 2.750 |
Ranked from 4 ratings. Score is adjusted from raw score by the median number of ratings per game in the jam.
Judge feedback
Judge feedback is anonymous and shown in a random order.
- This is a wonderful project and plays right into mechanistic interpretability! This novel 3-step method is great for making neurons more interpretable and it enables quite a bit of deeper analysis. I recommend also reading Alex Foote's winning submission for the last interpretability hackathon which echoes some of your comments at the end: https://alexfoote.itch.io/investigating-neuron-behaviour-via-dataset-example-pruning-and-local-search. Great work!
- Cool project! I'm excited to see Neuroscope being used like this (and I'm sorry you had to scrape the data - I need to get round to making the dataset available!) I liked the creativity and diversity of your methods, and like the spirit of trying to automate things! Using GPT-3 and FastText are cool ideas. My main criticisms are that I think these descriptions tend to not be specific enough and miss nuance, eg neuron 134 in layer 6 of solu-8l-pile is actually a neuron that activates on the 1 in Page: 1 in a specific document format in the pile, and seems way more specific than the description given! https://neuroscope.io/solu-8l-pile/6/134.html I also think that tokenization is a massive pain, that breaks up the semantic meaning of words into semi-arbitrary tokens, and I don't see how your method engages with that properly - it seems like it mostly doesn't involve the surrounding context of the word? I really liked the idea of substituting in synonym tokens for the current token, I'd love to see that done for the 5 tokens before the current token, and to try to figure out if we can find "similar tokens" in a principled way, when the token is not just a word/clear conceptual unit. But yeah, overall, nice work!
What are the full names of your participants?
Michelle Wai Man Lo
Does anyone from your team want to work towards publishing this work later?
Yes
Where are you participating from?
Online
Leave a comment
Log in with itch.io to leave a comment.