Skip to main content

Indie game storeFree gamesFun gamesHorror games
Game developmentAssetsComics
SalesBundles
Jobs
TagsGame Engines
A jam submission

The Start of Investigating a 1-Layer SoLU ModelView project page

We found a behavior where a toy SoLU model succeeds as well as a few edge cases where it fails.
Submitted by jakub151 — 9 minutes, 19 seconds before the deadline
Add to collection

Play project

The Start of Investigating a 1-Layer SoLU Model's itch.io page

Results

CriteriaRankScore*Raw Score
Novelty#53.6744.500
Reproducibility#83.6744.500
Generality#92.8583.500
ML Safety#92.8583.500
Mechanistic interpretability#103.6744.500

Ranked from 2 ratings. Score is adjusted from raw score by the median number of ratings per game in the jam.

Judge feedback

Judge feedback is anonymous and shown in a random order.

  • This bases itself off a really interesting idea of updating the activation patches most relevant to the task itself and might prove a fruitful path towards a next-level fine-tuning methodology that we can use to update the models' behaviour in a more direct and interpretable way. I'm quite interested in seeing this taken further as well and diving into this idea more, though the present project of course didn't get too many results.
  • Thanks for the project! Some feedback: * It wasn't clear exactly what clean prompts you patched to what corrupted prompts when showing the figures in the write-up - this changes a lot! I personally would have picked two prompts that the model gets correct (eg swim -> pool, pray -> church) and patched from one to the other. * I think the task was cool, mapping verbs to sensible nouns seems like a common and important task, and seems pretty easy to study since you can isolate the effect of one word on another word * It would have been nice to see something in the write-up re how good the model actually was at the task (ideally the probability it put to the correct final word) * My guess is that head 6 was implementing skip trigrams, which are a rule like "swim ... the -> pool", which means "if I want to predict what comes after 'the' and I can see 'swim' anywhere earlier in the prompt, then predict 'pool' comes next" * This might have been cleaner to study in a 1L attention-only model, where the heads are ALL that's going on * Given that you're in a 1L model with MLPs, it would be cool to figure out which model components are doing the work here! In particular, the logits are a sum of the logit contribution from the MLP layer and from each attention head and from the initial token embedding, so you can just directly look at which ones matter the most. The section with `decompose_resid` in Exploratory Analysis Demo demonstrates this. * It's interesting to me whether head 6's output directly improves the correct output logit, or if it's used by the MLPs, I'd love to see a follow-up looking at that! And if it is used by the MLPs, I'd be super curious to see what's going on there - can you find specific neurons that matter, and which boost the correct output logit, etc. And thanks for the TL feedback! I'd love to get more specifics re what confused you, what docs you looked for and couldn't find, and anything else that could help it be clearer - I wrote the library, so these things are hard to notice from the inside -Neel

What are the full names of your participants?
Carson Ellis, Jakub Kraus, Itamar Pres, Vidya Silai

What is your team name?
the ablations

What is you and your team's career stage?
students

Does anyone from your team want to work towards publishing this work later?

Maybe

Where are you participating from?

Online

Leave a comment

Log in with itch.io to leave a comment.

Comments

HostSubmitted

Great project! Thank you for the feedback as well <3