Play game
How to find the minimum of a list - Transformer Edition's itch.io pageResults
Criteria | Rank | Score* | Raw Score |
Interpretability | #11 | 3.182 | 3.375 |
Reproducibility | #13 | 3.536 | 3.750 |
Novelty | #14 | 2.711 | 2.875 |
Generality | #17 | 2.475 | 2.625 |
ML Safety | #22 | 1.650 | 1.750 |
Ranked from 8 ratings. Score is adjusted from raw score by the median number of ratings per game in the jam.
Where are you participating from?
London
What are the names of your team member?
Ole, Stefan, Ayham, Devesh, Joe
What are the email addresses of all your team members?
ojorgensen1417@gmail.com, sh2061@cam.ac.uk, ayham.saffar@gmail.com, josephmiller101@gmail.com, devesh.joshi22@ic.ac.uk
What is your team name?
BugSnax
Leave a comment
Log in with itch.io to leave a comment.
Comments
Good documentation of experimental set up.
The proposed algorithm seems really interesting, but the argumentation is hard to follow at points. For example, I don’t understand how the model goes from attending most strongly to the smallest numbers, to outputting them in the correct order - it makes sense that the model could do this, but it feels like either there is a missing step in the argument explaining how this actually occurs, or I am not understanding the explanation (very possible)! Could it be via the MLP layer? Would be interesting to test if a 1L attention-only transformer can also perform this task. What about a 2L attention-only transformer?
Line plots 1 and 2 were very interesting, and showed pretty convincing evidence that the model was sorting the numbers by attending to the smaller numbers most strongly. I wonder if they could have been presented in a clearer way - perhaps sorting the x-axis would have helped, as we would then hope to see a monotonically decreasing line.
I see there are plots of the train and test loss in the notebook. These would have been great to include, as well as accuracy metrics, to confirm the model can do the task! The loss histories seem to have a very interesting shape with an initial sharp decrease, then a flattening, then another sharp decrease (and maybe another small sharp decrease later on). Would be interesting to investigate why this was the case and what occurred at these points by investigating model checkpoints at these points. Perhaps there is an interesting phase change in the model here!
Also good to see null results reported in the final section on grokking, and you raise the interesting question of what conditions are necessary for grokking to occur.
Overall well done, this was a cool project! The algorithm you hypothesise makes sense and is very interesting, and there is some convincing evidence for parts of it. It felt like the argumentation and experiments could have been expanded to provide more clarity on other parts of the algorithm. I also think there are some really interesting follow up questions!
In any case, very cool idea!
The group limits are mostly symbolic so it is in fact federally legal! We'll accept it :)