This jam is now over. It ran from 2022-11-11 17:00:00 to 2022-11-13 13:15:00. View results

🧑‍🔬 Join us for this month's alignment jam! Get answers to all your questions in this FAQ. See all the starter resources here. Go to the GatherTown here.

Join this AI safety hackathon to find new perspectives on the "brains" of AI!

48 hours intense research in interpretability, the modern AI neuroscience.

We will work with multiple research directions and one of the most relevant is mechanistic interpretability. Here, we work towards a researcher understanding of neural networks and why they do what they do.

We provide you with the best starter templates that you can work from so you can focus on creating interesting research instead of browsing Stack Overflow. You're also very welcome to check out some of the ideas already posted!  


The schedule runs from 6PM CET / 9AM PST Friday to 7PM CET / 10AM PST Sunday. We start with an introductory talk and end with an awards ceremony. Join the public ICal here.

Local groups

If you are part of a local machine learning or AI safety group, you are very welcome to set up a local in-person site to work together with people on this hackathon! We will have several across the world (list upcoming) and hope to increase the amount of local spots.

You will work in groups of 1-6 people within our hackathon GatherTown and in the in-person event hubs.

How to participate

Create a user on the (this) website and click participate. We will assume that you are going to participate and ask you to please cancel if you won't be part of the hackathon.


You will work on research ideas you generate during the hackathon and you can find more inspiration below.


Everyone will help rate the submissions together on a set of criteria that we ask everyone to follow. You receive 5 projects that you have to rate during the 4 hours of judging before you can judge specific projects (so we avoid selectivity in the judging).

Each submission will be evaluated on the criteria below:

ML Safety2 How good are your arguments for how this result informs the longterm alignment and understanding of neural networks? How informative is the results for the field of ML and AI safety in general?
Interpretability1 How informative is it in the field of interpretability? Have you come up with a new method or found revolutionary results?
Novelty 1 Have the results not been seen before and are they surprising compared to what we expect?
Generality 1 Do your research results show a generalization of your hypothesis? E.g. if you expect language models to overvalue evidence in the prompt compared to in its training data, do you test more than just one or two different prompts and do proper interpretability analysis of the network?
Reproducibility 1 Are we able to easily reproduce the research and do we expect the results to reproduce? A high score here might be a high Generality and a well-documented Github repository that reruns all experiments.


We have many ideas available for inspiration on the Interpretability Hackathon ideas list. A lot of interpretability research is available on distill.pubtransformer circuits, and Anthropic's research page.

Introductions to mechanistic interpretability

See also the tools available on interpretability:

Digestible research

Below is a talk by Esben on a principled introduction to interpretability for safety:


All submissions

No submissions match your filter

Jason Hoelscher-Obermaier, Oscar Persson, Jochem Hölscher
An investigation on backup heads in GPT-2 for the Indirect Object Identification task.
Can we make neural network structures searchable?
Analysing the semantic meanings of transformer's embedding space
Github repo :
Observing and Validating Induction heads in SOLU-8l-old
Interpretability hackathon project, block multiverses & probabilities
Localizing concepts within transformer networks