The Mechanistic Interpretability Hackathon

Ratings

A jam submission

$B$ Confident Bro: Discovering Latent Knowledge In Language Models Without SupervisionView game page

Submitted by fbarez

Criteria	Rank	Score*	Raw Score
ML Safety	#2	3.674	4.500
Reproducibility	#13	2.041	2.500
Generality	#14	1.633	2.000
Mechanistic interpretability	#14	1.225	1.500
Novelty	#14	2.041	2.500

Ranked from 2 ratings. Score is adjusted from raw score by the median number of ratings per game in the jam.

Judge feedback is anonymous.

This project is intriguing and I am looking forward to seeing its potential real-world application. The implementation of confidence modulation is noteworthy as it seems to bridge the gap between machine knowledge and human understanding, an important consideration for machine learning safety. However, I would appreciate more concrete results and data in the report. Overall, I am excited to see the progress and further development of this work.

What are the full names of your participants?
Faz, Hugo, Sam, Bodgan, Elias

What is your team name?
Ox

What is you and your team's career stage?
meh

Does anyone from your team want to work towards publishing this work later?

Yes

Where are you participating from?

Oxford

No one has posted a comment yet