Scale Oversight for Machine Learning Hackathon

Hosted by Esben Kran, pseudobison, Zaki, fbarez, ruiqi-zhong · #alignmentjam

Entries

Ratings

Overview Submissions Results Screenshots Submission feed

A jam submission

Automated Sandwiching: Efficient Self-Evaluations of Conversation-Based Scalable Oversight TechniquesView project page

An automated Sandwiching experimental framework for evaluating scalable oversight techniques.

Submitted by gmukobi — 10 minutes, 27 seconds before the deadline

Add to collection

Play tool

Automated Sandwiching: Efficient Self-Evaluations of Conversation-Based Scalable Oversight Techniques's itch.io page

Results

Criteria	Rank	Score*	Raw Score
Overall	#1	4.250	4.250
Generality	#1	4.500	4.500
Reproducibility	#1	4.750	4.750
Novelty	#1	4.250	4.250
Topic	#1	4.750	4.750
Judge's choice	#1	n/a	n/a

Ranked from 4 ratings. Score is adjusted from raw score by the median number of ratings per game in the jam.

Judge feedback

Judge feedback is anonymous and shown in a random order.

This is exactly the type of project I was hoping would come out of this hackathon! Replacing humans with LMs as has been done here, if it can be made to work, potentially allows for much faster research iteration, so I think this is an interesting and worthwhile research direction. Despite the negative finding, this research direction has a lot of promise and I think there's much more that can be tried if you or others are inspired to continue in this direction -- other prompts or fine-tuning to try variations on what you've tried or generation/evaluation of arguments for/against particular answers, trying different approaches to eliciting the larger language model's confidence in its answers, etc. I suspect that Bowman et al. would be happy to share their full transcripts with you if you wanted to examine in more detail some of the successful strategies that some of their participants used or fine-tune on their data. Minor note: Regarding answer letters not in the top-5 tokens, you might get slightly better luck with the slightly modified prompt: "Final Answer (answer immediately with (A), (B), (C), or (D)): ("
Cool project! What would happen if you use a larger, more capable model to mimic the human overseer? (notice that we only want the overseer optimized to mimic the human overseer, rather than optimized for human preference for safety reasons). It seems plausible to me that you can get better performance by using a better model that can simulate human overseer. Overall: the ideas are relevant and novel. Though the results are a bit weak, it is a very informative first step for the followup investigations. Scalable Oversight: highly relevant. Novelty: the idea seems pretty impactful and could potentially be turned into a research paper, if there are more comprehensive experiments. Generality: the technique is fairly general – at least as general as the paradigm proposed by Bowman et. al Reproducibility: seem pretty easily reproducible.

What are the full names of your participants?
Sophia Pung, Gabriel Mukobi

What is your team name?
Automated Sandwiching

Comments

No one has posted a comment yet

itch.io