Play tool
Automated Sandwiching: Efficient Self-Evaluations of Conversation-Based Scalable Oversight Techniques's itch.io pageResults
Criteria | Rank | Score* | Raw Score |
Overall | #1 | 4.250 | 4.250 |
Generality | #1 | 4.500 | 4.500 |
Reproducibility | #1 | 4.750 | 4.750 |
Novelty | #1 | 4.250 | 4.250 |
Topic | #1 | 4.750 | 4.750 |
Judge's choice | #1 | n/a | n/a |
Ranked from 4 ratings. Score is adjusted from raw score by the median number of ratings per game in the jam.
Judge feedback
Judge feedback is anonymous and shown in a random order.
- This is exactly the type of project I was hoping would come out of this hackathon! Replacing humans with LMs as has been done here, if it can be made to work, potentially allows for much faster research iteration, so I think this is an interesting and worthwhile research direction. Despite the negative finding, this research direction has a lot of promise and I think there's much more that can be tried if you or others are inspired to continue in this direction -- other prompts or fine-tuning to try variations on what you've tried or generation/evaluation of arguments for/against particular answers, trying different approaches to eliciting the larger language model's confidence in its answers, etc. I suspect that Bowman et al. would be happy to share their full transcripts with you if you wanted to examine in more detail some of the successful strategies that some of their participants used or fine-tune on their data. Minor note: Regarding answer letters not in the top-5 tokens, you might get slightly better luck with the slightly modified prompt: "Final Answer (answer immediately with (A), (B), (C), or (D)): ("
- Cool project! What would happen if you use a larger, more capable model to mimic the human overseer? (notice that we only want the overseer optimized to mimic the human overseer, rather than optimized for human preference for safety reasons). It seems plausible to me that you can get better performance by using a better model that can simulate human overseer. Overall: the ideas are relevant and novel. Though the results are a bit weak, it is a very informative first step for the followup investigations. Scalable Oversight: highly relevant. Novelty: the idea seems pretty impactful and could potentially be turned into a research paper, if there are more comprehensive experiments. Generality: the technique is fairly general – at least as general as the paradigm proposed by Bowman et. al Reproducibility: seem pretty easily reproducible.
What are the full names of your participants?
Sophia Pung, Gabriel Mukobi
What is your team name?
Automated Sandwiching
Leave a comment
Log in with itch.io to leave a comment.
Comments
No one has posted a comment yet