Scale Oversight for Machine Learning Hackathon

Hosted by Esben Kran, pseudobison, Zaki, fbarez, ruiqi-zhong · #alignmentjam

Ratings

Overview Submissions Results Screenshots Submission feed

A jam submission

Reverse Word Wizards: Pitting Language Models Against the Art of ReversalView project page

Benchmark to test the capability of models to reverse given strings: https://github.com/Igiwigi/ReversingBench/tree/main

Submitted by Cudon — 9 hours, 35 minutes before the deadline

Add to collection

Play project

Reverse Word Wizards: Pitting Language Models Against the Art of Reversal's itch.io page

Results

Criteria	Rank	Score*	Raw Score
Overall	#2	3.750	3.750
Generality	#2	4.000	4.000
Reproducibility	#2	4.500	4.500
Judge's choice	#3	n/a	n/a
Novelty	#4	3.500	3.500
Topic	#5	3.250	3.250

Ranked from 4 ratings. Score is adjusted from raw score by the median number of ratings per game in the jam.

Judge feedback

Judge feedback is anonymous and shown in a random order.

What's most interesting to me about this entry is the recognition that LM abilities are prompt-sensitive and also often facilitated by a factored approach. In my view the connection to scalable oversight is a little tenuous, but is definitely there as one of the questions was "Evaluate how success of the assignment is affected by model complexity, ie. if the assignment of reversing a word is a potential inverse scaling problem" - if word reversal had been an example of a task where performance goes down with model size, that could potentially shed light on the kinds of alignment failures that are likely to be observed as scale increases. I was a little confused why the sample size was so small given that it would have been easy to generate an arbitrarily high number of stimuli. I'd note that there are ways to factor this particular problem that the authors didn't try, and I suspect the approach described in https://twitter.com/npew/status/1525900868305833984 would have been more successful, or an approach based on the technique described in section 3 of https://arxiv.org/pdf/2205.10625.pdf , but the authors explored this in pretty significant depth given that there were only 48 hours to work with!
Comments: it’s a pretty cool study! I think the task is cool and something that is of interest to the scienitifc community. Though I’m a bit confused about the result – what prompts produced Figure 1. A/B/C? Personally I think the prompt is actually the most important factor for the success, since the LM cannot really “see” the characters/letters directly. Looking at A & C – Do non-sense words actually have a lower success rate than real words? It seems that non-sense words actually have a higher accuracy according to the plot. My interpretation is that non-sense words are more likely to get tokenized into finer subparts of a word, thus making it easier for the model to reverse it. Further thoughts on the experimental design: to ablate other weird factors related character frequency, you can design the non-sense words to be the random ordering of the letters of the real words. Overall: I think this is a useful phenomenon to understand for deep learning models, though it seems loosely related to scalable oversight. The presentation of the result is a bit confusing to me. That said, I like the idea and it might be worthwhile to polish the result, perform a more comprehensive evaluation, and publish the results somewhere. Scalable oversight: loosely related. Novelty: seems like an interesting to property to understand for deep learning models. Generality: the experiments can probably be easily extended to broader setups. Reproducibility: seems reproducible.

What are the full names of your participants?
Ingrid Backman, Asta Rassmussen, Klara Nielsen

What is your team name?
The Circuit Wizards

Comments

No one has posted a comment yet

itch.io