Play project
Automated Model Oversight Using CoTP's itch.io pageResults
Criteria | Rank | Score* | Raw Score |
Reproducibility | #2 | 4.500 | 4.500 |
Topic | #2 | 4.000 | 4.000 |
Generality | #3 | 3.500 | 3.500 |
Overall | #3 | 3.500 | 3.500 |
Novelty | #5 | 3.000 | 3.000 |
Ranked from 4 ratings. Score is adjusted from raw score by the median number of ratings per game in the jam.
Judge feedback
Judge feedback is anonymous and shown in a random order.
- It's a cool project to use chain of thoughts and step-by-step reasoning to improve the model's capability to identify harmful outputs. More concrete examples and standard metrics (e.g., precision/recall) would help me understand the results better.
- This is a solid attempt to us LMs to oversee LMs and definitely in the spirit of the challenge. It could have benefited from a clearer articulation of what exaactly the research question was, although implicitly it seemed to be "does using CoT enable better automated evaluation for this task and dataset". The conclusions that can be drawn are somewhat limited as the imbalanced and small dataset does not allow for statistically significant comparisons, but it seems like it was valuable to the authors as a learning pilot to help identify issues and considerations that could help prepare for more substantial research of this kind in the future.
What are the full names of your participants?
Adam Khoja, Rishi Khare, John Wang
Leave a comment
Log in with itch.io to leave a comment.
Comments
No one has posted a comment yet