unfaithful IMO eval
Hi, I noticed that your reported IMO result is purely outcome-based eval for problems requiring answers and llm-as-a-judge for proof-based ones.
In fact, I have read the provided model samples https://github.com/inclusionAI/AWorld/tree/main/examples/imo/samples/samples%20from%20Ring-1T.
All the agent traces are concluded with a final guard output that reveals some missing points in the proof, making the evaluation unfaithful.
ex. p3 guard output is
# Verification Report
## Summary
**Final Verdict:** The solution's approach is viable and arrives at the correct final answer, but contains several **Justification Gaps** that prevent it from being fully rigorous.
...
In contrast, the solutions provided by gemini deep think and oai have correct answers alongside human-verified proofs. So I think this model is far behind claimed sliver-level performance.
Thank you for your feedback.
Please refer to: https://github.com/inclusionAI/AWorld/issues/520
Issue author keeps moving the goalpost lol.
He stopped on "However, the solutions have not been reviewed and checked by human expert, making it still unreliable." - the thing is IMO competition has concluded and the answers are public and known. No AI lab is using independent judges for this, they either verify it themselves manually (OpenAI did this even for their own entry), or they use LLM. Nothing abnormal in this by itself unless there are solutions you can convincingly prove were rated incorrect given the expected correct answers.
Thank you for your feedback.
Please refer to: https://github.com/inclusionAI/AWorld/issues/520
Issue author keeps moving the goalpost lol.
He stopped on "However, the solutions have not been reviewed and checked by human expert, making it still unreliable." - the thing is IMO competition has concluded and the answers are public and known. No AI lab is using independent judges for this, they either verify it themselves manually (OpenAI did this even for their own entry), or they use LLM. Nothing abnormal in this by itself unless there are solutions you can convincingly prove were rated incorrect given the expected correct answers.
I am not an expert on IMO. However, someone has pointed out some issues with P3 proof in the GitHub issue.