Skip to content

Fix AIME 2025 numeric scoring (equality, not suffix match)#45

Open
wise-east wants to merge 1 commit into
aisa-group:mainfrom
wise-east:fix/aime2025-numeric-equality-scorer
Open

Fix AIME 2025 numeric scoring (equality, not suffix match)#45
wise-east wants to merge 1 commit into
aisa-group:mainfrom
wise-east:fix/aime2025-numeric-equality-scorer

Conversation

@wise-east
Copy link
Copy Markdown

Summary

  • Replace inspect_evals/aime2025 (uses match(numeric=True)endswith on extracted digits) with a local inspect task and scorer that grades via normalized equality
  • Prefer ANSWER: line when present; fall back to last numeric token (same extraction intent as inspect)
  • Fix multi-\boxed{} stripping with iterative non-greedy substitution
  • Add unit tests for the 711/11 and 149/49 false-positive cases
  • Copy score.py / task.py into job dir in run_task.sh

Fixes #44

Test plan

cd src/eval/tasks/aime2025
PYTHONPATH=. pytest test_score.py -q

Upstream

Long-term fix belongs in inspect_ai (_common.py numeric + end should use ==). This PR mitigates PostTrainBench until inspect-ai is updated.

Made with Cursor

inspect_ai match(numeric=True) grades with endswith on the extracted answer,
which marks predictions like 711 correct when the label is 11. Add a local
scorer that parses ANSWER lines and compares normalized values with ==.

Fixes aisa-group#44

Co-authored-by: Cursor <cursoragent@cursor.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

AIME 2025 scoring: inspect_ai numeric match uses endswith (false positives e.g. 711 vs 11)

1 participant