Fix AIME 2025 numeric scoring (equality, not suffix match) by wise-east · Pull Request #45 · aisa-group/PostTrainBench

wise-east · 2026-05-20T21:43:31Z

Summary

Replace inspect_evals/aime2025 (uses match(numeric=True) → endswith on extracted digits) with a local inspect task and scorer that grades via normalized equality
Prefer ANSWER: line when present; fall back to last numeric token (same extraction intent as inspect)
Fix multi-\boxed{} stripping with iterative non-greedy substitution
Add unit tests for the 711/11 and 149/49 false-positive cases
Copy score.py / task.py into job dir in run_task.sh

Fixes #44

Test plan

cd src/eval/tasks/aime2025
PYTHONPATH=. pytest test_score.py -q

Upstream

Long-term fix belongs in inspect_ai (_common.py numeric + end should use ==). This PR mitigates PostTrainBench until inspect-ai is updated.

Made with Cursor

inspect_ai match(numeric=True) grades with endswith on the extracted answer, which marks predictions like 711 correct when the label is 11. Add a local scorer that parses ANSWER lines and compares normalized values with ==. Fixes aisa-group#44 Co-authored-by: Cursor <cursoragent@cursor.com>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix AIME 2025 numeric scoring (equality, not suffix match)#45

Fix AIME 2025 numeric scoring (equality, not suffix match)#45
wise-east wants to merge 1 commit into
aisa-group:mainfrom
wise-east:fix/aime2025-numeric-equality-scorer

wise-east commented May 20, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

wise-east commented May 20, 2026

Summary

Test plan

Upstream

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant