We used o4-mini and this prompt to transform the BigCodeBench problems to use standard I/O. But, ~300 of the translated problems fail their own tests.
Let's look into them. The attached file has all the problems. Any problem with a task_id that does not appear in the dataset is one where the tests fail.
P.S. note that the prompt is slightly wrong. We should re-generate at some point, but I doubt this caused failures.
unfiltered_stdio_bcb.jsonl.zip
We used o4-mini and this prompt to transform the BigCodeBench problems to use standard I/O. But, ~300 of the translated problems fail their own tests.
Let's look into them. The attached file has all the problems. Any problem with a
task_idthat does not appear in the dataset is one where the tests fail.P.S. note that the prompt is slightly wrong. We should re-generate at some point, but I doubt this caused failures.
unfiltered_stdio_bcb.jsonl.zip