Skip to content

Why do we fail to translate ~300 problems? #5

@arjunguha

Description

@arjunguha

We used o4-mini and this prompt to transform the BigCodeBench problems to use standard I/O. But, ~300 of the translated problems fail their own tests.

Let's look into them. The attached file has all the problems. Any problem with a task_id that does not appear in the dataset is one where the tests fail.

P.S. note that the prompt is slightly wrong. We should re-generate at some point, but I doubt this caused failures.

unfiltered_stdio_bcb.jsonl.zip

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions