Describe the bug
When running batched generation with VLLMModel._greedy_until, context-length checks were based only on the first prompt in the batch (len(inputs[0])) instead of the longest prompt.
If the first prompt was short but another prompt in the same batch was longer, truncation could be skipped incorrectly, causing some samples to exceed max_length.
To Reproduce
- Use
lighteval with the vLLM backend and configure a finite max_length.
- Create a batch with prompts of different lengths, where:
- first prompt is short,
- at least one later prompt is long enough that
prompt_len + max_new_tokens > max_length.
- Run a generation call that reaches
_greedy_until (e.g. a normal evaluation batch with max_new_tokens set).
- Observe that truncation/logging decisions are made from the first prompt length, so longer prompts in the same batch may not be truncated as required.
Minimal example logic (conceptual):
max_length = 1024
max_new_tokens = 200
- prompt lengths in same batch:
[100, 950]
- old behavior checks
100 + 200, decides no truncation, but second sample actually needs truncation (950 + 200 > 1024).
Expected behavior
Truncation decisions should use the worst-case prompt length in the batch (the maximum prompt length), so all samples remain within max_length.
Warnings should clearly indicate batch-aware length handling.
Version info
- lighteval main/33acf35f02c41d234c7df5cbdf1fd3e9d33ecd76
Describe the bug
When running batched generation with
VLLMModel._greedy_until, context-length checks were based only on the first prompt in the batch (len(inputs[0])) instead of the longest prompt.If the first prompt was short but another prompt in the same batch was longer, truncation could be skipped incorrectly, causing some samples to exceed
max_length.To Reproduce
lightevalwith the vLLM backend and configure a finitemax_length.prompt_len + max_new_tokens > max_length._greedy_until(e.g. a normal evaluation batch withmax_new_tokensset).Minimal example logic (conceptual):
max_length = 1024max_new_tokens = 200[100, 950]100 + 200, decides no truncation, but second sample actually needs truncation (950 + 200 > 1024).Expected behavior
Truncation decisions should use the worst-case prompt length in the batch (the maximum prompt length), so all samples remain within
max_length.Warnings should clearly indicate batch-aware length handling.
Version info