π Paper β’ π» Code β’ π€ Dataset β’ π€ Model (4B) β’ π Leaderboard β’ π Project Page
VEFX-Bench is a comprehensive benchmark for evaluating text-driven video editing and visual effects. It includes 5,049 annotated examples spanning 9 categories and 32 subcategories, evaluated by VEFX-Reward β a VLM-based reward model that scores edits across three dimensions on a 1β4 scale:
| Dimension | What it measures |
|---|---|
| Instructional Following (IF) | Does the edit accurately reflect the editing instruction? |
| Render Quality (RQ) | Visual clarity, temporal consistency, and physical plausibility |
| Edit Exclusivity (EE) | Were only the intended regions modified, without side-effects? |
VEFX-Reward scores on 1β4 scale. Ranked by GeoAgg (Ξ±=2 for IF, Ξ²=1 for RQ, Ξ³=1 for EE). Higher is better.
π Updated: May 2, 2026 β For the latest results & submissions, visit the live leaderboard β
| Rank | Model | Type | IF β | RQ β | EE β | GeoAgg β |
|---|---|---|---|---|---|---|
| π₯ | Kling o3 Omni | Commercial | 3.033 | 3.588 | 3.043 | 3.057 |
| π₯ | Kling o1 | Commercial | 3.040 | 3.534 | 2.976 | 2.985 |
| π₯ | Runway Gen-4.5 | Commercial | 2.817 | 3.319 | 2.923 | 2.912 |
| 4 | Seedance 2.0 | Commercial | 2.811 | 3.421 | 3.088 | 2.766 |
| 5 | Grok Imagine | Commercial | 2.606 | 3.346 | 3.376 | 2.723 |
| 6 | Luma Ray 3 | Commercial | 2.702 | 3.403 | 2.705 | 2.717 |
| 7 | UniVideo | Open-source | 2.294 | 3.266 | 3.091 | 2.516 |
| 8 | Wan 2.6 | Commercial | 2.012 | 3.317 | 2.446 | 2.146 |
| 9 | Luma Ray 2 | Commercial | 2.038 | 2.532 | 1.363 | 1.804 |
| 10 | VACE | Open-source | 2.027 | 3.172 | 1.180 | 1.775 |
Each demo shows the original video (left) alongside the edited video (right).
| π 5,049 Annotated Examples | π¬ 1,419 Source Videos |
| π 9 / 32 Categories / Subcategories | π€ 10 Editing Systems |
| π 3 Quality Dimensions (IF, RQ, EE) | π§ͺ 300 Benchmark Test Pairs |
| Model | Backbone | Params | HuggingFace | Status |
|---|---|---|---|---|
| VEFX-Reward-4B | Qwen3-VL-4B-Instruct | 4B | xiangbog/VEFX-Reward-4B | β Available |
| VEFX-Reward-32B | Qwen3-VL-32B-Instruct | 32B | TBD | π Coming soon |
conda create -n vefx-bench python=3.10 -y
conda activate vefx-bench
# Install PyTorch first (match your CUDA version)
# See https://pytorch.org/get-started/locally/ for the right command
pip install torch torchvision --index-url https://download.pytorch.org/whl/cu124
# Install remaining dependencies
pip install -r requirements.txt
# Install the package
pip install -e .Requirements: Python β₯ 3.10, CUDA GPU, ~10 GB VRAM (bfloat16). Make sure your PyTorch CUDA version matches your driver.
from vefx_reward import VEFXReward
model = VEFXReward("xiangbog/VEFX-Reward-4B", device="cuda")
scores = model.score(
original_video="examples/sample_videos/object_removal_original.mp4",
edited_video="examples/sample_videos/object_removal_edited.mp4",
instruction="Remove the woman with the grey backpack walking on the right side of the frame.",
)
print(scores)
# {'IF': 2.34, 'RQ': 1.93, 'EE': 1.82, 'Overall': 6.09}python examples/quick_start.py \
--original examples/sample_videos/object_removal_original.mp4 \
--edited examples/sample_videos/object_removal_edited.mp4 \
--instruction "Remove the woman with the grey backpack walking on the right side of the frame."The repo includes 4 sample video pairs with prompts. Score them all:
import json
from vefx_reward import VEFXReward
model = VEFXReward("xiangbog/VEFX-Reward-4B", device="cuda")
with open("examples/sample_videos/prompts.json") as f:
samples = json.load(f)
for sample in samples:
scores = model.score(
original_video=f"examples/sample_videos/{sample['original']}",
edited_video=f"examples/sample_videos/{sample['edited']}",
instruction=sample["instruction"],
)
print(f"[{sample['category']}] IF={scores['IF']:.2f} RQ={scores['RQ']:.2f} EE={scores['EE']:.2f}")Prepare a CSV with columns original_video, edited_video, instruction:
python examples/batch_scoring.py --csv edits.csv --output results.csvFor large-scale evaluation across multiple GPUs:
python examples/multi_gpu_scoring.py --csv edits.csv --num_gpus 4 --output results.csvVEFXReward(
model_path="xiangbog/VEFX-Reward-4B", # HuggingFace ID or local path
device="cuda", # "cuda", "cuda:0", "cpu"
dtype=torch.bfloat16, # torch.bfloat16 or torch.float16
fps=4.0, # Video sampling rate
max_frame_pixels=399360, # Max pixels per frame
)Score a single video edit. Returns {'IF': float, 'RQ': float, 'EE': float, 'Overall': float}.
Score multiple edits sequentially. Each sample is processed independently to avoid OOM.
@article{gao2025vefxbench,
title={VEFX-Bench: Benchmarking Generic Video Editing and Visual Effects},
author={Xiangbo Gao and Sicong Jiang and Bangya Liu and Xinghao Chen and Minglai Yang and Siyuan Yang and Mingyang Wu and Jiongze Yu and Qi Zheng and Haozhi Wang and Jiayi Zhang and Jared Yang and Jie Yang and Zihan Wang and Qing Yin and Zhengzhong Tu},
journal={arXiv preprint arXiv:2604.16272},
year={2026}
}This project is licensed under the Apache License 2.0. See LICENSE for details.



