AVE-Compass: Towards Holistic Evaluation for Audio-Video Editing Abilities
Published in NeurIPS 2026 (under review), 2026
AVE-Compass is a holistic evaluation framework for audio-video editing abilities. Key contributions include:
Benchmark Construction: Built a comprehensive audio-video dataset covering 33 fine-grained categories and 190 instructions, addressing the gap in cross-modal physical alignment evaluation.
Evaluation Design: Implemented a decoupled evaluation pipeline combining “objective physical metrics + modality-separated MLLM checklist” to precisely quantify Instruction Following (IF), Fidelity Preservation (FP), and Realism (R).
Experimental Analysis: Deployed full evaluation on SOTA models including Wan2.7 and LTX2, revealing core weaknesses such as “cross-modal intention failure” in multimodal LLMs through human-machine consistency and robustness analysis.
Recommended citation: Lin, Y. et al. (2026). "AVE-Compass: Towards Holistic Evaluation for Audio-Video Editing Abilities." NeurIPS 2026.
Download Paper
