VIVID: Backbone Training-Free Text-to-Image Video Editing via Variational Latent
Zhangkai Wu, Xuhui Fan, Zhongyuan Xie, Kaize Shi, Longbing Cao, KDD, 2026, Research Track.
Backbone training-free video editing built on pre-trained text-to-image (T2I) diffusion models enables lightweight, prompt-driven edits without additional finetuning. A critical yet often overlooked factor is cross-frame latent selection during DDIM inversion, which largely determines spatiotemporal coherence in the subsequent denoising process. Existing pipelines typically rely on static, heuristic keyframe policies and temperature-softmax responsibilities, yielding unscalability i.e., numerical sensitivity and scale bias, that degrades generalization across diverse scenes. In this paper, we propose~\textbf{VIVID} (\textbf{V}ariational \textbf{I}nference for \textbf{V}ideo editing with \textbf{I}mage \textbf{D}iffusion), an uncertainty-aware variational latent anchoring module that dynamically selects informative frames and compresses cross-frame latents into a compact set of semantic anchors. \VIVID{} learns stable assignments via a variational objective with contrastive alignment and prior regularization, producing anchors that preserve spatial details while enforcing temporal continuity, and can be plugged into existing backbone training-free T2I-based video editing frameworks as a drop-in replacement for heuristic selection. Extensive experiments on standard benchmarks and in-the-wild videos demonstrate that \VIVID{} achieves state-of-the-art inversion fidelity, editing quality, and temporal consistency, while reducing memory and runtime compared with prior backbone training-freebaselines.
