The Data Science Lab
since 2005
  • Home
  • Research
      • Research grants
      • Research interests
      • Research leadership
      • Student theses
      • Humanoid Ameca
      • AI Server
        • GPU
        • Request
        • Allocation
  • Consultancy
      • Consulting projects
      • Cooperate training
      • Enterprise innovation
      • Impact cases
      • Our clients
      • Partnership
  • People
      • Awards and honors
      • Staff
      • Team members
  • Activities
      • Events and services
      • Talks
      • Tutorials
      • Workshops
  • Publications
  • Communities
      • ACM ANZKDD Chapter
      • Big data summit
      • Data Analytics book series
      • DSAA conferences
      • IEEE TF-DSAA
      • IEEE TF-BESC
      • JDSA Springer
      • DataSciences.Info
      • MQ's AI Lab
  • Spotlights
      • Actionable knowledge discovery
      • Agent mining
      • AI: Artificial-intelligence
      • AI4Tech: AI enabling technologies
      • AI4Finance: AI for FinTech
      • AI robots & humanoid AI
      • Algorithmic trading
      • Banking analytics
      • Behavior analytics, computing, informatics
      • Coupling and interaction learning
      • COVID-19 global research and modeling
      • Data science knowledge map
      • Data science dictionary
      • Data science terms
      • Data science tools
      • Data science thinking
      • Domain driven data mining
      • Educational data mining
      • Large-scale statistical learning
      • Metasynthetic engineering
      • Market surveillance
      • Negative sequence analysis
      • Non-IID learning
      • Pattern relation analysis
      • Recommender systems
      • Smart beach analytics
      • Social security analytics
      • Tax analytics
  • About us
KDD26-Research: Text-to-Image Video Editing

VIVID: Backbone Training-Free Text-to-Image Video Editing via Variational Latent
Zhangkai Wu, Xuhui Fan, Zhongyuan Xie, Kaize Shi, Longbing Cao, KDD, 2026, Research Track.

Backbone training-free video editing built on pre-trained text-to-image (T2I) diffusion models enables lightweight, prompt-driven edits without additional finetuning. A critical yet often overlooked factor is cross-frame latent selection during DDIM inversion, which largely determines spatiotemporal coherence in the subsequent denoising process. Existing pipelines typically rely on static, heuristic keyframe policies and temperature-softmax responsibilities, yielding unscalability i.e., numerical sensitivity and scale bias, that degrades generalization across diverse scenes. In this paper, we propose~\textbf{VIVID} (\textbf{V}ariational \textbf{I}nference for \textbf{V}ideo editing with \textbf{I}mage \textbf{D}iffusion), an uncertainty-aware variational latent anchoring module that dynamically selects informative frames and compresses cross-frame latents into a compact set of semantic anchors. \VIVID{} learns stable assignments via a variational objective with contrastive alignment and prior regularization, producing anchors that preserve spatial details while enforcing temporal continuity, and can be plugged into existing backbone training-free T2I-based video editing frameworks as a drop-in replacement for heuristic selection. Extensive experiments on standard benchmarks and in-the-wild videos demonstrate that \VIVID{} achieves state-of-the-art inversion fidelity, editing quality, and temporal consistency, while reducing memory and runtime compared with prior backbone training-freebaselines.

Contact us
AI Lab, School of Computing, Faculty of Science and Engineering, Macquarie University
Macquarie University Frontier AI Research Centre
Level 3, 3 Innovation Road, Macquarie University, NSW 2109, Australia
Tel: +61-2-9850 9583
Staff: firstname.surname(a)mq.edu.au
Students: firstname.surname(a)students.mq.edu.au
Contacts@datasciences.org