A new AI model developed by researchers from the Chinese University of Hong Kong, Adobe Research, and Johns Hopkins UniversityEditVerseThe biggest breakthrough is that it attempts to break the huge gap between traditional image editing and video editing, and proposes a unified framework that allows users to edit and generate complex details for videos with intuitive operations similar to editing pictures (Photoshop).
The research team pointed out that the development of AI video editing has been limited in the past mainly due to architectural barriers (models are mostly image- or video-specific) and data scarcity (high-quality, annotated video data is far less than image data). This AI model, called "EditVerse," aims to simultaneously address these two major challenges.
Core technologies: Universal visual language and contextual learning
EditVerse's core methodology includes:
• Creating a "universal visual language":The model innovatively converts text, images, and videos into a unified, one-dimensional "token sequence" (data stream), enabling AI to understand and process visual information of different modalities in the same way.
• Powerful "context learning ability":Based on the Transformer model architecture and full self-attention, EditVerse can stitch together entire token sequences, including commands and original images. Through full self-attention, it accurately understands the relationships between various components (such as command text, specific objects in the video, and the style of the reference image). This design also enables it to flexibly handle inputs of varying resolutions and durations.
• Building a "knowledge transfer bridge":By adopting a unified framework, EditVerse can seamlessly transfer and apply the knowledge learned from massive image editing data (such as style and special effects) to film editing tasks, greatly alleviating the problem of scarce film data.
Overcoming Data Scarcity and Establishing the EditVerseBench Benchmark
To address the problem of insufficient training data, the research team established a data production line, using a variety of dedicated AI models to automatically generate a large number of video editing samples. These samples were then screened through a visual language model (VLM), ultimately producing 23.2 high-quality video editing samples.
This batch of data was mixed with 600 million image editing samples and 390 million video generation samples for training, which enhanced the model's knowledge transfer capabilities.
To scientifically evaluate the model's effectiveness, the team also launched the industry's first comprehensive benchmark for prescriptive video editing: "EditVerseBench." This benchmark includes 100 videos of varying resolutions, covering 20 editing tasks.
The effect exceeds Runway, demonstrating "emergence capability"
In the EditVerseBench performance test, EditVerse is ahead of existing open source models (such as TokenFlow, InsV2V, etc.) in multiple automated evaluation indicators (including video quality, text alignment, time consistency, VLM score, etc.).
More notably, EditVerse even outperformed the closed-source commercial model Runway Aleph in the VLM score (assessed by GPT-4o), which is closest to human preferences. In the real-person evaluation phase, EditVerse also received 51.7% user preference, surpassing Runway Aleph.
Researchers also discovered that EditVerse exhibits surprising "emergent abilities." Even when its video training data didn't include specific examples of "material transformation" or "special effects addition" (e.g., turning a turtle into crystal or adding a time-lapse effect to the sky), the model still understood the instructions and successfully completed the task.
Through ablation experiments (where the model's capabilities dropped significantly after removing image editing data), the team demonstrated that this "self-taught" ability primarily stems from the deep visual principles learned by the model from massive amounts of image data, and successfully transferred these principles to the field of video editing.
A new era of creation
The emergence of EditVerse not only provides a powerful new tool, but also may herald the arrival of a new content creation paradigm that moves from separation to unification, from cumbersome to simple, and is expected to popularize professional-level video editing capabilities to more creators.
Current related papers, project homepages and test codesAll have been made public.




