VideoComposer

VideoComposer: Compositional Video Synthesis
with Motion Controllability

Xiang Wang^1*, Hangjie Yuan^1*, Shiwei Zhang^1*, Dayou Chen^1*, Jiuniu Wang¹,
Yingya Zhang¹, Yujun Shen², Deli Zhao¹, Jingren Zhou¹

¹Alibaba Group, ²Ant Group

The pursuit of controllability as a higher standard of visual content creation has yielded remarkable progress in customizable image synthesis. However, achieving controllable video synthesis remains challenging due to the large variation of temporal dynamics and the requirement of cross-frame temporal consistency. Based on the paradigm of compositional generation, this work presents VideoCompoer that allows users to flexibly compose a video with textual conditions, spatial conditions, and more importantly temporal conditions. Specifically, considering the characteristic of video data, we introduce the motion vector from compressed videos as an explicit control signal to provide guidance regarding temporal dynamics. In addition, we develop a Spatio-Temporal Condition encoder (STC-encoder) that serves as a unified interface to effectively incorporate the spatial and temporal relations of sequential inputs, with which the model could make better use of temporal conditions and hence achieve higher inter-frame consistency. Extensive experimental results suggest that VideoCompoer is able to control the spatial and temporal patterns simultaneously within a synthesized video in various forms, such as text description, sketch sequence, reference video, or even simply hand-crafted motions. Code and models will be made publicly available.

Overview: Summary of the Generated Videos

The real source of art is your imagination, and VideoComposer is the solution for bringing it to life.

Ability 1: Compositional Video Synthesis

You can generate videos flexibly in any style that you can imagine.

Text + Sketch

“Rotation view of a beautiful long haired woman standing in the forest”

Style + Hand-crafted Sketch + Hand-crafted motions

"A moving golden moon"

Ability 2: Compositional image-to-video synthesis

Give me a picture, and I will describe a beautiful, vivid and living world for you.

Text + Single Image

"Smiling woman in cowboy hat with wheat ears"

Text + Single Image + Motions

"Smiling woman in cowboy hat with wheat ears"

Text + Single Image

"Bouquet of the yellow leaves. Autumn girl walking in city park"

Text + Single Image + Depth

"Bouquet of the yellow leaves. Autumn girl walking in city park"

Text + Single Image + Depth

“Honey bee collecting pollen on a blooming sunflower”

Text + Single Image

“The Matterhorn with flowers in the foreground”

Text + Single Image + Sketch

“The Matterhorn with flowers in the foreground”

Text + Single Image

“Aerial HA track over Bosphorus in Istanbul ”

Text + Single Image + Depth

“Aerial HA track over Bosphorus in Istanbul ”

Ability 3: Compositional video inpainting

Give me a broken time, and I will restore it to its most beautiful moment for you.

Text + Mask

“Swan lazily drifting and trimming itself on a calm river”

Text + Mask + Depth

“Swan lazily drifting and trimming itself on a calm river”

Text + Mask + Sketch

“Black swan lazily drifting and trimming itself on a calm river”

Text + Mask

“Coastal view of Cleveland Road and Robin Hood's Bay, blue sky and sea with white clouds”

Text + Mask + Depth

“Coastal view of Cleveland Road and Robin Hood's Bay, blue sky and sea with white clouds”

Text + Mask + Sketch

“Coastal view of Cleveland Road and Robin Hood's Bay, blue sky and sea with white clouds”

Ability 4: Compositional sketch-to-video generation

Give me a little hint, and I will showcase the beauty within your heart.

Text + Single Sketch

“Pigeon sits on a stone with an iron net behind it”

Text + Single Sketch + Depth

“Pigeon sits on a stone with an iron net behind it”

Text + Single Sketch

“A beautiful woman looking at camera in office”

Text + Single Sketch + Style

“A beautiful woman looking at camera in office”

Text + Single Sketch

“Red-backed Shrike lanius collurio”

Text + Single Sketch + Style

“Red-backed Shrike lanius collurio”

Ability 5: Versatile motion control using hand-crafted motions

Tell me how beauty moves with simple strokes, and I will bring it to life for you.

Text + Hand-crafted Motions

“A moving golden moon”

Text + Hand-crafted Motions

“A tiger walking on the grassland”

Text + Hand-crafted Motions

“A moving box”

Text + Hand-crafted Motions

“A golden five-pointed star”

Ability 6: Motion Transfer

If it is beautiful, I will create an even more beautiful one for you.

Ability 7: Video Translation

If you want it to be better and become the way you want it to be, just tell me in a nutshell.

Ability 8: Video Style Transfer

Seemingly nothing is impossible, if you dare to imagine and dare to act.

Ability 9: Compositional sketch sequence-to-video generation

If it is beautiful, then we can also have it.

“Woman designer making pattern for new dress”

“Luxury expensive table serving for a romantic dinner with candles”

Ability 10: Compositional depth sequence-to-video generation

Give me some simple sketches, and I will turn them into the most beautiful moments for you.

“Yellow daffodils open up their blossoms”

“Ripe cherry tomatoes fall into a bowl of water slow motion”

More>>

If you are seeking an exhilarating challenge and the chance to collaborate with AIGC and large-scale pretraining, then you have come to the right place. We are searching for talented, motivated, and imaginative researchers to join our team. If you are interested, please don't hesitate to send us your resume via email yingya.zyy@alibaba-inc.com

VideoComposer

Controllable technology for video content creation.

Text + Sketch

Image + Depth

Style + Depth + Sketch + Motions

Style + Hand-crafted Sketch + Hand-crafted motions

Text + Single Image

Text + Single Image + Motions

Text + Single Image

Text + Single Image + Motions

Text + Single Image

Text + Single Image + Depth

Text + Single Image

Text + Single Image + Depth

Text + Single Image

Text + Single Image + Sketch

Text + Single Image

Text + Single Image + Depth

Input Masked Videos

Text + Mask

Text + Mask + Depth

Text + Mask + Sketch

Input Masked Videos

Text + Mask

Text + Mask + Depth

Text + Mask + Sketch

Input Hand Drawings

Text + Single Sketch

Text + Single Sketch + Depth

Input Hand Drawings

Text + Single Sketch

Text + Single Sketch + Style

Text + Single Sketch + Style

Input Hand Drawings

Text + Single Sketch

Text + Single Sketch + Style

Text + Single Sketch + Style

Text + Hand-crafted Motions

Text + Hand-crafted Motions

Text + Hand-crafted Motions

Text + Hand-crafted Motions

Text + Hand-crafted Motions

Text + Hand-crafted Motions

Image + Hand-crafted Motions

Text + Hand-crafted Motions

Input Video

Output Video

Input Video

Output Video

Input Video

Output Video

Input Video

Output Video

More>>