MY ALT TEXT

Abstract

Video customization based on Text-to-Video (T2V) models aims to learn specific features from reference data to generate controllable videos. While significant strides have been made in image stylization and video motion customization, simultaneously controlling multiple concepts, such as content, style, and motion, remains a major challenge. In this work, we pioneer the systematic definition of the multi-concept Video customization task. To facilitate research in this area, we construct a comprehensive benchmark and propose DisCo-LoRA, a unified framework designed to tackle this problem by disentangling and flexibly recombining different concepts in two stages: (1) We decompose the objective into two sub-tasks: Content-Style and Content-Motion. Each sub-task is addressed using our Iterative Dual-LoRA Disentanglement Framework, which effectively disentangles distinct concepts within the data. (2) We identify layer-wise weight trends as crucial for LoRA identity, while weight magnitudes dictate composability. To harmonize these scales, we propose a Z-score-based statistical regularization that aligns weight distributions, preserving layer-wise trends while minimizing interference between different LoRAs. Extensive experiments show that Disco-LoRA excels in multi-concept video customization, effectively preserving appearance, style, and motion for controllable text-to-video generation.

Overall Framework of Disco-LoRA

Disco-LoRA Framework

Overview of Disco-LoRA. We independently train Content, Style, and Motion using our Iterative Dual-LoRA Disentanglement Framework. We simultaneously train a Target LoRA alongside a LoRA to be disentangled for each data, utilizing the Target LoRA for the final output. Furthermore, we apply Z-Score-Based Statistical Regularization to constrain parameter distributions and prevent concept bleeding. This design realizes free-form multi-concept video customization during inference.

Task 1: Content + Material + Object-motion Customization

Customizing videos with content, material appearance, and object motion.

Comparisons with baselines

Task 2: Content + Artstyle + Object-motion Customization

Customizing videos with content, art style appearance, and object motion.

Comparisons with baselines

Task 3: Content + Material + Camera-move Customization

Customizing videos with content, material appearance, and camera motion.

Comparisons with baselines

Task 4: Content + Artstyle + Camera-move Customization

Customizing videos with content, art style appearance, and camera motion.

Comparisons with baselines

Reference