Research Project

PixelWizard

Towards Efficient High-Fidelity Video Generation at Ultra-Large Spatial Resolution

Wenxue Li*, Jingjing Ren,* Peng Zhang*, Tian Ye, Daiguo Zhou, Jian Luan, Lei Zhu

Paper GitHub Hugging Face
PixelWizard teaser

Abstract

High-resolution video generation is limited by two coupled bottlenecks: unstable optimization and prohibitive computational cost. As the token sequence expands, optimization becomes biased toward local textures, weakening global coherence and increasing both training and inference cost.

PixelWizard addresses this by hierarchically decoupling global structure modeling from high-resolution detail synthesis. It first establishes a compact spatiotemporal anchor that concentrates dense structural priors, then uses this anchor to guide fine-grained synthesis at native high resolution.

To break the inference bottleneck, PixelWizard introduces Noise-Span Aligned Shortcut Training, together with Exponential Index-Biased Sampling and Adaptive Noise-Span Calibration. This enables robust few-step generation without memory-heavy teacher-student distillation.

Method

From compact anchors to high-resolution detail

01

Spatial-Temporal Anchor Modeling

PixelWizard models global motion and layout in a compact high-density latent space, where long-range spatiotemporal structure is easier to optimize.

02

Anchor-Guided High-Res Synthesis

The generated anchor is injected into the DiT backbone as a structural prior, allowing high-resolution synthesis to focus on local textures and fine details.

03

Noise-Span Aligned Shortcut Training

A step-size-aware shortcut objective lets the model traverse the generation trajectory with large denoising steps while remaining stable on high-resolution grids.

Results

High visual quality with practical inference speed

2K
2560 × 1440

121-frame videos with strong subject consistency and substantially lower latency.

4K
3840 × 2144

Ultra-large spatial resolution with preserved coherent structures and fine local details.

Speed
>10× faster

Accelerated native 2K/4K sampling through few-step inference without distillation.

Demo Videos

Citation

@misc{li2026pixelwizard,
  title={PixelWizard: Towards Efficient High-Fidelity Video Generation at Ultra-Large Spatial Resolution},
  author={Li, Wenxue and Ren, Jingjing and Zhang, Peng and Ye, Tian and Zhou, Daiguo and Luan, Jian and Zhu, Lei},
  year={2026}
}