Hyper Diffusion Avatars: Dynamic Human Avatar Generation using Network Weight Space Diffusion



1University of Bonn,

2Max Planck Institute for Informatics


Abstract

Creating human avatars is a highly desirable yet challenging task. Recent advancements in radiance field rendering have achieved unprecedented photorealism and real-time performance for personalized dynamic human avatars. However, these approaches are typically limited to person-specific rendering models trained on multi-view video data for a single individual, limiting their ability to generalize across different identities. On the other hand, generative approaches leveraging prior knowledge from pre-trained 2D diffusion models can produce cartoonish, static human avatars, which are animated through simple skeleton-based articulation. Therefore, the avatars generated by these methods suffer from lower rendering quality compared to person-specific rendering methods and fail to capture pose-dependent deformations such as cloth wrinkles. In this paper, we propose a novel approach that unites the strengths of person-specific rendering and diffusion-based generative modeling to enable dynamic human avatar generation with both high photorealism and realistic pose-dependent deformations. Our method follows a two-stage pipeline: first, we optimize a set of person-specific UNets, with each network representing a dynamic human avatar that captures intricate pose-dependent deformations. In the second stage, we train a hyper diffusion model over the optimized network weights. During inference, our method generates network weights for real-time, controllable rendering of dynamic human avatars. Using a large-scale, cross-identity, multi-view video dataset, we demonstrate that our approach outperforms state-of-the-art human avatar generation methods.

Main Video (With Narration)

Dynamic 3D Human Representation

Dynamic human representation learning based on UNet. Given a specific human pose, the pose-dependent position and normal maps are generated via inverse texture mapping. These maps serve as inputs to the UNet, which predicts pose-dependent 3D Gaussians for rendering

Network weight space diffusion

Diffusion process on network weight space. During the forward diffusion process, the standard Gaussian noise at time step t is added to the network weights and the transformer take the noisy weights as well as the time step t to predict the denoised weights.

Citation

            @misc{cao2025hyperdiffusionavatarsdynamic,
            title={Hyper Diffusion Avatars: Dynamic Human Avatar Generation using Network Weight Space Diffusion},
            author={Dongliang Cao and Guoxing Sun and Marc Habermann and Florian Bernard},
            year={2025},
            archivePrefix={arXiv},
            primaryClass={cs.GR},
            url={https://arxiv.org/abs/2509.04145}}