HyperStyle3D: Text-Guided 3D Portrait Stylization via Hypernetworks

Zhuo Chen1, Xudong Xu2, Yicaho Yan1, Zhengqin Xu1, Ye Pan1, Wenhan Zhu1, Wayne Wu2 Bo Dai2, Xiaokang Yang1
1MoE Key Lab of Artificial Intelligence, Shanghai Jiao Tong University, 2Shanghai AI Laboratory

Abstract

Portrait stylization is a long-standing task enabling extensive applications. Although 2D-based methods have made great progress in recent years, real-world applications such as metaverse and games often demand 3D content. On the other hand, the requirement of 3D data, which is costly to acquire, significantly impedes the development of 3D portrait stylization methods. In this paper, inspired by the success of 3D-aware GANs that bridge 2D and 3D domains with 3D fields as the intermediate representation for rendering 2D images, we propose a novel method, dubbed HyperStyle3D, based on 3D-aware GANs for 3D portrait stylization. At the core of our method is a hyper-network learned to manipulate the parameters of the generator in a single forward pass. It not only offers a strong capacity to handle multiple styles with a single model, but also enables flexible fine-grained stylization that affects only texture, shape, or local part of the portrait. While the use of 3D-aware GANs bypasses the requirement of 3D data, we further alleviate the necessity of style images with the CLIP model being the stylization guidance. We conduct an extensive set of experiments across the style, attribute, and shape, and meanwhile, measure the 3D consistency. These experiments demonstrate the superior capability of our HyperStyle3D model in rendering 3D-consistent images in diverse styles, deforming the face shape, and editing various attributes.

Video

Method Overview

The overview of our full-pipeline. (a) Our hyper-module consists of three trainable hyper-networks and a fixed text encoder. The text prompts of three levels (shape, attribute, and style) are encoded into the coarse, medium, and fine direction features, which are then fed into the corresponding hyper-network. The hyper-networks predict three groups of parameter offsets for the coarse, medium, and fine layers in the pre-trained 3D-aware generator. (b) Our hyper-module is trained under the supervision of CLIP loss and ID loss. Text prompts of three levels are simultaneously integrated into the training, empowering the hyper-module to handle the overlying manipulation of diverse styles, attributes, and shapes. We pre-define the source text as a description related to the current training target text, such as “Face” to “Bearded face”


Pulpit rock Pulpit rock

BibTeX

@ARTICLE{10542240,
  author={Chen, Zhuo and Xu, Xudong and Yan, Yichao and Pan, Ye and Zhu, Wenhan and Wu, Wayne and Dai, Bo and Yang, Xiaokang},
  journal={IEEE Transactions on Circuits and Systems for Video Technology}, 
  title={HyperStyle3D: Text-Guided 3D Portrait Stylization via Hypernetworks}, 
  year={2024},
  volume={},
  number={},
  pages={1-1},
  keywords={Three-dimensional displays;Shape;Generators;Solid modeling;Semantics;Deformation;Deformable models;3D-aware GAN;Style Transfer;Hyper-network},
  doi={10.1109/TCSVT.2024.3407135}}

}