FlashTex: Fast Relightable Mesh Texturing with LightControlNet

Kangle Deng² Timothy Omernick¹ Alexander Weiss¹ Deva Ramanan² Jun-Yan Zhu² Tinghui Zhou¹ Maneesh Agrawala^1,3

¹ Roblox ² Carnegie Mellon University ³ Stanford University

ECCV 2024 (Oral)

FlashTex textures an input 3D mesh given a user-provided text prompt. Notably, our generated texture can be relit properly in different lighting environments. The following results are rendered using Blender.

Indoor Studio

Street night

Fireplace

Anime sky

Outdoor

Golden Bay

"Wooden goblet with grain patterns"

"Stone Goblet carved with runes and symbols"

"Marble goblet with white base color and red veins"

"Metal goblet intricately designed to reflect a Van Gogh painting"

Light Probe

Abstract

Manually creating textures for 3D meshes is time-consuming, even for expert visual content creators. We propose a fast approach for automatically texturing an input 3D mesh based on a user-provided text prompt. Importantly, our approach disentangles lighting from surface material/reflectance in the resulting texture so that the mesh can be properly relit and rendered in any lighting environment. Our method introduces LightControlNet, a new text-to-image model based on the ControlNet architecture, that allows the specification of the desired lighting as a conditioning image to the model. Our text-to-texture pipeline then constructs the texture in two stages. The first stage produces a sparse set of visually consistent reference views of the mesh using LightControlNet. The second stage applies a texture optimization based on Score Distillation Sampling (SDS) that works with LightControlNet to increase the texture quality while disentangling surface material from lighting. We show that this pipeline is significantly faster than previous text-to-texture methods, while producing high-quality and relightable textures.

Method Pipeline

Our method efficiently generates relightable textures for an input 3D mesh and text prompt. In stage 1 (top left) we use multi-view visual prompting with our LightControlNet diffusion model to generate four visually consistent canonical views of the mesh under fixed lighting, that we concatenate into a reference image \( I_{\text{ref}} \). In stage 2 we apply a novel texture optimization procedure that uses \( I_{\text{ref}} \) as guidance in combination with a multi-resolution hash-grid representation of the texture \( \Gamma(\beta(\cdot)) \). For each iteration of the optimization we render two batches of images using \( \Gamma(\beta(\cdot)) \) -- one using the canonical views and lighting of \( I_{\text{ref}} \) which we use to compute a reconstruction loss \( \mathcal{L}_{\text{recon}} \) and the other using randomly sampled views and lighting which we use in combination with LightControlNet to compute an SDS loss \( \mathcal{L}_{\text{SDS}} \).

LightControlNet

(a) Our LightControlNet generates images with controlled lighting by a conditioning image that specifies desired lighting \( L \) for a view \( C \) of a 3D mesh. To form the conditioning image, we first render the mesh with the desired \( L \) and \( C \) using three different materials: (1) non-metal, not smooth, (2) half-metal, half-smooth, and (3) pure metal, smooth, and then combine the renderings into a single three-channel image. (b) Our LightControlNet is trained using paired data derived from Objaverse. Each training object is rendered twice using the same pose and lighting - once with its original texture and material, and again getting a conditioning image in which the original texture is replaced by our condition materials. We use the associated name and tags as the text prompt. (c) To illustrate the generalizability of our LightControlNet, we present several examples of the same shape under varying lighting conditions and text prompts.

Multi-View Visual Prompting

In stage 1 of our pipeline, we use our LightControlNet diffusion model to generate four visually consistent canonical views of the mesh under fixed lighting. (a) However, when we independently input four canonical conditioning images to LightControlNet, it generates four very different appearances and styles even with a fixed random seed. (b) When we concatenate the four images into a \( 2\times 2\) grid and pass them as a single image into LightControlNet, it produces a far more consistent appearance and style. We suspect this property arises from the presence of similar training data samples -- grid-organized sets depicting the same object -- in Stable Diffusion's training set.

More results rendered by Blender

Our exported textures are directly compatible with widely used rendering applications, such as Blender. The results below are rendered using Blender, with a rotating indoor studio lighting environment.

"Horse saddle, leather, craft, sewing, tanning, 20th-century"

"Pine Cone"

"Nimbus 2000, harry potter broom, magic"

"Peony flower"

"Rotten apple"

"Wooden Boat"

"A medieval steel helmet"

Our Results (Fixed object, Rotating lighting)

Below we will show results rendered by nvdiffrast. Specifically, we show a grid of pairs of input text prompt (top) and their corresponding generated textures (bottom). In the examples below, the lighting is rotating while the object is fixed .

"Doc Martens Boot"

"Wooden Boat"

"Ballet costume"

"Fairy lantern, adventure"

"A sculpture of horse without rider"

"A dog head statue"

"Stylish sweater"

"Futuristic helmet"

"Suede Womens Heeled Boot"

"Casual gothic outfit, fashion, stylish, clothes"

"Horse saddle"

"Purple scallop"

"Hiking boot"

"Camera Super 8 Braun Nizo S8T"

"Thorsberg tunic"

Our Results (Fixed lighting, Rotating object)

We also render the results with rotating objects and fixed lighting.

"MOPED 1978 Puch Moped Hero, motorcycle"

"Hiking boot"

"Medieval windmill"

"Nimbus 2000, harry potter broom, magic"

"Pruning shears"

"Fairy lantern, adventure"

"A sculpture of horse without rider"

"doll boots, leather, heels"

"perisphinctes, cretaceous, fossils, ammonite"

"futuristic helmet, cyborg, metal, robot"

"Suede Womens Heeled Boot"

"Stylish_Boot"

"horse saddle, leather, craft, sewing, tanning, 20th-century"

"purple scallop, purple shell, aquatic, shellfish"

"vintage cash register"

"a sculpture of monkey"

"LAV-25, tank"

"Doc Martens Boot"

"Wooden boat"

"Ballet costume, dance dress, woman, ballerina"

"Russian Antitank rifle, PTRD-1941, 14.5_mm, rifle, gun"

"Peony flower"

"Ranger bow"

"Camra 16 mm, Paillard Bolex, H16 REX-5, cinema, camera"

"Abandoned Jeep Gladiator, crashed, jeep, vehicle"

"Casual gothic outfit, fashion, stylish, clothes"

"Moon necklace"

"Pine cone"

"Thorsberg tunic, clothes"

"A stylish jacket"

"A vintage space explorer jacket with a matching helmet, weathered and covered in cosmic dust"

"Jacket made from the fabrics of a ghost ship phantasmal and shimmering"

"Jacket that gives the impression of a swirling nebula"

"an astronaut fishman"

"coral reef guardian"

"deep sea diver"

"mermaid warrior, trending on artstation"

"cave dweller, high detail"

"Hylian goblin soldier from legend of zelda"

"Dystopian rebel leader with intricate tattoos"

"Space-faring explorer in an advanced suit"

"an astronaut wolf"

"Hylian wolf soldier from legend of zelda"

"Japanese samurai wolf"

"Pirate tribal wolf"

Comparison with relightable baseline

We render our and Fantasia3D's results using the same fixed lighting with the object rotating. We also render an untextured mesh (half metal, half smooth material) with the same lighting to show the reference lighting environment.

Ours
[4 mins]

Fantasia3D
[30 mins]

Untextured Mesh (reference lighting)

"A sculpture of horse without rider"

"Camra 16 mm, Paillard Bolex, H16 REX-5, cinema, camera"

"Doc Martens Boot"

"MOPED 1978 Puch Moped Hero, motorcycle"

"Medieval windmill"

"Stylish Boot"

"Thorsberg tunic, clothes"

"a sculpture of monkey"

"ballet costume, dance, dress, woman, ballerina"

"doll boots, leather, heels"

"futuristic helmet, cyborg, metal, robot"

"hiking boot"

"horse saddle, leather, craft, sewing, tanning, 20th-century"

"peony flower"

"perisphinctes, cretaceous, fossils, ammonite"

"purple scallop, purple shell, aquatic, shellfish"

"vintage cash register"

"wooden boat"

Comparison with non-relightable baselines

Two variations of our method are compared with the non-relightable baselines. The first uses our proposed LightControlNet to generate relightable texture, while the non-relightable second employs a standard depth-guided ControlNet. All the results are placed in the same lighting environment, where the lighting is fixed while the object is rotating.

Ours (relightable)
[4 mins]

Ours (Non-relightable)
[2 mins]

text2tex
[15 mins]

TEXTure
[6 mins]

latent-paint
[10 mins]

"ballet costume, dance, dress, woman, ballerina"

"horse saddle, leather, craft, sewing, tanning, 20th-century"

"hiking boot"

"MOPED 1978 Puch Moped Hero, motorcycle"

"vintage cash register"

"pine cone"

"Thorsberg tunic, clothes"

"LAV-25, tank"

"Fairy lantern, adventure"

"Stylish Boot"

"Medieval windmill"

Citation


                    @inproceedings{deng2024flashtex,

                      title={FlashTex: Fast Relightable Mesh Texturing with LightControlNet},

                      author={Deng, Kangle and Omernick, Timothy and Weiss, Alexander and Ramanan, Deva and Zhu, Jun-Yan and Zhou, Tinghui and Agrawala, Maneesh},

                      booktitle={European Conference on Computer Vision (ECCV)},

                      year={2024},
}

Related and Concurrent Works

Rui Chen, Yongwei Chen, Ningxin Jiao, Kui Jia. Fantasia3D: Disentangling Geometry and Appearance for High-quality Text-to-3D Content Creation. ICCV 2023.

Xudong Xu, Zhaoyang Lyu, Xingang Pan, Bo Dai. MATLABER: Material-Aware Text-to-3D via LAtent BRDF auto-EncodeR. ArXiv 2023.

Yuan-Chen Guo, Ying-Tian Liu, Vikram Voleti, Ruizhi Shao, Chia-Hao Chen, Guan Luo, Zixin Zou, Chen Wang, Christian Laforte, Yan-Pei Cao, Song-Hai Zhang. threestudio: A unified framework for 3D content generation.

Acknowledgements

We thank Benjamin Akrish, Victor Zordan, Dmitry Trifonov, Derek Liu, Sheng-Yu Wang, Gaurav Parmer, Ruihan Gao, Nupur Kumari, and Sean Liu for their discussion and help. This work was done when Kangle was an intern at Roblox. The project is partly supported by Roblox. JYZ is partly supported by the Packard Fellowship. KD is supported by the Microsoft Research PhD Fellowship. The website template is taken from Custom Diffusion (which was built on DreamFusion's project page).

FlashTex: Fast Relightable Mesh Texturing with LightControlNet

Kangle Deng2 Timothy Omernick1 Alexander Weiss1 Deva Ramanan2 Jun-Yan Zhu2 Tinghui Zhou1 Maneesh Agrawala1,3

1 Roblox 2 Carnegie Mellon University 3 Stanford University

ECCV 2024 (Oral)

Abstract

Method Pipeline

LightControlNet

Multi-View Visual Prompting

More results rendered by Blender

Our Results (Fixed object, Rotating lighting)

Our Results (Fixed lighting, Rotating object)

Comparison with relightable baseline

Comparison with non-relightable baselines

Citation

Related and Concurrent Works

Acknowledgements

Kangle Deng² Timothy Omernick¹ Alexander Weiss¹ Deva Ramanan² Jun-Yan Zhu² Tinghui Zhou¹ Maneesh Agrawala^1,3

¹ Roblox ² Carnegie Mellon University ³ Stanford University