FlashTex: Fast Relightable Mesh Texturing with LightControlNet

1 Roblox    2 Carnegie Mellon University    3 Stanford University

arXiv

FlashTex textures an input 3D mesh given a user-provided text prompt. Notably, our generated texture can be relit properly in different lighting environments.


Abstract

Manually creating textures for 3D meshes is time-consuming, even for expert visual content creators. We propose a fast approach for automatically texturing an input 3D mesh based on a user-provided text prompt. Importantly, our approach disentangles lighting from surface material/reflectance in the resulting texture so that the mesh can be properly relit and rendered in any lighting environment. Our method introduces LightControlNet, a new text-to-image model based on the ControlNet architecture, that allows the specification of the desired lighting as a conditioning image to the model. Our text-to-texture pipeline then constructs the texture in two stages. The first stage produces a sparse set of visually consistent reference views of the mesh using LightControlNet. The second stage applies a texture optimization based on Score Distillation Sampling (SDS) that works with LightControlNet to increase the texture quality while disentangling surface material from lighting. We show that this pipeline is significantly faster than previous text-to-texture methods, while producing high-quality and relightable textures.


Method Pipeline

Our method efficiently generates relightable textures for an input 3D mesh and text prompt. In stage 1 (top left) we use multi-view visual prompting with our LightControlNet diffusion model to generate four visually consistent canonical views of the mesh under fixed lighting, that we concatenate into a reference image \( I_{\text{ref}} \). In stage 2 we apply a novel texture optimization procedure that uses \( I_{\text{ref}} \) as guidance in combination with a multi-resolution hash-grid representation of the texture \( \Gamma(\beta(\cdot)) \). For each iteration of the optimization we render two batches of images using \( \Gamma(\beta(\cdot)) \) -- one using the canonical views and lighting of \( I_{\text{ref}} \) which we use to compute a reconstruction loss \( \mathcal{L}_{\text{recon}} \) and the other using randomly sampled views and lighting which we use in combination with LightControlNet to compute an SDS loss \( \mathcal{L}_{\text{SDS}} \).


LightControlNet

(a) Our LightControlNet generates images with controlled lighting by a conditioning image that specifies desired lighting \( L \) for a view \( C \) of a 3D mesh. To form the conditioning image, we first render the mesh with the desired \( L \) and \( C \) using three different materials: (1) non-metal, not smooth, (2) half-metal, half-smooth, and (3) pure metal, smooth, and then combine the renderings into a single three-channel image. (b) Our LightControlNet is trained using paired data derived from Objaverse. Each training object is rendered twice using the same pose and lighting - once with its original texture and material, and again getting a conditioning image in which the original texture is replaced by our condition materials. We use the associated name and tags as the text prompt. (c) To illustrate the generalizability of our LightControlNet, we present several examples of the same shape under varying lighting conditions and text prompts.


Multi-View Visual Prompting

In stage 1 of our pipeline, we use our LightControlNet diffusion model to generate four visually consistent canonical views of the mesh under fixed lighting. (a) However, when we independently input four canonical conditioning images to LightControlNet, it generates four very different appearances and styles even with a fixed random seed. (b) When we concatenate the four images into a \( 2\times 2\) grid and pass them as a single image into LightControlNet, it produces a far more consistent appearance and style. We suspect this property arises from the presence of similar training data samples -- grid-organized sets depicting the same object -- in Stable Diffusion's training set.


Our Results (Fixed object, Rotating lighting)

Below we show a grid of pairs of input text prompt (top) and their corresponding generated textures (bottom). In the examples below, the lighting is rotating while the object is fixed .

"Doc Martens Boot"

"Wooden Boat"

"Ballet costume"

"Fairy lantern, adventure"

"A sculpture of horse without rider"

"A dog head statue"

"Stylish sweater"

"Futuristic helmet"

"Suede Womens Heeled Boot"

"Casual gothic outfit, fashion, stylish, clothes"

"Horse saddle"

"Purple scallop"

"Hiking boot"

"Camera Super 8 Braun Nizo S8T"

"Thorsberg tunic"


Our Results (Fixed lighting, Rotating object)

We also render the results with rotating objects and fixed lighting.

"MOPED 1978 Puch Moped Hero, motorcycle"

"Hiking boot"

"Medieval windmill"

"Nimbus 2000, harry potter broom, magic"

"Pruning shears"

"Fairy lantern, adventure"

"A sculpture of horse without rider"

"doll boots, leather, heels"

"perisphinctes, cretaceous, fossils, ammonite"

"futuristic helmet, cyborg, metal, robot"

"Suede Womens Heeled Boot"

"Stylish_Boot"

"horse saddle, leather, craft, sewing, tanning, 20th-century"

"purple scallop, purple shell, aquatic, shellfish"

"vintage cash register"

"a sculpture of monkey"

"LAV-25, tank"

"Doc Martens Boot"

"Wooden boat"

"Ballet costume, dance dress, woman, ballerina"

"Russian Antitank rifle, PTRD-1941, 14.5_mm, rifle, gun"

"Peony flower"

"Ranger bow"

"Camra 16 mm, Paillard Bolex, H16 REX-5, cinema, camera"

"Abandoned Jeep Gladiator, crashed, jeep, vehicle"

"Casual gothic outfit, fashion, stylish, clothes"

"Moon necklace"

"Pine cone"

"Thorsberg tunic, clothes"

"A stylish jacket"

"A vintage space explorer jacket with a matching helmet, weathered and covered in cosmic dust"

"Jacket made from the fabrics of a ghost ship phantasmal and shimmering"

"Jacket that gives the impression of a swirling nebula"

"an astronaut fishman"

"coral reef guardian"

"deep sea diver"

"mermaid warrior, trending on artstation"

"cave dweller, high detail"

"Hylian goblin soldier from legend of zelda"

"Dystopian rebel leader with intricate tattoos"

"Space-faring explorer in an advanced suit"

"an astronaut wolf"

"Hylian wolf soldier from legend of zelda"

"Japanese samurai wolf"

"Pirate tribal wolf"



Comparison with relightable baseline

We render our and Fantasia3D's results using the same fixed lighting with the object rotating. We also render an untextured mesh (half metal, half smooth material) with the same lighting to show the reference lighting environment.

Ours
[4 mins]

Fantasia3D
[30 mins]

Untextured Mesh (reference lighting)

"A sculpture of horse without rider"

"A sculpture of horse without rider"

"A sculpture of horse without rider"

"Camra 16 mm, Paillard Bolex, H16 REX-5, cinema, camera"

"Camra 16 mm, Paillard Bolex, H16 REX-5, cinema, camera"

"Camra 16 mm, Paillard Bolex, H16 REX-5, cinema, camera"

"Doc Martens Boot"

"Doc Martens Boot"

"Doc Martens Boot"

"MOPED 1978 Puch Moped Hero, motorcycle"

"MOPED 1978 Puch Moped Hero, motorcycle"

"MOPED 1978 Puch Moped Hero, motorcycle"

"Medieval windmill"

"Medieval windmill"

"Medieval windmill"

"Stylish Boot"

"Stylish Boot"

"Stylish Boot"

"Thorsberg tunic, clothes"

"Thorsberg tunic, clothes"

"Thorsberg tunic, clothes"

"a sculpture of monkey"

"a sculpture of monkey"

"a sculpture of monkey"

"ballet costume, dance, dress, woman, ballerina"

"ballet costume, dance, dress, woman, ballerina"

"ballet costume, dance, dress, woman, ballerina"

"doll boots, leather, heels"

"doll boots, leather, heels"

"doll boots, leather, heels"

"futuristic helmet, cyborg, metal, robot"

"futuristic helmet, cyborg, metal, robot"

"futuristic helmet, cyborg, metal, robot"

"hiking boot"

"hiking boot"

"hiking boot"

"horse saddle, leather, craft, sewing, tanning, 20th-century"

"horse saddle, leather, craft, sewing, tanning, 20th-century"

"horse saddle, leather, craft, sewing, tanning, 20th-century"

"peony flower"

"peony flower"

"peony flower"

"perisphinctes, cretaceous, fossils, ammonite"

"perisphinctes, cretaceous, fossils, ammonite"

"perisphinctes, cretaceous, fossils, ammonite"

"purple scallop, purple shell, aquatic, shellfish"

"purple scallop, purple shell, aquatic, shellfish"

"purple scallop, purple shell, aquatic, shellfish"

"vintage cash register"

"vintage cash register"

"vintage cash register"

"wooden boat"

"wooden boat"

"wooden boat"


Comparison with non-relightable baselines

Two variations of our method are compared with the non-relightable baselines. The first uses our proposed LightControlNet to generate relightable texture, while the non-relightable second employs a standard depth-guided ControlNet. All the results are placed in the same lighting environment, where the lighting is fixed while the object is rotating.

Ours (relightable)
[4 mins]

Ours (Non-relightable)
[2 mins]

text2tex
[15 mins]

TEXTure
[6 mins]

latent-paint
[10 mins]

"ballet costume, dance, dress, woman, ballerina"

"ballet costume, dance, dress, woman, ballerina"

"ballet costume, dance, dress, woman, ballerina"

"ballet costume, dance, dress, woman, ballerina"

"ballet costume, dance, dress, woman, ballerina"

"horse saddle, leather, craft, sewing, tanning, 20th-century"

"horse saddle, leather, craft, sewing, tanning, 20th-century"

"horse saddle, leather, craft, sewing, tanning, 20th-century"

"horse saddle, leather, craft, sewing, tanning, 20th-century"

"horse saddle, leather, craft, sewing, tanning, 20th-century"

"hiking boot"

"hiking boot"

"hiking boot"

"hiking boot"

"hiking boot"

"MOPED 1978 Puch Moped Hero, motorcycle"

"MOPED 1978 Puch Moped Hero, motorcycle"

"MOPED 1978 Puch Moped Hero, motorcycle"

"MOPED 1978 Puch Moped Hero, motorcycle"

"MOPED 1978 Puch Moped Hero, motorcycle"

"vintage cash register"

"vintage cash register"

"vintage cash register"

"vintage cash register"

"vintage cash register"

"pine cone"

"pine cone"

"pine cone"

"pine cone"

"pine cone"

"Thorsberg tunic, clothes"

"Thorsberg tunic, clothes"

"Thorsberg tunic, clothes"

"Thorsberg tunic, clothes"

"Thorsberg tunic, clothes"

"LAV-25, tank"

"LAV-25, tank"

"LAV-25, tank"

"LAV-25, tank"

"LAV-25, tank"

"Fairy lantern, adventure"

"Fairy lantern, adventure"

"Fairy lantern, adventure"

"Fairy lantern, adventure"

"Fairy lantern, adventure"

"Stylish Boot"

"Stylish Boot"

"Stylish Boot"

"Stylish Boot"

"Stylish Boot"

"Medieval windmill"

"Medieval windmill"

"Medieval windmill"

"Medieval windmill"

"Medieval windmill"


Citation

@article{deng2024flashtex,
  title={FlashTex: Fast Relightable Mesh Texturing with LightControlNet},
  author={Deng, Kangle and Omernick, Timothy and Weiss, Alexander and Ramanan, Deva and Zhu, Jun-Yan and Zhou, Tinghui and Agrawala, Maneesh},
  journal={arXiv preprint arXiv:2402.13251},
  year={2024},
}

Related and Concurrent Works


Acknowledgements

We thank Benjamin Akrish, Victor Zordan, Dmitry Trifonov, Derek Liu, Sheng-Yu Wang, Gaurav Parmer, Ruihan Gao, Nupur Kumari, and Sean Liu for their discussion and help. This work was done when Kangle was an intern at Roblox. The project is partly supported by Roblox. JYZ is partly supported by the Packard Fellowship. KD is supported by the Microsoft Research PhD Fellowship. The website template is taken from Custom Diffusion (which was built on DreamFusion's project page).