Differentiable rendering has paved the way for training neural networks to perform “inverse graphics” tasks such as predicting 3D geometry from monocular photographs. To train high-performing models, most of the current approaches rely on multi-view imagery which is not readily available in practice. Recent Generative Adversarial Networks (GANs) that synthesize images, in contrast, seem to acquire 3D knowledge implicitly during training: object viewpoints can be manipulated by simply manipulating the latent codes. However, these latent codes often lack further physical interpretation and thus GANs cannot easily be inverted to perform explicit 3D reasoning. In this paper, we aim to extract and disentangle 3D knowledge learned by generative models by utilizing differentiable renderers.

The key to our approach is to exploit GANs as a multi-view data generator to train an inverse graphics network using an off-the-shelf differentiable renderer, and the trained inverse graphics network as a teacher to disentangle the GAN’s latent code into interpretable 3D properties. The entire architecture is trained iteratively using cycle consistency losses. We show that our approach significantly outperforms state-of-the-art inverse graphics networks trained on existing datasets, both quantitatively and via user studies. We further showcase the disentangled GAN as a controllable 3D “neural renderer”, complementing traditional graphics renderers.

Figure 1: We employ two “renderers”: a GAN (StyleGAN in our work), and a differentiable graphics renderer (DIB-R in our work). We exploit StyleGAN as a synthetic data generator, and we label this data extremely efficiently. This “dataset” is used to train an inverse graphics network that predicts 3D properties from images.
We use this network to disentangle StyleGAN’s latent code through a carefully designed mapping network.
Figure 2: We show examples of cars (first two rows) synthesized in chosen viewpoints (columns). To get these, we fix the latent code w∗ v that controls the viewpoint (one code per column) and randomly sample the
remaining dimensions of (Style)GAN’s latent code (to get rows). Notice how well aligned the two cars are in
each column. In the third row we show the same approach applied to horse and bird StyleGAN.
Figure 3: A mapping network maps camera, shape, texture and background into a disentangled code that is passed to StyleGAN for “rendering”. We refer
to this network as StyleGAN-R.
Figure 4: 3D Reconstruction Results: Given input images (1st column), we predict 3D shape, texture, and render them into the same viewpoint (2nd column). We also show renderings in 3 other views in remaining columns to showcase 3D quality. Our model is able to reconstruct cars with various shapes, textures and
viewpoints. We also show the same approach on harder (articulated) objects, i.e., bird and horse.
Figure 5: Comparison on Pascal3D test set: We compare inverse graphics networks trained on Pascal3D and our StyleGAN dataset. Notice considerably higher quality of prediction when training on
the StyleGAN dataset.
Figure 6: Ablation Study: We ablate the use of multi-view consistency loss. Both texture are shape are worse without this loss, especially in the invisible parts (rows 2, 5, denoted by “w.o M. V.” — no multi-view
consistency used during training), showcasing the importance of our StyleGAN-multivew dataset.
Figure 7: Dual Renderer: Given input images (1st column), we first predict mesh and texture, and render them with the graphics renderer (2nd column), and our StyleGAN-R (3rd column).
Figure 8: Latent code manipulation: Given an input image (col 1), we predict 3D properties and synthesize a new image with StyleGAN-R, by manipulating the viewpoint (col 2, 3, 4). Alternatively, we directly optimize the (original) StyleGAN latent code w.r.t. image, however this leads to a blurry reconstruction (col 5).
Moreover, when we try to adjust the style for the optimized code, we get low quality results (col 6, 7).
Figure 9: Camera Controller: We manipulate azimuth, scale, elevation parameters with StyleGAN-R to synthesize images in new viewpoints while
keeping content code fixed.
Figure 10: 3D Manipulation: We sample 3 cars in column 1. We replace the shape of all cars with the shape of Car 1 (red box) in 2nd column. We transfer texture of Car 2 (green box) to other cars (3rd col). In last column, we paste background of Car 3 (cyan box) to the other cars. Examples indicated with boxes are un-
changed. Zoom in to see details.
Figure 11: Real Image Manipulation: Given input images (1st col), we predict 3D properties and use our StyleGAN-R to render them back (2nd col). We swap out shape, texture & background in cols 3–5.

A passionate individual who strives to reveal the mind functioning through computational neuroscience and humanities study