Apple develops GAUDI, an "AI architect" that generates ultra-realistic 3D scenes based on text

By    5 Aug,2022

However, Google's Dream Fields can only generate 3D views of a single object, and there are many difficulties in extending it to completely unconstrained 3D scenes. The biggest difficulty lies in the fact that the camera positions are very restricted. For a single object, every possible and reasonable camera position can be mapped to a dome, but in a 3D scene, the camera positions are restricted by objects and obstacles such as walls. If these factors are not taken into account during scene generation, it is very difficult to generate 3D scenes.


3D rendering specialist GAUDI

For the aforementioned problem of restricted camera positions, Apple's GAUDI model has come up with three specialised networks to easily deal with it.


GAUDI has a camera pose decoder that separates the camera pose from the 3D geometry and appearance of the scene, predicts the possible camera positions and ensures that the output is a valid position for the 3D scene architecture.


The scene decoder for a scene predicts a representation of the 3D plane, which is a 3D canvas.


The radiation field decoder then uses the volume rendering equations on this canvas to draw the subsequent images.


GAUDI's 3D generation consists of two stages.

The first is the optimisation of latent and network parameters: learning a latent representation that encodes the 3D radiation field and corresponding camera pose for thousands of trajectories. Unlike for individual objects, the effective camera pose varies with the scene, so the effective camera pose needs to be encoded for each scene.


The second is the use of diffusion models to learn generative models over the latent representation, thus enabling good modelling in both conditional and unconditional reasoning tasks. The former generates 3D scenes based on text or image cues, while the latter generates 3D scenes based on camera trajectories.


With 3D indoor scenes, GAUDI can generate new camera movements. As in some of the examples below, the text descriptions contain information about the scene and the navigation path. Here the team used a pre-trained RoBERTa-based text encoder and used its intermediate representation to tune the diffusion model.


In addition, using pre-trained ResNet-18 as an image encoder, GAUDI is able to sample the radiation field of a given image viewed from a random point of view to create a 3D scene from the image cue.


The researchers conducted experiments on four different datasets, including the indoor scanning dataset ARKitScences, and showed that GAUDI can reconstruct the learned view and can match the quality of existing methods. Even in the huge task of producing 3D scenes with hundreds of thousands of images for thousands of indoor scenes, GAUDI does not suffer from pattern collapse or orientation problems.


Not only will the advent of GAUDI have an impact on many computer vision tasks, but its ability to generate 3D scenes will also benefit research areas such as model-based reinforcement learning and planning, SLAM and the production of 3D content.


2/2

POPULAR CATEGORY