GANcraft Unsupervised 3D Neural Rendering Of Minecraft Worlds

From Open Source Bridge
Jump to: navigation, search

We present GANcraft, an unsupervised neural rendering framework for generating photorealistic images of large 3D block worlds such as those created in Minecraft. Our method takes a semantic block world as input, where each block is assigned a semantic label such as dirt, grass, or water. We represent the world as a continuous volumetric function and train our model to render view-consistent photorealistic images for a user-controlled camera. In the absence of paired ground truth real images for the block world, we devise a training technique based on pseudo-ground truth and adversarial training. This stands in contrast to prior work on neural rendering for view synthesis, which requires ground truth images to estimate scene geometry and view-dependent appearance. In addition to camera trajectory, GANcraft allows user control over both scene semantics and output style. Experimental results with comparison to strong baselines show the effectiveness of GANcraft on this novel task of photorealistic 3D block world synthesis. The project website is available at https://nvlabs.github.io/GANcraft/.



Imagine a world where every Minecrafter is a 3D painter!



Advances in 2D image-to-image translation [3, 22, 50] have enabled users to paint photorealistic images by drawing simple sketches similar to those created in Microsoft Paint. Despite these innovations, creating a realistic 3D scene remains a painstaking task, out of the reach of most people. It requires years of expertise, professional software, a library of digital assets, and a lot of development time. In contrast, building 3D worlds with blocks, say physical LEGOs or their digital counterpart, is so easy and intuitive that even a toddler can do it. Wouldn’t it be great if we could build a simple 3D world made of blocks representing various materials (like Fig. 1 (insets)), feed it to an algorithm, and receive a realistic looking 3D world featuring tall green trees, ice-capped mountains, and the blue sea (like Fig. 1)? With such a method, we could perform world-to-world translation to convert the worlds of our imagination to reality. Needless to say, such an ability would have many applications, from entertainment and education, to rapid prototyping for artists.



In this paper, we propose GANcraft, a method that produces realistic renderings of semantically-labeled 3D block worlds, such as those from Minecraft (www.minecraft.net). Minecraft, the best-selling video game of all time with over 200 million copies sold and over 120 million monthly users [2], is a sandbox video game in which a user can explore a procedurally-generated 3D world made up of blocks arranged on a regular grid, while modifying and building structures with blocks. Minecraft provides blocks representing various building materials-grass, dirt, water, sand, snow, etc. Each block is assigned a simple texture, and the game is known for its distinctive cartoonish look. While one might discount Minecraft as a simple game with simple mechanics, Minecraft is, in fact, a very popular 3D content creation tool. Minecrafters have faithfully recreated large cities and famous landmarks including the Eiffel Tower! The block world representations are intuitive to manipulate and this makes it well-suited as the medium for our world-to-world translation task. We focus on generating natural landscapes, which was also studied in several prior work in image-to-image translation [3, 50].



At first glance, generating a 3D photorealistic world from a semantic block world seems to be a task of translating a sequence of projected 2D segmentation maps of the 3D block world, and is a direct application of image-to-image translation . This approach, however, immediately runs into several serious issues. First, obtaining paired ground truth training data of the 3D block world, segmentation labels, and corresponding real images is extremely costly if not impossible. Second, existing image-to-image translation models [21, 50, 62, 72] do not generate consistent views [36]. Each image is translated independent of the others.



While the recent world-consistent vid2vid work [36] overcomes the issue of view-consistency, it requires paired ground truth 3D training data. Even the most recent neural rendering approaches based on neural radiance fields such as NeRF [39], NSVF [31], and NeRF-W [37], require real images of a scene and associated camera parameters, and are best suited for view interpolation. As there is no paired 3D and ground truth real image data, as summarized in Table 1, none of the existing techniques can be used to solve this new task. This requires us to employ ad hoc adaptations to make our problem setting as similar to these methods’ requirements as possible, e.g. training them on real segmentations instead of Minecraft segmentations.



In the absence of ground truth data, we propose a framework to train our model using pseudo-ground truth photorealistic images for sampled camera views. Our framework uses ideas from image-to-image translation and improves upon work in 3D view synthesis to produce view-consistent photorealistic renderings of input Minecraft worlds as shown in Fig. 1. Although we demonstrate our results using Minecraft, our method works with other 3D block world representations, such as voxels. We chose Minecraft because it is a popular platform available to a wide audience.



Our key contributions include:



• The novel task of producing view-consistent photorealistic renderings of user-created 3D semantic block worlds, or world-to-world translation, a 3D extension of image-to-image translation.



• A framework for training neural renderers in the absence of ground truth data. This is enabled by using pseudo-ground truth images generated by a pretrained image synthesis model (Section 3.1).



• A new neural rendering network architecture trained with adversarial losses (Section 3.2), that extends recent work in 2D and 3D neural rendering [20, 31, 37, 39, 45] to produce state-of-the-art results which can be conditioned on a style image (Section 4).



2D image-to-image translation. The GAN framework [16] has enabled multiple methods to successfully map an image from one domain to another with high fidelity, e.g., from input segmentation maps to photorealistic images. This task can be performed in the supervised setting [22, 26, 35, 50, 62, 73], where example pairs of corresponding images are available, as well as the unsupervised setting [14, 21, 32, 33, 35, 54, 72], where only two sets of images are available. Methods operating in the supervised setting use stronger losses such as the L11_1start_FLOATSUBSCRIPT 1 end_FLOATSUBSCRIPT or perceptual loss [23], in conjunction with the adversarial loss. As paired data is unavailable in the unsupervised setting, works typically rely on a shared-latent space assumption [32] or cycle-consistency losses [72]. For a comprehensive overview of image-to-image translation methods, please refer to the survey of Liu et al. [34].



Our problem setting naturally falls into the unsupervised setting as we do not possess real-world images corresponding to the Minecraft 3D world. To facilitate learning a view-consistent mapping, we employ pseudo-ground truths during training, which are predicted by a pretrained supervised image-to-image translation method.



Pseudo-ground truths were first explored in prior work on self-training, or bootstrap learning [38, 67]111See https://ruder.io/semi-supervised/ for an overview.. More recently, this technique has been adopted in several unsupervised domain adaptation works [13, 27, 56, 61, 65, 70, 74]. They use a deep learning model trained on the ‘source’ domain to obtain predictions on the new ‘target’ domain, treat these predictions as ground truth labels, or pseudo labels, and finetune the deep learning model on such self-labeled data.



In our problem setting, we have segmentation maps obtained from the Minecraft world but do not possess the corresponding real image. We use SPADE [50], a conditional GAN model, trained for generating landscape images from input segmentation maps to generate pseudo ground truth images. This yields the pseudo pair: input Minecraft segmentation mask and the corresponding pseudo ground truth image. The pseudo pairs enable us to use stronger supervision such as L11_1start_FLOATSUBSCRIPT 1 end_FLOATSUBSCRIPT, L22_2start_FLOATSUBSCRIPT 2 end_FLOATSUBSCRIPT, and perceptual [23] losses in our world-to-world translation framework, resulting in improved output image quality. This idea of using pretrained GAN models for generating training data has also been explored in the very recent works of Pan et al. [48] and Zhang et al. [71], which use a pretrained StyleGAN [24, 25] as a multi-view data generator to train an inverse graphics model.



3D neural rendering. A number of works have explored combining the strengths of the traditional graphics pipeline, such as 3D-aware projection, with the synthesis capabilities of neural networks to produce view-consistent outputs. By introducing differentiable 3D projection and using trainable layers that operate in the 3D and 2D feature space, several recent methods [4, 18, 43, 44, 57, 63] are able to model the geometry and appearance of 3D scenes from 2D images. Some works have successfully combined neural rendering with adversarial training [18, 43, 44, 45, 55], thereby removing the constraint of training images having to be posed and from the same scene. However, the under-constrained nature of the problem limited the application of these methods to single objects, synthetic data, or small-scale simple scenes. As shown later in Section 4, we find that adversarial training alone is not enough to produce good results in our setting. This is because our input scenes are larger and more complex, the available training data is highly diverse, and there are considerable gaps in the scene composition and camera pose distribution between the block world and the real images.



Most recently, NeRF [39] demonstrated state-of-the-art results in novel view synthesis by encoding the scene in the weights of a neural network that produces the volume density and view-dependent radiance at every spatial location. The remarkable synthesis ability of NeRF has inspired a large number of follow-up works which have tried to improve the output quality [31, 69], make it faster to train and evaluate [30, 31, 42, 52, 60], extend it to deformable objects [15, 28, 49, 51, 64], account for lighting [9, 6, 37, 58] and compositionality [17, 45, 47, 68], as well as add generative capabilities [11, 55, 45].



Most relevant to our work are NSVF [31], NeRF-W [37], and GIRAFFE [45]. NSVF [31] reduces the computational cost of NeRF by representing the scene as a set of voxel-bounded implicit fields organized in a sparse voxel octree, which is obtained by pruning an initially dense cuboid made of voxels. NeRF-W [37] learns image-dependent appearance embeddings allowing it to learn from unstructured photo collections, and produce style-conditioned outputs. These works on novel view synthesis learn the geometry and appearance of scenes given ground truth posed images. In our setting, the problem is inverted - we are given coarse voxel geometry and segmentation labels as input, without any corresponding real images.



Similar to NSVF [31], we assign learnable features to each corner of the voxels to encode geometry and appearance. In contrast, we do not learn the 3D voxel structure of the scene from scratch, but instead implicitly refine the provided coarse input geometry (e.g. shape and opacity of trees represented by blocky voxels) during the course of training. Prior work by Riegler et al. [53] also used a mesh obtained by multi-view stereo as a coarse input geometry. Similar to NeRF-W [37], we use a style-conditioned network. This allows us to learn consistent geometry while accounting for the view inconsistency of SPADE [50]. Like neural point-based graphics [4] and GIRAFFE [45], we use differentiable projection to obtain features for image pixels, and then use a CNN to convert the 2D feature grid to an image. Like GIRAFFE [45], we use an adversarial loss in training. We, however, learn on large, complex scenes and produce higher-resolution outputs (1024×\times×2048 original image size in Fig. 1, v/s 64×\times×64 or 256×\times×256 pixels in GIRAFFE), in which case adversarial loss alone fails to produce good results.



3 Neural Rendering of Minecraft Worlds



Our goal is to convert a scene represented by semantically-labeled blocks (or voxels), such as the maps from Minecraft, to a photorealistic 3D scene that can be consistently rendered from arbitrary viewpoints (as shown in Fig. 1). In this paper, we focus on landscape scenes that are orders of magnitude larger than single objects or scenes typically used in the training and evaluation of previous neural rendering works. In all of our experiments, we use voxel grids of 512×\times×512×\times×256 blocks (512×\times×512 blocks horizontally, 256 blocks tall vertically). Given that each Minecraft block is considered to have a size of 1 cubic meter [1], each scene covers an area equivalent to 262,144 m2superscript𝑚2m^2italic_m start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT (65 acres, or the size of 32 soccer fields) in real life. At the same time, our model needs to learn details that are much finer than a single block, such as tree leaves, flowers, and grass, that too without supervision. As the input voxels and their labels already define the coarse geometry and semantic arrangement of the scene, it is necessary to respect and incorporate this prior information into the model. We first describe how we overcome the lack of paired training data by using pseudo-ground truths. Then, we present our novel sparse voxel-based neural renderer.



3.1 Generating pseudo-ground truth training data



The most straightforward way of training a neural rendering model is to utilize ground truth images with known camera poses. A simple L22_2start_FLOATSUBSCRIPT 2 end_FLOATSUBSCRIPT reconstruction loss is sufficient to produce good results in this case [31, 37, 39, 46, 66]. However, in our setting, ground truth real images are simply unavailable for user-generated block worlds from Minecraft.



An alternative route is to train our model in an unpaired, unsupervised fashion like CycleGAN [72], or MUNIT [21]. This would use an adversarial loss and regularization terms to translate Minecraft segmentations to real images. However, as shown in the ablation studies in Section 4, this setting does not produce good results, for both prior methods, and neural renderers. This can be attributed to the large domain gap between blocky Minecraft and the real world, as well as the label distribution shift between worlds.



To bridge the domain gap between the voxel world and our world, we supplement the training data with pseudo-ground truth that is generated on-the-fly. For each training iteration, we randomly sample camera poses from the upper hemisphere and randomly choose a focal length. We then project the semantic labels of the voxels to the camera view to obtain a 2D semantic segmentation mask. The segmentation mask, as well as a randomly sampled style code, is fed to a pretrained image-to-image translation network, SPADE [50] in our case, to obtain a photorealistic pseudo-ground truth image that has the same semantic layout as the camera view, as shown in the left part of Fig. 2. This enables us to apply reconstruction losses, such as L22_2start_FLOATSUBSCRIPT 2 end_FLOATSUBSCRIPT, and the perceptual loss [23], between the pseudo-ground truth and the rendered output from the same camera view, in additional to the adversarial loss. This significantly improves the result.



The generalizability of the SPADE model trained on large-scale datasets, combined with its photorealistic generation capability helps reduce both, the domain gap and the label distribution mismatch. Sample pseudo-pairs are shown in the right part of Fig. 2. While this provides effective supervision, it is not perfect. This can be seen especially in the last two columns in the right part of Fig. 2. The blockiness of Minecraft can produce unrealistic images with sharp geometry. Certain camera poses and style code combinations can also produce images with artifacts. We thus have to be careful to balance reconstruction and adversarial losses to ensure successful training of the neural renderer.



3.2 Sparse voxel-based volumetric neural renderer



Voxel-bounded neural radiance fields. Let K𝐾Kitalic_K be the number of occupied blocks in a Minecraft world, which can also be represented by a sparse voxel grid with K𝐾Kitalic_K non-empty voxels given by 𝒱=V1,…,VK𝒱subscript𝑉1…subscript𝑉𝐾\mathcalV=\V_1,...,V_K\caligraphic_V = italic_V start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … , italic_V start_POSTSUBSCRIPT italic_K end_POSTSUBSCRIPT . Each voxel is assigned a semantic label l1,…,lKsubscript𝑙1…subscript𝑙𝐾\l_1,...,l_K\ italic_l start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … , italic_l start_POSTSUBSCRIPT italic_K end_POSTSUBSCRIPT . We learn a neural radiance field per voxel. The Minecraft world is then represented by the union of voxel-bounded neural radiance fields given by



F(𝐩,𝐳)=Fi(𝐩,𝐳),if 𝐩∈Vi,i∈1,⋯,K(𝟎,0),otherwise,𝐹𝐩𝐳casessubscript𝐹𝑖𝐩𝐳formulae-sequenceif 𝐩subscript𝑉𝑖𝑖1⋯𝐾00otherwise\displaystyle F(\mathbfp,\mathbfz)=\begincasesF_i(\mathbfp,\mathbfz% ),&~\textif \mathbfp\in V_i,~\ i\in\1,\cdots,K\\\ (\mathbf0,0),&~\textotherwise\endcases,italic_F ( bold_p , bold_z ) = start_ROW start_CELL italic_F start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( bold_p , bold_z ) , end_CELL start_CELL if bold_p ∈ italic_V start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_i ∈ 1 , ⋯ , italic_K end_CELL end_ROW start_ROW start_CELL ( bold_0 , 0 ) , end_CELL start_CELL otherwise end_CELL end_ROW , (1) where F𝐹Fitalic_F is the radiance field of the whole scene and Fisubscript𝐹𝑖F_iitalic_F start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT is the radiance field bounded by Visubscript𝑉𝑖V_iitalic_V start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT. Querying a location in the neural radiance field returns a feature vector (or color in prior work [31, 37, 39]) and a density value. At the location where a block does not exist, we have the null feature vector 𝟎0\mathbf0bold_0 and zero density 00. To model diversified appearance of the same scene, e.g. day and night, the radiance fields are conditioned on style code 𝐳𝐳\mathbfzbold_z.



The voxel-bounded neural radiance field Fisubscript𝐹𝑖F_iitalic_F start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT is given by



Fi(𝐩,𝐳)=Gθ(gi(𝐩),li,𝐳)=(𝐜(𝐩,l(𝐩),𝐳),σ(𝐩,l(𝐩))),subscript𝐹𝑖𝐩𝐳subscript𝐺𝜃subscript𝑔𝑖𝐩subscript𝑙𝑖𝐳𝐜𝐩𝑙𝐩𝐳𝜎𝐩𝑙𝐩F_i(\mathbfp,\mathbfz)=G_\theta(g_i(\mathbfp),l_i,\mathbfz)=% \left(\mathbfc(\mathbfp,l(\mathbfp),\mathbfz),\sigma(\mathbfp,l(% \mathbfp))\right),italic_F start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( bold_p , bold_z ) = italic_G start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( italic_g start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( bold_p ) , italic_l start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , bold_z ) = ( bold_c ( bold_p , italic_l ( bold_p ) , bold_z ) , italic_σ ( bold_p , italic_l ( bold_p ) ) ) , where gi(𝐩)subscript𝑔𝑖𝐩g_i(\mathbfp)italic_g start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( bold_p ) is the location code at 𝐩𝐩\mathbfpbold_p and li≡l(𝐩)subscript𝑙𝑖𝑙𝐩l_i\equiv l(\mathbfp)italic_l start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ≡ italic_l ( bold_p ) is a short-hand for the label of the voxel that 𝐩𝐩\mathbfpbold_p belongs to. The multi-layer perceptron (MLP) Gθsubscript𝐺𝜃G_\thetaitalic_G start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT is used to predict the feature 𝐜𝐜\mathbfcbold_c, and volume density σ𝜎\sigmaitalic_σ at the location 𝐩𝐩\mathbfpbold_p. We note that Gθsubscript𝐺𝜃G_\thetaitalic_G start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT is shared amongst all voxels. Inspired by NeRF-W [37], 𝐜𝐜\mathbfcbold_c additionally depends on the style code, while the density σ𝜎\sigmaitalic_σ does not. To obtain the location code, we first assign a learnable feature vector to each of the eight vertices of a voxel Visubscript𝑉𝑖V_iitalic_V start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT. The location code at 𝐩𝐩\mathbfpbold_p, gi(𝐩)subscript𝑔𝑖𝐩g_i(\mathbfp)italic_g start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( bold_p ), is then derived through trilinear interpolation. Here, we assume that each voxel has a shape of 1×\times×1×\times×1, and the coordinate axes are aligned to the voxel grid axes. Vertices and their feature vectors are shared for adjacent voxels. This allows for a smooth transition of features when crossing the voxel boundaries, preventing discontinuities in the output. We compute Fourier features from gi(𝐩)subscript𝑔𝑖𝐩g_i(\mathbfp)italic_g start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( bold_p ), similar to NSVF [31], and also append the voxel class label. Our method can be interpreted as a generalization of NSVF [31] to use a style and semantic label conditioning.



Neural sky dome. The sky is an indispensable part of photorealistic landscape scenes. However, as it is physically located much farther away from the other objects, it is inefficient to represent it with a layer of voxels. In GANcraft, we assume that the sky is located infinitely far away (no parallax). Thus, its appearance is only dependent on the viewing direction. The same assumption is commonly used in computer graphics techniques such as environment mapping [8]. We use an MLP Hϕsubscript𝐻italic-ϕH_\phiitalic_H start_POSTSUBSCRIPT italic_ϕ end_POSTSUBSCRIPT, to map ray direction 𝐯𝐯\mathbfvbold_v to sky color, or feature, 𝐜sky≡Hϕ(𝐯,𝐳)subscript𝐜skysubscript𝐻italic-ϕ𝐯𝐳\mathbfc_\mathrmsky\equiv H_\phi(\mathbfv,\mathbfz)bold_c start_POSTSUBSCRIPT roman_sky end_POSTSUBSCRIPT ≡ italic_H start_POSTSUBSCRIPT italic_ϕ end_POSTSUBSCRIPT ( bold_v , bold_z ), conditioned on style code 𝐳𝐳\mathbfzbold_z, This representation can be viewed as covering the whole scene with an infinitely large sky dome.



Volumetric rendering. Here, we describe how a scene represented by the above-mentioned neural radiance fields and sky dome can be converted to 2D feature maps via volumetric rendering. Under a perspective camera model, each pixel in the image corresponds to a camera ray 𝐫(t)=𝐨+t𝐯𝐫𝑡𝐨𝑡𝐯\mathbfr(t)=\mathbfo+t\mathbfvbold_r ( italic_t ) = bold_o + italic_t bold_v, originating from the center of projection 𝐨𝐨\mathbfobold_o and advances in direction 𝐯𝐯\mathbfvbold_v. The ray travels through the radiance field while accumulating features and transmittance,



C(𝐫,𝐳)=∫0+∞T(t)σ(𝐫(t),l(𝐫(t)))𝐜(𝐫(t),l(𝐫(t)),𝐳)𝑑t𝐶𝐫𝐳superscriptsubscript0𝑇𝑡𝜎𝐫𝑡𝑙𝐫𝑡𝐜𝐫𝑡𝑙𝐫𝑡𝐳differential-d𝑡\displaystyle C(\mathbfr,\mathbfz)=\int_0^+\inftyT(t)\sigma\big(% \mathbfr(t),l(\mathbfr(t))\big)\mathbfc\big(\mathbfr(t),l(\mathbf% r(t)),\mathbfz\big)dtitalic_C ( bold_r , bold_z ) = ∫ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT + ∞ end_POSTSUPERSCRIPT italic_T ( italic_t ) italic_σ ( bold_r ( italic_t ) , italic_l ( bold_r ( italic_t ) ) ) bold_c ( bold_r ( italic_t ) , italic_l ( bold_r ( italic_t ) ) , bold_z ) italic_d italic_t



+T(+∞)𝐜sky(𝐯,𝐳),𝑇subscript𝐜sky𝐯𝐳\displaystyle\quad\quad\quad\quad\ +T(+\infty)\mathbfc_\mathrmsky(% \mathbfv,\mathbfz),+ italic_T ( + ∞ ) bold_c start_POSTSUBSCRIPT roman_sky end_POSTSUBSCRIPT ( bold_v , bold_z ) , (2)



whereT(t)=exp(-∫0tσ(𝐫(s))𝑑s).where𝑇𝑡superscriptsubscript0𝑡𝜎𝐫𝑠differential-d𝑠\displaystyle\mathrmwhere~T(t)=\exp\left(-\int_0^t\sigma(\mathbfr(s% ))ds\right).roman_where italic_T ( italic_t ) = roman_exp ( - ∫ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT italic_σ ( bold_r ( italic_s ) ) italic_d italic_s ) . (3) C(𝐫,𝐳)𝐶𝐫𝐳C(\mathbfr,\mathbfz)italic_C ( bold_r , bold_z ) denotes the accumulated feature of ray 𝐫𝐫\mathbfrbold_r, and T(t)𝑇𝑡T(t)italic_T ( italic_t ) denotes the accumulated transmittance when the ray travels a distance of t𝑡titalic_t. As the radiance field is bounded by a finite number of voxels, the ray will eventually exit the voxels and hit the sky dome. We thus consider the sky dome as the last data point on the ray, which is totally opaque. This is realized by the last term in Eq. 3.2. The above integral can be approximated using discrete samples and the quadrature rule, a technique popularized by NeRF [39]. Please refer to NeRF [39] or our supplementary for the full equations.



We use the stratified sampling technique from NSVF [31] to randomly sample valid (voxel bounded) points along the ray. To improve efficiency, we truncate the ray so that it will stop after a certain accumulated distance through the valid region is reached. We regularize the truncated rays to encourage their accumulated opacities to saturate before reaching the maximum distance. We adopt a modified Bresenham method [5] for sampling valid points, which has a very low complexity of O(N)𝑂𝑁O(N)italic_O ( italic_N ), where N𝑁Nitalic_N is the longest dimension of the voxel grid. Details are in the supplementary.



Hybrid neural rendering architecture. Prior works [31, 37, 39] directly produce images by accumulating colors using the volumetric rendering scheme described above instead of accumulating features. Unlike them, we divide rendering into two parts: 1) We perform volumetric rendering with an MLP to produce a feature vector per pixel instead of an RGB image, and 2) We employ a CNN to convert the per-pixel feature map to a final RGB image of the same size. The overall framework is shown in Fig. 3. We perform activation modulation [20, 50] conditioned on the input style code for both the MLP and CNN. The individual networks are described in detail in the supplementary.



Apart from improving the output image quality as shown in Section 4, this two-stage design also helps reduce the computational and memory footprint of rendering. The MLP modeling the 3D radiance field is evaluated on a per-sample basis, while the image-space CNN is only evaluated after multiple samples along a ray are merged into a single pixel. The number of samples to the MLP scales linearly with the output height, width, and number of points sampled per ray (24 in our case), while the size of the feature map only depends on output height and width. However, unlike MLPs that operate pre-blending, the image-space CNN is not intrinsically view consistent. We thus use a shallow CNN with a receptive field of only 9×\times×9 pixels to constrain its scope to local manipulations. A similar idea of combining volumetric rendering and image-space rendering has been used in GIRAFFE [45]. Unlike us, they also rely on the CNN to upsample a low-resolution 16×\times×16 feature map.



Losses and regularizers. We train our model with both reconstruction and adversarial losses. The reconstruction loss is applied between the predicted images and the corresponding pseudo-ground truth images. We use a combination of the perceptual [23], L11_1start_FLOATSUBSCRIPT 1 end_FLOATSUBSCRIPT, and L22_2start_FLOATSUBSCRIPT 2 end_FLOATSUBSCRIPT losses. For the GAN loss, we treat the predicted images as ‘fake’, and both the real images, and the pseudo-ground truth images as ‘real’. We use a discriminator conditioned on the semantic segmentation maps, based on Liu et al. [35] and Schönfeld et al. [54]. We use the hinge loss [29] as the GAN training objective. Following previous works on multimodal image synthesis [3, 21, 73], we also include a style encoder which predicts the posterior distribution of the style code given a pseudo-ground truth image. The reconstruction loss, in conjunction with the style encoder, makes it possible to control the appearance of the output image with a style image.



As mentioned earlier, we truncate the ray during volumetric rendering. To avoid artifacts due to the truncation, we apply an opacity regularization term on the truncated ray, ℒopacity=∑𝐫∈𝐑truncTout(𝐫)subscriptℒopacitysubscript𝐫subscript𝐑truncsubscript𝑇out𝐫\mathcalL_\mathrmopacity=\sum_\mathbfr\in\mathbfR_\mathrmtrunc% T_\textrmout(\mathbfr)caligraphic_L start_POSTSUBSCRIPT roman_opacity end_POSTSUBSCRIPT = ∑ start_POSTSUBSCRIPT bold_r ∈ bold_R start_POSTSUBSCRIPT roman_trunc end_POSTSUBSCRIPT end_POSTSUBSCRIPT italic_T start_POSTSUBSCRIPT out end_POSTSUBSCRIPT ( bold_r ). This discourages leftover transmittance after a ray reaches the truncation distance.



The previous section described how we obtain pseudo-ground truths in the absence of paired Minecraft-real training data, and the architecture of our neural renderer. Here, we validate our framework by comparing with prior work on multiple diverse large Minecraft worlds.



Datasets. We collected a dataset of ∼similar-to\sim∼1M landscape images from the internet with a minimum side of at least 512 pixels. For each image, we obtained 182-class COCO-Stuff [10] segmentation labels by using DeepLabV2 [12, 41]. This formed our training set of paired real segmentation maps and images. We set aside 5000 images as a test set. We generated 5 different Minecraft worlds of 512×\times×512×\times×256 blocks each. We sampled worlds with various compositions of water, sand, forests, and snow, to show that our method works correctly under significant label distribution shifts.



Baselines. We compare against the following, which are representative methods under different data availability regimes.



• MUNIT [21]. This is an image-to-image translation method trainable in the unpaired, or unsupervised setting. Unlike CycleGAN [72] and UNIT [32], MUNIT can learn multimodal translations. We learn to translate Minecraft segmentation maps to real images.



• SPADE [50]. This is an image-to-image translation method that is trained in the paired ground truth, or supervised setting. We train this by translating real segmentation maps to corresponding images and test it on Minecraft segmentations.



• wc-vid2vid [36]. Unlike the above two methods, this can generate a sequence of images that are view-consistent. wc-vid2vid projects the pixels from previous frames to the next frame to generate a guidance map. This serves as a form of memory of the previously generated frames. This method also requires paired ground truth data, as well as the 3D point clouds for each output frame. We train this to translate real segmentation maps to real images, while using the block world voxel surfaces as the 3D geometry.



• NSVF-W [31, 37]. We combine the strengths of two recent works on neural rendering, NSVF [31], and NeRF-W [37], to create a strong baseline. NSVF represents the world as voxel-bounded radiance fields, and can be modified to accept an input voxel world, just like our method. NeRF-W is able to learn from unstructured image collections with variations in color, lighting, and occlusions, making it well-suited for learning from our pseudo-ground truths. Combining the style-conditioned MLP generator from NeRF-W with the voxel-based input representation of NSVF, we obtain NSVF-W. This resembles the neural renderer used by us, with the omission of the image-space CNN. As these methods also require paired ground truth, we train NSVF-W using pseudo-ground truths generated by the pretrained SPADE model.



MUNIT, SPADE, and wc-vid2vid use perceptual and adversarial losses during training, while NSVF, NeRF-W, and thus NSVF-W use the L22_2start_FLOATSUBSCRIPT 2 end_FLOATSUBSCRIPT loss. Details are in the supplementary.



Implementation details. We train our model at an output resolution of 256×\times×256 pixels. Each model is trained on 8 NVIDIA V100 GPUs with 32GB of memory each. This enables us to use a batch size of 8 with 24 points sampled per camera ray. Each model is trained for 250k iterations, which takes approximately 4 days. All baselines are also trained for an equivalent amount of time. Additional details are available in the supplementary.



Evaluation metrics. We use both quantitative and qualitative metrics to measure the quality of our outputs.



• Fréchet Inception Distance [19] (FID) and Kernel Inception Distance [7] (KID). We use FID and KID to measure the distance between the distributions of the generated and real images, using Inception-v3 [59]. We generate 1000 images for each of the 5 worlds from arbitrarily sampled camera view points using different style codes, for a total of 5000 images. We then generate outputs from each method for the same pair of view points and style code for a fair comparison. We use a held-out set of 5000 real landscape images to compute the metrics. For both metrics, a lower value indicates better image quality.



• Human preference score. Using Amazon Mechanical Turk (AMT), we perform a subjective visual test to gauge the relative quality of generated videos with top-qualified turkers. We ask turkers to choose 1) the more temporally consistent video, and 2) the video with overall better realism. For each of the two questions, a turker is shown two videos synthesized by two different methods and asked to choose the superior one according to the criteria. We generate 64 videos per world, total of 320 per method, and each comparison is evaluated by 3 workers.



Main results. Fig. 4 shows output videos generated by different methods. Each row is a unique world, generated using the same style-conditioning image for all methods. We can observe that our outputs are more realistic and view-consistent when compared to baselines. MUNIT [21] and SPADE [50] demonstrate a lot of flickering as they generate one image at a time, without any memory of past outputs. Further, MUNIT also fails to learn the correct mapping of segmentation labels to textures as it does not use paired supervision. While wc-vid2vid [36] is more view-consistent, it fails for large motions as it incrementally inpaints newly explored parts of the world. NSVF-W [31, 37] and GANcraft are both inherently view-consistent due to their use of volumetric rendering. However, due to the lack of a CNN renderer and the use of the L22_2start_FLOATSUBSCRIPT 2 end_FLOATSUBSCRIPT loss, NSVF-W produces dull and unrealistic outputs with artifacts. The use of an adversarial loss is key to ensuring vivid and realistic results, and this is further reinforced in the ablations presented below. Our method is also capable of generating higher resolution outputs as shown in Fig. 1, by sampling more rays.



We sample novel camera views from each world and compute the FID and KID against a set of held-out real images. As seen in Table 2, our method achieves FID and KID close to that of SPADE, which is a very strong image-to-image translation method, while beating other baselines. Note that wc-vid2vid uses SPADE to generate the output for first camera view in a sequence and is thus ignored in this comparison. Further, as summarized in Table 3, users consistently preferred our method and chose its predictions as the more view-consistent and realistic videos. More high-resolution results and comparisons as well as some failure cases are available in the supplementary.



Ablations. We train ablated versions of our full model on one Minecraft world due to computational constraints. We show example outputs from them in Fig. 5. Using no pseudo-ground truth at all and training with just the GAN loss produces unrealistic outputs, similar to MUNIT [21]. Directly producing images from volumetric rendering, without using a CNN, results in a lack of fine detail. Compared to the full model, skipping the GAN loss on real images produces duller images, and skipping the GAN loss altogether produces duller, blurrier images resembling NSVF-W outputs. Qualitative analysis is available in the supplementary.



We introduced the novel task of world-to-world translation and proposed GANcraft, a method to convert block worlds to realistic-looking worlds. We showed that pseudo-ground truths generated by a 2D image-to-image translation network provide effective means of supervision in the absence of real paired data. Our hybrid neural renderer trained with both real landscape images and pseudo-ground truths, and adversarial losses, outperformed strong baselines.



There still remain a few exciting avenues for improvements, including learning a smoother geometry in spite of coarse input geometry, and using non-voxel inputs such as meshes and point clouds. While our method is currently trained on a per-world basis, we hope future work can enable feed-forward generation on novel worlds.



Acknowledgements. We thank Rev Lebaredian for challenging us to work on this interesting problem. We thank Jan Kautz, Sanja Fidler, Ting-Chun Wang, Xun Huang, Xihui Liu, Guandao Yang, and Eric Haines for their feedback during the course of developing the method.

lalalalal