RetinaGS: Scalable Training for Dense Scene Rendering with Billion-Scale 3D Gaussians

Bingling Li ^*, Shengyi Chen ^*, Luchao Wang , Kaimin Liao , Sijie Yan ^†, Yuanjun Xiong

MThreads AI

^* Equal Contribution ^† Corresponding Author

Our approach enables the training of a native large-scale 3D Gaussian Splatting (3DGS) model with over 1 billion Gaussian Primitives. The videos showcase the reconstruction results on a 2.7 km² city dataset with 140,000 images. Even when zooming in to a specific location, the details remain rich and realistic.

Abstract

In this work, we explore the possibility of training high-parameter 3D Gaussian splatting (3DGS) models on large-scale, high-resolution datasets. We design a general model parallel training method for 3DGS, named RetinaGS, which uses a proper rendering equation and can be applied to any scene and arbitrary distribution of Gaussian primitives. It enables us to explore the scaling behavior of 3DGS in terms of primitive numbers and training resolutions that were difficult to explore before and surpass previous state-of-the-art reconstruction quality. We observe a clear positive trend of increasing visual quality when increasing primitive numbers with our method. We also demonstrate the first attempt at training a 3DGS model with more than one billion primitives on the full MatrixCity dataset that attains a promising visual quality.

Method

Different sizes of datasets require varying levels of computational power and numbers of 3DGS Primitives. Larger and higher-resolution datasets can no longer be trained using just a single GPU , which limits the pursuit of scale and fidelity in 3DGS.

The training pipeline is shown in above figure. We denote each subset as a sub-model and assign it to a separate GPU, while a central manager is responsible for managing the subspaces S_n. This manager also handles the parsing of incoming rendering requests and distributes rendering tasks to the relevant sub-models. The computational results from all sub-models, T_k and C_k, are sent back to the central manager. These results can be represented in a 2D map with the same resolution as the target image, which consumes only minimal communication bandwidth. The central manager then completes the final rendering, calculates the loss, and sends the gradient back to each sub-model for model parameter updates.

Anylysis

Performance vs. Gaussian Primitives

Our distributed training approach enables the use of a large number of Gaussian primitives. We observed a strong positive correlation between the number of Gaussian primitives and the final model's PSNR. When the number of primitives is similar, our PSNR closely matches the quality of 3DGS trained on a single GPU; however, we can effectively achieve higher PSNRs by simply increasing the number of primitives.

Many-to-Many Relationship Between Rays and Traversed Subspaces

Previous methods [2,3] skip the merging process between subspaces, implicitly assuming that a single ray is primarily influenced by one subspace and predominantly contributed to by Gaussians within that subspace. This assumption holds true for bird's-eye view datasets. Due to the limited perspective variation in these datasets, rays almost always pass downward through one, or at most two, subspaces. Above figure shows the frequency histogram of the number of subspaces hit by each ray. As seen in more general datasets, such as street view datasetsand indoor datasets, rays often intersect multiple subspaces. This undermines the basic assumptions of previous works. Our distributed framework is computationally equivalent to the original 3DGS and does not rely on any assumptions about data distribution or perspective , making it a more general approach.

The following video demonstrates the visual effects from both a bird view and a street view in an urban scene.

Effectiveness of Partition Strategies

Below figure shows merge tests with only two subspaces for different partition approaches. The cell-independent methods used in previous works [2,3] can cause rendering artifacts at boundaries. Our approach renders the boundaries perfectly.

Results

Close-up Inspection of Indoor Objects

In the real world, the closer you are to an object, the more details you can observe. However, the memory limitations of a single-core GPU restrict the number of rendering units (Gaussian Primitives), making it difficult to closely inspect objects with high detail. Distributed training can significantly improve this limitation.

Ours
32.19M

3DGS
1.09M

3DGS
2.65M

Ours
47.59M

3DGS
1.09M

Ours
32.19M

3DGS
2.65M

Ours
47.59M

Visual Quality vs. Number of Gaussians

Urban scenes require a vast number of Gaussian primitives, which a single-core GPU struggles to handle.

100M
8GPU

1B
64GPU

100M
8GPU

1B
64GPU

100M
8GPU

1B
64GPU

100M
8GPU

1B
64GPU

BibTeX

@article{li2024retinags,
  title={RetinaGS: Scalable Training for Dense Scene Rendering with Billion-Scale 3D Gaussians},
  author={Li, Bingling and Chen, Shengyi and Wang, Luchao and He, Kaimin and Yan, Sijie and Xiong, Yuanjun},
  journal={arXiv preprint arXiv:2406.11836},
  year={2024}
}

References

[1] Kerbl, B., Kopanas, G., Leimkühler, T., & Drettakis, G. (2023). 3D Gaussian Splatting for Real-Time Radiance Field Rendering. ACM Transactions on Graphics.

[2] Lin, J., Li, Z., Tang, X., Liu, J., Liu, S., Liu, J., Lu, U., Wu, X., Xu, S., Yan, Y. & Yang, W. (2024). Vastgaussian: Vast 3d gaussians for large scene reconstruction. CVPR.

[3] Liu, Y., Guan, H., Luo, C., Fan, L., Peng, J., & Zhang, Z. (2024). Citygaussian: Real-time high-quality large-scale scene rendering with gaussians. ECCV.