FreeScene: Mixed Graph Diffusion for 3D Scene Synthesis from Free Prompts

Tongyuan Bai1 Wangyuanfan Bai 1 Dong Chen1 Tieru Wu1,3 Manyi Li2 Rui Ma1,3,*

1 School of Artificial Intelligence, Jilin University
2 School of Software, Shandong University
3 Engineering Research Center of Knowledge-Driven Human-Machine Intelligence, MOE, China
* Corresponding authors



We introduce FreeScene, a user-friendly controllable indoor scene synthesis framework that allows convenient and effective control with free-form user inputs, including text and/or different types of images (top-view diagram in the left, realistic photograph in the middle, sketch in the right). Leveraging a multimodal agent to extract partial graph priors from user inputs, FreeScene subsequently generates a complete and reasonable scene that conforms to the graph via a generative diffusion model.

[Paper]      [Code]      [BibTeX]

Abstract

Controllability plays a crucial role in the practical applications of 3D indoor scene synthesis. Existing works either allow rough language-based control, that is convenient but lacks fine-grained scene customization, or employ graph based control, which offers better controllability but demands considerable knowledge for the cumbersome graph design process. To address these challenges, we present FreeScene, a user-friendly framework that enables both convenient and effective control for indoor scene synthesis. Specifically, FreeScene supports free-form user inputs including text description and/or reference images, allowing users to express versatile design intentions. The user inputs are adequately analyzed and integrated into a graph representation by a VLM-based Graph Designer. We then propose MG-DiT, a Mixed Graph Diffusion Transformer, which performs graph-aware denoising to enhance scene generation. Our MG-DiT not only excels at preserving graph structure but also offers broad applicability to various tasks, including, but not limited to, text-to-scene, graph-to-scene, and rearrangement, all within a single model. Extensive experiments demonstrate that FreeScene provides an efficient anduser-friendly solution that unifies text-based and graph based scene synthesis, outperforming state-of-the-art methods in terms of both generation quality and controllability in a range of applications.

Approach

Given a reference image, text, or either of them, a carefully designed agent called Graph Designer analyzes the objects and their spatial relationships, constructing a partial graph prior that captures both the object types and their spatial relationships. Next, leveraging the proposed MG-DiT, we apply a constrained sampling method that preserves the integrity of the graph prior throughout the noising and denoising processes, while generating the remaining parts. Ultimately, we obtain the complete room layout and perform object retrieval from the 3D-FUTURE dataset, retrieving each object with the most similar OpenCLIP feature (derived from the generated fVQ-VAE indices) within the same category.



Graph Designer: The Graph Designer utilizes a one-shot Chain-of-Thought (CoT) method to guarantee the accuracy of the extraction results. It then parses and preprocesses the VLM responses to ensure compatibility with the MG-DiT inputs.
Mixed Graph Diffusion Transformer: MG-DiT conditions on text and timesteps to jointly denoise continuous bounding box features along with discrete graph and fVQ-VAE features. The network consists of input and output processing layers, stacked with several Mixed Graph Transformer Blocks, as illustrated.

Results


Result 1: Qualitative comparisons on text-to-scene generation.

Result 2: Qualitative comparisons on graph-to-scene generation.

Acknowledgements: This research was supported in part by the National Natural Science Foundation of China (No. 62202199, 62302269), the Excellent Young Scientists Fund Program (Overseas) of Shandong Province (No. 2023HWYQ-034) and the Fundamental Research Funds for the Central Universities.