Introduction
With the rapid development of computer vision, graphics and deep learning, converting two-dimensional images into accurate three-dimensional models has become a core technology in many fields, especially in virtual reality, augmented reality, autonomous driving, game development and robotics[1](#ref-CR1 “Poole, B., Jain, A., Barron, J. T. & Mildenhall, B. Dreamfusion: Text-to-3d using 2d diffusion. arXiv preprint arXiv:2209.14988
(2022).“),2,[3](#ref-CR3 “Li, J., Wang, Y., Ning, X., He, W. & Cai, W. Fefdm-transformer: Dual-channel multi-stage transformer-based encoding and fusion mode fo…
Introduction
With the rapid development of computer vision, graphics and deep learning, converting two-dimensional images into accurate three-dimensional models has become a core technology in many fields, especially in virtual reality, augmented reality, autonomous driving, game development and robotics[1](#ref-CR1 “Poole, B., Jain, A., Barron, J. T. & Mildenhall, B. Dreamfusion: Text-to-3d using 2d diffusion. arXiv preprint arXiv:2209.14988
(2022).“),2,3,4. This technology not only improves the accuracy of object recognition and scene understanding, but also greatly promotes innovation in digital design, production and manufacturing processes. Successful three-dimensional reconstruction requires not only restoring the geometric shape of the object, but also retaining high-precision details and texture reproduction, especially ensuring consistency under multiple perspectives, so as to achieve a seamless connection between the real world and the digital world5,6,[7](#ref-CR7 “Li, J. et al. Instant3d: Fast text-to-3d with sparse-view generation and large reconstruction model. arXiv preprint arXiv:2311.06214
(2023).“),8,[9](https://www.nature.com/articles/s41598-025-24916-6#ref-CR9 “Xing, X., Wang, B., Ning, X., Wang, G. & Tiwari, P. Short-term od flow prediction for urban rail transit control: A multi-graph spatiotemporal fusion approach. Information Fusion 118, 102950. https://doi.org/10.1016/j.inffus.2025.102950
(2025).“).
Diffusion models have shown great potential in generation tasks, especially in terms of high-quality sample generation. Through a gradual denoising process, diffusion models can efficiently reconstruct real-world images from random noise10,11. The potential of diffusion models is particularly prominent in the field of 3D reconstruction. By combining the diffusion process with 3D geometric information, researchers have overcome the limitations of traditional methods in recovering complex shapes and improved the details and consistency of generated objects. For example, G3D (3D Diffusion Model) successfully generates 3D objects with high detail accuracy by implementing the diffusion process in 3D space, providing a more powerful framework for 3D reconstruction12. In addition, the DreamFusion method uses the diffusion model to generate 3D objects from 2D images, and effectively overcomes the shortcomings of traditional generation methods in terms of details and consistency by gradually recovering shape and texture during the diffusion process[1](https://www.nature.com/articles/s41598-025-24916-6#ref-CR1 “Poole, B., Jain, A., Barron, J. T. & Mildenhall, B. Dreamfusion: Text-to-3d using 2d diffusion. arXiv preprint arXiv:2209.14988
(2022).“). Meanwhile, Point-NeRF combines the diffusion model with Neural Radiance Field (NeRF) to generate fine 3D point clouds using multi-view image data, thereby improving the quality and details of 3D object reconstruction13.
Although diffusion models have made significant progress in 2D to 3D image generation tasks, current techniques still face several challenges. First, many existing methods fail to maintain sufficient accuracy when generating the geometry of complex objects, especially in detail recovery and high-complexity scenes. Second, while diffusion models can generate relatively realistic 3D forms, issues with consistency remain when generating from multiple viewpoints. For example, when generating 3D models from multiple perspectives, lack of viewpoint consistency often leads to visual discrepancies in the generated objects, thereby impacting the model’s practical applicability. Therefore, improving the consistency of 3D models across multiple viewpoints and optimizing reconstruction accuracy remain key challenges in current research.
To address the aforementioned issues, we propose a novel 3D object reconstruction framework, NeuroDiff3D. This model is designed to significantly enhance the accuracy and viewpoint consistency of 3D object reconstruction by dividing the process into two main modules: the 3D Prior Pipeline and the Model Training Pipeline. In the 3D Prior Pipeline, we first utilize the 3D diffusion model (G3D)12 to generate a rough 3D prior. Through the diffusion process, noise is gradually eliminated, allowing us to recover the basic geometric structure and texture features of the object in 3D space, which serves as the foundation for subsequent training. Then, in the Model Training Pipeline, we input structural information, texture information, and semantic information as priors into the T2i-Adapter model14. This process further refines the relationship between textures and structures, ensuring that the generated 3D models exhibit superior visual quality and viewpoint consistency.
By combining advanced 3D diffusion modeling techniques with multimodal information fusion methods in deep learning, NeuroDiff3D is able to generate more detailed, accurate, and consistent 3D object models from multiple viewpoints, while achieving higher frame rates (FPS) and lower FLOPs (as shown in Figure 1). In summary, the main contributions of this paper are as follows:
Fig. 1
Performance comparison of different models in terms of FPS, CLIP-score, and FLOPs. The size of each bubble represents the FLOPs, with larger bubbles indicating higher computational cost.
Proposed NeuroDiff3D, which generates a rough 3D object prior in the 3D Prior Pipeline and further optimizes the structure, texture, and semantic information in the Model Training Pipeline, ensuring the generated 3D models maintain detail recovery and consistency across multiple viewpoints.
Combines image, structural, texture, and semantic information, enabling the model to better reconstruct the geometric shape and details of the 3D object, enhancing its applicability in various real-world scenarios.
On multiple datasets, NeuroDiff3D outperforms existing Text-to-3D and Image-to-3D methods, demonstrating its strong potential in handling complex scenes.
Related work
Text-to-3D generation
The Text-to-3D generation task, which involves generating 3D models from natural language descriptions, has made significant progress with the rapid development of deep learning13. Traditional 3D generation methods, such as multi-view stereo (MVS) and structure from motion (SfM), generate 3D models by extracting geometric information from multiple 2D images. Although these methods perform excellently in terms of geometric accuracy, they have significant limitations in texture recovery and multi-view consistency15. With the rise of deep learning techniques, many neural network-based generation methods, such as voxel generation, point cloud generation, and mesh reconstruction, have gradually become mainstream16. For example, PointNet and DeepMVS focus on point cloud generation, which improves the reconstruction accuracy of 3D objects to a certain extent. NeRF (Neural Radiance Fields) enhances texture and detail recovery by creating volumetric representations of scenes, further improving the quality of 3D objects17,[18](https://www.nature.com/articles/s41598-025-24916-6#ref-CR18 “Li, W., Chen, R., Chen, X. & Tan, P. Sweetdreamer: Aligning geometric priors in 2d diffusion for consistent text-to-3d. arXiv preprint arXiv:2310.02596
(2023).“). However, these methods still face issues with multi-view consistency and incomplete detail recovery when dealing with complex object shapes or occlusions.
Diffusion models, as an emerging generation technology, have shown great potential in 3D generation. Methods like G3D and DreamFusion gradually convert random noise into high-quality 3D shapes through the diffusion process19,20. These methods perform well in generating detailed and consistent objects but still have room for improvement in terms of efficiency and generation accuracy. Additionally, transformer-based methods such as Text2Mesh and T2I-Adapter, which combine text and image information, enhance the precision and detail of 3D generation models[21](https://www.nature.com/articles/s41598-025-24916-6#ref-CR21 “Mercier, A. et al. Hexagen3d: Stablediffusion is just one step away from fast and diverse text-to-3d generation. arXiv preprint arXiv:2401.07727
(2024).“),22. These methods utilize multi-modal information to overcome the shortcomings of traditional methods, generating more refined and consistent 3D objects. Implicit representation methods, such as NeRF, have also been extended to Text-to-3D tasks, like Text2NeRF, which generates high-precision 3D scenes from text descriptions through implicit field modeling23,24. These methods provide higher detail accuracy and texture consistency, but they still face challenges in complex scenes and detail recovery.
Image-to-3D generation
The task of Image-to-3D generation, which involves converting 2D images into corresponding 3D models, has become an important research direction in computer vision and graphics25. Traditional image-to-3D reconstruction methods, such as stereo vision and photogrammetry, rely on extracting depth information from multiple 2D images to generate 3D models. These methods typically require high-quality images from different viewpoints and tend to perform poorly in cases of occlusion, low-resolution images, or complex object shapes. An increasing number of methods based on convolutional neural networks (CNNs) and generative models have been applied to image-to-3D generation tasks26,27. 3D-R2N2 is one of the earlier deep learning methods that learns 3D object representations from single-view images through CNNs and predicts 3D shapes through voxel grids. Although this method has made significant progress in this field, it still faces challenges in maintaining detailed geometry and handling complex structures. The introduction of generative models such as generative adversarial networks (GANs) and variational autoencoders (VAEs) aims to improve the generation quality of image-to-3D methods28. Pix2Vox and 3D-GAN use GANs to generate 3D models from one or more 2D images, where the generator learns to generate realistic 3D shapes and the discriminator ensures the consistency of the generated objects. These methods perform well in generating more refined reconstructed objects, but still face challenges in ensuring texture mapping accuracy and consistency across multiple viewpoints29,30. Additionally, with the advent of Neural Radiance Fields (NeRF), image-to-3D generation methods have seen new breakthroughs. NeRF-W and Mip-NeRF successfully generate high-quality 3D scenes by incorporating volumetric representation and combining texture and lighting information from input images. While these methods are capable of capturing realistic details, they often require a large number of input images to perform well and are computationally expensive31. Moreover, they still struggle with non-rigid objects or objects undergoing significant deformations. Some methods have also turned to point cloud generation, such as PointNet and PVCNet, directly generating 3D point clouds from images32. These methods generate sets of points in 3D space that represent the object’s surface. Although point cloud methods perform well in terms of computational efficiency and can handle scenes from arbitrary viewpoints, they often fail to generate smooth surfaces and are lacking in detail and resolution compared to voxel or mesh-based methods33. Multi-view image-to-3D generation methods have also emerged as a strong alternative, where multiple images from different viewpoints are used to generate more accurate 3D models. MVSNet (Multi-view Stereo Network) and DeepMVS use deep learning to combine depth maps from multiple 2D images and generate high-precision 3D models34. These methods perform well in terms of accuracy and detail but still require a large amount of input data and struggle in scenes with sparse or limited textures.
NeuroDiff3D, by integrating 3D diffusion modeling with multimodal information fusion, is capable of generating more accurate, consistent, and detailed 3D models from input images. This method enhances detail recovery accuracy by utilizing 3D prior knowledge and texture information, while also optimizing computational efficiency, overcoming the limitations of existing methods in multi-view consistency and generation accuracy.
Methods
Preliminaries
3D diffusion models have made significant progress in generation tasks in recent years, particularly in image generation and 3D shape reconstruction35. The core idea of diffusion models is to transform random noise into structured data through a forward process of gradually adding noise and a reverse process of progressively denoising. For 3D generation tasks, the diffusion model needs to not only recover the geometry of the object but also restore its texture information to ensure that the generated 3D model has both detail and consistency.
In the 3D diffusion model, the forward diffusion process is performed by gradually adding noise to the original 3D object representation. Its transition probability can be expressed as:
$$\begin{aligned} q(x_t | x_{t-1}) = \mathcal {N}(x_t; (1 - \beta _t) x_{t-1}, \beta _t I) \end{aligned}$$
(1)
where (x_t) is the state of the 3D object at time step (t), (\beta _t) is the noise variance scheduling function, and (I) is the identity matrix. This process continues until, at the final step ((T)), the object is completely covered by noise.
On the other hand, the reverse diffusion process gradually denoises the 3D object through a neural network. In each step, the model predicts the denoised object based on the current noise state. The conditional probability of the reverse diffusion process can be expressed as:
$$\begin{aligned} p(x_{t-1} | x_t) = \mathcal {N}(x_{t-1}; \mu _\theta (x_t, t), \sigma _t^2 I) \end{aligned}$$
(2)
where (\mu _\theta (x_t, t)) is the denoised object representation predicted by the neural network, (\sigma _t^2) is the noise schedule at time step (t), and (\theta) are the parameters of the neural network, which are learned during training. The objective is to minimize the reconstruction error.
In order to train a 3D diffusion model, we need to minimize the variational lower bound of the data likelihood to ensure that the model can effectively generate 3D objects consistent with the target distribution. The objective function of training is usually expressed as:
$$\begin{aligned} L = \mathbb {E}_q \left[ \sum _{t=1}^T \beta _t \Vert x_t - x_{t-1} \Vert ^2 \right] \end{aligned}$$
(3)
where (x_t) is the noisy state at time step (t), and (x_{t-1}) is the denoised object predicted by the model. The training process minimizes the reconstruction error at each time step, allowing the model to accurately recover the original 3D object during the reverse diffusion process.
Once the model is trained, it can generate new 3D objects by starting from random noise and applying the reverse diffusion process. The final 3D object generated at time step (T) is obtained through denoising, recovering a high-quality 3D shape that is consistent with the target distribution.
Overview of our network
As shown in Figure 2, the network architecture proposed in this paper consists of two main parts: the 3D Prior Pipeline and the Model Training Pipeline. In the 3D Prior Pipeline, the 3D diffusion model is first used for geometry modeling, generating a rough 3D object representation. The key task in this stage is to gradually recover the object’s geometric structure, texture details, and semantic information through the diffusion model, providing an initial 3D prior for subsequent training. In the Model Training Pipeline, input images and the structural, texture, and semantic information obtained from the 3D Prior Pipeline are processed through the T2i-Adapter. The model parameters are gradually optimized through the joint operation of trainable and frozen modules. After multiple backpropagation steps, a fine-grained 3D image is ultimately generated.
Fig. 2
NeuroDiff3D Overall Network Architecture. The architecture consists of two main pipelines: (a) 3D Prior Pipeline, which utilizes a 3D Diffusion Model to generate prior knowledge; and (b) Model Training Pipeline, which generates 3D images by training T2I methods. Data source: Pix3D dataset[36](https://www.nature.com/articles/s41598-025-24916-6#ref-CR36 “Sun, X. et al. Pix3d: Dataset and methods for single-image 3d shape modeling. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2974–2983, https://doi.org/10.1109/CVPR.2018.00314
(IEEE, 2018).“) and OmniObject3D dataset[37](https://www.nature.com/articles/s41598-025-24916-6#ref-CR37 “Wu, T. et al. Omniobject3d: Large-vocabulary 3d object dataset for realistic perception, reconstruction and generation. arXiv preprint arXiv:2301.07525
https://doi.org/10.48550/arXiv.2301.07525
(2023).“).
3D prior pipeline
Coarse 3D prior generation
In the 3D Prior Pipeline, the first step is to generate a coarse 3D object prior using a 3D diffusion model (e.g., Shap-E). To address the limitations of traditional implicit 3D representation methods (e.g., NeRF) in extracting high-quality detailed surfaces, NeuroDiff3D integrates deformable tetrahedral meshes and differentiable tetrahedral stepping (MT) layers to extract explicit 3D surface meshes. This innovative design enables the model to more accurately restore the object’s geometric structure while providing richer texture details.
The specific process is as follows: first, a preliminary 3D result in the form of a coarse geometric model is generated from the input image prompts. Then, a Multilayer Perceptron (MLP) is employed to query the Signed Distance Function (SDF) values of each vertex. This MLP adopts a “4-layer fully connected + residual connection” architecture. The input layer receives the vertex coordinates (3D) of the coarse model generated by Shap-E together with initial texture features (64D), concatenated into a 67D vector. The first three hidden layers each contain 1024 neurons with LeakyReLU activation (negative slope = 0.2), while the fourth layer compresses the representation to 256 dimensions with ReLU activation. To alleviate gradient vanishing in deep networks, residual paths are added between the 2nd–3rd and 3rd–4th layers. Finally, the output layer is split into two branches: one predicts the SDF value (1D) and the other outputs the displacement deviation (3D, corresponding to corrections along the x/y/z axes), with no activation function applied.
The SDF describes the distance from any point in space to the object’s surface, calculated as:
$$\begin{aligned} SDF(x)=\left| x-p_{\text {surface}} \right| _2 \end{aligned}$$
(4)
where (x) is any point in space, and (p_{\text {surface}}) is a point on the object’s surface. With this SDF, points near the object’s surface can be estimated, and the generated model can be further optimized.
The MLP optimization process based on Shap-E’s initial results is divided into three stages. First, in the Feature Extraction stage, uniform sampling is performed on Shap-E’s coarse 3D mesh, selecting one key vertex per 100 vertices, resulting in a total of 2048 sampled points. For each point, world coordinates ((x_{\text {coord}})) and initial texture features ((f_{\text {tex}}), 64D, output by Shap-E’s texture decoder) are extracted and concatenated into the MLP input vector (v_{\text {in}} = [x_{\text {coord}}; f_{\text {tex}}]). In the Error Calculation stage, (v_{\text {in}}) is fed into the MLP to obtain the predicted SDF and original sampled coordinates as benchmarks, the “SDF error term” and “displacement error term” are computed as follows:
$$\begin{aligned} \mathcal {L}_{\text {sdf}} = \Vert \text {SDF}(x_i) - \hat{\text {SDF}}(x_i)\Vert _2^2, \quad \mathcal {L}_{\text {disp}} = \Vert \Delta x_i\Vert _2^2 \end{aligned}$$
(5)
The total loss is then calculated as:
$$\begin{aligned} \mathcal {L}_{\text {total}} = \mathcal {L}_{\text {sdf}} + 0.3\mathcal {L}_{\text {disp}} \end{aligned}$$
(6)
where 0.3 is the displacement error weight, determined through cross-validation. Finally, in the Parameter Update stage, Shap-E’s parameters are fixed, and only the MLP weights and biases are updated using the Adam optimizer (learning rate = (10^{-4}), weight decay = (10^{-5})). After 50 iterations, the MLP-predicted displacement deviations are added to the original sampled points to generate an optimized 3D mesh model, completing the transformation from the coarse model to a fine-grained prior.
To extract more details and maintain consistency from the coarse 3D object, a set of points is sampled from the 3D result, and their SDF values and displacement deviations are predicted. The optimization goal is to refine the model and improve its geometric accuracy and texture quality. To ensure result accuracy and consistency, SDF prediction is optimized by minimizing the following loss function:
$$\begin{aligned} \mathscr {L}_{\text {SDF}}=\sum _{i=1}^{N}\left( \left( SDF\left( x_{i}\right) -\hat{SDF}\left( x_{i}\right) \right) ^{2}+\lambda \left| x_{i}-\hat{x}_{i}\right| _{2}\right) \end{aligned}$$
(7)
where (\hat{SDF}(x_{i})) is the predicted SDF of sampled point (x_{i}), (\hat{x}_{i}) is the predicted position, (N) is the number of sampled points, and (\lambda) is a hyperparameter balancing geometric and SDF prediction errors. Here, (\lambda = 0.5)—determined by testing (\lambda \in [0.1, 1.0]) on the OmniObject3D validation set (CMMD as the metric)—which ensures optimal geometric consistency and avoids distortion or SDF deviation from unbalanced weights.
Multimodal knowledge generation
Geometric information is generated by querying each point’s SDF value using the MLP. The SDF (Equation (4)) is critical for recovering the object’s surface geometry. During optimization, the model refines the geometric shape by comparing predicted and actual SDF errors, gradually improving the accuracy of the generated geometric structure.
Texture information is generated by combining normal maps and noise perturbation. A normal map describes the object’s surface direction; texture information is generated by adding noise to this map, expressed as:
$$\begin{aligned} T(x)=\hat{T}(x)+\varepsilon \end{aligned}$$
(8)
where (\hat{T}(x)) is the network-predicted texture map, (\varepsilon \sim \mathcal {N}(0, I)) is noise sampled from a standard normal distribution, and (x) is a point in space. Adding noise incorporates surface details and simulates inevitable noise and subtle variations in the natural world.
Semantic information is generated by encoding the input text description into a semantic vector, which improves the accuracy and consistency of 3D object generation. The custom-designed Semantic Encoder in this paper adopts a “Transformer-based architecture + semantic adaptation layer”: a 6-layer Transformer encoder (hidden dimension of 512, 8 attention heads) processes the tokenized text sequence (maximum length of 64) to capture contextual semantics; the subsequent semantic adaptation layer contains a fully connected layer with Tanh activation (input 512D, output 256D) and a LayerNorm layer, completing feature compression and normalization.
In the pre-training phase, the first 4 layers of the CLIP model’s text encoder are reused, while the remaining 2 layers of the Transformer and the semantic adaptation layer are randomly initialized. The encoder directly outputs a 256D vector (S), which is processed by the semantic gating mechanism (128×256 transformation matrix) into a 128D semantic prior, ensuring compatibility with the dimensionality of the structure and texture priors, making multimodal feature concatenation compatible.
The encoding process is formally expressed as:
$$\begin{aligned} S = \text {SemanticEncoder}(T_{\text {input}}) \end{aligned}$$
(9)
where (S) is the semantic vector and (T_{\text {input}}) is the input text description.
Model training pipeline
In the Model Training Pipeline, 3D prior information (geometric, texture, semantic) is transmitted to the T2i-Adapter module, which combines image and text information to generate latent 3D object representations.
Preprocessing of multimodal prior information
Three types of prior knowledge are preprocessed for subsequent fusion:
Structural Prior ((f_{\text {struc}})): Extracted from the optimized 3D mesh. Face sampling is performed (1 face per 10, totaling 512 sampled faces), and for each face, the normal vector (3D), area (1D), and center coordinates (3D) are calculated. These values are concatenated and fed into a 1-layer MeshConv (kernel size = 3) to output a 128D structural feature vector.
Texture Prior ((f_{\text {tex}})): Output by the 3D Prior Pipeline’s texture generation module (Equation (6)), resulting in a 256×256×3 texture map. Global texture features are extracted using a pre-trained ResNet-18 (with the first 10 layers frozen), and these features are then compressed to a 128D vector via a 1-layer fully connected layer.
Semantic Prior ((f_{\text {sem}})): Derived from the 256D vector of the original semantic encoder (Equation (7)). Key semantic information is filtered using a “Semantic Gating” mechanism, defined as
$$\begin{aligned} f_{\text {sem_gate}} = \sigma (W_g f_{\text {sem}} + b_g) \end{aligned}$$
where (\sigma) is the Sigmoid function, (W_g) is a 128×256 matrix, and (b_g) is a 128D bias. The final output is a 128D semantic feature vector.
Multimodal information fusion and processing
The CLIP text encoder converts input text descriptions into 512D text features, and the CLIP image encoder processes input images into 512D image features. Based on input semantic information and preprocessed 3D priors, a two-stage “feature alignment–cross-attention fusion” strategy is adopted for encoding/decoding. In the feature dimension alignment stage, the structural (128D), texture (128D), and semantic (128D) prior vectors are concatenated into a 384D vector (f_{\text {prior}}), and a 1-layer fully connected layer (input = 384, output = 512) aligns its dimension with text/image features, resulting in (f_{\text {prior_align}}). In the cross-attention fusion stage, the T2i-Adapter is used for multimodal cross-attention calculation. First, the text feature (f_{\text {text}}) (Query) and the image feature (f_{\text {img}}) (Key) are used to calculate the text-image attention weights:
$$\begin{aligned} A_{t-i} = \text {Softmax} \left( \frac{f_{\text {text}} f_{\text {img}}^T}{\sqrt{512}} \right) \end{aligned}$$
(10)
Then, the image features weighted by (A_{t-i}) ((f_{\text {img_attn}} = A_{t-i} f_{\text {img}})) serve as the new Key, and (f_{\text {prior_align}}) serves as the Value to calculate attention weights:
$$\begin{aligned} A_{t-p} = \text {Softmax} \left( \frac{f_{\text {text}} f_{\text {img_attn}}^T}{\sqrt{512}} \right) \end{aligned}$$
(11)
The final fused feature is:
$$\begin{aligned} f_{\text {fusion}} = 0.5 f_{\text {text}} + 0.3 f_{\text {img_attn}} + 0.2 f_{\text {prior_align}} \end{aligned}$$
(12)
where the weight coefficients are determined via validation set tuning, generating a high-precision latent 3D object representation.
This process can be described as:
$$\begin{aligned} Z_{T} = \text {T2i-Adapter}(S, T, N) \end{aligned}$$
(13)
where (Z_{T}) is the latent space representation after multi-layer neural network processing, (S) is semantic information, (T) is texture information, and (N) is geometric information (e.g., normal maps).
Reverse diffusion optimization and model output
In the reverse diffusion process, the model gradually recovers the 3D object representation from noise data. At each time step, the latent representation is updated using:
$$\begin{aligned} Z_{t-1}=\mu _{\theta }\left( Z_{t}, t\right) +\sigma _{t} \cdot \varepsilon \end{aligned}$$
(14)
where (Z_{t}) is the latent representation at time step (t), (\mu _{\theta }(Z_{t}, t)) is the network-predicted denoised mean, (\sigma _{t}) is the time-step-(t) noise scheduling parameter (with a defined relationship to the forward diffusion noise variance (\beta _t): (\sigma _t^2 = \frac{\beta _t (1 - \bar{\alpha }_{t-1})}{1 - \bar{\alpha }_t}), where (\bar{\alpha }_t = \prod _{s=1}^t (1 - \beta _s)) is the cumulative noise attenuation coefficient of the first (t) forward steps—derived from the diffusion model’s probability chain rule to ensure noise distribution consistency), and (\varepsilon \sim \mathcal {N}(0, I)) is standard normal noise. After multiple backpropagation steps, the latent representation is optimized to generate a refined 3D object representation (Z_{0}).
Finally, after reverse diffusion and optimization, the model generates the final 3D object model via a decoder:
$$\begin{aligned} \hat{X}=\text {Decoder}\left( Z_{0}\right) \end{aligned}$$
(15)
where (\hat{X}) is the final 3D model, obtained by decoding the latent representation (Z_{0}).
Through this stepwise optimization, the Model Training Pipeline combines text descriptions with 3D object priors to generate high-precision, consistent 3D models from noise. During training, the collaboration of trainable and frozen modules ensures efficient parameter optimization while maintaining computational efficiency.
Experiment
Experimental details
Datasets
In this paper, we use two main datasets to support the task of generating 3D models from 2D images. First, the Pix3D dataset[36](https://www.nature.com/articles/s41598-025-24916-6#ref-CR36 “Sun, X. et al. Pix3d: Dataset and methods for single-image 3d shape modeling. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2974–2983, https://doi.org/10.1109/CVPR.2018.00314
(IEEE, 2018).“) is a large-scale 2D-3D image pair dataset containing images from different object categories and their corresponding 3D models. Each image is accompanied by a detailed 3D model, usually in OBJ format. The dataset contains about 12,000 images, covering 10 object categories, suitable for tasks such as single-view 3D reconstruction, viewpoint estimation, and shape modeling. Second, the OmniObject3D dataset[37](https://www.nature.com/articles/s41598-025-24916-6#ref-CR37 “Wu, T. et al. Omniobject3d: Large-vocabulary 3d object dataset for realistic perception, reconstruction and generation. arXiv preprint arXiv:2301.07525
https://doi.org/10.48550/arXiv.2301.07525
(2023).“) provides 6,000 3D object data, covering 190 object categories, including multi-view images, 3D meshes, point clouds, and other data types. This dataset supports 3D object reconstruction and viewpoint generation, and provides rich training data for the multi-view image to 3D model generation task.
Data preprocessing
In this paper, the original image resolution of the Pix3D dataset ranges from 256×256 to 1024×1024. To avoid uneven feature distribution and balance detail retention with computational cost, all images are resized to 512×512 resolution, which is adapted to the sampling accuracy of 2048 key vertices in the 3D prior generation stage. The original multi-view image resolution of the OmniObject3D dataset is 384×384. To align with the input specifications of the Pix3D dataset, these images are also resized to 512×512 resolution, ensuring consistency in input feature dimensions between the two datasets. The image normalization follows the same strategy as the pre-trained CLIP model: first, the pixel values are normalized from [0, 255] to [0, 1], and then standardized using the mean (0.48145466, 0.4578275, 0.40821073) and standard deviation (0.26862954, 0.26130258, 0.27577711) to match the feature distribution of the CLIP encoder’s pre-training data. This enhances the accuracy of feature alignment during the cross-modal information fusion phase. Additionally, during training, a random horizontal flip data augmentation strategy is applied to the input images to improve the model’s generalization ability.
Experimental environment
The experimental setup in this paper, including hardware and software configurations, is shown in Table 1.
Metrics
This paper uses multiple evaluation metrics to evaluate the performance of NeuroDiff3D. CMMD is used to measure the difference between the generated model and the real 3D model. The smaller the value, the closer the generated model is to the real model. FIDCLIP measures the similarity by calculating the distance between the generated data and the real data in the feature space. The smaller the value, the more similar the two are. CLIP-score measures the consistency between the text description and the image content. The higher the value, the more consistent the text and the image are. LPIPS evaluates the perceptual similarity between the generated image and the real image. The smaller the value, the more similar the images are.
Quantitative comparisons
As shown in Table 2, we compared NeuroDiff3D with existing text-to-3D (Text-to-3D) and image-to-3D (Image-to-3D) generation methods, evaluating several key metrics, including CMMD, FIDCLIP, CLIP-score, and LPIPS, to comprehensively assess the performance of different methods in 3D generation tasks. These evaluation metrics cover the geometric consistency of the generated results, detail restoration accuracy, texture mapping quality, and semantic consistency with the input descriptions. In terms of geometric consistency (CMMD), NeuroDiff3D outperforms other methods by a significant margin. On the OmniObject3D and Pix3D datasets, NeuroDiff3D achieves CMMD values of 2.985 and 3.136, respectively, which are considerably lower than other comparative models (e.g., Shap-E with values of 4.115 and 4.252). This demonstrates that NeuroDiff3D excels in recovering the geometric shape of objects, maintaining higher geometric consistency, particularly in the reconstruction of complex shapes, significantly improving reconstruction accuracy. For FIDCLIP, NeuroDiff3D also yields outstanding results, with 27.354 (OmniObject3D) and 28.832 (Pix3D), significantly better than DreamFusion (31.651 and 32.702). This indicates that NeuroDiff3D generates 3D objects with better image quality and texture recovery, more closely resembling real-world objects, and performs excellently in visual consistency and detail recovery. In terms of CLIP-score, NeuroDiff3D achieves scores of 0.899 (OmniObject3D) and 0.898 (Pix3D), which are significantly higher than other methods, such as Shap-E (0.634 and 0.741). This result shows that NeuroDiff3D generates 3D models with significantly improved semantic consistency with the input text, exhibiting stronger text-to-3D generation capability. Although NeuroDiff3D performs slightly lower than InstantMesh and LRM in terms of the LPIPS metric, overall, by integrating 3D diffusion modeling and multimodal information fusion technology, NeuroDiff3D significantly outperforms existing methods in key dimensions such as geometric consistency, detail restoration, and semantic alignment. It effectively breaks through the limitations of traditional 3D generation methods and demonstrates great potential for generating high-quality 3D models.
Qualitative comparisons
As shown in Fig. 3, we compared NeuroDiff3D with several state-of-the-art image-guided 3D generation methods, including DreamFusion, Magic3D, SyncMVD, Paint3D, and Hunyuan3D-Paint. The figure displays the generated results for three types of objects: cartoon characters, dogs, and vases. From the comparison, it is evident that NeuroDiff3D outperforms the other methods in terms of detail generation and consistency. For the cartoon character, the 3D model generated by NeuroDiff3D showcases more refined details, particularly in the texture of the hair and clothing, appearing more natural and intricate. In contrast, other methods, such as DreamFusion and Magic3D, produce models with blurred details and some geometric distortions. For the dog model, NeuroDiff3D is able to clearly reproduce the facial expressions and body details, making the model appear more lifelike. In comparison, Hunyuan3D-Paint and SyncMVD exhibit distortions in the contours and surface textures of the body, leading to the loss of fine details. In the case of the vase, NeuroDiff3D not only maintains high geometric consistency but also excels in texture detail, particularly in the flower details and color transitions on the vase. Although the vase models generated by other methods also have some detail, there is still room for improvement in terms of texture accuracy and shape consistency. Overall, NeuroDiff3D demonstrates superior performance in all comparison tasks, particularly in texture recovery, geometric consistency, and detail representation, showing significant improvements over other existing methods.
Fig. 3
Qualitative results: Comparison of NeuroDiff3D with state-of-the-art methods for image-guided 3D image generation. Data source: Pix3D dataset[36](https://www.nature.com/articles/s41598-025-24916-6#ref-CR36 “Sun, X. et al. Pix3d: Dataset and methods for single-image 3d shape modeling. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2974–2983, https://doi.org/10.1109/CVPR.2018.00314
(IEEE, 2018).“) and OmniObject3D dataset[37](https://www.nature.com/articles/s41598-025-24916-6#ref-CR37 “Wu, T. et al. Omniobject3d: Large-vocabulary 3d object dataset for realistic perception, reconstruction and generation. arXiv preprint arXiv:2301.07525
https://doi.org/10.48550/arXiv.2301.07525
(2023).“).
Parameter analysis
As shown in Table 3, we compared the performance of different 3D generation methods based on text and image inputs, including model parameters (Params), floating point operations (FLOPs), frames per second (FPS), and inference time. From the table, it is evident that NeuroDiff3D performs exceptionally well in several important metrics. First, in terms of model parameters (Params), NeuroDiff3D has a parameter count of 475.9M, which is relatively moderate compared to other methods (e.g., Shap-E with 550.3M). This shows that NeuroDiff3D has achieved a good balance between maintaining efficient model performance and reducing computing resource requirements. In terms of floating point operations (FLOPs), NeuroDiff3D has 510.4G FLOPs, which is significantly lower than most other methods (such as Shap-E’s 760.7G). This shows that although NeuroDiff3D has high accuracy, its computational complexity and resource consumption are effectively controlled, thereby improving overall performance. In terms of FPS (frames per second), NeuroDiff3D achieved a frame rate of 32.5, which is higher than other methods such as ProlificDreamer (30.1 FPS) and Hunyuan3D-Paint (28.5 FPS), showing its advantages in processing speed and real-time performance. Most importantly, NeuroDiff3D achieves a minimum inference time of 7 seconds, the fastest among all methods, far exceeding Paint3D (8 seconds) and SyncMVD (12 seconds), which makes NeuroDiff3D more responsive and efficient in practical applications.
User analysis
To verify the practical applicability of NeuroDiff3D, this study conducted a subjective evaluation experiment involving 100 participants without professional backgrounds in computer vision or 3D modeling. These participants had not participated in any previous evaluations related to 3D generation models, so as to eliminate interference from prior cognition. The experiment adopted a blinded evaluation approach: the names of all models to be evaluated (including NeuroDiff3D and comparative models such as DreamFusion, Magic3D, Paint3D, and Hunyuan3D-Paint) were hidden and presented with random serial numbers. In each round, only the generation results of one type of object (cartoon characters, dogs, vases) from one model were displayed to avoid subjective bias. Participants were required to complete 3 rounds of independent scoring, and the average score was taken as the final individual score.
The display order of the models to be evaluated was randomly generated by a computer; for each type of