A Comprehensive Dataset for Image Segmentation in Custom Manufacturing Environments

Background & Summary

The Industrial Revolutions, characterized by periods of rapid technological advancement, were sparked by society’s demand for more efficiency, improved quality, and higher throughput. The last three centuries have birthed substantial breakthroughs in industrial manufacturing (Fig. 1). The first industrial revolution pioneered mechanical innovations relying on steam and water, while the second leveraged electrification and advanced machine tooling, further boosting production output[1](https://www.nature.com/articles/s41597-025-06007-3#ref-CR1 “Angelopoulos, A. et al. Tackling faults in the Industry 4.0 era—a survey of machine-learning solutions and key aspects. Sensors (Switzerland) 20(1), 109, https:/…

Background & Summary

(2020).“),[2](https://www.nature.com/articles/s41597-025-06007-3#ref-CR2 “Ghobakhloo, M. The future of manufacturing industry: a strategic roadmap toward Industry 4.0. Journal of Manufacturing Technology Management 29(6), 910–936, https://doi.org/10.1108/JMTM-02-2018-0057

(2018).“). Then, in the 1950s, the third industrial revolution adopted increased digitization using semiconductors and communication networks, paving the way for automated manufacturing[1](https://www.nature.com/articles/s41597-025-06007-3#ref-CR1 “Angelopoulos, A. et al. Tackling faults in the Industry 4.0 era—a survey of machine-learning solutions and key aspects. Sensors (Switzerland) 20(1), 109, https://doi.org/10.3390/s20010109

(2018).“). Currently, the fourth industrial revolution, also known as Industry 4.0, introduced new digital technologies, including artificial intelligence (AI) and machine learning (ML) approaches in various industries. Industry 4.0 is known for introducing “smart” devices, or the Internet of Things (IoT). It also encompasses the introduction of cyber-physical systems (CPS), cloud computing, big-data analytics, modeling and simulation, automation, and additive manufacturing technology. Industry 4.0 advances have paved the way for ‘smart factories’, which are more intelligent, adaptable, and productive due to the implementation of smart sensors, IoT devices, and independent autonomous systems[2](https://www.nature.com/articles/s41597-025-06007-3#ref-CR2 “Ghobakhloo, M. The future of manufacturing industry: a strategic roadmap toward Industry 4.0. Journal of Manufacturing Technology Management 29(6), 910–936, https://doi.org/10.1108/JMTM-02-2018-0057

(2018).“),[3](https://www.nature.com/articles/s41597-025-06007-3#ref-CR3 “Zheng, P. et al. Smart manufacturing systems for Industry 4.0: Conceptual framework, scenarios, and future perspectives. Frontiers of Mechanical Engineering 13(2), 137–150, https://doi.org/10.1007/s11465-018-0499-5

(2018).“).

Fig. 1

Timeline of the industrial revolution.

As the most impactful sector in the United States economy, each dollar invested in manufacturing generates an additional $2.69 for the economy[4](https://www.nature.com/articles/s41597-025-06007-3#ref-CR4 “National Association of Manufacturers. Facts About Manufacturing, Available at: https://nam.org/manufacturing-in-the-united-states/facts-about-manufacturing-expanded/

(2024).“). Moreover, manufacturing accounts for 12% of the U.S. GDP[5](https://www.nature.com/articles/s41597-025-06007-3#ref-CR5 “Bosman, L., Hartman, N. & Sutherland, J. How manufacturing firm characteristics can influence decision making for investing in Industry 4.0 technologies. Journal of Manufacturing Technology Management 31(5), 1117–1141, https://doi.org/10.1108/JMTM-09-2018-0283

(2020).“), demonstrating the importance of sustaining success and innovation in this area. Unfortunately, recent surveys reveal that an estimated 2.1 million out of 4 million manufacturing positions are projected to be unfilled by 2030[6](https://www.nature.com/articles/s41597-025-06007-3#ref-CR6 “Deloitte Insights. Creating pathways for tomorrow’s workforce today, Manufacturing Institute, Available at: https://www2.deloitte.com/content/dam/insights/articles/7048_DI_ER-I-Beyond-reskilling-in-manufacturing/DI_ER-I-Beyond-reskilling-in-manufacturing.pdf

(2021).“), while finding qualified workers was 1.4 times more difficult in 2020 versus 2018[6](https://www.nature.com/articles/s41597-025-06007-3#ref-CR6 “Deloitte Insights. Creating pathways for tomorrow’s workforce today, Manufacturing Institute, Available at: https://www2.deloitte.com/content/dam/insights/articles/7048_DI_ER-I-Beyond-reskilling-in-manufacturing/DI_ER-I-Beyond-reskilling-in-manufacturing.pdf

(2021).“). Limited access to human capital severely restricts the ability to embrace innovative Industry 4.0 technologies[5](https://www.nature.com/articles/s41597-025-06007-3#ref-CR5 “Bosman, L., Hartman, N. & Sutherland, J. How manufacturing firm characteristics can influence decision making for investing in Industry 4.0 technologies. Journal of Manufacturing Technology Management 31(5), 1117–1141, https://doi.org/10.1108/JMTM-09-2018-0283

(2020).“), implying a grim future for Industry 4.0 and manufacturing in the U.S. This is partly attributed to the pandemic, which caused the loss of 41% of manufacturing jobs, erasing nearly a decade of job creation[6](https://www.nature.com/articles/s41597-025-06007-3#ref-CR6 “Deloitte Insights. Creating pathways for tomorrow’s workforce today, Manufacturing Institute, Available at: https://www2.deloitte.com/content/dam/insights/articles/7048_DI_ER-I-Beyond-reskilling-in-manufacturing/DI_ER-I-Beyond-reskilling-in-manufacturing.pdf

(2021).“),[7](https://www.nature.com/articles/s41597-025-06007-3#ref-CR7 “U.S. Bureau of Labor Statistics. HOUSEHOLD DATA ANNUAL AVERAGES 18b. Employed persons by detailed industry and age, Available at: https://www.bls.gov/cps/cpsaat18b.htm

(2023).“). Furthermore, a shift in workplace values presents additional challenges. An aging generation of workers who value production and are accustomed to shift-based employment has sustained manufacturing, many of whom are projected to retire within the next decade[7](https://www.nature.com/articles/s41597-025-06007-3#ref-CR7 “U.S. Bureau of Labor Statistics. HOUSEHOLD DATA ANNUAL AVERAGES 18b. Employed persons by detailed industry and age, Available at: https://www.bls.gov/cps/cpsaat18b.htm

(2023).“). Contrarily, the younger generation values work-life balance, second to an attractive salary[6](https://www.nature.com/articles/s41597-025-06007-3#ref-CR6 “Deloitte Insights. Creating pathways for tomorrow’s workforce today, Manufacturing Institute, Available at: https://www2.deloitte.com/content/dam/insights/articles/7048_DI_ER-I-Beyond-reskilling-in-manufacturing/DI_ER-I-Beyond-reskilling-in-manufacturing.pdf

(2021).“). Manufacturers must now compete with warehouse wholesalers offering comparable compensation with better work-life balance, while also overcoming disparities between workplace structures and employee desires and expectations.

In 2021, 57% of manufacturers reported using advanced technologies to redesign job tasks (e.g., automating previously manual tasks)[8](https://www.nature.com/articles/s41597-025-06007-3#ref-CR8 “Deloitte. Building The Resilient Organization, Available at: https://img.beverf.net/r5/j7/3z/2021-Resilience-Report.pdf

(2021).“). Concurrently, 75% of industrial organizations identified reskilling the workforce as important or very important for their success over the next year[8](https://www.nature.com/articles/s41597-025-06007-3#ref-CR8 “Deloitte. Building The Resilient Organization, Available at: https://img.beverf.net/r5/j7/3z/2021-Resilience-Report.pdf

(2021).“). Still, only 10% felt very ready to address this trend[9](https://www.nature.com/articles/s41597-025-06007-3#ref-CR9 “Deloitte Insights. The social enterprise at work: Paradox as a path forward, Deloitte, Available at: https://www2.deloitte.com/content/dam/insights/us/articles/us43244_human-capital-trends-2020/us43244_human-capital-trends-2020/di_hc-trends-2020.pdf

(2020).“). Fortunately, the government and corporations recognize these issues and have prepared economic and workplace initiatives to address them. First, the Biden administration awarded $1 Billion to twenty-one winners of the Build Back Better Regional Challenge[10](https://www.nature.com/articles/s41597-025-06007-3#ref-CR10 “The White House. President Biden to Announce 21 Winners of $1 Billion American Rescue Plan Regional Challenge, September Available at: https://bidenwhitehouse.archives.gov/briefing-room/statements-releases/2022/09/02/president-biden-to-announce-21-winners-of-1-billion-american-rescue-plan-regional-challenge/

(2022).“). This funding enables recipients to rebuild economies, promote inclusive and equitable recovery, and create thousands of good-paying jobs in clean energy, next-generation manufacturing, and biotechnology[10](https://www.nature.com/articles/s41597-025-06007-3#ref-CR10 “The White House. President Biden to Announce 21 Winners of $1 Billion American Rescue Plan Regional Challenge, September Available at: https://bidenwhitehouse.archives.gov/briefing-room/statements-releases/2022/09/02/president-biden-to-announce-21-winners-of-1-billion-american-rescue-plan-regional-challenge/

(2022).“),[11](https://www.nature.com/articles/s41597-025-06007-3#ref-CR11 “White House. FACT SHEET: The Biden-Harris Administration Launches the Talent Pipeline Challenge, Available at: https://bidenwhitehouse.archives.gov/briefing-room/statements-releases/2022/06/17/fact-sheet-the-biden-harris-administration-launches-the-talent-pipeline-challenge-supporting-employer-investments-in-equitable-workforce-development-for-infrastructure-jobs/

(2022).“). Secondly, General Motors’ Technical Learning University (TLU) helps employees stay competitive in the Industry 4.0 workplace. GM’s TLU develops employees through an electrical apprenticeship program, hands-on training with cutting-edge automation, and a controls engineering college[12](https://www.nature.com/articles/s41597-025-06007-3#ref-CR12 “General Motors. Building a future-ready workforce at General Motors. Available at: https://www.gm.com/stories/tech-learning-university

.“).

With a 67% reduction in the number of U.S. foundries since the year 2000, the U.S. Department of Defense (DoD) is encountering supply issues for domestically manufactured castings and forgings[13](https://www.nature.com/articles/s41597-025-06007-3#ref-CR13 “America Makes. AMERICAMAKE Public Participants, Available at: https://www.americamakes.us/wp-content/uploads/2023/07/SS-5536.pdf

(2023).“). Many components required by the DoD are high-mix, low-volume (HMLV), whereas today’s domestic casting and forging businesses prefer high-value, high-quantity items such as those found in the automotive, heavy machinery, and heavy equipment industries[13](https://www.nature.com/articles/s41597-025-06007-3#ref-CR13 “America Makes. AMERICAMAKE Public Participants, Available at: https://www.americamakes.us/wp-content/uploads/2023/07/SS-5536.pdf

(2023).“). Investment in Industry 4.0 technology and worker education would benefit struggling HMLV foundries. Many facilities use Industry 4.0 tools to assess part quality, optimize process parameters, manage maintenance and monitor processes[3](https://www.nature.com/articles/s41597-025-06007-3#ref-CR3 “Zheng, P. et al. Smart manufacturing systems for Industry 4.0: Conceptual framework, scenarios, and future perspectives. Frontiers of Mechanical Engineering 13(2), 137–150, https://doi.org/10.1007/s11465-018-0499-5

(2018).“),[14](https://www.nature.com/articles/s41597-025-06007-3#ref-CR14 “Zheng, T., Ardolino, M., Bacchetti, A. & Perona, M. The applications of Industry 4.0 technologies in manufacturing context: a systematic literature review. International Journal of Production Research 59(6), 1922–1954, https://doi.org/10.1080/00207543.2020.1824085

(2021).“). These tools can help companies fill personnel gaps caused by the labor shortage. This paper presents a novel and comprehensive manufacturing dataset composed of sand cast parts. The dataset comprises three distinct data types: (1) real camera images, (2) synthetic images generated from three-dimensional (3D) scans, and (3) augmented images produced by systematically modifying these scans. Each image is paired with a corresponding segmentation mask and organized into labeled folders, facilitating the training of image segmentation neural networks. Additionally, the dataset includes objective image quality metrics intended to benchmark and analyze their influence on the learning dynamics and generalization capabilities of these networks. The primary motivation behind creating this dataset is to support the development of an automated procedure for post-processing tasks, particularly the removal of sprues and risers. This application is especially relevant for labor-constrained, high-mix low-volume (HMLV) foundries, where the variability and complexity of parts hinder the standardization of manual cutting operations.

Methods

The objective of this research is to develop a robust and versatile manufacturing dataset aimed at enhancing and evaluating the performance of image segmentation networks. The dataset comprises three types of images: (1) real images captured from actual cast parts, (2) synthetic images generated from 3D scan data, and (3) augmented images created by modifying the 3D scan-derived models. Each image is accompanied by objective quality metrics to support quantitative assessment of model performance. To address the notable absence of a comprehensive, labeled image dataset featuring sand-cast parts with attached sprues and risers, this work presents a purpose-built dataset tailored to that specific application.

The first step in developing this dataset involved designing an appropriate part for sand casting. A part measuring 1.25*″* × 1.25*″* × 0. 5*″* with 30-degree draft angles was created using 3D printing (Fig. 2). This particular design was chosen for its simple geometry, which minimizes the complexity and thus reduces the computational demands of the segmentation network, makes casting the part easier, and streamlines the data labeling process. This basic geometry allows for initial training of the network on fewer features, with plans to incorporate parts with more complex features in subsequent phases by augmenting existing photos with additional details. For the casting process, Petrobond red casting sand, a mixture of sand, oil, and clay, was used to form the molds. We chose 6061 aluminum for casting the parts, producing a total of 30 components. Each sample was distinctively varied by altering the orientation and placement of the sprues and risers to generate a diverse range of images for the dataset. This variability is critical for ensuring that the neural network can generalize well across different instances of sand-cast parts.

Fig. 2

Creating the sand cast.

Upon completion of casting the parts, we began capturing camera images and 3D scans for each component. For the image capture, we used an Arduino Nicla Vision camera operated using OpenMV IDE software. To enrich our dataset and provide varied representations of each part, we implemented a systematic approach to photograph each sample from multiple angles. To accomplish this, each sample was positioned on a turntable, as is illustrated in Fig. 3. We captured a photograph, rotated the sample by 20 degrees, and repeated until a full 360-degree rotation was achieved. This process was then replicated with each part flipped 180 degrees (i.e., upside down), resulting in a set of 36 images per sample. Given the total number of parts, this approach produced a set of 1080 camera images (Fig. 3). The diversity and volume of images are anticipated to enhance the training data for the neural network, potentially increasing its predictive accuracy. The turntable used for this process was designed using Creo Parametric software. It was manufactured using a Bambu X1 Carbon 3D printer and operated by a NEMA-17 stepper motor, which was driven by an A4988 stepper motor driver controlled via an Arduino Uno. This setup not only ensured precise control over the sample orientation during image capture but also ensured consistent replication across all samples.

Fig. 3

Image capture setup with Arduino Nicla Vision Camera, Arduino Uno, and turntable with sample sitting on top.

The next steps required each sample to be 3D scanned and the resultant data to be imported into CAD software for augmentation purposes. We began by testing three different scanners to identify a device that offered adequate speed, accuracy, and repeatability. The first device, the Matter & Form (M&F) tabletop 3D scanner, is a red laser scanner priced at $750, and reporting an accuracy of ±0.1 mm. A convenient feature of this scanner is its integration with BevelPix, an open-source online platform developed by M&F that enables users to design, create, and share 3D objects directly from their post-processing software, MF Studio. Despite these advantages, the M&F scanner presents challenges primarily due to its red laser technology, which exhibits high sensitivity to light that can affect the repeatability of scans. Due to this sensitivity, the scanning process for each component required approximately 30–45 minutes per sample. Moreover, the options for post-processing were somewhat restrictive, limited primarily to selecting mesh density and deleting data points. These limitations prompted the consideration of alternative scanning solutions to better meet our project’s needs.

In light of the shortcomings associated with the M&F scanner, we opted to transition to an alternative scanning system, the Faro Edge 8-axis FaroArm. This red laser scanner, priced at approximately $14,000, features an impressive accuracy and repeatability of ±25 μm, and offers a high resolution of 2,000 points per line with a 40 μm point spacing, and is less sensitive to ambient lighting compared to the M&F. The FaroArm’s post-processing capabilities are facilitated by Geomagic software, which allows for thorough review and modification of the scanned data. Despite its superior precision and robust software, the FaroArm presents several operational challenges. One major limitation is its range of motion, restricted by a 6-degree-of-freedom (DOF) robotic arm with a handheld red laser scanner attached. This configuration limits the user’s ability to capture complex geometries in a single scan, often necessitating multiple scans. When additional scans are required, the user must manually align the point clouds, which is both difficult and time-consuming. Moreover, the scanning procedure demands precise user manipulation, requiring the operator to maintain an optimal distance from the object and ensure the correct orientation of the scanner throughout the process. The physical size of the arm itself can also obstruct operations. Consequently, the time required to complete a single scan can vary significantly, ranging from 10 minutes to over an hour. These limitations demanded the exploration of a more efficient scanning solution.

The Einscan Pro HD (Fig. 4), a structured white light scanner priced at approximately $7,500, offers a high level of precision, possessing an accuracy and repeatability of ±45 μm. It provides flexibility in operation, with configurations for both handheld and turntable modes. In its turntable configuration, the Pro HD can complete a scan in less than one minute, demonstrating significant efficiency. One of the distinctive advantages of the Einscan Pro HD is its compatibility with various post-processing software applications, including GeoMagic, Solid Edge, and EXScan Pro. The latter, EXScan Pro, is particularly notable for its extensive functionality. It supports mesh refinement, point deletion, and automatic point cloud alignment across multiple scans. Furthermore, it offers a unique feature for creating watertight models by filling in gaps where data may have been missed during scanning—a capability not found in the other software options evaluated. The decision to utilize the Einscan Pro HD was based on several criteria: ease of use, scanning speed, robust post-processing capabilities, cost-effectiveness, accuracy, and versatility. These characteristics collectively affirmed the scanner’s suitability for our project requirements, making it an optimal choice among the available options.

Fig. 4

Shining 3D Einscan Pro HD 3D scanner collecting a sample.

After selecting the Einscan Pro HD for its high-resolution capabilities, all 30 physical samples were digitized using 3D scanning (See Fig. 5). This scanning phase laid the foundation for generating both synthetic and augmented images, which are later utilized in training the image segmentation neural network. Post-scanning, the resulting 3D models were imported into Creo Parametric for targeted augmentations. The principal modification involved replacing the square component of each scan with a CAD model that matched the original part’s dimensions and features. This modification ensured consistency across samples and controlled variability in the features analyzed by the neural network. Additionally, it simulates the type of augmentation that HMLV foundries could implement to rapidly train models for recognizing their specific cast parts.

Fig. 5

A raw, unmodified 3-D scanned sample.

Following the augmentations in Creo, the models were imported into Blender to generate the final image sets as shown in Fig. 6. Synthetic images were rendered directly from the unmodified 3D scans and wrapped with an aluminum texture to simulate the appearance of cast metal as depicted in Fig. 7. Augmented images incorporated the CAD-based modifications, which were also textured identically to preserve visual consistency.

Fig. 6

Augmentation of a synthetic image.

Fig. 7

3D scanned part wrapped in aluminum texture.

To streamline image generation, a Python script was developed and executed within Blender to automate the rendering process. Each sample yielded 18 images captured at 20-degree intervals in both upright and inverted orientations, mirroring the approach used for capturing real camera images. Rather than replicating the fluorescent laboratory lighting, which resulted in poor visual fidelity in Blender, sunlight was selected as the primary light source, as it provided the most accurate and consistent visualization of surface features. The lighting angle was carefully maintained throughout and was matched, to the best extent possible, with the lighting angle used in the real image captures to ensure visual consistency across image types. Alternative lighting setups were evaluated but did not produce satisfactory image quality. This pipeline yielded in 1,080 synthetic and 1,080 augmented images. The pipeline not only ensured consistency between image types but also produced a diverse and extensive dataset to support effective learning and generalization by the neural network.

Following image generation, class labels and segmentation masks were manually created using the Roboflow[15](https://www.nature.com/articles/s41597-025-06007-3#ref-CR15 “Dwyer, B. et al. Roboflow, Available at: https://roboflow.com/

. (2024).“) computer vision platform. The set contains a total of 3 classes: part, sprue, and riser (Fig. 8). Roboflow supports two annotation methods for image segmentation: the manual polygon tool and the AI-enabled Smart Polygon tool. With the polygon tool, users delineate the target object by clicking along its perimeter to form a closed polygon (Fig. 9). Once the outline is complete, a class label can be assigned via the interface menu shown in Fig. 10. A solid-colored overlay is then applied to indicate the mask’s creation.

Fig. 8

An annotated part in Roboflow.

Fig. 9

Using the polygon tool to generate segmentation mask in Roboflow.

Fig. 10

Assigning class labels to objects in scene using Roboflow.

To accelerate the labeling process, the Smart Polygon tool leverages an AI network, such as Segment Anything Model 2[16](https://www.nature.com/articles/s41597-025-06007-3#ref-CR16 “Ravi, N. et al. SAM 2: Segment Anything in Images and Videos. Facebook AI Research, https://doi.org/10.48550/arXiv.2408.00714

(2024).“) (SAM2), to generate segmentation masks. These models, often pre-trained on large datasets like MS COCO or SA-V, infer object boundaries from user interactions. As a user hovers over different image regions, the algorithm proposes potential mask regions by highlighting the detected object boundaries (Fig. 11). Clicking within a highlighted region confirms the selection (Fig. 12). After an initial selection, users may either accept the mask and assign a class label or refine the mask if it does not accurately capture the object. If the mask is too large and overlaps unintended areas, the user can click on the overlap to reduce its extent (Fig. 13). Conversely, if the mask is too small, additional clicks on exposed regions allow for expansion. Users can further fine-tune the mask by dragging, adding, or removing control points to achieve a precise fit. Additionally, the Smart Polygon tool features a complexity slider, enabling users to adjust the mask’s granularity according to the object’s contours. While the Smart Polygon tool generally performs well, it may encounter challenges when processing images with complex geometries or unique visual features, requiring additional manual refinement to ensure accurate segmentation.

Fig. 11

Highlighting a proposed mask region in Roboflow.

Fig. 12

Proposed mask region and slider to adjust mask complexity, where the green dot indicates where the user clicked to confirm the mask.

Fig. 13

Refined segmentation mask made with smart polygon tool, red dots indicate where the user clicked to indicate areas that do not contain the target class.

Once annotation is complete, a dataset version must be created. When creating a version, Roboflow allows users the option to preprocess and augment images with the option to resize, rotate, crop, convert to grayscale, and modify the hue, tone, contrast, and brightness of images. Next, users can export labeled, preprocessed images in various formats, including but not limited to JSON, .txt, .xml, and .csv. These formats are compatible with a broad range of neural network architectures and training pipelines. For this work, annotations were exported in the YOLO format, where each image is associated with a .txt file containing the class ID and normalized bounding box and segmentation mask coordinates. The bounding box is represented as (xcenter, ycenter, width, height), with all values normalized to the range [0, 1]. The segmentation mask is included as a series of normalized (x, y) coordinate pairs outlining the object’s shape, appended to the bounding box entry in the same file.

Upon completing the dataset comprising real, synthetic, and augmented images, it was essential to conduct a comprehensive image quality analysis across all 3,240 images. This analysis aimed to quantify both the amount and the nature of information captured within each image type, offering critical insights into how these images may be interpreted and utilized by computer vision models. Understanding these distinctions is particularly important given that deep learning systems do not evaluate images as humans do; they rely on statistical and structural patterns within the data, which can vary significantly across different generation methods. Notably, this type of pre-training data quality assessment is largely absent from existing literature, despite its potential to inform model design, dataset refinement, and training efficiency.

The primary motivation for this evaluation was to investigate how various image quality metrics affect a model’s capacity to learn meaningful representations. In particular, assessing the fidelity of synthetic and augmented images relative to real camera images is vital for validating their effectiveness as training data. This step ensures that the synthetic pipeline does not introduce distortions or artifacts that could mislead the model or reduce generalization performance. Moreover, this approach represents a novel step toward addressing a fundamental question in data-driven machine learning: how much data is truly “good enough”? Rather than relying solely on dataset size, this analysis emphasizes dataset quality, exploring how different image characteristics—texture, structure, variability, and complexity—influence learning outcomes. As synthetic data becomes increasingly common in manufacturing and other industrial applications, such analyses are essential for building efficient, high-performing models while reducing dependence on time-consuming and labor-intensive real-world data collection.

To carry out this assessment, we evaluated a set of complementary image quality metrics: Shannon’s entropy, gray-level co-occurrence matrix (GLCM) features, intrinsic dimensionality (ID), and the structural similarity index (SSIM). Each of these metrics is described in detail in the following section.

Shannon’s Entropy

Shannon’s entropy quantifies the distribution of gray-level pixel intensities across an image, which is sometimes interpreted as the “amount of information contained” inside an image[17](https://www.nature.com/articles/s41597-025-06007-3#ref-CR17 “Shannon, C. E. A Mathematical Theory of Communication. The Bell System Technical Journal 27, 623–656, https://doi.org/10.1002/j.1538-7305.1948.tb01338.x

(1948).“),[18](https://www.nature.com/articles/s41597-025-06007-3#ref-CR18 “Vajapeyam, S. Understanding Shannon’s Entropy Metric for Information, https://doi.org/10.48550/arXiv.1405.2061

(2014).“). In data science, it translates to the amount of storage space an encoded image occupies[18](https://www.nature.com/articles/s41597-025-06007-3#ref-CR18 “Vajapeyam, S. Understanding Shannon’s Entropy Metric for Information, https://doi.org/10.48550/arXiv.1405.2061

(2014).“). A higher entropy suggests that an image might be more complex and information-rich, while a lower entropy may suggest the opposite. Shannon’s Entropy for a grayscale image is defined as:

$$H=-\mathop{\sum }\limits_{i=0}^{n-1}{p}_{i},log,{p}_{i}$$

(1)

where n is the number of gray levels, ranging from 0 to 255, and p**i, which ranges from 0 to 1, is the probability of a pixel having a given gray level i, and is calculated based on the distribution of pixels across the entire image. For example, if an image contained a total of 100 pixels, and only one pixel has an intensity of 200, then p200 = 0.01.

Another way to conceptualize Shannon’s Entropy is through considering the predictability of a variable’s behavior[17](https://www.nature.com/articles/s41597-025-06007-3#ref-CR17 “Shannon, C. E. A Mathematical Theory of Communication. The Bell System Technical Journal 27, 623–656, https://doi.org/10.1002/j.1538-7305.1948.tb01338.x

(1948).“),[18](https://www.nature.com/articles/s41597-025-06007-3#ref-CR18 “Vajapeyam, S. Understanding Shannon’s Entropy Metric for Information, https://doi.org/10.48550/arXiv.1405.2061

(2014).“). An encoded image is represented by a matrix or array of numerical values. If we consider a grayscale image, black pixels are typically assigned a value of 0, while white pixels are assigned 255, with all gray-level intensities in-between. An image containing all black pixels has p0 = 1, because every pixel has the same value and the occurrence of a black pixel is guaranteed. As we move through the image pixel by pixel, the probability of encountering a black pixel remains 100%, making the value entirely predictable. In effect, once we observe a few black pixels, we can confidently assume that the rest of the image will also be black. Therefore, p0, the variable representing black pixels, is exactly predictable. As a result, there is no need to transmit this information, because it can be perfectly reconstructed without any additional data. Thus, the entropy, or the amount of information present, in the image is zero[18](https://www.nature.com/articles/s41597-025-06007-3#ref-CR18 “Vajapeyam, S. Understanding Shannon’s Entropy Metric for Information, https://doi.org/10.48550/arXiv.1405.2061

(2014).“). On the other hand, if an image comprises 50% white pixels and 50% of all other intensities, then p0 = 0.5, because the probability of correctly predicting the next pixel as white is only 50%. In such cases, additional information is necessary to improve prediction accuracy beyond random chance. High-entropy images like these are more complex and less predictable, which can place greater demands on computational resources during training. We hypothesize that this increased complexity may lead to longer training durations compared to low-entropy images. Understanding a variable’s behavior in this way is essential for developing robust models capable of efficiently learning from diverse datasets.

While Shannon’s entropy provides valuable insights as to how much data is contained in an image, it provides no structural or organizational information. This limitation is evident in Fig. 14. Although they are two distinct grayscale images—one an ordered gradient (Fig. 14a) and the other a random assortment of grayscale pixels (Fig. 14b)—both have the same Shannon’s entropy value of 8 because they share the same distribution of pixel values. Consequently, Shannon’s entropy alone is insufficient for fully understanding the image’s content, and should be used in conjunction with other image quality assessment tools for a more comprehensive analysis.

Fig. 14

Two grayscale images with Shannon’s Entropy value of 8.

Gray-Level Co-occurrence Matrix (GLCM)

Given the limitations of Shannon’s entropy in capturing spatial or structural information, we employed additional analytical tools to extract insights related to pixel relationships and image texture. The Gray-Level Co-Occurrence Matrix (GLCM) is particularly effective in quantifying textural features by analyzing the spatial relationships between pairs of pixels within an image[19](https://www.nature.com/articles/s41597-025-06007-3#ref-CR19 “NASA Ocean Biology Processing Group. Grey Level Co-Occurrence Matrix (GLCM). Available at: https://seadas.gsfc.nasa.gov/help-9.0.0/operators/GLCM.html

.“). Specifically, the GLCM evaluates how frequently combinations of pixel intensities occur at a given offset and orientation, thereby capturing patterns that reflect the image’s textural structure. From this matrix, several statistical measures, such as contrast, energy, homogeneity, and correlation, are derived and characterize properties like smoothness, roughness, and regularity[20](https://www.nature.com/articles/s41597-025-06007-3#ref-CR20 “MathWorks. Properties of gray-level co-occurrence matrix (GLCM), MATLAB. Available at: https://www.mathworks.com/help/images/ref/graycoprops.html

.“). These features, when compared across image sets, provide valuable insights into the textural variability that can impact neural network performance on image classification or segmentation tasks. Leveraging significant differences in these texture metrics may enhance model generalization by highlighting distinct structural cues that are more salient than other image features. Figure 15 visualizes the directional spatial relationships assessed during GLCM computation, emphasizing its role in spatial texture analysis.

Fig. 15

GLCM directional analysis to extract texture properties.

Contrast measures the gray-level variation between a reference pixel and its neighboring pixels[21](https://www.nature.com/articles/s41597-025-06007-3#ref-CR21 “Singh, S., Srivastava, D. & Agarwal, S. GLCM and Its Application in Pattern Recognition. In: International Symposium on Computational and Business Intelligence, IEEE, https://doi.org/10.1109/ISCBI.2017.8053537

(2017).“). It reflects the degree of local intensity variation, where higher values indicate greater disparities between adjacent pixel intensities. Contrast is also associated with the linear dependency between the gray levels of two neighboring pixels[22](https://www.nature.com/articles/s41597-025-06007-3#ref-CR22 “Haralick, R., Shanmugam, K. & Dinstein, I. Textural Features for Image Classification. Transactions on Systems, Man, and Cybernetics, IEEE, no. 6, https://doi.org/10.1109/TSMC.1973.4309314

(1973).“). Mathematically, its value ranges from 0 to (size(GLCM, 1)−1)2, where a contrast value of zero corresponds to a completely uniform image—one in which every pixel has the same gray-scale value[20](https://www.nature.com/articles/s41597-025-06007-3#ref-CR20 “MathWorks. Properties of gray-level co-occurrence matrix (GLCM), MATLAB. Available at: https://www.mathworks.com/help/images/ref/graycoprops.html

.“). The contrast for a gray-scale image is defined as:

$$Contrast(d,\theta )=\mathop{\sum }\limits_{i=0}^{{{N}_{g}-1}\mathop{\sum }\limits_{j=0}}{{N}_{g}-1}| i-j{| }^{2},{P}_{d}{\theta },(i,j)$$

(2)

where d represents the spatial distance between pixel pairs, θ is the angle of orientation between pixels(e.g., 0°, 45°, 90°, or 135°), N**g indicates the number of gray-levels present in the image, and ({P}_{d}^{\theta }(i,j)) is the normalized GLCM value indicating the probability of observing a pixel with intensity i adjacent to a pixel with intensity j at distance d and angle θ.

Energy quantifies the number of repeated pixel pairs and thus reflects the textural uniformity of an image. High energy values indicate greater uniformity or regular patterns, while lower values suggest more variation and disorder in the texture[21](https://www.nature.com/articles/s41597-025-06007-3#ref-CR21 “Singh, S., Srivastava, D. & Agarwal, S. GLCM and Its Application in Pattern Recognition. In: International Symposium on Computational and Business Intelligence, IEEE, https://doi.org/10.1109/ISCBI.2017.8053537

(2017).“). Energy values range from 0 to 1, with a value of 1 corresponding to a completely uniform image[20](https://www.nature.com/articles/s41597-025-06007-3#ref-CR20 “MathWorks. Properties of gray-level co-occurrence matrix (GLCM), MATLAB. Available at: https://www.mathworks.com/help/images/ref/graycoprops.html

.“). The energy for a grayscale image is defined as:

$$,{\rm{Energy}},(d,\theta )=\mathop{\sum }\limits_{i=0}^{{{N}_{g}-1}\mathop{\sum }\limits_{j=0}}{{N}_{g}-1}\left[{P}_{d}^{{\theta }{(i,j)}}{2}\right]$$

(3)

where d is the spatial distance between pixels, θ is the orientation angle between pixels, N**g is the number of gray levels in the image, and ({P}_{d}^{\theta }(i,j)) denotes the normalized GLCM value representing the probability of a pixel with gray level i co-occurring with gray level j at the specified distance and angle.

Homogeneity, also known as the angular second moment (ASM), measures the similarity between a reference pixel and its neighboring pixel[23](https://www.nature.com/articles/s41597-025-06007-3#ref-CR23 “Kobayashi, T., Sundaram, D., Nakata, K. & Tsurui, H. Gray-level co-occurrence matrix analysis of several cell types in mouse brain using resolution-enhanced photothermal microscopy. Journal of Biomedical Optics 22(3), 036011, https://doi.org/10.1117/1.JBO.22.3.036011

(2017).“). It captures how close the gray levels of pixel pairs are to each other. Homogeneity values range from 0 to 1, with a value of 1 indicating a completely uniform image in which all pixels have identical gray-level intensities. The homogeneity for a grayscale image is defined as:

$$Homo(d,\theta )=\mathop{\sum }\limits_{i=0}^{{{N}_{g}-1}\mathop{\sum }\limits_{j=0}}{{N}_{g}-1}\frac{1}{1+{(i-j)}^{2}},{P}_{d}{\theta }(i,j)$$

(4)

where d is the spatial distance between pixels, θ is the orientation angle, N**g is the number of gray levels in the image, and ({P}_{d}^{\theta }(i,j)) represents the normalized co-occurrence probability of observing pixel intensities i and j at the specified distance and angle.

Correlation measures the statistical relationship between a pixel and its neighbor across the entire image. It quantifies how much the gray levels of one pixel predict the gray levels of its neighbor[20](https://www.nature.com/articles/s41597-025-06007-3#ref-CR20 “MathWorks. Properties of gray-level co-occurrence matrix (GLCM), MATLAB. Available at: https://www.mathworks.com/help/images/ref/graycoprops.html

.“). Correlation values range from −1 to 1, where −1 indicates perfect negative correlation, 0 denotes no correlation, and 1 represents perfect positive correlation. The correlation for a grayscale image is defined as:

$$Corr(d,\theta )=\mathop{\sum }\limits_{i=0}^{{{N}_{g}-1}\mathop{\sum }\limits_{j=0}}{{N}_{g}-1}\frac{ij{P}_{d}^{\theta }(i,j)-{\mu }_{x}{\mu }_{y}}{{\sigma }_{x}{\sigma }_{y}}$$

(5)

where d is the spatial distance between pixels, θ is the orientation angle, N**g is the number of gray levels in the image, and ({P}_{d}^{\theta }(i,j)) denotes the normalized probability of a pixel pair (i, j) occurring at the specified distance and angle. The terms μ**x and μ**y are the means of the marginal distributions of the reference and neighboring pixel intensities, respectively, while σ**x and σ**y are their corresponding standard deviations.

Intrinsic Dimensionality

To supplement the previous analytical tools, the intrinsic dimensionality (ID) refers to the minimum number of variables needed to describe a data distribution, making it a key metric for understanding dataset complexity[24](#ref-CR24 “Pope, P., Zhu, C., Abdelkader, A., Goldblum, M. & Goldstein, T. The Intrinsic Dimension of Images and Its Impact on Learning, https://doi.org/10.48550/arXiv.2104.08894

(2021).“),[25](#ref-CR25 “Weerasinghe, S., Alpcan, T., Erfani, S. M., Leckie, C. & Rubinstein, B. I. P. Local Intrinsic Dimensionality Signals Adversarial Perturbations. In: Conference on Decision and Control, IEEE, https://doi.org/10.48550/arXiv.2109.11803

(2022).“),[26](https://www.nature.com/articles/s41597-025-06007-3#ref-CR26 “Camastra, F. & Staiano, A. Intrinsic dimension estimation: Advances and open problems. Information Sciences 328, 26–41, https://doi.org/10.1016/j.ins.2015.08.029

(2016).“). To estimate the ID of a dataset, this work employs the Maximum Likelihood Estimation (MLE) method proposed by Levina and Bickel[26](https://www.nature.com/articles/s41597-025-06007-3#ref-CR26 “Camastra, F. & Staiano, A. Intrinsic dimension estimation: Advances and open problems. Information Sciences 328, 26–41, https://doi.org/10.1016/j.ins.2015.08.029

(2016).“),[27](https://www.nature.com/articles/s41597-025-06007-3#ref-CR27 “Levina, E. & Bickel, P. J. Maximum Likelihood Estimation of Intrinsic Dimension, Available at: https://proceedings.neurips.cc/paper_files/paper/2004/file/74934548253bcab8490ebd74afed7031-Paper.pdf

. (2004).“). This technique evaluates the ID locally by analyzing the neighborhood around each data point and computing the Euclidean distances to the kth nearest neighbors[28](https://www.nature.com/articles/s41597-025-06007-3#ref-CR28 “Lin, T. & Zha, H. Riemannian manifold learning. IEEE Transactions on Pattern Analysis and Machine Intelligence 30(5), 796–809, https://doi.org/10.1109/TPAMI.2007.70735

(2008).“). The MLE approach assumes locally uniform data density and models the number of neighboring points within a radius using a Poisson process[24](https://www.nature.com/articles/s41597-025-06007-3#ref-CR24 “Pope, P., Zhu, C., Abdelkader, A., Goldblum, M. & Goldstein, T. The Intrinsic Dimension of Images and Its Impact on Learning, https://doi.org/10.48550/arXiv.2104.08894

(2021).“),[26](https://www.nature.com/articles/s41597-025-06007-3#ref-CR26 “Camastra, F. & Staiano, A. Intrinsic dimension estimation: Advances and open problems. Information Sciences 328, 26–41, https://doi.org/10.1016/j.ins.2015.08.029

. (2004).“). Under these assumptions, the ID at a given point x can be estimated using the following likelihood-based expression[24](https://www.nature.com/articles/s41597-025-06007-3#ref-CR24 “Pope, P., Zhu, C., Abdelkader, A., Goldblum, M. & Goldstein, T. The Intrinsic Dimension of Images and Its Impact on Learning, https://doi.org/10.48550/arXiv.2104.08894

. (2004).“):

$${\widehat{m}}_{k}(x)={\left[\frac{1}{k-1}\mathop{\sum }\limits_{j=1}^{{k-1}\log \frac{{T}_{k}(x)}{{T}_{j}(x)}\right]}}{-1}$$

(6)

where T**j(x) denotes the Euclidean (ℓ2) distance from point x to its jth nearest neighbor. The local intrinsic dimension estimate ({\widehat{m}}_{k}(x)) is computed for each point in the dataset, and a global estimate is obtained by averaging over all n points:

$${\bar{m}}_{k}=\frac{1}{n}\mathop{\sum }\limits_{i=1}^{n}{\widehat{m}}_{k}({x}_{i})$$

(7)

Estimating the minimum number of features required to represent the dataset provides a quantitative measure of its underlying complexity. Understanding intrinsic dimensionality is essential for informing the design of neural networks that are sufficiently expressive to capture relevant structures in the data without overfitting. Datasets with higher ID values may lead to longer training times, increased risk of overfitting, and greater sensitivity to hyperparameter tuning. Conversely, lower ID suggests a more compact data structure, potentially enabling faster convergence, improved generalization, and reduced computational cost. Comparing the ID of real, synthetic, and augmented image sets provides insight into how their complexity may affect model performance and training efficiency. These distinctions can guide dataset refinement and model architecture selection to align with the underlying structure of the data.

Structural Similarity Index (SSIM)

Considering that the human visual system (HVS) is particularly sensitive to structural information in images, some image quality assessment methods are designed to extract and evaluate this structure to better approximate percieved quality[29](https://www.nature.com/articles/s41597-025-06007-3#ref-CR29 “Zhai, G. & Min, X. Perceptual image quality assessment: a survey. Science China Information Sciences 63(11), 211301, https://doi.org/10.1007/s11432-019-2757-1

(2020).“). One prominent example is the Structural Similarity Index (SSIM), introduced by Wang et al.[30](https://www.nature.com/articles/s41597-025-06007-3#ref-CR30 “Wang, Z., Bovik, A. C., Sheikh, H. R. & Simoncelli, E. P. Image quality assessment: From error visibility to structural similarity. IEEE Transactions on Image Processing 13(4), 600–612, https://doi.org/10.1109/TIP.2003.819861

(2004).“),[31](https://www.nature.com/articles/s41597-025-06007-3#ref-CR31 “Wang, Z., Simoncelli, E. P. & Bovik, A. C. Multi-scale structural similarity fo

Background & Summary

Background & Summary

Methods

Shannon’s Entropy

Gray-Level Co-occurrence Matrix (GLCM)

Intrinsic Dimensionality

Structural Similarity Index (SSIM)

Similar Posts