Background & Summary
Building segmentation refers to the task of identifying and delineating building footprints at the pixel level from aerial or satellite imagery. It plays a crucial role in a wide range of geospatial applications, including remote sensing, urban planning, environmental monitoring, infrastructure management, and disaster response[1](#ref-CR1 “Wu, G. et al. Automatic building segmentation of aerial imagery using multi-constraint fully convolutional networks. Remote Sensing 10(3), 407, https://doi.org/10.3390/rs10030407
(2018).“),[2](#ref-CR2 “Nielsen, M. M. Remote sensing for urban planning and management: The use of window-independent context segmentation to extract urban features in stockholm. Computers, Environment and Urban Systems 52, 1–9, https://doi.org/10.10…
Background & Summary
Building segmentation refers to the task of identifying and delineating building footprints at the pixel level from aerial or satellite imagery. It plays a crucial role in a wide range of geospatial applications, including remote sensing, urban planning, environmental monitoring, infrastructure management, and disaster response[1](#ref-CR1 “Wu, G. et al. Automatic building segmentation of aerial imagery using multi-constraint fully convolutional networks. Remote Sensing 10(3), 407, https://doi.org/10.3390/rs10030407
(2018).“),[2](#ref-CR2 “Nielsen, M. M. Remote sensing for urban planning and management: The use of window-independent context segmentation to extract urban features in stockholm. Computers, Environment and Urban Systems 52, 1–9, https://doi.org/10.1016/j.compenvurbsys.2015.02.002
(2015).“),[3](https://www.nature.com/articles/s41597-025-06014-4#ref-CR3 “Gupta, R. & Shah, M. Rescuenet: Joint building segmentation and damage assessment from satellite imagery. In 2020 25th International Conference on Pattern Recognition (ICPR), pages 4405–4411, IEEE, https://doi.org/10.1109/ICPR48806.2021.9412295
(2021).“). In these domains, AI-based tools for automatic building segmentation have become increasingly essential, offering fast, scalable, and cost-effective solutions for extracting building footprints. The availability of these tools provides an opportunity to automate traditionally manual and resource-intensive mapping workflows[4](https://www.nature.com/articles/s41597-025-06014-4#ref-CR4 “Li, Z., Xin, Q., Sun, Y. & Cao, M. A deep learning-based framework for automated extraction of building footprint polygons from very high-resolution aerial imagery. Remote Sensing 13(18), 3630, https://doi.org/10.3390/rs13183630
(2021).“).
In recent years, deep learning has emerged as a highly effective approach for building segmentation, with models based on convolutional neural networks (CNNs) achieving remarkable performance. These models can learn hierarchical and contextual features directly from the input RGB imagery, allowing them to accurately delineate buildings across diverse environments and architectural styles[5](https://www.nature.com/articles/s41597-025-06014-4#ref-CR5 “Zhu, X. X. et al. Deep learning in remote sensing: A comprehensive review and list of resources. IEEE Geoscience and Remote Sensing Magazine 5(4), 8–36, https://doi.org/10.1109/MGRS.2017.2762307
(2017).“). The effectiveness of these models is strongly dependent on the availability of large-scale and high-quality annotated training datasets[6](https://www.nature.com/articles/s41597-025-06014-4#ref-CR6 “Yu, A. et al. Deep learning methods for semantic segmentation in remote sensing with small data: A survey. Remote Sensing 15(20), 4987, https://doi.org/10.3390/rs15204987
(2023).“). Such datasets must have high spatial resolution to accurately capture fine architectural details, and geographic diversity to ensure robustness across different urbanization levels and landscapes. Additionally, they must provide a sufficient volume of images to enable effective training of modern models and to ensure strong generalization performance.
Various datasets for building segmentation from aerial and satellite imagery have been proposed to support the training and evaluation of deep learning-based segmentation models. Table 1 shows a comparison among existing datasets.
Although many datasets exist, none of them is simultaneously large-scale with a high spatial resolution and broad geographic diversity. For example, the ISPRS Potsdam dataset[7](https://www.nature.com/articles/s41597-025-06014-4#ref-CR7 “Volpi, M. & Tuia, D. Dense semantic labeling of subdecimeter resolution images with convolutional neural networks. IEEE Transactions on Geoscience and Remote Sensing 55(2), 881–893, https://doi.org/10.1109/TGRS.2016.2616585
(2016).“) offers very high spatial resolution but is geographically limited, covering only a single urban area of Potsdam, in Germany. The INRIA dataset[8](https://www.nature.com/articles/s41597-025-06014-4#ref-CR8 “Maggiori, E., Tarabalka, Y., Charpiat, G. & Alliez, P. Can semantic labeling methods generalize to any city? the inria aerial image labeling benchmark. In 2017 IEEE International Geoscience and Remote Sensing Symposium, pages 3226–3229, IEEE, https://doi.org/10.1109/IGARSS.2017.8127684
(2017).“) provides imagery from multiple locations around the world, but it is limited to urban areas and does not include rural or suburban regions, while the Landcover.ai dataset[9](https://www.nature.com/articles/s41597-025-06014-4#ref-CR9 “Boguszewski, A., Batorski, D., Ziemba-Jankowska, N., Dziedzic, T. & Zambrzycka, A. Landcover.ai: Dataset for automatic mapping of buildings, woodlands, water and roads from aerial imagery. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 1102–1110, https://doi.org/10.1109/CVPRW53098.2021.00121
(2021).“) only focuses on rural areas across Poland. This geographic bias may limit the model generalization across diverse urbanization levels and building architectural styles. Datasets like Spacenet 6[10](https://www.nature.com/articles/s41597-025-06014-4#ref-CR10 “Shermeyer, J. et al. Spacenet 6: Multi-sensor all weather mapping dataset. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops, pages 196–197, https://doi.org/10.1109/CVPRW50498.2020.00106
(2020).“), GF-7[11](https://www.nature.com/articles/s41597-025-06014-4#ref-CR11 “Chen, P. et al. A benchmark gaofen-7 dataset for building extraction from satellite images. Scientific Data 11(1), 187, https://doi.org/10.1038/s41597-024-03009-5
(2024).“), and CBIS[12](https://www.nature.com/articles/s41597-025-06014-4#ref-CR12 “Wu, K. et al. A dataset of building instances of typical cities in china. China Scientific Data 6(1), 182–190, https://doi.org/10.11922/sciencedb.00620
(2021).“) provide broader geographic diversity, including many urban, suburban, and rural areas, but lack the spatial detail necessary for fine-grained segmentation. Similarly, the Massachusetts dataset13 offers a good geographic coverage and diversity, but its spatial resolution is relatively low (only 1 meter per pixel). This limited spatial detail may reduce the accuracy in capturing fine architectural features, such as small building footprints or complex roof structures. Other datasets, such as UBC[14](https://www.nature.com/articles/s41597-025-06014-4#ref-CR14 “Huang, X. et al. Urban building classification (ubc)-a dataset for individual building detection and classification from satellite imagery. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 1413–1421, https://doi.org/10.1109/CVPRW56347.2022.00147
(2022).“), are relatively small in scale (only 66 km2), limiting their usability for the training of modern deep learning models. Recently, the Microsoft Building Footprints dataset (https://planetarycomputer.microsoft.com/dataset/ms-buildings) has been proposed: it includes over 999 million buildings worldwide, obtained from Bing Maps imagery collected between 2014 and 2021. However, it may suffer from temporal misalignment between imagery and annotations, leading to inconsistent footprints. In addition, it is generated entirely through automated deep learning methods without validation from authoritative cartographic sources, which may affect its accuracy and reliability in certain regions. For all these reasons, there is the need for a large-scale, high-resolution, and geographically diverse dataset that provides accurate and reliable pixel-level building annotations.
Given the limitations of existing datasets for building segmentation, in this work we introduce Segmentation Friuli Venezia Giulia (SegFVG)[15](https://www.nature.com/articles/s41597-025-06014-4#ref-CR15 “Rota, C., Kumar, R., Piccoli, F. & Ciocca, G. A high-resolution large-scale dataset for building segmentation from aerial imagery in northeastern Italy. Bicocca Open Archive Research Data. https://doi.org/10.17632/9kbc6zdn7b
, 2025.“), a large-scale, high-resolution dataset containing 15,403 aerial image tiles, each of size 2000 × 2000 pixels, with a ground sampling distance of 0.1 meters. SegFVG[15](https://www.nature.com/articles/s41597-025-06014-4#ref-CR15 “Rota, C., Kumar, R., Piccoli, F. & Ciocca, G. A high-resolution large-scale dataset for building segmentation from aerial imagery in northeastern Italy. Bicocca Open Archive Research Data. https://doi.org/10.17632/9kbc6zdn7b
, 2025.“) includes precise pixel-level annotations of building footprints across 616 km2 of the Friuli Venezia Giulia region in northeastern Italy. The area of Friuli Venezia Giulia is particularly interesting for building segmentation as it encompasses a diverse range of environments, including alpine rural zones in the north, flat agricultural plains in the center, and densely populated coastal settlements along the Adriatic Sea. The geographical distribution of the tiles contained in SegFVG[15](https://www.nature.com/articles/s41597-025-06014-4#ref-CR15 “Rota, C., Kumar, R., Piccoli, F. & Ciocca, G. A high-resolution large-scale dataset for building segmentation from aerial imagery in northeastern Italy. Bicocca Open Archive Research Data. https://doi.org/10.17632/9kbc6zdn7b
, 2025.“), as well as tile examples, is shown in Fig. 1. Each tile is represented by a black dot on the map. According to the classifications provided by the National Institute of Statistics (Istat) (https://www.istat.it/classificazione/principali-statistiche-geografiche-sui-comuni/), as shown in Fig. 2 (top row), SegFVG[15](https://www.nature.com/articles/s41597-025-06014-4#ref-CR15 “Rota, C., Kumar, R., Piccoli, F. & Ciocca, G. A high-resolution large-scale dataset for building segmentation from aerial imagery in northeastern Italy. Bicocca Open Archive Research Data. https://doi.org/10.17632/9kbc6zdn7b
, 2025.“) captures a diverse distribution of altimetric zones: 9.5% of the tiles are located in mountainous areas, 29.1% in hilly areas, and the remaining 61.5% in flat plains. In terms of urbanization levels, 11.5% of the tiles correspond to urban contexts, 45.8% to suburban areas, and 42.7% to rural settings. Finally, with respect to coastal proximity, 17.2% of the tiles are situated near the Adriatic Sea (coastal), while the remaining 82.8% are inland.
Fig. 1
Spatial distribution of SegFVG image tiles (left) across the Friuli Venezia Giulia region. Each black dot represents an image tile of 2000 × 2000 pixels (200 × 200 m, examples on the right) included in the dataset, illustrating the geographic coverage.
Fig. 2
Overview of the dataset composition based on altimetric zone, urbanization level, and coastal proximity classification. In the top row, the percentages refer to the tiles in each class, while in the bottom row, they refer to the number of buildings in each class.
In total, SegFVG[15](https://www.nature.com/articles/s41597-025-06014-4#ref-CR15 “Rota, C., Kumar, R., Piccoli, F. & Ciocca, G. A high-resolution large-scale dataset for building segmentation from aerial imagery in northeastern Italy. Bicocca Open Archive Research Data. https://doi.org/10.17632/9kbc6zdn7b
, 2025.“) includes approximately 357,000 annotated building structures. Figure 2 (bottom row) shows the building distribution according to their altimetric zone, urbanization level, and coastal proximity, while Fig. 3 illustrates the distribution of these buildings grouped by municipality, highlighting the spatial variability in building density across the region, from dense urban centers to sparsely populated rural areas. The combination of varied landscapes, urbanization levels, and architectural styles creates a representative and challenging benchmark for analyzing buildings across different geographic and urban contexts.
Fig. 3
Map of the Friuli Venezia Giulia region showing the number of annotated buildings in the SegFVG dataset. There are a total of 215 municipalities. Each area refers to a municipality and is color-coded according to the total count of buildings, showing the concentration of annotations across the region.
Overall, SegFVG[15](https://www.nature.com/articles/s41597-025-06014-4#ref-CR15 “Rota, C., Kumar, R., Piccoli, F. & Ciocca, G. A high-resolution large-scale dataset for building segmentation from aerial imagery in northeastern Italy. Bicocca Open Archive Research Data. https://doi.org/10.17632/9kbc6zdn7b
, 2025.“) is characterized by its large scale, its high spatial resolution, and its geographic diversity (i.e., it includes heterogeneous environments, such as coastal areas, plains, hills, and Alpine regions, which are associated with distinct settlement patterns and building typologies). These features make it particularly well-suited for the development of deep learning models for building segmentation.
In addition to the dataset, we provide benchmark results using multiple deep learning models, which demonstrate the usability of SegFVG[15](https://www.nature.com/articles/s41597-025-06014-4#ref-CR15 “Rota, C., Kumar, R., Piccoli, F. & Ciocca, G. A high-resolution large-scale dataset for building segmentation from aerial imagery in northeastern Italy. Bicocca Open Archive Research Data. https://doi.org/10.17632/9kbc6zdn7b
, 2025.“) for the development of accurate segmentation models and can be used as a baseline for future research in this field. To the best of our knowledge, SegFVG[15](https://www.nature.com/articles/s41597-025-06014-4#ref-CR15 “Rota, C., Kumar, R., Piccoli, F. & Ciocca, G. A high-resolution large-scale dataset for building segmentation from aerial imagery in northeastern Italy. Bicocca Open Archive Research Data. https://doi.org/10.17632/9kbc6zdn7b
, 2025.“) is the first publicly available building segmentation dataset focused on the Italian territory.
Methods
An overview of the framework we adopted to create SegFVG[15](https://www.nature.com/articles/s41597-025-06014-4#ref-CR15 “Rota, C., Kumar, R., Piccoli, F. & Ciocca, G. A high-resolution large-scale dataset for building segmentation from aerial imagery in northeastern Italy. Bicocca Open Archive Research Data. https://doi.org/10.17632/9kbc6zdn7b
, 2025.“) is shown in Fig. 4. It involves three main steps: (1) data collection, (2) data processing, and (3) data cleaning. In the following, we describe these steps more in detail.
Fig. 4
The framework used for the generation of the SegFVG dataset.
Data collection
We constructed the SegFVG[15](https://www.nature.com/articles/s41597-025-06014-4#ref-CR15 “Rota, C., Kumar, R., Piccoli, F. & Ciocca, G. A high-resolution large-scale dataset for building segmentation from aerial imagery in northeastern Italy. Bicocca Open Archive Research Data. https://doi.org/10.17632/9kbc6zdn7b
, 2025.“) dataset by leveraging official geospatial services provided by the Friuli Venezia Giulia (FVG) region. The data collection process begins with retrieving building shapes from the regional Web Feature Service (WFS) (https://serviziogc.regione.fvg.it/geoserver/EDIFICI/wfs). These layers provide building volumetric units for the entire region derived from the Regional Technical Numerical Map (RTNM) (https://eaglefvg.regione.fvg.it/). A volumetric unit is defined as the smallest portion of a building that has a uniform elevation from the ground (https://geoportale.comune.milano.it/sit/tematiche/territorio/).
We first reprojected all vector layers to the EPSG:6708 coordinate reference system. We divided the FVG region into square tiles measuring 200 meters per side to organize the spatial data. We then discarded tiles containing fewer than 40 shapes (volumetric units) to prioritize areas with meaningful building density and to reduce uninformative background regions.
For each retained tile, we downloaded true orthophotos with a ground sampling distance of 0.1 meters per pixel from the regional Web Map Service (WMS) (http://irdat-ortofoto.regione.fvg.it/geoserver/ortofoto/ows). These images were captured during aerial surveys conducted between 2017 and 2020 using aircraft equipped with Vexcel UltraCam Eagle and UltraCam Xp digital large-format cameras. We adopted a sliding window approach to manage the retrieval of images spanning large shapes. All resulting images are saved as GeoTIFFs to preserve accurate georeferencing metadata.
We clipped shapes to match tile boundaries and filtered them by a minimum area threshold of 20 m2 to remove small and potentially noisy shapes. We converted each remaining shape into a polygon defined in image pixel coordinates using an affine transformation and applied a y-axis inversion to align with the image coordinate system. Finally, the polygon boundaries are serialized into JSON format.
For each tile, we generated a pair of files: a raster RGB image and a corresponding JSON annotation file containing the filtered volumetric units of each building represented in the image.
Data pre-processing
Following the data collection phase, the raw dataset consisted of raster RGB images paired with volumetric unit information provided as JSON files. As a first step, we discarded a subset of images that exhibited large black regions in the background. These areas, represented by zero-valued pixels, are likely the result of incomplete image tiles, which occur when the tile contains areas that fall outside the boundaries of the Friuli Venezia Giulia region. We then converted the vector polygons in the JSON files into rasterized binary masks, assigning a value of 1 to pixels within the polygons and 0 to all other pixels. During this process, all connected volumetric units are aggregated to produce a single unified footprint for each building. This step is essential because, in the RGB true orthoimages, these structures appear as continuous buildings without internal divisions, and treating each volumetric unit separately creates artificial splits that do not reflect the visual reality. The dataset obtained after these pre-processing operations consists of image/mask pairs. We refer to this preliminary dataset version as SegFVG_v0.
Data cleaning
After manually inspecting a random subset of images from SegFVG_v0, we identified recurring annotation errors, such as missing buildings (false negatives) and incorrectly labeled structures (false positives). These issues are likely due to temporal discrepancies between the imagery and the RTNM data, as they were captured at different times, leading to mismatches caused by construction, demolition, or changes in land use. An example is presented in Fig. 7a, which shows both missing building footprints and erroneous annotations.
These inaccuracies can negatively impact both the training of deep learning models and the reliability of performance evaluations[16](https://www.nature.com/articles/s41597-025-06014-4#ref-CR16 “Rottmann, M. & Reese, M. Automated detection of label errors in semantic segmentation datasets via deep learning and uncertainty quantification. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pages 3214–3223, https://doi.org/10.1109/WACV56688.2023.00323
(2023).“). We thus implemented a data cleaning pipeline to improve the overall annotation quality, and we established a manually corrected reference set to quantitatively assess the effectiveness of the data cleaning procedure.
We first randomly selected 50 images from the SegFVG_v0 dataset and manually refined their corresponding masks using the CVAT tool (https://www.cvat.ai/). The goal was to ensure that all building footprints were accurately annotated, removing both false positives and false negatives. This manually corrected subset, referred to as the gold standard, is used as a reference for evaluating the effectiveness of the data cleaning procedure.
We then implemented a semi-automatic data cleaning procedure to address the annotation inconsistencies. In short, we trained different deep learning models for building segmentation with different backbones using all the images in the SegFVG_v0 dataset, excluding the 50 images in the gold standard, and finally used the majority voting consensus of their predictions to generate the corrected segmentation masks. The use of multiple backbones allows leveraging the complementary strengths of different feature extractors, which can capture different semantic patterns from the training data, even if they contain some inconsistencies[17](https://www.nature.com/articles/s41597-025-06014-4#ref-CR17 “Abimannan, S. et al. Ensemble multifeatured deep learning models and applications: A survey. IEEE Access 11, 107194–107217, https://doi.org/10.1109/ACCESS.2023.3320042
(2023).“). Specifically, we used four models based on the U-Net architecture[18](https://www.nature.com/articles/s41597-025-06014-4#ref-CR18 “Ronneberger, O., Fischer, P. & Brox, T. U-net: Convolutional networks for biomedical image segmentation. In Medical Image Computing and Computer-assisted Intervention–MICCAI 2015: 18th International Conference, pages 234–241, Springer, https://doi.org/10.1007/978-3-319-24574-4_28
(2015).“) with Resnet-50[19](https://www.nature.com/articles/s41597-025-06014-4#ref-CR19 “He, K., Zhang, X., Ren, S. & Sun, J. Deep residual learning for image recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 770–778, https://doi.org/10.1109/CVPR.2016.90
(2016).“), Efficientnet-B420, Densenet-201[21](https://www.nature.com/articles/s41597-025-06014-4#ref-CR21 “Huang, G., Liu, Z., Van Der Maaten, L. & Weinberger, K. Q. Densely connected convolutional networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 4700–4708, https://doi.org/10.1109/CVPR.2017.243
(2017).“), and Xception[22](https://www.nature.com/articles/s41597-025-06014-4#ref-CR22 “Chollet, F. Xception: Deep learning with depthwise separable convolutions. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 1251–1258, https://doi.org/10.1109/CVPR.2017.195
(2017).“) as backbone encoders. Each encoder is initialized with pre-trained weights from ImageNet[23](https://www.nature.com/articles/s41597-025-06014-4#ref-CR23 “Deng, J. et al. Imagenet: A large-scale hierarchical image database. In 2009 IEEE Conference on Computer Vision and Pattern Recognition, pages 248–255, IEEE, https://doi.org/10.1109/CVPR.2009.5206848
(2009).“). During training, the batch size is set to 32 and the learning rate to 1e-4. We use the Dice loss ({{\mathcal{L}}}_{Dice}), which is defined as:
$${{\mathcal{L}}}_{Dice}=1-2\cdot \frac{\sum {y}_{pred}\cdot {y}_{true}}{\sum {y}_{pred}+\sum {y}_{true}+\epsilon },$$
(1)
where ypred ∈ [0, 1] is the predicted probability after sigmoid activation, ytrue ∈ {0, 1} is the reference label, and ϵ is a small value to ensure numerical stability. We trained the models for 12,000 iterations. For data augmentation, we randomly cropped images at 256 × 256 pixel resolution and used various strategies, including flip, rotation, and corrections applied to brightness, contrast, hue, and saturation.
After all the models were trained, we used each of them to generate segmentation masks for the entire dataset. In addition to these four predictions, we considered the masks of SegFVG_v0, resulting in five candidate masks per image. We then applied majority voting consensus across these five candidates to obtain a refined segmentation mask, where each pixel was labeled as building (i.e., 1) if at least three of the five masks agreed. Formally, let M = {m1, m2, m3, m4, m5} be the set of binary masks, where each m**i ∈ {0, 1}H×W is a binary mask of height H and width W, m**i(x, y) denotes the pixel value at location (x,y) in the i-th mask, with 1 indicating a building pixel and 0 indicating the background. Then, (\widehat{m}\in {{0,1}}^{H\times W}) is the refined mask and (\widehat{m}(x,y)) is defined as:
We finally post-processed the obtained mask by applying morphological opening (erosion followed by dilation) using a 13 × 13 kernel, which helped remove small spurious regions.
The dataset obtained after these data cleaning operations is called SegFVG_v1, which corresponds to the final version.
Data Record
The SegFVG[15](https://www.nature.com/articles/s41597-025-06014-4#ref-CR15 “Rota, C., Kumar, R., Piccoli, F. & Ciocca, G. A high-resolution large-scale dataset for building segmentation from aerial imagery in northeastern Italy. Bicocca Open Archive Research Data. https://doi.org/10.17632/9kbc6zdn7b
, 2025.“) dataset is available for download at https://doi.org/10.17632/9kbc6zdn7b. It is distributed as a compressed ZIP archive of approximately 24 GB. The dataset presented and peer reviewed in this article corresponds to Version 2. Figure 5 illustrates the hierarchical directory structure.
Fig. 5
Directory structure of the SegFVG dataset.
The root folder, named SegFVG, contains three subdirectories: images, masks_v0, and masks_v1. They store the input RGB aerial images and their corresponding segmentation masks, where masks_v0 contains the preliminary annotations (SegFVG_v0), and masks_v1 contains the cleaned annotations produced by the data cleaning pipeline (SegFVG_v1). Each RGB image is associated with a segmentation mask that provides precise pixel-level annotations of building footprints. Some examples are shown in Fig. 6.
Fig. 6
Examples of aerial image tiles (top) and their corresponding building segmentation mask (bottom). The masks highlight building footprints in white.
All files are stored in GeoTIFF format and named consistently so that each image corresponds directly to its reference mask. Each image filename encodes the spatial extent of the corresponding tile using projected coordinate values. Specifically, the filename follows the structure ({{\rm{x}}}_{\min }_{{\rm{y}}}_{\min }_{{\rm{x}}}_{\max }_{{\rm{y}}}_{\max }.{\rm{tif}}), where each value represents the coordinates of the tile boundary box (upper left and bottom right corners) in the projected coordinate reference system EPSG:6708.
The dataset is split into training, validation, and test partitions, which are provided as TXT files in the root directory: train.txt, val.txt, and test.txt. Each TXT file lists the filenames corresponding to the images in that split. Specifically, the training, validation, and test sets contain 10,748, 1,535, and 3,070 images, respectively. In addition, the root directory contains a gold_standard.txt file, which includes the filenames of the 50 manually corrected images and segmentation masks used to evaluate the data cleaning procedure. The corresponding masks before and after manual correction are placed in masks_v0 and masks_v1, respectively. The description of the dataset splits is reported in Table 2.
Finally, following the classifications described in Section “Background & Summary”, we provide a CSV file named classification.csv, which contains information for each image tile, including the municipality it belongs, its associated altimetric zone (mountain, hill, or plain), urbanization level (urban, suburban, or rural), and coastal proximity (coastal or inland). Table 3 shows some examples of records in the classification.csv file.
Technical Validation
We validated the quality of the segmentation masks obtained using the data cleaning pipeline described in Section “Data cleaning”. In addition, we conducted a series of experiments using standard deep learning models to assess the quality and usability of SegFVG[15](https://www.nature.com/articles/s41597-025-06014-4#ref-CR15 “Rota, C., Kumar, R., Piccoli, F. & Ciocca, G. A high-resolution large-scale dataset for building segmentation from aerial imagery in northeastern Italy. Bicocca Open Archive Research Data. https://doi.org/10.17632/9kbc6zdn7b
, 2025.“) for building segmentation from aerial imagery. The results are evaluated using standard metrics for semantic segmentation, i.e., pixel-level Precision, Recall, F1-score, and Intersection over Union (IoU).
Quality assessment of data cleaning pipeline
With this evaluation, we aim to validate the data cleaning pipeline we used to correct the errors we identified in SegFVG_v0, as described in Section “Data cleaning”. Indeed, the annotations in SegFVG_v0 may contain errors that typically appear as false negatives, where buildings are missed, or false positives, where background regions are incorrectly labeled as buildings.
We validated the quality of the data cleaning pipeline using the gold standard. The performance of the models involved in this process is reported in Table 4. As shown, the direct comparison of the segmentation masks in SegFVG_v0 with the gold standard reports an F1-score and IoU of 0.856 and 0.749, respectively. This highlights the presence of annotation errors and motivates the need to clean the data. We can consider these results as the baseline. The performance obtained by the different deep learning models shows an improvement over the baseline, confirming that model-based refinement is an effective solution to identify and correct annotation errors. Finally, when the model predictions are combined and integrated with the annotations of SegFVG_v0, the resulting masks achieve a higher agreement with the gold standard. Specifically, the performance improves to 0.947 and 0.900 in F1-score and IoU, respectively, demonstrating a considerable gain in segmentation quality. Since the gold standard contains manually corrected and error-free annotations, higher performance on this set suggests that the cleaned annotations are more accurate and of better quality. This is also visually confirmed by Fig. 7, which shows an example of segmentation masks from SegFVG_v0 cleaned using the adopted pipeline. Here we polygonized the segmentation masks and displayed only the polygon boundaries for better clarity. We can see that the cleaned masks are more similar to the gold standard.
Fig. 7
Example of the images obtained using the data cleaning process.
These results validate both numerically and visually the reliability of the proposed data cleaning pipeline, which significantly improved the quality of the annotations compared to those of SegFVG_v0.
Performance of deep learning models
The goal of this evaluation is to assess whether the SegFVG[15](https://www.nature.com/articles/s41597-025-06014-4#ref-CR15 “Rota, C., Kumar, R., Piccoli, F. & Ciocca, G. A high-resolution large-scale dataset for building segmentation from aerial imagery in northeastern Italy. Bicocca Open Archive Research Data. https://doi.org/10.17632/9kbc6zdn7b
, 2025.“) can support the training of deep learning models for building segmentation from aerial imagery.
We selected different architectures, including U-Net[18](https://www.nature.com/articles/s41597-025-06014-4#ref-CR18 “Ronneberger, O., Fischer, P. & Brox, T. U-net: Convolutional networks for biomedical image segmentation. In Medical Image Computing and Computer-assisted Intervention–MICCAI 2015: 18th International Conference, pages 234–241, Springer, https://doi.org/10.1007/978-3-319-24574-4_28
(2015).“), which is a well-established encoder-decoder model; Pyramid Scene Parsing Network (PSPNet)[24](https://www.nature.com/articles/s41597-025-06014-4#ref-CR24 “Zhao, H., Shi, J., Qi, X., Wang, X. & Jia, J. Pyramid scene parsing network. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 2881–2890, https://doi.org/10.1109/CVPR.2017.660
(2017).“), which aggregates context using pyramid pooling at different spatial resolutions; Feature Pyramid Network (FPN)[25](https://www.nature.com/articles/s41597-025-06014-4#ref-CR25 “Lin, T.-Y. et al. Feature pyramid networks for object detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 2117–2125, https://doi.org/10.1109/CVPR.2017.106
(2017).“), which enhances semantic segmentation by combining high-level semantic features with low-level spatial details across scales; DeepLabV326, which leverages atrous spatial pyramid pooling to capture multi-scale contextual information; Pyramid Attention Network (PAN)27, which integrates multi-scale contextual information through spatial attention mechanisms; and Segmentation Transformer (SegFormer)[28](https://www.nature.com/articles/s41597-025-06014-4#ref-CR28 “Xie, E. et al. Segformer: Simple and efficient design for semantic segmentation with transformers. Advances in Neural Information Processing Systems 34, 12077–12090, https://doi.org/10.5555/3540261.3541185
(2021).“), which is a transformer-based model known for its efficiency and accuracy. We followed the same training setup adopted for the models used for the data cleaning pipeline, as described in Section “Data cleaning”. The only difference is the number of training iterations, which is here set to 8,400. The hyperparameters we used for these experiments are summarized in Table 5.
We report the experimental results obtained on the test set, using SegFVG_v1 masks as a reference, in Table 6. These results can be used as a baseline for future research. In addition to evaluating model performance, the table reports the total number of model parameters, the training time required to complete 8400 iterations using the previously discussed setup, and the inference time required to generate a segmentation map from an input image of size 2000 × 2000. Time is measured using a NVIDIA GeForce GTX 1080 GPU.
Looking at precision, SegFormer[28](https://www.nature.com/articles/s41597-025-06014-4#ref-CR28 “Xie, E. et al. Segformer: Simple and efficient design for semantic segmentation with transformers. Advances in Neural Information Processing Systems 34, 12077–12090, https://doi.org/10.5555/3540261.3541185
(2021).“) achieves the highest value (0.955), showing its effectiveness in minimizing false positives. In terms of recall, UNet[18](https://www.nature.com/articles/s41597-025-06014-4#ref-CR18 “Ronneberger, O., Fischer, P. & Brox, T. U-net: Convolutional networks for biomedical image segmentation. In Medical Image Computing and Computer-assisted Intervention–MICCAI 2015: 18th International Conference, pages 234–241, Springer, https://doi.org/10.1007/978-3-319-24574-4_28
(2015).“) achieves the highest value (0.975), indicating a stronger ability to detect most pixels corresponding to buildings. Among all models, UNet[18](https://www.nature.com/articles/s41597-025-06014-4#ref-CR18 “Ronneberger, O., Fischer, P. & Brox, T. U-net: Convolutional networks for biomedical image segmentation. In Medical Image Computing and Computer-assisted Intervention–MICCAI 2015: 18th International Conference, pages 234–241, Springer, https://doi.org/10.1007/978-3-319-24574-4_28
(2015).“) also achieves the highest F1-score of 0.959, indicating a strong balance between precision and recall. In contrast, PSPNet exhibits the lowest performance with an F1-score of 0.817. Regarding IoU, UNet[18](https://www.nature.com/articles/s41597-025-06014-4#ref-CR18 “Ronneberger, O., Fischer, P. & Brox, T. U-net: Convolutional networks for biomedical image segmentation. In Medical Image Computing and Computer-assisted Intervention–MICCAI 2015: 18th International Conference, pages 234–241, Springer, https://doi.org/10.1007/978-3-319-24574-4_28
(2015).“) again performs best with a score of 0.921, confirming its robustness across different metrics. Overall, UNet is the best-performing model. In contrast, PSPNet[24](https://www.nature.com/articles/s41597-025-06014-4#ref-CR24 “Zhao, H., Shi, J., Qi, X., Wang, X. & Jia, J. Pyramid scene parsing network. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 2881–2890, https://doi.org/10.1109/CVPR.2017.660
(2017).“) is the worst-performing model, achieving F1-score and IoU values that are 0.142 and 0.230 lower, respectively, compared to UNet[18](https://www.nature.com/articles/s41597-025-06014-4#ref-CR18 “Ronneberger, O., Fischer, P. & Brox, T. U-net: Convolutional networks for biomedical image segmentation. In Medical Image Computing and Computer-assisted Intervention–MICCAI 2015: 18th International Conference, pages 234–241, Springer, https://doi.org/10.1007/978-3-319-24574-4_28
(2015).“). Fig. 8 shows some examples of building segmentation masks obtained using the UNet[18](https://www.nature.com/articles/s41597-025-06014-4#ref-CR18 “Ronneberger, O., Fischer, P. & Brox, T. U-net: Convolutional networks for biomedical image segmentation. In Medical Image Computing and Computer-assisted Intervention–MICCAI 2015: 18th International Conference, pages 234–241, Springer, https://doi.org/10.1007/978-3-319-24574-4_28
(2015).“) model. Specifically, Fig. 8d highlights the differences between the predicted masks and the reference ones, illustrating common errors made by building segmentation models. These include: (1) missed buildings (false negatives), where small or shadowed structures are not detected; (2) spurious detections (false positives), where non-building elements are incorrectly classified as buildings; and (3) boundary inaccuracies, where mask borders are overly smooth or slightly misaligned.
Fig. 8
Examples of building segmentation masks obtained using UNet[18](https://www.nature.com/articles/s41597-025-06014-4#ref-CR18 “Ronneberger, O., Fischer, P. & Brox, T. U-net: Convolutional networks for biomedical image segmentation. In Medical Image Computing and Computer-assisted Intervention–MICCAI 2015: 18th International Conference, pages 234–241, Springer, https://doi.org/10.1007/978-3-319-24574-4_28
(2015).“). In Fig. 8d, white pixels are true positives, red ones are false negatives, blue ones are false positives.
Using the UNet[18](https://www.nature.com/articles/s41597-025-06014-4#ref-CR18 “Ronneberger, O., Fischer, P. & Brox, T. U-net: Convolutional networks for biomedical image segmentation. In Medical Image Computing and Computer-assisted Intervention–MICCAI 2015: 18th International Conference, pages 234–241, Springer, https://doi.org/10.1007/978-3-319-24574-4_28
(2015).“) model, we conducted a more detailed performance analysis aiming to understand how segmentation performance varies across different geographical and demographic conditions present in the dataset. For reference, we also documented the performance of the other models.
As a first step, we evaluated model performance in regions with varying urbanization levels. Each image tile was categorized as urban, suburban, or rural according to the number of inhabitants per square kilometer. These categories often reflect different patterns in the distribution and structure of the building. Urban areas are typically characterized by compact layouts, where buildings are closely spaced and exhibit a wide variety of shapes and sizes.
In contrast, rural areas contain only a few isolated structures, often dispersed across large expanses of natural or agricultural land. The obtained results are summarized in Table 7.
The model demonstrates consistent performance across all urbanization levels, with F1-scores above 0.940 and IoU values above 0.900. Slightly lower performance in urban areas suggests challenges related to building overlap and structural complexity. In contrast, suburban and rural areas yield slightly better results, likely due to simpler spatial arrangements and clearer separation between buildings. Figure 9 shows some examples of segmentation results obtained in urban and rural areas.
Fig. 9
Example of UNet segmentation results on urban and rural areas.
We further analyzed the performance of the model in different types of landscapes by grouping image tiles into three categories according to their altimetric zone: mountains, hills, and plains. This geographic diversity often presents buildings with different characteristics: mountainous regions often contain scattered buildings reflecting the lower population density, hilly areas tend to feature moderately dense and varied structures, while plains typically exhibit more regular, densely arranged buildings. The results presented in Table 8 show that the model performs well in all types of landscapes.
The lowest performance is observed in mountainous areas, where the model achieves an F1-score of 0.949 and an IoU of 0.903. This is likely due to visual and topographic complexity in such regions, including steep slopes, shadows, and more irregular building patterns. In contrast, performance improves slightly in hilly regions, with an F1-score of 0.958 and an IoU of 0.919, and reaches its highest levels in plains, where the model achieves an F1-score of 0.960 and an IoU of 0.924. The more uniform terrain and regular building layouts in plain regions likely contribute to this improved performance. Figure 10 shows examples of segmentation results obtained in mountainous and plain areas.
Fig. 10
Example of UNet segmentation results on mountains and plains.
Finally, we investigated the model performance by separating coastal from inland areas. The results in Table 9 show that the model performs well in both contexts, but with slightly better results in inland areas. Specifically, the model achieves an F1-score of 0.961 and an IoU of 0.925 for inland areas, compared to an F1-score of 0.949 and an IoU of 0.903 in coastal areas. The lower performance in coastal zones may be attributed to the greater variability in building styles and densities, as well as visual interference from the sea, beaches, and other shoreline features.
These results demonstrate the capability of SegFVG[15](https://www.nature.com/articles/s41597-025-06014-4#ref-CR15 “Rota, C., Kumar, R., Piccoli, F. & Ciocca, G. A high-resolution large-scale dataset for building segmentation from aerial imagery in northeastern Italy. Bicocca Open Archive Research Data. https://doi.org/10.17632/9kbc6zdn7b
, 2025.“) to support high-performance building segmentation from aerial imagery across diverse geographic and demographic contexts. The consistent performance observed in varying population densities and geographical contexts validates the representativeness of the dataset and its utility for downstream applications and model development.
Data availability
SegFVG[15](https://www.nature.com/articles/s41597-025-06014-4#ref-CR15 “Rota, C., Kumar, R., Piccoli, F. & Ciocca, G. A high-resolution large-scale dataset for building segmentation from aerial imagery in northeastern Italy. Bicocca Open Archive Research Data. https://doi.org/10.17632/9kbc6zdn7b
, 2025.“) is publicly available at https://doi.org/10.17632/9kbc6zdn7b under a CC BY 4.0 license, which permits reuse and modification with appropriate attribution. The dataset is distributed as the archive file SegFVG.zip
Code availability
The source code for training deep learning models and performing inference, together with pretrained model checkpoints, is available in the archive Code.zip at https://doi.org/10.17632/9kbc6zdn7b. The code and checkpoints are released under a CC BY 4.0 license, permitting reuse and modification with appropriate attribution.
References
Wu, G. et al. Automatic building segmentation of aerial imagery using multi-constraint fully convolutional networks. Remote Sensing 10(3), 407, https://doi.org/10.3390/rs10030407 (2018).
Nielsen, M. M. Remote sensing for urban planning and management: The use of window-independent context segmentation to extract urban features in stockholm. Computers, Environment and Urban Systems 52, 1–9, https://doi.org/10.1016/j.compenvurbsys.2015.02.002 (2015).
Gupta, R. & Shah, M. Rescuenet: Joint building segmentation and damage assessment from satellite imagery. In 2020 25th International Conference on Pattern Recognition (ICPR), pages 4405–4411, IEEE, https://doi.org/10.1109/ICPR48806.2021.9412295 (2021). 1.
Li, Z., Xin, Q., Sun, Y. & Cao, M. A deep learning-based framework for automated extraction of building footprint polygons from very high-resolution aerial imagery. Remote Sensing 13(18), 3630, https://doi.org/10.3390/rs13183630 (2021).
Zhu, X. X. et al. Deep learning in remote sensing: A comprehensive review and list of resources. IEEE Geoscience and Remote Sensing Magazine 5(4), 8–36, https://doi.org/10.1109/MGRS.2017.2762307 (2017).
Yu, A. et al. Deep learning methods for semantic segmentation in remote sensing with small data: A survey. Remote Sensing 15(2