Fair human-centric image dataset for ethical AI benchmarking

Main

Image datasets have played a foundational role in the history of AI development, with ImageNet12 enabling the rise of deep learning methods in the early 2010s13. While AI technologies have made tremendous strides in their capabilities and adoption since then, bias in data and models remains a persistent challenge[2](https://www.nature.com/articles/s41586-025-09716-2#ref-CR2 “Birhane, A. & Prabhu,…

Main

Image datasets have played a foundational role in the history of AI development, with ImageNet12 enabling the rise of deep learning methods in the early 2010s13. While AI technologies have made tremendous strides in their capabilities and adoption since then, bias in data and models remains a persistent challenge2,[14](https://www.nature.com/articles/s41586-025-09716-2#ref-CR14 “Lee, N. et al. Survey of social bias in vision-language models. Preprint at arxiv.org/abs/2309.14381

(2023).“). Inadequate evaluation data can result in fairness and robustness issues, making it challenging to identify potential harms1,10,15. These harms include the perpetuation of racist, sexist and physiognomic stereotypes2,4, as well as the exclusion or misrepresentation of entire populations3,5,16. Such data inadequacies therefore compromise the fairness and accuracy of AI models.

The large-scale scraping of images from the web without consent2,6,[17](https://www.nature.com/articles/s41586-025-09716-2#ref-CR17 “Thiel, D. Identifying and Eliminating CSAM in Generative ML Training Data and Models (Stanford Univ., 2023); https://doi.org/10.25740/kh752sm9123

.“) not only exacerbates issues related to data bias, but can also present legal issues, particularly related to privacy7,18,19 and intellectual property (IP)20. Consequently, prominent datasets have been modified or retracted8. Moreover, the lack of fair compensation for data and annotations presents critical concerns about the ethics of supply chains in AI development21,22.

Datasets made available by government agencies such as NIST[23](https://www.nature.com/articles/s41586-025-09716-2#ref-CR23 “Grother, P., Ngan, M., Hanaoka, K., Yang, J. C. & Hom, A. Face Recognition Technology Evaluation (FRTE) Part 1: Verification Technical Report (NIST, 2025); www.nist.gov/programs-projects/face-recognition-vendor-test-frvt-ongoing

.“) or using third-party licensed images24 often have similar issues with the absence of informed consent and compensation. Many dataset developers mistakenly assume that using images with Creative Commons licences addresses relevant privacy concerns3. Only a few consent-based fairness datasets with self-reported labels exist25,26,27. However, these datasets have little geographical diversity. They also lack pixel-level annotations, meaning that they can be used for only a small number of human-centric computer vision tasks3.

Evaluating models and mitigating bias are key for ethical AI development. Recent methods such as PASS[28](https://www.nature.com/articles/s41586-025-09716-2#ref-CR28 “Dhar, P., Gleason, J., Roy, A., Castillo, C. D. & Chellappa, R. Pass: protected attribute suppression system for mitigating bias in face recognition. In Proc. IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) 15087–15096 (IEEE, 2023); https://doi.org/10.1109/ICCV48922.2021.01481

.“), FairFaceVar29 and MultiFair30 aim to reduce demographic leakage or enforce fairness constraints through adversarial training and fairness-aware representations. Previous work has also shown that many face-recognition models and benchmarks encode structural biases, underscoring the need for fairness at every stage of development[31](https://www.nature.com/articles/s41586-025-09716-2#ref-CR31 “Nagpal, S., Singh, M., Singh, R. & Vatsa, M. Deep learning for face recognition: pride or prejudiced? Preprint at arxiv.org/abs/1904.01219

(2019).“). Yet, these methods remain constrained by the same dataset limitations that they seek to address, including a lack of consent, demographic self-identification and global representation. Most research in the computer vision fairness literature relies on repurposing non-consensual datasets that lack self-reported demographic information. This lack of self-reported demographic information then leads researchers to guess complex social constructs, such as the race and gender of image subjects, from images alone. These inferences can entrench stereotypes32,33, cause psychological harm to data subjects when inaccurate34,35 and compromise the validity of downstream tasks36.

The dearth of responsibly curated datasets creates an ethical dilemma for practitioners who would like to audit bias in their models. Their options are to use (1) diverse and densely annotated public datasets that carry legal or ethical risks; (2) one of the few publicly available consent-based but highly limited datasets (requiring them to add their own pixel-level annotations); (3) proprietary datasets that do not provide transparency to external parties; (4) datasets that have been quietly retracted due to ethical concerns but continue to circulate in derivative forms37; or (5) nothing—simply to not check for bias7,11,18.

To address these challenges, we introduce the FHIBE, a publicly available, consensually collected, globally diverse fairness evaluation dataset for a wide range of vision-based tasks, from face verification to visual question answering (VQA). FHIBE comprises 10,318 images of 1,981 unique individuals from 81 countries/areas38. Current consent-based fairness datasets25,26,27 lack data from regions with stringent regulations, such as the European Union (EU), making FHIBE, to our knowledge, the first publicly available, human-centric computer vision dataset to include consensually collected images from the EU. FHIBE features the most comprehensive annotations to date of demographic and physical attributes, environmental conditions, camera settings and pixel-level annotations. To assess FHIBE’s capabilities, we used it to evaluate bias in a wide variety of narrow models (designed for specific tasks) and foundation models (general purpose) commonly used in human-centric computer vision. Our analyses spanned eight narrow model tasks (pose estimation, person segmentation, person detection, face detection, face parsing, face verification, face reconstruction and face super-resolution), along with VQA for foundation models. We affirm previously documented biases, and we show that FHIBE can support more granular diagnoses on the factors leading to such biases. We also identify previously undocumented biases, including lower model performance for older individuals and strong stereotypical associations in foundation models based on pronouns and ancestry.

A large number of participants were involved in the data collection, annotation and quality assurance (QA) processes for our project (as described in Supplementary Information C). To collect a dataset as globally diverse as possible, we worked with data vendors to collect data from crowdsourced image subjects. Additional annotations were also collected from crowdsourced and vendor-employed annotators. We provided extensive guidelines to vendors and performed additional steps for QA, privacy preservation, IP protection and consent revocation to further protect the rights of those involved in the data-collection process (Methods). By creating FHIBE, we not only provide researchers with a new evaluation dataset, but we also show the possibilities and limitations of responsible data collection and curation in practice.

FHIBE

Overview

FHIBE comprises 10,318 images of 1,981 unique individuals, averaging six images per primary subject. We used a crowdsourcing approach, working with data vendors that operate globally to collect the dataset. We developed comprehensive data-collection guidelines and implemented a rigorous quality assessment protocol, which we discuss in detail in the Methods.

The dataset includes 1,711 primary subjects (individuals submitting images of themselves; Supplementary Information C) and 417 secondary subjects (individuals who appear alongside primary subjects, increasing the diversity and complexity of the images). Note that some primary subjects are also secondary subjects in other images. In total, 623 images contain both primary and secondary subjects. Captured between May 2011 and January 2024, the images span 81 countries/areas across 5 regions and 16 subregions38. To increase the diversity of the images (location, clothing, appearance, environmental conditions and so on), we permitted participants to submit images that they had previously taken of themselves. The images were taken with 785 distinct camera models from 45 manufacturers, and represent a wide range of real-world conditions, including 16 scene types, 6 lighting conditions, 7 weather scenarios, 3 camera positions and 5 camera distances. Example images with the accompanying subject, instrument and environment metadata are provided in Fig. 1.

Fig. 1: Annotations about the image subjects, instrument and environment are available for all images in FHIBE.

For visualization purposes, we display one type of metadata per image in this figure. Each annotation is linked to the annotators who made or checked the annotation. If the annotator disclosed their demographic attributes (age, pronouns, ancestry), that information is also provided. A full list of annotations is provided in Supplementary Information A. NA, not applicable.

FHIBE also features self-reported pose and interaction annotations, with predefined labels categorized into 16 body poses, 2 head poses and 47 distinct interactions—14 with other subjects and 33 with objects. The dataset offers a rich array of appearance characteristics, including 15 hair and 4 facial hair styles, 7 hair types, 13 hair and 12 facial hair colours, 9 eye colours and 11 types of facial marks.

There are also 6 pronoun categories, 56 integer ages (18 to 75 years) grouped into 5 age categories, 20 ancestry subregions within 5 regions and 6 Fitzpatrick skin tones39. There are 1,234 intersectional groups defined by age group, pronoun, ancestry subregion and Fitzpatrick skin tone, with the number of images per group ranging from 1 to 1,129, with a median of 9 images.

FHIBE includes pixel-level annotations for face and person bounding boxes, 33 keypoints and 28 segmentation categories (Fig. 2). Annotator identifiers (an anonymized ID distinguishing each annotator) are provided for each annotation. Annotator demographic information is also included for transparency, if self-disclosed by the annotators. A complete list of annotations is provided in Supplementary Information A. Distribution plots showing the diversity of FHIBE are shown in Extended Data Figs. 1 and 2 and Supplementary Information B and D. The inter-rater reliability analysis, showing the high quality and consistency of FHIBE annotations, is shown in the Methods and Supplementary Information E.

Fig. 2: Example FHIBE images annotated with detailed pixel-level annotations, keypoints, segmentation masks and bounding boxes.

Pixel-level annotations include keypoint annotations (small red circles) indicating the geometric structure (white lines) of human bodies and faces (for example, right eye inner, left foot index); segmentation masks dividing the human body and face into segments, assigning a label to each pixel (for example, left arm, jewellery); and face and person bounding boxes (red and blue rectangles, respectively).

Furthermore, FHIBE includes two derivative face datasets: a cropped-only set with 10,941 images from 1,981 subjects, and a cropped-and-aligned set with 8,370 images from 1,824 subjects. Both face datasets include all annotations.

Comparison with existing datasets

We compare FHIBE against 27 human-centric computer vision datasets that have been used in fairness evaluations in Extended Data Table 1, considering their collection methods, annotations and ethical dimensions.

The majority of the datasets were scraped from Internet platforms or derived from scraped datasets. Seven well-known datasets were revoked by their authors and are no longer publicly available. Reasons for their removal are typically not stated explicitly, but point to growing criticism due to ethical challenges and concerns around web scraping data for AI development37. While a number of datasets have annotated bounding boxes, key points and segmentation masks, their pixel-level annotations do not match the density of FHIBE’s annotations. Datasets with dense pixel-level annotations, like COCO40, VQA2.041 and MIAP42, contain only limited demographic information, none of which is self-reported.

Only four datasets mention that data were collected after obtaining consent from data subjects. CCv226 and the Chicago Face Database27 are consent-based datasets, but provide no further details on how consent was obtained. While Dollar Street43 provides details on how consent was obtained, use in AI development was not stated as its purpose for collection, and there is no indication that the subjects consented to the processing of their biometric or other personal information. FHIBE stands out as the only dataset collected with robust consent for AI evaluation and bias mitigation.

FHIBE also has greater utility for diagnosing bias in AI compared with other consent-based datasets. CCv2 and Dollar Street have no pixel-level annotations. This makes them unsuitable for the diverse computer vision task evaluations that FHIBE enables. CCv2 and Chicago Face Database also only feature videos/images of individuals facing the camera, largely indoors, with only their head and shoulders shown. They lack full-body images and diverse backgrounds and poses, limiting their utility for many computer vision tasks, such as pose estimation, and for evaluating how models might perform in deployment contexts in which the individuals might not be looking at the camera.

Moreover, FHIBE stands out from other consent-driven datasets in terms of its detailed and self-reported demographic labels, which enable the investigation of model performance at complex intersections of demographic attributes (Table 1). Although CCv1 has 4.4 times more images and CCv2 has 2.8 times more subjects than FHIBE, FHIBE has 3.4 times more annotations and 16.9 times more attribute values (Table 2). FHIBE also has greater representation from regions that are under-represented in many computer vision datasets, such as Africa (44.7%) and lower-middle income economies (71.5%) (Table 3), making it uniquely suitable for bias evaluation.

Ethical considerations to FHIBE design

In developing FHIBE, we sought to implement best practices for ethical data collection recommended in the literature2,3,44. We focused particularly on consent, privacy protection, compensation, safety, diversity and utility. The design decisions discussed below can also provide a starting point for future responsible data collection and curation efforts, including those not focused on fairness evaluation. Detailed descriptions of how these ethical considerations were implemented are provided in the Methods.

Consent

Informed consent is central to research involving human participants, promoting participant safety and protection while supporting research integrity19,45. It involves the participants having sufficient information regarding the project and the potential risks before deciding to participate. Informed consent is also fundamental to data privacy protection, as encoded in various laws and regulations7,18,19,[46](https://www.nature.com/articles/s41586-025-09716-2#ref-CR46 “General Data Protection Regulation (European Commission, 2016); gdpr-info.eu/

.“).

Our consent processes were designed to comply with comprehensive data protection laws like the EU General Data Protection Regulation (GDPR)[46](https://www.nature.com/articles/s41586-025-09716-2#ref-CR46 “General Data Protection Regulation (European Commission, 2016); gdpr-info.eu/

.“). These processes included developing consent forms with clear language about the uses and disclosures of the collected data, the processing of biometric and sensitive data and the rights of data subjects with regard to their data. Policy considerations imbued in data privacy laws, such as respect for human dignity, also influenced other aspects of our data collection, including decisions regarding the types of attributes we collected (for example, pronouns rather than gender), participant recruitment guidelines (for example, no coercive practices) and restrictions on downstream uses of the dataset (for example, users are prohibited from attempting to reidentify subjects).

To ensure that consent is given on a voluntary basis[46](https://www.nature.com/articles/s41586-025-09716-2#ref-CR46 “General Data Protection Regulation (European Commission, 2016); gdpr-info.eu/

.“), data subjects retain control over their personal information and may withdraw their personal data from the dataset at any time, with no impact on the compensation they received from the project. In the event of consent withdrawal, we commit to maintaining dataset integrity by replacing withdrawn images and preserving the dataset’s size and diversity to the extent possible. This commitment makes FHIBE a first in computer vision—a living dataset designed to evolve responsibly.

Privacy and IP

In addition to obtaining informed consent, we took additional measures to remove incidental personal information from the images. We used a state-of-the-art generative diffusion model47 to in-paint over non-consensual subjects (for example, individuals in the background of an image) and personally identifiable information (for example, license plates, credit cards). We then manually checked each image to verify the personal information had been removed, mitigating potential algorithmic biases in the automated methods48. This approach avoids the limitations of traditional privacy measures, such as automated face blurring49, which can still allow for reidentification through distinctive non-facial features (for example, tattoos, birthmarks)50. We further tested our method to ensure that it did not compromise the utility of the data for model evaluation. Moreover, we coarsened certain attributes and release others only in aggregate form.

To secure appropriate rights to license the images for downstream users, the participants submitting images were also required to review and agree to terms affirming they had the rights to provide the images and understood the nature of their contribution. Furthermore, our instructions to data vendors and participants included requirements to minimize the presence of third-party IP, such as trademarks and landmarks. We also implemented automated checks with manual verification to detect and exclude images with prominent third-party IP, such as logos, from our dataset.

Compensation

Crowdworkers often contend with low wages and demanding working conditions21,22, while individuals whose images are included in web-scraped datasets receive no compensation. To address these concerns, we asked data vendors to report minimum payment rates per task per region and to compensate crowdworker participants—image subjects, annotators and QA annotators (definitions are provided in Supplementary Information C)—at least the applicable local minimum wage based on task-time estimates. Vendors’ reported minimum payment rates were cross-referenced against the International Labor Organization’s Global Wage Report[51](https://www.nature.com/articles/s41586-025-09716-2#ref-CR51 “Global Wage Report 2020-21: Wages and Minimum Wages in the Time of COVID-19 5–207 (International Labour Organization, 2020); www.ilo.org/wcmsp5/groups/public/—dgreports/—dcomm/—publ/documents/publication/wcms_762534.pdf

.“) or, where this was not applicable, with the minimum wage of a country with comparable GDP per capita. The median compensation for image subjects was 12× the applicable minimum wage (further information about project costs is provided in the Discussion and Methods).

Safety

Webscraped datasets frequently include harmful and illegal content, ranging from derogatory annotations to instances of child sexual abuse material (CSAM)2,6,[17](https://www.nature.com/articles/s41586-025-09716-2#ref-CR17 “Thiel, D. Identifying and Eliminating CSAM in Generative ML Training Data and Models (Stanford Univ., 2023); https://doi.org/10.25740/kh752sm9123

.“). Although the risk of such content appearing in our dataset was low given our sourcing method, instructions to data subjects and vendor QA, we performed additional manual and automated checks to ensure safety. Each image was manually reviewed to identify and remove any harmful content and the image hashes were cross-referenced against a database of known CSAM maintained by the National Center for Missing & Exploited Children (NCMEC). This dual approach—leveraging both technology and human judgement—helped to create a dataset that is both safe and respectful of human dignity.

Diversity

While diversity is a relevant consideration for data collection generally, the fact that FHIBE is a fairness evaluation set made it especially important to optimize for diversity across many dimensions: image subject demographics, appearance (for example, not wearing the same clothing in all images), poses, interactions between subjects and objects, and environment.

FHIBE contains detailed demographic information—such as age, pronouns and ancestry, making it possible to use FHIBE to evaluate model bias along many axes of interest. As FHIBE is a publicly available dataset, we sought to balance minimizing the disclosure of sensitive information while maximizing the availability of useful annotations for bias diagnosis. This led to our decision to collect pronouns, as pronouns are more likely to be public-facing information, while gender identity and sex can be quite sensitive, particularly for gender and sex minorities52. Moreover, while we collected information on data subjects’ disability status, pregnancy status, height and weight to measure the diversity of our dataset along these dimensions, we do not release these annotations with the dataset and only disclose the summary statistics in aggregate for transparency purposes (Supplementary Information B.1). Note that participant disclosures about pregnancy and disability status were optional.

Collecting pronouns rather than gender identity also reduced risks associated with misgendering3,[53](https://www.nature.com/articles/s41586-025-09716-2#ref-CR53 “National Institutes of Health—Division of Program Coordination, Planning and Strategic Initiatives. Gender Pronouns & Their Use in Workplace Communications (2022); dpcpsi.nih.gov/sgmro/gender-pronouns-resource

.“), and collecting ancestry offered a more stable alternative to country-specific racial categories3,54. We further describe the rationales to use pronouns and ancestry in Supplementary Information J.

We also collected annotations on phenotypic and performative markers to enhance bias analysis. Phenotypic attributes—like skin colour, eye colour and hair type—provide observable characteristics related to relevant demographic bias dimensions9, while performative markers—such as facial hair, cosmetics and clothing—help to identify social stereotypes and spurious correlations55. Moreover, FHIBE includes camera-level metadata and environmental annotations, capturing factors such as illumination, camera position and scene, which are important for understanding model performance across diverse conditions16,56.

With the exception of pixel-level annotations, head pose and camera distance, we focused on the collection of self-reported information to address the limitations (as discussed above) of previous data-collection efforts that used annotators to guess subjects’ attributes. Collecting self-reported attributes (as opposed to labelling them later) had the additional benefit of ensuring that the participants were well aware of the information about them that would be used in the project.

Utility

An evaluation set is valuable only insofar as it enables assessments of model performance on relevant tasks. FHIBE provides extensive annotations for analysing human-centric visual scenes, including face- and person-specific bounding boxes, keypoints and segmentation masks. As a result, FHIBE can be used to evaluate models across a much wider variety of tasks than previously possible using consent-based computer vision datasets. Its combination of pixel-level annotations and attribute labels makes FHIBE to our knowledge the most comprehensively annotated fairness dataset currently available.

Moreover, we compared the utility of FHIBE as a fairness evaluation set with existing datasets. As discussed in the Methods, for each of the eight narrow model computer vision tasks that FHIBE was designed for, we evaluated commonly used models using FHIBE and pre-existing evaluation datasets (Supplementary Information F). The findings are discussed in the ‘Evaluation results’ section below.

Evaluation results

Bias discovery in narrow models

FHIBE’s diverse and comprehensive annotations provide both breadth and depth in fairness assessments, enabling the evaluation of models across a range of demographic attributes and their intersections. We examined the performance of a variety of pretrained narrow models—across eight common computer vision tasks: pose estimation, person segmentation, person detection, face detection, face parsing, face verification, face reconstruction and face super-resolution—on FHIBE’s demographic groups and their intersections (that is, pronoun × age group × ancestry × skin tone). The exact methodology is described in the Methods.

Through our benchmarking analysis, we found that intersectional groups combining multiple sensitive attributes—including pronoun, age, ancestry and skin tone—experience the largest performance disparities (Supplementary Fig. 21). Notably, despite the fact that skin tone is often used as a proxy for ancestry/race/ethnicity in fairness evaluations57, we find that intersections featuring both skin tone and ancestry have much greater disparities than those with only one of these attributes.

For each task, we also examined the intersectional groups for which the models showed the highest versus lowest disparity in performance. Note that, for this particular analysis, we considered only groups with at least ten subjects, and pairwise group comparisons were filtered using the Mann–Whitney U-test for statistical significance. To control for multiple comparisons, we applied Bonferroni correction58 by adjusting the significance threshold based on the number of pairwise tests, therefore considering only pairs with a statistically significant difference ((P < \frac{0.05}{{\rm{number}},{\rm{of}},{\rm{pairwise}},{\rm{tests}}})). Through this analysis (Extended Data Table 2 and Supplementary Information K), we found that younger individuals (aged 18–29 years), those with lighter skin tones and those with Asian ancestry were more frequently among the groups that models performed best on, whereas older individuals (aged 50–59 and 60+ years), those with darker skin tones and those with African ancestry appeared more often among the groups that models performed worst on. However, despite these high-level trends, there was variability across models and specific intersections. For example, for face detection, RetinaFace performed best for ‘she/her/hers × type I × Asia’ and worst for ‘he/him/his × type II × Africa’, whereas MTCNN performed best for ‘she/her/hers × type II × Africa’ and worst for ‘he/him/his × type IV × Europe’.

This variability highlights the importance of testing for intersectional biases on a case-by-case basis, as bias trends can vary depending on the specific model–task combination. Overall, disparities likely arise from a combination of systemic biases—such as demographic under-representation—and task- or model-specific interactions with sensitive attributes. While some patterns align with broader structural inequalities, others reflect localized effects, emphasizing the need for nuanced and intersectional fairness assessments, which FHIBE’s extensive demographic annotations facilitate.

FHIBE further enables in-depth analyses of model performance disparities by identifying the specific features contributing to bias trends with greater granularity than what existing datasets facilitate. For example, we found that face-detection models showed consistently higher accuracy for individuals with she/her/hers pronouns compared with he/him/his pronouns (Supplementary Tables 14—16), a finding consistent with previous research59. Through our direct error modelling analysis, we used FHIBE’s extensive annotations to identify attribute

Main

Main

FHIBE

Overview

Comparison with existing datasets

Ethical considerations to FHIBE design

Consent

Privacy and IP

Compensation

Safety

Diversity

Utility

Evaluation results

Bias discovery in narrow models

Similar Posts