Abstract
Meaningful governance of any system requires the system to be assessed and monitored effectively. In the domain of Artificial Intelligence (AI), global efforts have established a set of ethical principles, including fairness, transparency, and privacy upon which AI governance expectations are being built. The computing research community has proposed numerous means of measuring an AI system’s normative qualities along these principles. Current reporting of these measures is principle-specific, limited in scope, or otherwise dispersed across publication platforms, hindering the domain’s ability to critique its practices. To address this, we introduce the Responsible AI Measures Dataset, consolidating 12,067 data points across 791 evaluation measures covering 11 ethical pr…
Abstract
Meaningful governance of any system requires the system to be assessed and monitored effectively. In the domain of Artificial Intelligence (AI), global efforts have established a set of ethical principles, including fairness, transparency, and privacy upon which AI governance expectations are being built. The computing research community has proposed numerous means of measuring an AI system’s normative qualities along these principles. Current reporting of these measures is principle-specific, limited in scope, or otherwise dispersed across publication platforms, hindering the domain’s ability to critique its practices. To address this, we introduce the Responsible AI Measures Dataset, consolidating 12,067 data points across 791 evaluation measures covering 11 ethical principles. It is extracted from a corpus of computing literature (n = 257) published between 2011 and 2023. The dataset includes detailed descriptions of each measure, AI system characteristics, and publication metadata. An accompanying, interactive visualization tool supports usability and interpretation of the dataset. The Responsible AI Measures Dataset enables practitioners to explore existing assessment approaches and critically analyze how the computing domain measures normative concepts.
Code availability
The project files are publicly available on the figshare repository, “Responsible_AI_Measures_Dataset” at the following location, https://figshare.com/articles/dataset/_b_Responsible_Artificial_Intelligence_RAI_Measures_Dataset_b_/29551001?file=57701437. Two code files have been included; the first file is Python code in a Jupyter notebook used to clean and process the data, in addition to producing all plots. The second code file is a Python script to create and deploy the interactive visualization. This repository is expected to remain active indefinitely. This dataset is published as Version 1.0. Updates will follow a major and minor versioning control scheme, with major additions indicated by the first number (e.g. 2.0 or 3.0), and minor changes indicated by the second number (e.g. 1.1 or 1.2). The README file will be updated to catalog all major and minor changes to the dataset, and all updates will be reflected on the figshare repository. We are currently working on a method to enable users to propose and contribute additional measures. This will allow the dataset to expand over time in a way that reflects emerging needs and uses. We are approaching this as a collaborative effort, consulting with practitioners and researchers from diverse backgrounds and disciplines to ensure the dataset remains relevant and inclusive across contexts.
References
Buolamwini, J. & Gebru, T. Gender shades: Intersectional accuracy disparities in commercial gender classification. 81, 77–91 (2018). 1.
Noble, S. U. Algorithms of Oppression: How Search Engines Reinforce Racism. (New York University Press, New York, NY, 2018). 1.
Barocas, S. et al. Designing disaggregated evaluations of AI systems: Choices, considerations, and tradeoffs. in Proceedings of the 2021 AAAI/ACM Conference on AI, Ethics, and Society 368–378 https://doi.org/10.1145/3461702.3462610 (ACM, New York, NY, USA, 2021). 1.
Section 3: Obligations of Providers and Deployers of High-Risk AI Systems and Other Parties. https://artificialintelligenceact.eu/section/3-3/. 1.
NIST. NIST AI RMF Playbook. https://airc.nist.gov/airmf-resources/playbook/ (2022). 1.
Measuring well-being and progress. OECD https://www.oecd.org/en/topics/measuring-well-being-and-progress.html. 1.
Laufer, B., Jain, S., Cooper, A. F., Kleinberg, J. & Heidari, H. Four years of FAccT: A reflexive, mixed-methods analysis of research contributions, shortcomings, and future prospects. in 2022 ACM Conference on Fairness, Accountability, and Transparency, https://doi.org/10.1145/3531146.3533107 (ACM, New York, NY, USA, 2022). 1.
Adams, R. et al. The Global Index on Responsible AI. https://www.global-index.ai/ (2024). 1.
Smith, J. J., Beattie, L. & Cramer, H. Scoping fairness objectives and identifying fairness metrics for recommender systems: The practitioners’ perspective. in Proceedings of the ACM Web Conference 2023 3648–3659, https://doi.org/10.1145/3543507.3583204 (ACM, New York, NY, USA, 2023). 1.
Pagano, T. P. et al. Bias and unfairness in machine learning models: A systematic review on datasets, tools, fairness metrics, and identification and mitigation methods. Big Data Cogn. Comput. 7, 15, https://doi.org/10.3390/bdcc7010015 (2023).
Carvalho, D. V., Pereira, E. M. & Cardoso, J. S. Machine learning interpretability: A survey on methods and metrics. Electronics (Basel) 8, 832, https://doi.org/10.3390/electronics8080832 (2019).
Gustafson, L. et al. FACET: Fairness in Computer Vision Evaluation Benchmark. in 20313–20325 https://doi.org/10.1109/ICCV51070.2023.01863 (2023). 1.
OECD. OECD AI Policy Observatory Portal, https://oecd.ai/en/ai-principles (2024). 1.
Blodgett, S. L., Lopez, G., Olteanu, A., Sim, R. & Wallach, H. Stereotyping Norwegian salmon: An inventory of pitfalls in fairness benchmark datasets. in Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers) (eds. Zong, C., Xia, F., Li, W. & Navigli, R.) 1004–1015, https://doi.org/10.18653/v1/2021.acl-long.81 (Association for Computational Linguistics, Stroudsburg, PA, USA, 2021). 1.
Diaz, F. & Madaio, M. Scaling Laws Do Not Scale. in Proceedings of the 2024 AAAI/ACM Conference on AI, Ethics, and Society 341–357, https://doi.org/10.5555/3716662.3716692 (AAAI Press, London, England, 2025). 1.
Liao, Q. V. & Xiao, Z. Rethinking model evaluation as narrowing the Socio-technical gap. arXiv [cs.HC] https://arxiv.org/abs/2306.03100 (2023). 1.
Jacobs, A. Z. & Wallach, H. Measurement and Fairness. in Proceedings of the 2021 ACM Conference on Fairness, Accountability, and Transparency, https://doi.org/10.1145/3442188.3445901 (ACM, New York, NY, USA, 2021). 1.
Xiao, Z., Zhang, S., Lai, V. & Liao, Q.V. Evaluating evaluation metrics: A framework for analyzing NLG evaluation metrics using measurement theory. Proc. 2023 Conf. Empirical Methods in Natural Language Processing 10967–10982 https://doi.org/10.18653/v1/2023.emnlp-main.744 (Association for Computational Linguistics, 2023). 1.
Ackerman, M. S. The Intellectual Challenge of CSCW: The Gap Between Social Requirements and Technical Feasibility. Human–Computer Interaction 15, 179–203, https://doi.org/10.1207/S15327051HCI1523_5 (2000).
Black, E. et al. Toward operationalizing pipeline-aware ML fairness: A research agenda for developing practical guidelines and tools. in Equity and Access in Algorithms, Mechanisms, and Optimization 1–11, https://doi.org/10.1145/3617694.3623259 (ACM, New York, NY, USA, 2023). 1.
Rismani, S., et al Responsible Artificial Intelligence (RAI) Measures Dataset, https://doi.org/10.6084/m9.figshare.29551001 (2025). 1.
Arksey, H. & O’Malley, L. Scoping studies: towards a methodological framework. Int. J. Soc. Res. Methodol. 8, 19–32, https://doi.org/10.1207/S15327051HCI1523_5 (2005).
Jobin, A., Ienca, M. & Vayena, E. The global landscape of AI ethics guidelines. Nat. Mach. Intell. 1, 389–399, https://doi.org/10.1038/s42256-019-0088-2 (2019).
Levac, D., Colquhoun, H. & O’Brien, K. K. Scoping studies: advancing the methodology. Implement. Sci. 5, 69, https://doi.org/10.1186/1748-5908-5-69 (2010).
Peters, M. D. J. et al. Updated methodological guidance for the conduct of scoping reviews. JBI Evid. Synth. 18, 2119–2126, https://doi.org/10.11124/JBIES-20-00167 (2020).
Mökander, J., Axente, M., Casolari, F. & Floridi, L. Conformity assessments and post-market monitoring: A guide to the role of auditing in the proposed European AI regulation. Minds Mach. (Dordr.) 32, 241–268, https://doi.org/10.1007/s11023-021-09577-4 (2022).
Belter, C. W. Citation analysis as a literature search method for systematic reviews. J. Assoc. Inf. Sci. Technol. 67, 2766–2777, https://doi.org/10.1002/asi.23605 (2016).
Dwork, C., Hardt, M., Pitassi, T., Reingold, O. & Zemel, R. Fairness through awareness. in Proceedings of the 3rd Innovations in Theoretical Computer Science Conference, https://doi.org/10.1145/2090236.2090255 (ACM, New York, NY, USA, 2012). 1.
Shelby, R. et al. Sociotechnical harms of algorithmic systems: Scoping a taxonomy for harm reduction. Proc. 2023 AAAI/ACM Conf. AI, Ethics, Soc. 24, 723–741, https://doi.org/10.1145/3600211.3604673 (2023).
Rismani, S. et al. From plane crashes to algorithmic harm: Applicability of safety engineering frameworks for responsible ML. Proc. 2023 CHI Conf. Hum. Factors Comput. Syst. 2, 1–18, https://doi.org/10.1145/3544548.3581407 (2023).
Lewis, G. A., Echeverría, S., Pons, L. & Chrabaszcz, J. Augur: a step towards realistic drift detection in production ML systems. in Proceedings of the 1st Workshop on Software Engineering for Responsible AI 37–44 https://doi.org/10.1145/3526073.3527590 (ACM, New York, NY, USA, 2022).
Acknowledgements
This project is funded by a Mila-Google grant. We would like to thank Ava Gilmour for her contributions to searching the databases, uploading files to Covidence, and the screening process.
Author information
Authors and Affiliations
McGill University, Montréal, Canada
Shalaleh Rismani, Leah Davis & AJung Moon 1.
Pennsylvania State University, State College, USA
Bonam Mingole 1.
Google Research, Montreal, Canada
Negar Rostamzadeh 1.
Google Research, San Francisco, USA
Renee Shelby
Authors
- Shalaleh Rismani
- Leah Davis
- Bonam Mingole
- Negar Rostamzadeh
- Renee Shelby
- AJung Moon
Contributions
Shalaleh Rismani served as the lead project researcher, spearheading the conceptualization, literature review, methodology development, and execution of the project. Shalaleh also served as a Co-Principal Investigator (Co-PI) on the Mila-Google grant and led the writing of the introduction, background, and methodology sections of this paper. Leah Davis contributed to the screening process and took the lead in the post-processing of the raw dataset, primarily in developing the figshare repository. In addition, Leah created and deployed the interactive visualization tool. She also led the writing of the post-processing, usage notes, and code availability sections, co-authoring the paper with Shalaleh. Bonam Mingole played a significant role in the data extraction process, serving as one of the main reviewers. Negar Rostamzadeh provided critical guidance in the conceptualization of the project and offered valuable feedback on the paper. Negar also served as a Co-PI on the Mila-Google grant. Renee Shelby contributed to the conceptualization of the project, conducted an expert review of a significant portion of the dataset, and provided substantial editorial support for this paper. Renee also served as a Co-PI on the Mila-Google grant. AJung Moon acted as the Principal Investigator (PI) on the Mila-Google grant, offering strategic guidance during the project’s conceptualization and contributing editorial revisions to the paper.
Corresponding author
Correspondence to Shalaleh Rismani.
Ethics declarations
Competing interests
The authors declare no competing interests.
Additional information
Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.
About this article
Cite this article
Rismani, S., Davis, L., Mingole, B. et al. Responsible AI measures dataset for ethics evaluation of AI systems. Sci Data (2025). https://doi.org/10.1038/s41597-025-06021-5
Received: 06 January 2025
Accepted: 23 September 2025
Published: 20 December 2025
DOI: https://doi.org/10.1038/s41597-025-06021-5