1 Introduction
Large Language Models (LLMs), also known as Large Multimodal Models (LMMs), Visual Language Models (VLMs), ‘dissolution models’ [[1](https://link.springer.com/article/10.1007/s12369-025-01301-x#ref-CR1 “Hundt A, Agnew W, Zeng V, Kacianka S, Gombolay M (2022) Robots enact malignant stereotypes. In 2022 ACM Conference on Fairness, Accountability, and Transparency. Association for Computing Machinery, New York, NY, USA. FAccT’ 22, pp. 743-756. https://sites.google.com/view/robots-enact-stereotypes/home
PDF with appendix: https://arxiv.org/pdf/2207.11569.pdf
“)], or ‘foundation models’, are used to ingest and generate predictions of plausible ‘tokens’ that might represent text, code, images, audio, and other multimodal data. Researchers have proposed using LLMs fo…
1 Introduction
Large Language Models (LLMs), also known as Large Multimodal Models (LMMs), Visual Language Models (VLMs), ‘dissolution models’ [[1](https://link.springer.com/article/10.1007/s12369-025-01301-x#ref-CR1 “Hundt A, Agnew W, Zeng V, Kacianka S, Gombolay M (2022) Robots enact malignant stereotypes. In 2022 ACM Conference on Fairness, Accountability, and Transparency. Association for Computing Machinery, New York, NY, USA. FAccT’ 22, pp. 743-756. https://sites.google.com/view/robots-enact-stereotypes/home
PDF with appendix: https://arxiv.org/pdf/2207.11569.pdf
“)], or ‘foundation models’, are used to ingest and generate predictions of plausible ‘tokens’ that might represent text, code, images, audio, and other multimodal data. Researchers have proposed using LLMs for robotic tasks [[1](#ref-CR1 “Hundt A, Agnew W, Zeng V, Kacianka S, Gombolay M (2022) Robots enact malignant stereotypes. In 2022 ACM Conference on Fairness, Accountability, and Transparency. Association for Computing Machinery, New York, NY, USA. FAccT’ 22, pp. 743-756. https://sites.google.com/view/robots-enact-stereotypes/home
PDF with appendix: https://arxiv.org/pdf/2207.11569.pdf
“),2,3,[4](#ref-CR4 “Ha H, Florence P, Song S (2023) Scaling up and distilling down: language-guided robot skill acquisition. In: Tan J, Toussaint M, Darvish K (eds) Proceedings of The 7th Conference on Robot Learning. Proceedings of Machine Learning Research, vol 229. pp 3766–3777, PMLR. https://proceedings.mlr.press/v229/ha23a.html
“),5,6,7,8], to approximate ‘common sense reasoning’ [3] (separate from genuine human cognition [[9](https://link.springer.com/article/10.1007/s12369-025-01301-x#ref-CR9 “Guest O, Martin A (2023) On logical inference over brains, behaviour, and artificial neural networks. Comput Brain Behav 6:213–227. https://doi.org/10.1007/s42113-022-00166-x
. Accepted: 23 December 2022. Published online: 13 February 2023“)]), quick prototyping [10], modeling of human inputs [11], and generally as a way to facilitate Human-Robot Interaction (HRI) [11,12,13,14,15,16]. Researchers and companies are also actively developing open-vocabulary robot capabilities [[17](https://link.springer.com/article/10.1007/s12369-025-01301-x#ref-CR17 “Yenamandra S, Ramachandran A, Yadav K, Wang A, Khanna M, Gervet T, Yang T-Y, Jain V, Clegg AW, Turner J, Kira Z, Savva M, Chang A, Chaplot DS, Batra D, Mottaghi R, Bisk Y, Paxton C (2023) HomeRobot: open vocabpen vocab mobile manipulationanipulation. https://aihabitat.org/static/challenge/home_robot_ovmm_2023/OVMM.pdf
“), [18](https://link.springer.com/article/10.1007/s12369-025-01301-x#ref-CR18 “Yenamandra S, Ramachandran A, Khanna M, Yadav K, Chaplot DS, Chhablani G, Clegg A, Gervet T, Jain V, Partsey R, Ramrakhya R, Szot A, Yang T-Y, Edsinger A, Kemp C, Shah B, Kira Z, Batra D, Mottaghi R, Bisk Y, Paxton C (2023) The homerobot open vocab mobile manipulation challenge. In Thirty-seventh Conference on Neural Information Processing Systems: Competition Track. https://aihabitat.org/challenge/2023_homerobot_ovmm/
“)], i.e. where a user can freely pose a task request to a robot in natural language, without syntax or vocabulary constraints. As an example, Figure corporation created a concept demo video (https://www.youtube.com/watch?v=Sq1QZB5baNw) [[19](https://link.springer.com/article/10.1007/s12369-025-01301-x#ref-CR19 “Figure: Figure status update - OpenAI speech-to-speech reasoning. YouTube. https://youtu.be/Sq1QZB5baNw
“)] of this kind in collaboration with OpenAI. It shows a robot picking up an apple and handing it to a user who asked “Can I have something to eat?”, as depicted in Fig. 1.
Fig. 1
High-level concept for a task approval process based on a Figure corp. demo [[19](https://link.springer.com/article/10.1007/s12369-025-01301-x#ref-CR19 “Figure: Figure status update - OpenAI speech-to-speech reasoning. YouTube. https://youtu.be/Sq1QZB5baNw
“)]
However, open vocabulary models that accept unconstrained natural language input have proven to pose significant risks, generating harmful stereotypes [20], toxic language and hate speech [21, 22], as well as violent, dangerous and illegal content, such as incitement to violence, harassment and theft [23]. In the robotics context, Hundt et al. [[1](https://link.springer.com/article/10.1007/s12369-025-01301-x#ref-CR1 “Hundt A, Agnew W, Zeng V, Kacianka S, Gombolay M (2022) Robots enact malignant stereotypes. In 2022 ACM Conference on Fairness, Accountability, and Transparency. Association for Computing Machinery, New York, NY, USA. FAccT’ 22, pp. 743-756. https://sites.google.com/view/robots-enact-stereotypes/home
PDF with appendix: https://arxiv.org/pdf/2207.11569.pdf
“)] demonstrated that “robotic systems have all the [bias, gender, and racial stereotype] problems that software systems have, plus their embodiment adds the risk of causing irreversible physical harm; and worse, no human intervenes in fully autonomous robots.” They demonstrated at scale how even seemingly minor biases in a physical AI-driven robot can cause both physically dangerous safety risks and discriminatory actions against people. This raises the question of how, and to what extent, such safety and discrimination problems could manifest in HRI contexts. Given the physical nature of robotics, such properties of LLMs could lead to tremendous physical and psychological safety risks. This is a pressing problem because companies and researchers have started deploying LLM-driven robots in live demonstrations with real people [24].
Ensuring safety in the dynamic context of Human-Robot Interactions (HRI) and their larger sociotechnical systems is essential because safety is not an intrinsic property of models [[25](https://link.springer.com/article/10.1007/s12369-025-01301-x#ref-CR25 “Narayanan A, Kapoor S (2012) AI safety is not a model property. Published on AI snake oil. https://www.aisnakeoil.com/p/ai-safety-is-not-a-model-property
Accessed 2024-03-12“)]. For example, larger systems can be compartmentalized in a way that ensures harm is undetectable, or unsafe to address by individual humans. Even so, it remains necessary and appropriate to detect and mitigate harm when and where it is revealed and feasible to do so. Given the current lack of in-depth knowledge of these risks in HRI, and their potential seriousness, our goal in this paper is thus to systematically investigate and characterize discrimination and safety in LLM-driven HRI. This paper focuses on complementary aspects of discrimination and safety: since our discrimination scenarios can lead to physical and mental safety impacts on specific social groups, and our safety scenarios reflect common instances of harmful and abusive behavior targeted at marginalized social groups.
We make the following contributions (Fig. 2 and Table 1 summarize key outcomes):
- 1.
Introduce direct discrimination and contextual safety assessment tasks as valuable evaluations of LLMs on robots (Sects. 3, 5).
- 2.
Measure the presence of direct discrimination in LLMs, on HRI tasks such as proxemics, facial expression, rescue, and home assistance (Sects. 3 and 4) using established LLM-for-robotics frameworks [2].
- 3.
Show situations in which robot behavior is harmful, and that it matches patterns of harmful discrimination documented in the literature (Sect. 4).
- 4.
Show that LLMs fail to meet basic system functional and safety requirements in unconstrained natural language (open vocabulary) settings by approving dangerous, violent, and unlawful activities (Sects. 5 and 6). This is evaluated using established functionality tests [[28](https://link.springer.com/article/10.1007/s12369-025-01301-x#ref-CR28 “Raji ID, Kumar IE, Horowitz A, Selbst A (2022) The fallacy of ai functionality. In 2022 ACM Conference on Fairness, Accountability, and Transparency. FAccT’ 22, Association for Computing Machinery. (pp 959–972). New York, NY, USA. https://doi.org/10.1145/3531146.3533158
“)], safety frameworks [[1](https://link.springer.com/article/10.1007/s12369-025-01301-x#ref-CR1 “Hundt A, Agnew W, Zeng V, Kacianka S, Gombolay M (2022) Robots enact malignant stereotypes. In 2022 ACM Conference on Fairness, Accountability, and Transparency. Association for Computing Machinery, New York, NY, USA. FAccT’ 22, pp. 743-756. https://sites.google.com/view/robots-enact-stereotypes/home
PDF with appendix: https://arxiv.org/pdf/2207.11569.pdf
“), 29,[30](#ref-CR30 “Kuespert DR (2016) Research laboratory safety. De Gruyter, Berlin, Boston. https://doi.org/10.1515/9783110444438
“),[31](https://link.springer.com/article/10.1007/s12369-025-01301-x#ref-CR31 “National research council: safe science: promoting a culture of safety in academic chemical research (2014) The National Academies Press, Washington, DC. https://doi.org/10.17226/18706
“)], and harm taxonomies [32] (Sect. 5).
- 5.
Discuss the implications of these findings, their relation to existing literature on LLM and robotics harms, and what they mean for the feasibility of LLM-for-robotics projects (Sect. 7).
Fig. 2
Summary of key findings with respect to selected LLM robot risks
Notably, one of the strengths of our approach and findings is that we achieve our contributions via a combination of straightforward adversarial and non-adversarial prompting alone, without special modifications, model jailbreaking, or other red-teaming techniques. Our results show it is currently trivial to find viable safety, security, and functionality failures in LLM-Driven robots. We anticipate that these high-impact AI-driven robot identity, safety, and security topics will grow alongside the responsible AI, robot safety, and computer security fields.
2 Background and Related Work
2.1 LLMs for Robotics
Robotics researchers have recently proposed several algorithms based on LLMs for robotics tasks [2,3,[4](#ref-CR4 “Ha H, Florence P, Song S (2023) Scaling up and distilling down: language-guided robot skill acquisition. In: Tan J, Toussaint M, Darvish K (eds) Proceedings of The 7th Conference on Robot Learning. Proceedings of Machine Learning Research, vol 229. pp 3766–3777, PMLR. https://proceedings.mlr.press/v229/ha23a.html
“),5,6,7,8]. For example, the SayCan method [2] defines a set of actions available to the robot, and uses LLMs to obtain the probability that each action contributes to make progress towards solving a task, e.g. “find an apple”, “go to the table”, “place the apple”. Ding et al. [3] uses LLMs to obtain ‘common’ spatial relationships between objects, for instance to understand what is meant by “setting the table” in terms of relative object positions. Ha et al. [[4](https://link.springer.com/article/10.1007/s12369-025-01301-x#ref-CR4 “Ha H, Florence P, Song S (2023) Scaling up and distilling down: language-guided robot skill acquisition. In: Tan J, Toussaint M, Darvish K (eds) Proceedings of The 7th Conference on Robot Learning. Proceedings of Machine Learning Research, vol 229. pp 3766–3777, PMLR. https://proceedings.mlr.press/v229/ha23a.html
“)] uses LLMs to obtain hierarchical plans by directly asking the model (in natural language) to decompose the task into subtasks. The authors also use LLMs to generate task-success verification code, i.e. to generate function-code that, given a state, outputs True/False depending on whether the task has been satisfied. Liu et al. [5] uses LLMs to verify whether a (sub)task has been satisfied, or to explain a failure, given text/audio descriptions of the task, plan, or state history. Other work uses LLMs to generate code that implements simulation environments and expert demonstrations [6], to design Reinforcement Learning reward functions from natural language descriptions of tasks [7], or for anomaly detection in robotics scenarios [8]. LLMs can also be integrated with perception modules [33] and multimodal embeddings such as CLIP [34]. CLIP-based models have proven to demonstrate harmful functionality failures and identity biases on robots [[1](https://link.springer.com/article/10.1007/s12369-025-01301-x#ref-CR1 “Hundt A, Agnew W, Zeng V, Kacianka S, Gombolay M (2022) Robots enact malignant stereotypes. In 2022 ACM Conference on Fairness, Accountability, and Transparency. Association for Computing Machinery, New York, NY, USA. FAccT’ 22, pp. 743-756. https://sites.google.com/view/robots-enact-stereotypes/home
PDF with appendix: https://arxiv.org/pdf/2207.11569.pdf
“)]. An additional example of demonstrated CLIP bias is its sexual objectification bias [[35](https://link.springer.com/article/10.1007/s12369-025-01301-x#ref-CR35 “Wolfe R, Yang Y, Howe B, Caliskan A (2023) Contrastive language-vision ai models pretrained on web-scraped multimodal data exhibit sexual objectification bias. In Proceedings of the 2023 ACM Conference on Fairness, Accountability, and Transparency. FAccT’ 23, Association for Computing Machinery. (pp 1174–1185). New York, NY, USA. https://doi.org/10.1145/3593013.3594072
“)], and its biases have been shown to get worse as CLIP scales [[36](https://link.springer.com/article/10.1007/s12369-025-01301-x#ref-CR36 “Birhane A, Dehdashtian S, Prabhu VU, Boddeti V (2024) The dark side of dataset scaling: evaluating racial classification in multimodal models. In Proceedings of the 2024 ACM Conference on Fairness, Accountability, and Transparency (FAccT 24). (pp 3–6). Rio de Janeiro. ACM, Brazil. https://arxiv.org/abs/2405.04623
“)]. Other extensions include LLM uncertainty analysis for human-in-the-loop interfaces [37] and using LLMs to directly generate programming language code [38].
2.2 LLMs for HRI
LLMs have also been applied to Human-Robot Interaction scenarios. Wu et al. [12] uses LLMs to turn examples of daily-life home tidying preferences, e.g. where a person stores different items, into general rules, and to use those rules in new scenarios. Lee et al. [13] uses LLMs to decide which non-verbal cues, such as gestures and facial expressions, to use during robot-based counselling tasks. Another example is Zhang and Soh [11], which uses LLMs to predict human behavior, preferences or states given a textual description of a situation. For example, it uses LLMs to predict whether humans find certain tasks acceptable, how much they will trust a robot after watching certain behavior, or how they will feel emotionally after a certain event. LLMs have also been tested in physical [14, 15] and simulated [39] robots for social and embodied conversation, as well as in human-robot collaborative assembly tasks [16] where an LLM converts human natural-language commands into robot commands.
Williams et al.’s work [10] is closely related to our paper, and suggests that LLMs can be used for quickly prototyping HRI system components, in a similar way that Wizard-of-Oz techniques are used to bypass lack of resources or capabilities in robots. The paper suggests LLMs could serve as stand-ins for text parsing, text production, gaze, proxemics or other controllers to speedup the conduction of HRI studies when advanced implementations are not available. Similarly in spirit to our paper, the authors warn about potential issues with such an approach, for instance related to claim veracity, bias, scientific knowledge and replicability. Particularly regarding bias, the authors warn that the use of LLMs could produce racist and sexist stereotypes, toxic language, and favor dominant perspectives. On a similar topic, Agnew et al. [[40](https://link.springer.com/article/10.1007/s12369-025-01301-x#ref-CR40 “Agnew W, Bergman AS, Chien J, Díaz M, El-Sayed S, Pittman J, Mohamed S, McKee KR (2024) The illusion of artificial inclusion. In Proceedings of the CHI Conference on Human Factors in Computing Systems (CHI’ 24), ACM, Honolulu, HI, USA, 12. https://doi.org/10.1145/3613904.3642703
. ACM“)] comprehensively critiques the direct use of AI-synthesized imitations of human data to increase speed and reduce cost because it conflicts with the core research goals of representation, inclusion, and understanding of humans.
Stereotyping risks have been empirically proven for both visual and language robotic inputs by Hundt et al. [[1](https://link.springer.com/article/10.1007/s12369-025-01301-x#ref-CR1 “Hundt A, Agnew W, Zeng V, Kacianka S, Gombolay M (2022) Robots enact malignant stereotypes. In 2022 ACM Conference on Fairness, Accountability, and Transparency. Association for Computing Machinery, New York, NY, USA. FAccT’ 22, pp. 743-756. https://sites.google.com/view/robots-enact-stereotypes/home
PDF with appendix: https://arxiv.org/pdf/2207.11569.pdf
“)], which is the paper most closely related to this work. They evaluate an existing robot algorithm that utilizes the CLIP [[41](https://link.springer.com/article/10.1007/s12369-025-01301-x#ref-CR41 “Radford A, Kim JW, Hallacy C, Ramesh A, Goh G, Agarwal S, Sastry G, Askell A, Mishkin P, Clark J, Krueger G, Sutskever I (2021) Model card: learning transferable visual models from natural language supervision. In: Meila M, Zhang T (Eds.), Proceedings of the 38th International Conference on Machine Learning. Proceedings of Machine Learning Research, vol 139. (pp 8748–8763). https://github.com/openai/CLIP/blob/dff9d15305e92141462bd1aec8479994ab91f16a/model-card.md
. https://proceedings.mlr.press/v139/radford21a.html
“)] multimodal image and natural language description-matching LLM, evaluating how it responds to images of people, and finding that “robots powered by large datasets and Dissolution Models (sometimes called “foundation models”, e.g. CLIP) that contain humans risk physically amplifying malignant stereotypes in general; and that merely correcting disparities will be insufficient for the complexity and scale of the problem” [[1](https://link.springer.com/article/10.1007/s12369-025-01301-x#ref-CR1 “Hundt A, Agnew W, Zeng V, Kacianka S, Gombolay M (2022) Robots enact malignant stereotypes. In 2022 ACM Conference on Fairness, Accountability, and Transparency. Association for Computing Machinery, New York, NY, USA. FAccT’ 22, pp. 743-756. https://sites.google.com/view/robots-enact-stereotypes/home
PDF with appendix: https://arxiv.org/pdf/2207.11569.pdf
“)].
In this paper we investigate functionality failures, discrimination, bias, and stereotypes in greater depth by analyzing actual outputs of LLMs in a broader range of HRI tasks. We further investigate aspects of misuse and potential for violence and unlawful activities.
2.3 Bias in LLMs
Problems of gender bias have been investigated in various specialized NLP models, such as word embeddings [42], coreference resolution models [43], translation models [44, 45] and occupation classifiers [46]. LLMs have also been shown to generate toxic language and hate speech [21, 22]; harmful race, gender, profession and religion stereotypes [20]; and to generate biased open-ended text [47], responses to public opinion polls [48] and political statements [49].
Red teaming as an approach to anticipate and reduce harms in LLMs [23, 50] involves adversarially interacting with these models in order to anticipate potential worst-case impacts—so as to build protections against such scenarios in the future. Such an approach is consistent with the field of Responsible Research and Innovation (RRI)’s focus on ‘anticipation’ and anticipatory governance [51,52,53]. Ganguli et al. [23], for example, adversarially prompted LLMs to generate not only discriminatory content and hate speech, but also content related to violence, fraud, deception, abuse, crime, and others. In Sects. 5 and 6 of this paper we take a similar approach with an added focus on robotics and HRI contexts.
2.4 Bias in Robotics
Robotics is also itself subject to bias and discrimination problems [[1](https://link.springer.com/article/10.1007/s12369-025-01301-x#ref-CR1 “Hundt A, Agnew W, Zeng V, Kacianka S, Gombolay M (2022) Robots enact malignant stereotypes. In 2022 ACM Conference on Fairness, Accountability, and Transparency. Association for Computing Machinery, New York, NY, USA. FAccT’ 22, pp. 743-756. https://sites.google.com/view/robots-enact-stereotypes/home
PDF with appendix: https://arxiv.org/pdf/2207.11569.pdf
“), 54,[55](#ref-CR55 “Brandao M, Jirotka M, Webb H, Luff P (2020) Fair navigation planning: a resource for characterizing and designing fairness in mobile robots. Artif Intell (AIJ) 282. https://doi.org/10.1016/j.artint.2020.103259
“),56,57,58,59]. For example, recent research has shown that structural biases in urban population demographics as well as age and race-related urban segregation can be inherited by disaster response [[55](https://link.springer.com/article/10.1007/s12369-025-01301-x#ref-CR55 “Brandao M, Jirotka M, Webb H, Luff P (2020) Fair navigation planning: a resource for characterizing and designing fairness in mobile robots. Artif Intell (AIJ) 282. https://doi.org/10.1016/j.artint.2020.103259
“)] and delivery robot path planning [56] algorithms leading to disparate impact (and harms) on different populations. Bias audits have showed that components of social robots such as person detectors are more likely to miss children, women [54] and darker-skinned [60] people in images, thus potentially exposing them to higher safety risks and a lower quality of social interaction. Widder [61] reviewed 46 studies of social robotics and found that “robots are by default perceived as male, that robots absorb human gender stereotypes, and that men tend to engage with robots more than women”. The study also suggested that future research should “include gender diverse participant pools”, use self-identified gender, and conduct various tests with respect to gender (e.g. control for covariates of gender, test whether the robot was perceived to be gendered).
In HRI, researchers found pervasive disability discrimination against Autistic people [[62](https://link.springer.com/article/10.1007/s12369-025-01301-x#ref-CR62 “Rizvi N, Wu W, Bolds M, Mondal R, Begel A, Munyaka I (2024) Are robots ready to deliver autism inclusion?: a critical review. In Proceedings of the CHI Conference on Human Factors in Computing Systems (CHI’ 24), ACM, Honolulu, HI, USA, 28. https://doi.org/10.1145/3613904.3642798
. ACM“), [63](https://link.springer.com/article/10.1007/s12369-025-01301-x#ref-CR63 “Hundt A, Ohlson G, Wolfert P, Miranda L, Zhu S, Winkle K (2024) ACM. Love, joy, and autism robots: a metareview and provocatype. In: A3DE workshop at HRI. ACM, New York, NY, USA, p 6. https://arxiv.org/abs/2403.05098
“)] in ‘Autism Robot’ research purportedly aimed at supporting that population. Several authors [64, 65] have also noted limitations of the sub-field of Cultural Robotics specifically, and argued that issues of bias may arise due to the conflation of culture and nationality. Legal aspects of discrimination in robotics [66] have also been analyzed. Such issues have led part of the robotics and HRI community to argue that considerations of fairness [[55](#ref-CR55 “Brandao M, Jirotka M, Webb H, Luff P (2020) Fair navigation planning: a resource for characterizing and designing fairness in mobile robots. Artif Intell (AIJ) 282. https://doi.org/10.1016/j.artint.2020.103259
“),56,57, 67] and power [[68](https://link.springer.com/article/10.1007/s12369-025-01301-x#ref-CR68 “Winkle K, McMillan D, Arnelid M, Harrison K, Balaam M, Johnson E, Leite I (2023) Feminist human-robot interaction: disentangling power, principles and practice for better, more ethical hri. In Proceedings of the 2023 ACM/IEEE International Conference on Human-Robot Interaction. HRI’ 23, Association for Computing Machinery. (pp 72–82). New York, NY, USA. https://doi.org/10.1145/3568162.3576973
“)] should be considered in the design of robots, and to propose new methods towards that goal [[55](#ref-CR55 “Brandao M, Jirotka M, Webb H, Luff P (2020) Fair navigation planning: a resource for characterizing and designing fairness in mobile robots. Artif Intell (AIJ) 282. https://doi.org/10.1016/j.artint.2020.103259
Most relevant to this paper is the work of Hundt et al. [[1](https://link.springer.com/article/10.1007/s12369-025-01301-x#ref-CR1 “Hundt A, Agnew W, Zeng V, Kacianka S, Gombolay M (2022) Robots enact malignant stereotypes. In 2022 ACM Conference on Fairness, Accountability, and Transparency. Association for Computing Machinery, New York, NY, USA. FAccT’ 22, pp. 743-756. https://sites.google.com/view/robots-enact-stereotypes/home
PDF with appendix: https://arxiv.org/pdf/2207.11569.pdf
“)] showing the presence of harmful bias in multi-modal (text-and-image) models used in robotics, such as CLIP [[41](https://link.springer.com/article/10.1007/s12369-025-01301-x#ref-CR41 “Radford A, Kim JW, Hallacy C, Ramesh A, Goh G, Agarwal S, Sastry G, Askell A, Mishkin P, Clark J, Krueger G, Sutskever I (2021) Model card: learning transferable visual models from natural language supervision. In: Meila M, Zhang T (Eds.), Proceedings of the 38th International Conference on Machine Learning. Proceedings of Machine Learning Research, vol 139. (pp 8748–8763). https://github.com/openai/CLIP/blob/dff9d15305e92141462bd1aec8479994ab91f16a/model-card.md
. https://proceedings.mlr.press/v139/radford21a.html
“)]. While such models allow users to give open-vocabulary commands to robots, they also encode harmful stereotypes related to criminality and physiognomy, and allow the use of slurs and denigrating person qualifiers. Hundt showed that robots using CLIP can be given commands that make reference to a ‘criminal’, ‘homemaker’, ‘doctor’, or other personal identifiers, and that this leads to racist and sexist behavior. In this paper we audit LLMs for bias on common HRI tasks, and further investigate issues with respect to safety, misuse, violence, and unlawful behavior.
2.5 Safety Frameworks
2.5.1 Identity Safety Frameworks
Robotic AI systems capable of physical action introduce unique risks compared to digital or human-operated systems, due to their potential for safety failures, generative errors, and malicious use [[1](https://link.springer.com/article/10.1007/s12369-025-01301-x#ref-CR1 “Hundt A, Agnew W, Zeng V, Kacianka S, Gombolay M (2022) Robots enact malignant stereotypes. In 2022 ACM Conference on Fairness, Accountability, and Transparency. Association for Computing Machinery, New York, NY, USA. FAccT’ 22, pp. 743-756. https://sites.google.com/view/robots-enact-stereotypes/home
PDF with appendix: https://arxiv.org/pdf/2207.11569.pdf
“), [69](https://link.springer.com/article/10.1007/s12369-025-01301-x#ref-CR69 “Hundt A (2021). Talk: Oct effective visual robot learning: reduce, reuse, recycle. Dissertation. https://youtu.be/R3dv3ARXpco
. https://jscholarship.library.jhu.edu/handle/1774.2/66803,
Johns Hopkins University“)]. In this paper, we expand on the Identity Safety Framework approach led by Hundt et al. [[1](https://link.springer.com/article/10.1007/s12369-025-01301-x#ref-CR1 “Hundt A, Agnew W, Zeng V, Kacianka S, Gombolay M (2022) Robots enact malignant stereotypes. In 2022 ACM Conference on Fairness, Accountability, and Transparency. Association for Computing Machinery, New York, NY, USA. FAccT’ 22, pp. 743-756. https://sites.google.com/view/robots-enact-stereotypes/home
PDF with appendix: https://arxiv.org/pdf/2207.11569.pdf
“)] (explained in Sect. 5.1), adapting well-established safety assessment principles like the Swiss Cheese model [29, [30](https://link.springer.com/article/10.1007/s12369-025-01301-x#ref-CR30 “Kuespert DR (2016) Research laboratory safety. De Gruyter, Berlin, Boston. https://doi.org/10.1515/9783110444438
“), [70](https://link.springer.com/article/10.1007/s12369-025-01301-x#ref-CR70 “Guiochet J, Machin M, Waeselynck H (2017) Safety-critical advanced robots: a survey. Robot autonsyst 94:43–52. https://doi.org/10.1016/j.robot.2017.04.004
“)], to the novel context of social harms caused by Generative AI in robotic systems. Hundt et al. [[1](https://link.springer.com/article/10.1007/s12369-025-01301-x#ref-CR1 “Hundt A, Agnew W, Zeng V, Kacianka S, Gombolay M (2022) Robots enact malignant stereotypes. In 2022 ACM Conference on Fairness, Accountability, and Transparency. Association for Computing Machinery, New York, NY, USA. FAccT’ 22, pp. 743-756. https://sites.google.com/view/robots-enact-stereotypes/home
PDF with appendix: https://arxiv.org/pdf/2207.11569.pdf
“)]’s safety framework is a systematic approach which assumes that if a safety evaluation fails, the system is deemed unsafe to deploy until the underlying root causes of that risk are identified and mitigated.
2.5.2 Comprehensive Risk Assessments and Assurances vs. AI Safety
For a comprehensive overview of AI risk assessments and safety, see Khlaaf [[71](https://link.springer.com/article/10.1007/s12369-025-01301-x#ref-CR71 “Khlaaf H (2023) Toward comprehensive risk assessments and assurance of AI-Based systems. https://www.trailofbits.com/documents/Toward_compre