Abstract
Feedback on user interface (UI) mockups is crucial in design. However, human feedback is not always readily available. We explore the potential of using large language models for automatic feedback. Specifically, we focus on applying GPT-4 to automate heuristic evaluation, which currently entails a human expert assessing a UI’s compliance with a set of design guidelines. We implemented a Figma plugin that takes in a UI design and a set of written heuristics, and renders automatically-generated feedback as constructive suggestions. We assessed performance on 51 UIs using three sets of guidelines, compared GPT-4-generated design suggestions with those from human experts, and conducted a study with 12 expert designers to understand fit with existing practice. We found tha…
Abstract
Feedback on user interface (UI) mockups is crucial in design. However, human feedback is not always readily available. We explore the potential of using large language models for automatic feedback. Specifically, we focus on applying GPT-4 to automate heuristic evaluation, which currently entails a human expert assessing a UI’s compliance with a set of design guidelines. We implemented a Figma plugin that takes in a UI design and a set of written heuristics, and renders automatically-generated feedback as constructive suggestions. We assessed performance on 51 UIs using three sets of guidelines, compared GPT-4-generated design suggestions with those from human experts, and conducted a study with 12 expert designers to understand fit with existing practice. We found that GPT-4-based feedback is useful for catching subtle errors, improving text, and considering UI semantics, but feedback also decreased in utility over iterations. Participants described several uses for this plugin despite its imperfect suggestions.
Figure 1:
Diagram illustrating the UI prototyping workflow using this plugin. First, the designer prototypes the UI in Figma (Box A) and then runs the plugin (Arrow A1). The designer then selects the guidelines to use for evaluation (Box B) and runs the evaluation with the selected guidelines (Arrow A2). The plugin obtains evaluation results from the LLM and renders them in an interpretable format (Box C). The designer uses these results to update their design and reruns the evaluation (Arrow A3). The designer iteratively revises their Figma UI mockup, following this process, until they have achieved the desired result.
1 Introduction
User interface (UI) design is an essential domain that shapes how humans interact with technology and digital information. Designing user interfaces commonly involves iterative rounds of feedback and revision. Feedback is essential for guiding designers towards improving their UIs. While this feedback traditionally comes from humans (via user studies and expert evaluations), recent advances in computational UI design enable automated feedback. However, automated feedback is often limited in scope (e.g., the metric could only evaluate layout complexity) and can be challenging to interpret [50]. While human feedback is more informative, it is not readily available and requires time and resources for recruiting and compensating participants.
One method of evaluation that still relies on human participants today is heuristic evaluation, where an experienced evaluator checks an interface against a list of usability heuristics (rules of thumb) developed over time, such as Nielsen’s 10 Usability Heuristics [39]. Despite appearing straightforward, heuristic evaluation is challenging and subjective [40], dependent on the evaluator’s previous training and personality-related factors [25]. These limitations further suggest an opportunity for AI-assisted evaluation.
There are several reasons why LLMs could be suitable for automating heuristic evaluation. The evaluation process primarily involves rule-based reasoning, which LLMs have shown capacity for [42]. Moreover, design guidelines are predominately in text form, making them amenable for LLMs, and the language model could also return its feedback as text-based explanations that designers prefer [23]. Finally, LLMs have demonstrated the ability to understand and reason with mobile UIs [56], as well as generalize to new tasks and data [28, 49]. However, there are also reasons that suggest caution for using LLMs for this task. For one, LLMs only accept text as input, while user interfaces are complex artifacts that combine text, images, and UI components into hierarchical layouts. In addition, LLMs have been shown to hallucinate [24] (i.e., generate false information) and may potentially identify incorrect guideline violations. This paper explores the potential of using LLMs to carry out heuristic evaluation automatically. In particular, we aim to determine their performance, strengths and limitations, and how an LLM-based tool can fit into existing design practices.
To explore the potential of LLMs in conducting heuristic evaluation, we built a tool that enables designers to run automatic evaluations on UI mockups and receive text-based feedback. We package this system as a plugin for Figma [1], a popular UI design tool. Figure 1 illustrates the iterative usage of this plugin. The designer prototypes their UI in Figma, and then selects a set of guidelines they would like to use for evaluation in our plugin. The plugin returns the feedback, which the designer uses to revise their mockup. The designer can then repeat this process on their edited mockup. To improve the LLM’s performance and adapt to individual preferences, designers can provide feedback on each generated suggestion, which is integrated into the model for the next round of evaluation. The plugin produces UI mockup feedback by querying an LLM with the guidelines’ text and a JSON representation of the UI. The LLM then returns a set of detected guideline violations. Instead of directly stating the violations, they are phrased as constructive suggestions for improving the UI. As LLMs can only process text and have a limited context window, we developed a JSON representation of the UI that concisely captures the layout hierarchy and contains both semantic (text, semantic label, element type) and visual (location, size, and color) details of each element and group in the UI. To further accommodate context window limits, we scoped the plugin to evaluate only static (i.e., non-interactive) UI mockups, one screen at a time.
We conducted an exploration of how several current state of the art LLMs perform on this task and found that GPT-4 had the best performance by far. Hence, we solely focus on GPT-4 for the remaining studies. To assess GPT-4’s performance in conducting heuristic evaluation on a large scale, we carried out a study where three design experts rated the accuracy and helpfulness of its heuristic evaluation feedback for 51 distinct UIs. To compare GPT-4’s output with feedback provided by human experts, we conducted a heuristic evaluation study with 12 design experts, who manually identified guideline violations in a set of 12 UIs. Finally, to qualitatively determine GPT-4’s strengths and limitations and its performance as an iterative design tool, we conducted a study with another group of 12 design experts, who each used this tool to iteratively refine a set of 3 UIs and evaluated the LLM feedback each round. For all three studies, we used diverse guidelines covering visual design, usability, and semantic organization to generate design feedback.
We found that GPT-4 was generally accurate and helpful in identifying issues in poor UI designs, but its performance became worse after iterations of edits that improved the design, making it unsuitable as an iterative tool. Furthermore, its performance varied, depending on the guideline. GPT-4 generally performed well on straightforward checks with the data available in the UI JSON and worse when the JSON differed from what was visually or semantically depicted in the UI. Finally, although GPT-4’s feedback is sometimes inaccurate, most study participants still found this tool useful for their own design practices, as it was able to catch subtle errors, improve the UI’s text, and reason with the UI’s semantics. They stated that the errors made by GPT-4 are not dangerous, as there is a human in the loop to catch them, and suggested various use cases for the tool. Finally, we also distilled a set of concrete limitations of GPT-4 for this task.
In summary, even with today’s limitations, GPT-4 can already be used to automatically evaluate some heuristics for UI design; other heuristics may require more visual information or other technical advancements. However, designers accepted occasional imperfect suggestions and appreciated GPT-4’s attention to detail. This implies that while LLM tools will not replace human heuristic evaluation, they may nevertheless soon find a place in design practice.
Our contributions are as follows:
•
A Figma plugin that uses GPT-4 to automate heuristic evaluation of UI mockups with arbitrary design guidelines.
•
An investigation of GPT-4’s capability to automate heuristic evaluations through a study where three human participants rated the accuracy and helpfulness of LLM-generated design suggestions for 51 UIs.
•
A comparison of the violations found by this tool with those identified by human experts.
•
An exploration of how such a tool can fit into existing design practice via a study where 12 design experts used this tool to iteratively refine UIs, assessed the LLM-generated feedback, and discussed their experiences working with the plugin.
2 Related Work
2.1 AI-Enhanced Design Tools
Before the widespread use of generative AI, research in AI-enhanced design tools explored a variety of model architectures to accomplish a wide range of tasks. For instance, Lee et al. built a prototyping assistance tool (GUIComp) that provides multi-faceted feedback for various stages of the prototyping process. GUIComp uses an auto-encoder to support querying UI examples for design inspiration and separate convolutional neural networks to evaluate the visual complexity of the UI prototypes and predict salient regions [23]. Other studies have utilized computer vision techniques to predict saliency in graphical designs [14] and perceived tappability [50, 52]. Deep learning models have been developed for generation [7], autocompletion [5], and optimization [11, 53, 54] of UI layouts. One limitation of these techniques is that a separate model is needed for each type of task. In addition, study participants had difficulty interpreting the feedback from these models [50] and would have liked natural language explanations of detected design issues [23]. Our work addresses both of these limitations. First, our system supports arbitrary guidelines that evaluate various aspects of the UI design as input. Furthermore, the language model uses natural language to explain each detected guideline violation.
2.2 Applications of Generative AI in Design
The recent emergence of generative AI, such as GPT, has led to various applications in design support. Park et al. carried out two studies that employ LLMs to simulate user personas in online social settings. They used GPT-3 to generate interactions on social media platforms as testing data for these platforms [44]. They later expanded on this work to build agents that could remember, reflect on, and retrieve memories from interacting with other agents to realistically simulate large-scale social interactions [43]. Hämäläinen et al. used GPT-3 to generate synthetic human-like responses to survey questionnaires about video game experiences [18]. Finally, Wang et al. investigated the feasibility of using LLMs to interact with UIs via natural language [56]. They developed prompting techniques for tasks like screen summarization, answering questions about the screen, generating questions about the screen, and mapping instructions to UI actions. Researchers have also begun to create design tools that use Generative AI. Lawton et al. built a system where a human and generative AI model collaborate in drawing, and ran an exploratory study on the capabilities of this tool [22]. Stylette allows users to specify design goals in natural language and uses GPT to infer relevant CSS properties [19]. Perhaps most similar to our work is a study by Petridis et al. [46], who explored using LLM prompting in creating functional LLM-based UI prototypes. Their study findings showed that LLM prompts sped up prototype creation and clarified LLM-based UI requirements, which led to the development of a Figma Plugin for automated content generation and determination of optimal frame changes. These existing studies, however, have not examined the application of LLMs as a general-purpose evaluator for mobile UIs of any category with a diverse set of heuristics.
2.3 AI-enhanced Software Testing
Another domain of UI evaluation is testing the functionality of the GUI (i.e., “software testing”). Existing LLM-based approaches include Liu et al.’s method [27], which uses GPT-3 to simulate a human tester that would interact with the GUI. Their system had greater coverage and found more bugs than existing baselines, and also identified new bugs on Google Play Store apps. Wang et al. conducted a comprehensive literature review on using LLMs for software testing. They analyzed various studies that used LLMs for unit test generation, validation of test outputs, test input generation, analyzing bugs, fixing identified bugs in code, and identifying and correcting bugs. Contrary to software testing, our study focuses on evaluating GUI mockups, which is at an earlier stage of the UI development process. Furthermore, evaluation of mockups and software are intrinsically different; mockup evaluation focuses on adherence to design guidelines and user feedback, whereas software testing focuses on finding bugs in the implementation. Prior to LLMs, Chen et al. utilized computer vision techniques to identify discrepancies between the UI mockup and implementation [6]. Their system could identify differences in positioning, color, and size of corresponding elements. However, their evaluation requires a UI mockup as the benchmark, while our system could carry out evaluation using any set of heuristics.
2.4 Heuristics and Design Guidelines
An essential aspect of the design process is gathering feedback to improve future iterations. One central way designers generate feedback is to conduct heuristic evaluations [38, 40], which uses a set of guidelines to identify and characterize undesired interface characteristics as violations of specific guidelines. While initially designed for desktop interfaces, other work has adapted heuristic evaluation to more devices and domains [9, 31, 33, 47]. In general, researchers have developed design guidelines for a vast category of devices, tasks, and populations including accessible data visualizations [12], multi-modal touchscreen graphics [17], smart televisions [58], ambient lighting interactions [32], hands-free speech interaction [35], navigation in virtual environments [55], website readability [34], supporting web design for aging communities [21, 61], and for cross-cultural design considerations [2]. A widely-used set of guidelines is Nielsen’s 10 Usability Heuristics [39], a set of general principles for interaction design. Luther et al. surveyed design textbooks and other resources and compiled a comprehensive set of specific critique statements for the visual design of an interface, which were organized into 7 visual design principles [29]. Recently, Duan et al. developed a set of 5 specific and actionable guidelines for organizing UI elements based on their semantics (i.e., functionality, content, or purpose) to help design clear and intuitive interfaces [10]. While these guidelines are meant to encode common design patterns and errors distilled from design expert guidance, they still require a human to interpret and apply them, making adapting to a new set of guidelines time-consuming, especially for novice designers. Our work builds off of these design guidelines as a means of focusing and justifying the LLM’s design suggestions and feedback.
2.5 User Interfaces for Design Feedback
Prior research has explored several ways to support designers as they both give and receive feedback across a range of media [20, 30, 45, 57, 60]. Cheng et al. explore the process of publicly gathering design feedback from online forums [8] and list several design considerations for feedback systems. For supporting in-context feedback for graphic designs, CritiqueKit [15] showcased a UI for providing and improving real-time design feedback, while Charrette [41] supported organizing and sharing feedback on longer histories and variations of a design. A study by Ngoon et al. showcased reusing expert feedback suggestions and adaptive guidance as two ways of improving creative feedback by making the feedback more specific, justified, and actionable [36]. This notion of adaptive conceptual guidance is further explored by Shöwn [37], demonstrating the utility of adapting presented design suggestions and examples automatically given the user’s current working context. Our plugin provides in-context design feedback grounded by this prior work on user interfaces for design feedback, while automatically generating the provided feedback and design suggestions.
3 System Details
In this section, we describe the set of design goals for an automatic LLM-driven heuristic evaluation tool, how they are realized in our system, the underlying implementation, techniques to improve the LLM’s performance, and explorations of alternative prompt designs and various LLM models for this task.
3.1 Design Goals
Based on design principles and expected LLM behavior, we came up with a set of goals that lay out what an automatic LLM-based heuristic evaluation tool should be able to do. The goals are as follows:
(1)
The tool should be able to accommodate arbitrary UI prototypes; designers should be able to use this tool to perform heuristic evaluations on their mockups and identify potential issues, before implementation.
(2)
The tool should be heuristic-agnostic, so different guidelines or heuristics can be used.
(3)
The guideline violations detected by the LLM should be presented in a way that adheres to the principles of effective feedback [48].
(4)
The LLM generated feedback should be presented in the context of the critiqued design. This is to narrow the gulf of evaluation, making it easier for designers to interpret the feedback.
(5)
Finally, in case the LLM makes a mistake, the designer should be able to hide feedback they find unhelpful. This data should also be sent to the LLM to improve its prediction accuracy.
3.2 Design Walkthrough
Figure 2:
Illustration of plugin interactions that contextualize text feedback with the UI. “A” shows that clicking on a link in the violation text selects the corresponding group or element in the Figma mockup and Layers panel. “B” shows the “click to focus” feature, where clicking on a violation fades the other violations and draws a box around the corresponding group in the UI screenshot. “C” illustrates that hovering over a group or element link draws a blue box around the corresponding element in the screenshot. “D” points out that clicking on the ‘X’ icon of a violation hides it and adds this feedback to the LLM prompt for the next round of evaluation.
We built our tool as a plugin for Figma, enabling designers to evaluate any Figma mockup (Goal 1). Figure 1 illustrates this plugin’s step-by-step usage with interface screenshots. The designer first prototypes their UI in Figma and runs the plugin (Figure 1 Box A). Due to context window limitations, the plugin only evaluates a single UI screen at a time. Furthermore, it only assesses static mockups, as evaluation of interactive mockups may require multiple screens as input or more complex UI representations, which could exceed the LLM’s context limit. After starting the plugin, it opens up a page to select guidelines to use for heuristic evaluation (Box B). Designers can select from a set of well-known guidelines, like Nielsen’s 10 Usability Heuristics, or enter any list of heuristics they would like to use (Goal 2). They can also select more than one set of guidelines for the evaluation.
Once the LLM completes the heuristic evaluation, text explanations of all violations found and a UI screenshot are rendered back to the designer (Figure 1 Box C). This “UI Snapshot” serves as a reference to the state of the mockup at the time of evaluation, in case the designer makes any changes based on the evaluation results. Each violation explanation contains the name of the violated guideline and is phrased as constructive feedback, following the guidelines set by Sadler et al. [48] (Goal 3). According to Sadler, effective feedback is specific and relevant, highlighting the performance gap and providing actionable guidance for improvement. To accomplish this, the feedback must include these three things: 1) the expected standard, 2) the gap between the quality of work and the standard, and 3) what needs to be done to close this gap. Our design feedback adheres to Sadler’s principles and starts by stating the standard set by the guideline, followed by the issue with the current design (the gap between the design and expected standard), and concludes with advice on fixing the issue. Figure 9 provides four examples of these explanations.
The plugin also includes several features that help designers contextualize the text feedback with corresponding UI elements. (Goal 4). Figure 2 illustrates these features. Selecting a violation fades the other suggestions and draws a box around the relevant group or element in the screenshot, as shown in Figure 2 (B). In addition, all UI elements and groups mentioned are rendered as links. Hovering over a link draws a box over the corresponding group or element in the screenshot (C), and clicking on the link selects the item in the Figma mockup and Layers panel (A), streamlining the editing process. Finally, to address Goal 5, if the designer finds a suggestion incorrect or unhelpful, they can click on its ‘X’ icon to hide it (D). Hiding the violation sends feedback to the LLM for subsequent evaluation rounds so this violation will not be shown again.
After the designer revises their mockup based on LLM feedback, they can rerun the evaluation to generate new suggestions. This usage is intended to match the iterative feedback and revision process during design. The plugin uses the information from the Layers panel of Figma to create the text-based representation of the mockup (discussed in more detail in the next section). Hence, it relies on accurate names for groups in the Layers panel to convey semantic information about the UI; for instance, the group containing icons in the navbar should be named “navbar”. Designers must manually add these names, so they are often missing. To address this, we implemented an auxiliary label generation feature that can be run before evaluation to generate group names automatically (based on their contents).
3.3 Implementation
Figure 3:
Our LLM-based plugin system architecture. The designer prototypes a UI in Figma (Box 1), and the plugin generates a UI representation to send to an LLM (3). The designer also selects heuristics/guidelines to use for evaluating the prototype (2), and a prompt containing the UI representation (in JSON) and guidelines is created and sent to the LLM (4). After identifying all the guideline violations, another LLM query is made to rephrase the guideline violations into constructive design advice (4). The LLM response is then programmatically parsed (5), and the plugin produces an interpretable representation of the response to display (6). The designer dismisses incorrect suggestions, which are incorporated in the LLM prompt for the next round of evaluation, if there is room in the context window (7).
Figure 4:
An example portion of a UI JSON. It has a tree structure, where each node has a list of child nodes (the “children” field). Each node in this JSON is color-coded with its corresponding group or element in the UI screenshot. The node named “lyft event photo and logo” is a group (“type: GROUP”) consisting of a photo of the live chat event (“lyft live chat event photo”) and the Lyft logo (“lyft logo”). The JSON node for the photo contains its location information (“bounds”), type (“IMAGE”), and unique identifier (“id”). The JSON node for “lyft logo” contains its location and some stylistic information, like the stroke color and stroke weight for its white border.
We implemented this plugin in Typescript using the Figma Plugin API. The plugin makes an API request to OpenAI’s GPT-4 for LLM queries. Since LLMs can only accept text as input, the plugin takes in a JSON representation of the UI. While multi-modal models exist that could take in both the UI screenshot and guidelines text (e.g., [26]), we found that its performance was considerably worse than GPT-4’s for this task (at the time).
Our JSON format captures the DOM (Document Object Model) structure of the UI mockup, and is similar in structure and content to the HTML-based representation used by [56] that performed well on UI-related tasks. Figure 4 contains an example portion of a UI JSON with corresponding groups and elements marked in the UI screenshot. The tree structure is informative of the overall organization of the UI, with UI elements (buttons, icons, etc.) as leaves and groups (of elements and/or smaller groups) as intermediate nodes. Each node in this JSON tree contains semantic information (text labels, element or group names, and element type) and visual data (x,y-position of the top left corner, height, width, color, opacity, background color, font, etc.) of its element or group. Hence, this JSON representation captures both semantic and visual features of the UI, which supports the evaluation of various aspects of the design and differentiates it from the representation used by [56] that captures only semantic information. This JSON representation is constructed from the data (e.g., group/element names) and grouping structure found in the Layers panel of Figma, which are editable by designers.
Figure 3 shows the core system design of the plugin. Due to the context window limits of GPT-4, we remove all unnecessary or redundant information and condense verbose details into a concise JSON structure (Box 3). This condensed JSON representation and guideline text are combined into a prompt sent to the LLM. After the LLM returns the identified guideline violations, another query is sent to the LLM to convert these violations into constructive advice (Box 4). This chain of prompts is illustrated in Figure 12 (Appendix), which describes the components of each prompt. The LLM response is parsed by the TypeScript code (Box 5) and rendered into an interpretable format for designers (Box 6). Figma IDs for each element and group are stored internally, which supports selection of elements or groups in the mockup via links (Figure 2, A) and quick access to their layout information. Layout information is used to draw boxes around elements and groups in the screenshot, as shown in Figure 2 (B and C). Finally, unhelpful suggestions that were dismissed by the designer are incorporated into the prompt for the next round of evaluation (Box 7), if there is room in the context window. The label generation feature is also executed via an LLM call, with the prompt containing JSON data of all unnamed groups and instructions for the LLM to create a descriptive label for each JSON based on its contents.
3.4 Improving LLM Performance
We chose the most advanced GPT version available (GPT-4), as it has the strongest reasoning abilities [42]. However, GPT-4 does not support fine-tuning and has a context window limit of 8.1k tokens. This context window limit leaves inadequate room for few-shot and “chain-of-thought” [59] examples because each Figma UI JSON requires around 3-5k tokens, the guidelines text take up to 2k tokens, and few-shot and chain-of-thought examples both require the corresponding UI JSONs. Due to these limitations, our method for improving GPT-4’s performance entailed adding explicit instructions in the prompt to avoid common mistakes, as shown in Figure 12 (Appendix). Finally, we set the temperature to 0 to ensure GPT-4 returns the most probable violations.
The remaining space in the context window was allocated to suggestions that were dismissed (hidden) by designers. Incorporating this feedback targets areas of poor performance specific to the UI being evaluated and also adapts GPT-4’s feedback to the designer’s preferences. Since the UI JSON is already provided in the prompt, this feedback does not require much space. However, the UI may have changed due to edits, so JSONs of the groups/elements for a dismissed violation are still included, but they are considerably smaller than the entire UI JSON. These items are incorporated in the conversation history of the next prompt, as examples of inaccurate suggestions (see Figure 12 in the Appendix). In addition, we ask GPT-4 to reflect on why it was wrong and add this prompt and its response to the conversation history. This “self-reflection” has been shown to improve LLM performance [51].
3.5 Exploration of Alternative Prompt Compositions
Table 1:
Prompt Condition | Total Violations | Helpful Violations |
Complete (Plugin) | 63 | 38 |
One Call | 62 | 31 |
No Heuristics | 50 | 14 |
General UI Feedback | 57 | 24 |
LLM | Total Violations | Helpful Violations |
GPT-4 (Plugin) | 63 | 38 |
GPT-3.5-16k | 228 | 23 |
Claude 2 | 7 | 1 |
PaLM 2 | 12 | 3 |
The top table compares the total number of violations and the number of helpful violations (based on the authors’ judgement) found in 12 UI mockups for different prompt compositions. The “Complete (Plugin)” condition refers to the prompt composition used in the plugin. The bottom table compares the total number of violations and the number of helpful violations (based on the authors’ judgement) found in the 12 UI mockups by each LLM, with GPT-4 being used in the plugin.
We investigate how different prompt components influence GPT-4’s output to identify potential opportunities for simplifying our complex prompt. For our analysis, we used 12 distinct mockups of mobile UIs taken from the Figma community. Furthermore, we used three sets of heuristics covering different aspects of UI design: Nielsen’s 10 Usability Heuristics [38], Luther et al.’s visual design principles compiled in “CrowdCrit” [29], and Duan et al.’s 5 semantic grouping guidelines [10]. These 12 UIs and three sets of heuristics were consistently used in all subsequent analyses and studies in this paper, except for the Performance Study, which used a larger set of 51 UIs. We query the LLM with prompt variations and then compute the total number of reported violations and the number of helpful violations (based on the authors’ judgment), and we also qualitatively examine the violations. We consider a violation to be helpful if it is both accurate and would lead to an improvement in the design. Table 3.5 (“Prompt Condition”) compares violation counts for each condition with the complete prompt chain.
3.5.1 One Call.
Our prompt chain makes two LLM calls – one to carry out the heuristic evaluation and the other to rephrase results into constructive feedback (Appendix Figure 12). We examine the effects of combining these two into a single call, as this would reduce latency. Quantitatively, the total number of violations remained similar, but the number of helpful violations was lower. However, more importantly, the output was never formatted correctly with one call, and the format also varied across different calls. Furthermore, GPT-4 sometimes omitted other important details, such as how to fix the violation. Since correct output formatting is necessary for the plugin to parse and render the violations, combining the two calls is not feasible.
3.5.2 No Heuristics.
The detailed heuristics text occupies a lot of space in the LLM’s context window, so we examined the performance without including them in the prompt. We edited prompts to look for “visual design issues”, “usability issues”, or “semantic group issues” instead of passing in the heuristics.
Table 3.5 (top) shows that GPT-4 provided fewer suggestions total (50 vs. 63) and considerably fewer helpful suggestions when heuristics were not included in the prompt (14 vs. 38). Qualitatively, the suggestions for Crowdcrit and Nielsen were similar to those from the complete prompt, though the suggestions were more thorough when the Crowdcrit heuristics were included. However, for Semantic Grouping, not passing in the heuristics resulted in only violations that concerned the semantic relatedness of group members, whereas passing in the guidelines resulted in a more diverse set of issues found. We conclude that while the LLM could give plausible UI feedback without passing in heuristics, the quality of the suggestions is worse.
3.5.3 General UI Feedback.
Finally, we investigate how GPT-4 responds without specific guidance when prompted for general UI feedback. We removed all mentions of “guidelines” in the prompt and replaced “violations” with “feedback.” Quantitatively, the performance for this condition was worse. Qualitatively, GPT-4 still carried out heuristic evaluation to an extent, as the issues were grounded in existing design conventions, but in a less rigorous and organized manner. Compared to the complete prompt, the feedback was less diverse, and the LLM often focused on only one type of issue (e.g., misalignment) when there were other types of violations. We conclude that GPT-4 can produce plausible output when asked for general UI feedback, but specific guidance produces higher quality and more diverse suggestions.
3.6 Comparison with other LLMs
We explored the potential of other state-of-the-art LLMs in carrying out this task: Claude 2, GPT-3.5-turbo-16k, and PaLM 2. Llama 2 was considered but excluded because its 4k context window size is insufficient for the task. Similar to the prompt analysis, we compute the total number of violations found and the number of helpful violations.
We found that Claude 2, GPT-3.5-turbo-16k, and PaLM 2 all had considerably worse performance than GPT-4, as shown in Table 3.5 (bottom). Claude 2 and PaLM 2 found very few violations; Claude 2 only found violations in 4 UIs, and PaLM 2 only identified one violation per UI, even after adjusting the prompt to indicate more than one violation per UI. In fact, all 12 UIs have multiple violations, as later confirmed in a heuristic evaluation by human experts. The few violations found by these two LLMs were mostly unhelpful, such as suggesting the dollar sign needs a text label. GPT-3.5-turbo-16k had the opposite behavior, finding nearly 4 times as many violations as GPT-4. However, most of the time, it indiscriminately applied the same guideline to every element of the appropriate type, regardless if there is an issue (e.g., stating the font is difficult to read for every text element). This behavior also meant that most of its helpful violations were found by chance, despite finding fewer helpful violations than GPT-4. Finally, GPT-3.5-turbo-16k and PaLM 2 had difficulty following the prompt’s instructions, often formatting the output incorrectly (with a separate rephrasing call) or making the mistakes they were told to avoid, such as returning violations regarding the mobile status bar.
These models are all smaller than GPT-4, with billions of parameters, compared to GPT-4’s 1.7 trillion [42]. These models have also been shown to have worse reasoning skills [3]. These factors likely contributed to their poor performance in this task. Since GPT-4 has the best performance by far, we solely focus on GPT-4 for the remaining three studies on the plugin.
4 Study Method
Figure 5:
An illustration of the formats of the three studies. The Performance Study consists of 3 raters evaluating the accuracy and helpfulness of GPT-4-generated suggestions for 51 UI mockups. The Heuristic Evaluation Study with Human Experts consists of 12 design experts, who each looked for guideline violations in 6 UIs, and finishes with an interview asking them to compare their violations with those found by the LLM. Finally, the Iterative Usage study comprises of another group of 12 design experts, each working with 3 UI mockups. For each mockup, the expert iteratively revises the design based on the LLM’s valid suggestions and rates the LLM’s feedback, going through 2-3 rounds of this per UI. The Usage study concludes with an interview about the expert’s experience with the tool.
To explore the potential of GPT-4 in automating heuristic evaluation, we carried out three studies (see Figure 5). In the Performance study, three designers rated the accuracy and helpfulness of GPT-4’s generated suggestions for 51 diverse UI mockups to establish performance metrics across a variety of designs. Next, we conducted a heuristic evaluation study with 12 design experts, who each manually identified guideline violations in 6 UIs. Afterwards, they compared their identified violations with those found by GPT-4 in an interview. Finally, in the Iterative Usage study, another group of 12 designers iteratively refined three UIs each with the tool and discussed how the tool might fit into existing workflows in an interview. We obtained UIs from the Figma Community, where designers share their mockups publicly. To attain a diverse set of UIs, we searched for UIs from various app categories, such as finance and e-commerce. We selected UIs that have room for improvement (based on our guidelines) and have JSON representations that could fit into GPT-4’s context window. We only used mobile UIs because web UIs were usually too large. For each UI, we ensured that the grouping structure in the Layers panel matched the visual grouping structure in the UI screenshot. We also used our tool to automatically generate semantically informative names for unnamed groups in the Layers panel.
4.1 Performance Study
We recruited three designers for the Performance study through advertising at an academic institution. Each participant had 3-4 years of design experience, and their areas of expertise include mobile, web, product, and UX design. This background information was collected during a brief instructional meeting conducted prior to participants starting this task. We precomputed the guideline violations for all 51 UIs to ensure that all participants saw the same suggestions, allowing us to calculate inter-rater agreement. The 51 UIs were split into three groups of 17, and each group was evaluated using one set of guidelines. Each participant saw the same set of 51 UIs and were given a week to rate the suggestions. Participants spent an average of 6.8 hours total on this task.
For each suggestion, participants were asked to select a rating for accuracy on a scale of 1 to 3 (“1 - not accurate”, “2 - partially accurate”, “3 - accurate”) and then provide a brief, one-sentence explanation for their rating. Participants were also asked to rate the suggestion’s helpfulness on a scale of 1 to 5, with 1 being “not at all helpful” and 5 being “very helpful”, and also provide a brief explanation. We stored all GPT-4 suggestions, along with the corresponding anonymized rating data, explanations, and UI JSONs from this study, and have made this dataset available in the Supplementary Materials.
4.2 Manual Heuristic Evaluation Study with Human Experts
We recruited 12 participants through advertising at a large technology company and an academic institution. Two participants had less than 3 years of design experience, six had 3-5 years, two had 6-10 years, and one had 15 years. Their areas of expertise include mobile, web, product, UX, cross device, and UX and UI research. The study was conducted remotely during a 90-minute session, where participants looked for guideline violations in 6 UIs in a Figma file. Each UI was assigned one of the sets of guidelines for evaluation.
The first 75 minutes consisted of the heuristic evaluation. Participants were instructed to provide the name of the guideline violated, an explanation of the violation following [48], and a usability severity rating for each violation found. There were a total of 12 UIs used for this study, and each UI was evaluated by 6 participants. The remaining 15 minutes were allocated for a semi-structured interview, where we demoed the plugin and generated feedback for the same 6 UIs the participant evaluated. We then asked the participants to compare the LLM’s violations with their own.
4.3 Iterative Usage Study
We recruited another group of 12 participants through advertising at an academic institution and a large technology company. One participant had less than 3 years of design experience, five had 3-5 years, three had 6-10 years, two had 11-15 years, and one had over 32 years. Their areas of expertise include mobile, web, product, UX, mixed reality design, and UX and HCI research. The study was conducted either in-person or remotely during a 90-minute session. Participants were given three UIs in a Figma file, each with their corresponding heuristics assigned for the evaluation. Participants worked through one UI at a time. They first rated the accuracy and helpfulness of GPT-4’s suggestions, following the scales used in the Performance Study. However, participants in the Usage study were asked to follow helpful suggestions to edit the mockup, though they could skip revisions that require too much work, like restructuring the entire layout. After participants finished revising the UI, they would rerun the plugin to generate a new set of suggestions for the revised mockup and then re-rate the new suggestions. For UIs 1 and 3, participants did one round of edits and two rounds of ratings. For UI 2, participants did two rounds of edits and three rounds of ratings, which is meant to assess the LLM’s iterative performance. This study used the same set of 12 UIs as the manual heuristic evaluation study (with the same guideline assignments for each UI’s evaluation), and each UI was seen by three participants. To assess rater agreement, we again precomputed the first round suggestions for each UI. After participants finished all three tasks, we concluded with a semi-structured interview, focusing on overall impressions, potential drawbacks and dangers, potential for iterative use, and fit with their design workflow.
5 Results
5.1 Quantitative Results: Performance Study
Figure 6:
Histogram showing the number of ratings in each category for accuracy and helpfulness, from the 3 participants in the Performance Study. For accuracy, the scale is: “1 - not accurate”, “2 - partially accurate”, and “3 - accurate”. The scale for helpfulness ranges from “1 - not at all helpful” to “5 - very helpful”. The rating data is also visualized as horizontal bar charts for this study and the Usage Study.
Figure 7:
Horizontal bar charts showing the distribution of ratings from the Performance Study for each individual guideline. The ratings for accuracy are in the top row, and helpfulness is in the bottom row, and each chart has a horizontal black line depicting the average rating. We highlight several guidelines with high ratings in green, such as “Consistency and Standards” from Nielsen Norman’s 10 Usability Heuristics. We used orange to highlight an average performing guideline – “Emphasis” (from CrowdCrit), which had bimodal ratings for accuracy and helpfulness. Finally, we used red to highlight the worst performing guideline – “Aesthetic and Minimalist Design”, which had generally poor accuracy and helpfulness ratings.
More GPT-4 generated suggestions were rated as accurate and helpful than not. Across all generated suggestions, 52 percent were rated Accurate, 19 percent Partially Accurate, and 29 percent Not Accurate; 49 percent were considered helpful or very helpful, 15 percent moderately helpful, and 36 percent slightly or not at all helpful. We show histograms of ratings given to all suggestions for each set of guidelines in Figure 6, along with averages. In line with the aggregate statistics, GPT-4 is more accurate and helpful than not for each set of guidelines. Furthermore, this difference is largest for CrowdCrit’s visual guidelines and smallest for the Semantic Grouping guidelines. Regarding the average rating for all suggestions, CrowdCrit outperformed the other guidelines for accuracy and helpfulness, with a greater outperformance in helpfulness. Semantic Grouping had the worst performance for accuracy, and Nielsen Norman performed the worst for helpfulness. Later, we show concrete examples of GPT-4 generated feedback in Figure 9 that includes accurate and helpful suggestions, as well as inaccurate and unhelpful ones.
We also grouped the ratings by individual guideline and visualized them in horizontal bar charts shown in Figure 7. This reveals finer-grained types of heuristics on which GPT-4 performed better than others. The accuracy was highest for the “Recognition rather than Recall”, “Match Between System and Real World”, and “Consistency and Standards” usability heuristics (highlighted in green in Figure 7), and participants generally found “Consistency and Standards” violations helpful. “Consistency and Standards” mostly caught inconsistencies in the visual layout of the UI, like misalignment and inconsistency in size. In contrast, the “Aesthetic and Minimalist Design” guideline was generally inaccurate and hence unhelpful, as shown in red in Figure 7. Finally, the “Emphasis” principle (shown in orange) from CrowdCrit was bimodal – a fairly even distribution of “accurate” and “inaccurate”, as well as “very helpful” and “unhelpful ratings”. The “Emphasis” principle mostly identified issues related to the visual hierarchy of the UI.
The subjective nature of heuristic evaluation was already highlighted by Nielsen [40]. To characterize subjectivity in our study, we computed inter-rater reliability using Fleiss’ Kappa [13]. Accuracy ratings had an agreement score of 0.112 and helpfulness ratings had a score of 0.100, which suggests only slight agreement. In addition to the subjective nature of this task, particular choices in the phrasing of the suggestions could have also lowered agreement scores. For example, the suggestion “The icons in this group... could be more user-friendly with the addition of text labels” calls for a subjective opinion on whether or not text labels are needed, and raters might reasonably disagree.
Table 2:
Performance Metrics | GPT-4 | Human Evaluator (Avg.) |
---|---|---|
Precision | 0.603 | 0.829 |
Recall | 0.380 | 0.336 |
F1 | 0.466 | 0.478 |
Table showing the Precision, Recall, and F1 scores of GPT-4 and an individual human evaluator, computed from the ground truth dataset. The metrics for the human evaluator is computed by averaging these metrics across all participants in the study (for the 6 UIs they each ev