1 Introduction
Imagine programming a robot to go to the grocery store to buy diapers and wine. If you had to program the steps in advance, unexpected scenarios such as the diaper delivery truck blocking the entrance would confuse your robot. Since testing it would require watching it perform, every failure would take time to uncover and provide data for just one new workaround. Getting this robot to and from the store may be possible, but at what cost? How complex must the algorithm be and how long would it take to code it?
This robot represents the state of visual GUI automation until the emergence of model-based GUI automation, first implemented in the open-source Brobot framework appearing in January 2022. Model-based GUI automation makes it possible to reliably solve complex t…
1 Introduction
Imagine programming a robot to go to the grocery store to buy diapers and wine. If you had to program the steps in advance, unexpected scenarios such as the diaper delivery truck blocking the entrance would confuse your robot. Since testing it would require watching it perform, every failure would take time to uncover and provide data for just one new workaround. Getting this robot to and from the store may be possible, but at what cost? How complex must the algorithm be and how long would it take to code it?
This robot represents the state of visual GUI automation until the emergence of model-based GUI automation, first implemented in the open-source Brobot framework appearing in January 2022. Model-based GUI automation makes it possible to reliably solve complex tasks, like the GUI automation equivalent of sending a robot to the grocery store for diapers and wine.
The remainder of the paper is organized as follows. Section 2 provides the foundational concepts of GUI automation and key concepts in adjacent domains. Section 3 discusses related work and defines the boundaries of model-based GUI automation as a domain. Section 4 looks at visual GUI automation in software testing, emphasizing the limitations of existing process-based techniques. Section 5 presents formal mathematical models of GUI automation, and Sect. 6 explores how these models can be implemented in a framework. The DoT app provides practical implementations of key concepts in the paper and is introduced in Sect. 7. Section 8 examines stochasticity in GUI automation and Sect. 9 looks at the transformation from implicit to explicit environment representation. Section 10 explores robustness, code complexity, and scalability through practical examples and mathematical analysis, demonstrating the real-world advantages of model-based GUI automation. Section 11 discusses the novel approach to integration and unit testing of GUI automation applications. Finally, Sect. 12 explores the transformative applications enabled by this model-based approach, illustrating its potential impact across various domains.
2 Foundations of GUI automation
This section lays the groundwork for understanding model-based GUI automation by introducing key concepts and definitions. It explores the nature of GUI environments and fundamentals of GUI automation.
The GUI environment for a specific activity is the set of all possible screens displayed on the monitor during this activity. When a person uses a computer to perform a task, such as booking a flight, they do so with an understanding of how to use the computer, its operating system, the web browser, and the specific website. This understanding represents a mental model of the GUI environment for that activity.
GUI automation is the process of manipulating the graphical user interface with software. GUI automation, whether for productivity, game playing, software testing, or any other activity, involves at least 2 applications: (1) the application controlling the automation, and (2) the application(s) acted on, or being automated. In software testing, the automated application is referred to as the AUT, Application Under Test, or SUT, Software Under Test (Banerjee et al. 2013) [[1](https://link.springer.com/article/10.1007/s10270-025-01319-9#ref-CR1 “Banerjee, I., Nguyen, B., Garousi, V., Memon, A.: Graphical user interface (GUI) testing: systematic mapping and repository. Inf. Softw. Technol. 55(10), 1679–1694 (2013). https://doi.org/10.1016/j.infsof.2013.03.004
“)].
In the context of GUI automation, a GUI action is an individual operation performed during automation. Examples of GUI actions include clicking a location, searching for an image, dragging the mouse, and typing (Yeh et al. 2009) [[2](https://link.springer.com/article/10.1007/s10270-025-01319-9#ref-CR2 “Yeh, T., Chang, T.-H., Miller, R.C.: Sikuli: using GUI screenshots for search and automation. In Proceedings of the 22nd Annual ACM Symposium on User Interface Software and Technology (UIST ’09), pp. 183–192. ACM (2009). http://dspace.mit.edu/bitstream/handle/1721.1/72686/miller_sikuli.pdf%3Fsequence%3D1
“)]. A GUI process is a sequence of GUI actions and can contain one or more actions.
Process-based GUI automation tools use scripts or workflows to perform actions on the GUI. Scripts are sequences of commands or instructions and can be coded manually in a programming or domain-specific language, assembled using a GUI, or recorded for later replay. GUI automation is traditionally a process-based activity.
GUI automation can be divided into two categories: declarative and visual. An example of a tool that does declarative GUI automation is Selenium, which accesses web elements based on their attributes or hierarchical position within the HTML document (Selenium, 2024) [[3](https://link.springer.com/article/10.1007/s10270-025-01319-9#ref-CR3 “Selenium: selenium documentation (2023). https://www.selenium.dev/documentation/
“)].
Visual GUI automation acts as a human agent, uses machine vision to understand the GUI, and performs keyboard and mouse actions (e.g., click, type, drag) to manipulate it. Visual GUI test automation tools can recognize GUI elements based on appearance (e.g., color, shape, text) instead of internal properties (e.g., name, ID, class). They can work with any application regardless of its platform or technology. Perhaps the best-known visual GUI automation software is SikuliXFootnote 1 (Yeh et al. 2009) [[2](https://link.springer.com/article/10.1007/s10270-025-01319-9#ref-CR2 “Yeh, T., Chang, T.-H., Miller, R.C.: Sikuli: using GUI screenshots for search and automation. In Proceedings of the 22nd Annual ACM Symposium on User Interface Software and Technology (UIST ’09), pp. 183–192. ACM (2009). http://dspace.mit.edu/bitstream/handle/1721.1/72686/miller_sikuli.pdf%3Fsequence%3D1
“)].
Pattern matching is the type of image recognition primarily used in visual GUI automation and visual GUI testing. One implementation of pattern matching is the OpenCV library’s matchTemplate function, which slides an image across the search region and compares overlapping pixels. Typically, an image on file is compared to the pixels in the GUI. When a part of the screen meets a specified minimum similarity score, the observation is successful and returns the on-screen coordinates of the image match. This process is deterministic and does not depend on a trained neural net or latent spaces.
3 Related work and domain boundaries
Although this paper focuses on automation rather than testing, the academic literature on GUI automation predominantly addresses testing applications. Understanding this landscape is crucial for positioning model-based GUI automation as a distinct domain. In this section, I analyze the distinctions between continuous and discrete processes, establish clear boundaries between model-based testing and model-based GUI automation, and clarify terminology to prevent confusion.
3.1 Continuous versus discrete processes: a fundamental mismatch
Software testing is a discrete [[4](https://link.springer.com/article/10.1007/s10270-025-01319-9#ref-CR4 “Aho, P., Vos, T.E.J.: Challenges in automated testing through graphical user interface. In: 2018 IEEE International Conference on Software Testing, Verification and Validation Workshops (ICSTW), pp. 118–121 (2018). https://doi.org/10.1109/icstw.2018.00038
“)] activity. Each test case operates as an independent unit designed to verify a specific functionality. When a test finishes—whether passing or failing—the system resets to its initial state before running the next test. This discreteness is essential for testing, where isolation between test cases ensures clear diagnostics.
Fig. 1
Traditional GUI automation differs from many manually performed activities, although AI-driven tools can also adopt a model-based approach
Process-based visual GUI automation tools align well with these discrete test cases, as both follow predetermined action sequences with clear start and end points. However, this creates a fundamental mismatch when applying these same tools to other GUI activities. Most real-world GUI interactions are continuous processes. For example, when someone buys a flight online and encounters an unexpected issue like a broken link, they do not abandon the entire task and restart from the beginning. Instead, they adapt, finding an alternative path forward from their current position.
This incompatibility explains why traditional visual GUI automation tools often struggle with non-testing automation tasks. Three key characteristics highlight this mismatch:
Situation Continuity: Continuous processes have fluid, non-resettable states that contrast sharply with the distinct, repeatable steps of discrete test cases.
Adaptive Goal Pursuit: They require dynamically adjusting actions based on current conditions, unlike discrete processes that follow predetermined paths.
Cumulative Historical Impact: Past actions shape future possibilities, a stark contrast to discrete processes where each step operates more independently.
Figure 1 illustrates this critical distinction. Software testing naturally aligns with process-based, discrete approaches whether performed manually or through automation. However, activities such as game playing and productivity tasks are fundamentally continuous. While humans naturally approach these continuous activities with a mental model of the GUI environment, the process-based approach of traditional automation tools creates an inherent limitation. This misalignment between tool capabilities and activity requirements explains why conventional GUI automation often fails in non-testing contexts.
Fig. 2
In model-based testing, test cases are generated from a model of the AUT
3.2 Model-based testing
Using a model of the AUT, a model-based testing tool can generate test cases that cover different paths in the AUT (Petrenko et al. 2012) [[5](https://link.springer.com/article/10.1007/s10270-025-01319-9#ref-CR5 “Petrenko, A., Simao, A., Maldonado, J.C.: Model-based testing of software and systems: recent advances and challenges. Int. J. Softw. Tools Technol. Transf. 14(4), 383–386 (2012). https://doi.org/10.1007/s10009-012-0240-3
“)]. The model in model-based testing is a map of possible GUI actions (Utting et al. 2011) [[6](https://link.springer.com/article/10.1007/s10270-025-01319-9#ref-CR6 “Utting, M., Pretschner, A., Legeard, B.: A taxonomy of model-based testing approaches. Softw. Test. Verif. Reliab. 22(5), 297–312 (2011). https://doi.org/10.1002/stvr.456
“)]. Figure 2 demonstrates how model-based testing works.
Figure 3 shows a model of an AUT characteristic of model-based testing. The model is a directed graph comprised of nodes (corresponding to actions taken on GUI elements) and edges (showing which actions are possible from the last action taken). For example, the edge from the blue node to the yellow node shows that the yellow action can be performed after the blue action. Model-based testing uses the model to describe all possible action chains within the AUT, referred to as the event-interaction graph (EIG) by Yuan and Memon (2010) [[7](https://link.springer.com/article/10.1007/s10270-025-01319-9#ref-CR7 “Yuan, X., Memon, A.M.: Iterative execution-feedback model-directed GUI testing. Inf. Softw. Technol. 52(5), 559–575 (2010). https://doi.org/10.1016/j.infsof.2009.11.009
“)]. Typically, the model is used to generate a list of test scripts that are then analyzed to find deficiencies in the AUT. The generated test scripts are linear processes, and the execution of each script remains rigid and sequential.
Fig. 3
Nodes represent actions taken on GUI widgets, and edges show potential next actions
The number of paths in an EIG and the time it takes to run all tests increase exponentially with a linear increase in connected states (Moreira et al. 2017) [[8](https://link.springer.com/article/10.1007/s10270-025-01319-9#ref-CR8 “Moreira, R.O., Paiva, A., Nabuco, M., Memon, A.M.: Pattern-based GUI testing: bridging the gap between design and quality assurance. Softw. Test. Verif. Reliab. 27(3), e1629 (2017). https://doi.org/10.1002/stvr.1629
“)]. This scalability issue is known as the state explosion problem (Clarke et al. 2012) [[9](https://link.springer.com/article/10.1007/s10270-025-01319-9#ref-CR9 “Clarke, E.M., Klieber, W., Novacek, M., Zuliani, P.: Model checking and the state explosion problem. In: Lecture Notes in Computer Science. Springer, pp. 1–30 (2012). https://doi.org/10.1007/978-3-642-35746-6_1
“)].
3.3 Terminological clarifications
The term model-based automation is used interchangeably in common discourse with the terms model-based testing and model-based test automation.
In model-based automation and model-based GUI automation, the words model-based and automation have different meanings, as captured in Table 1.
In model-based automation, automation refers to the generation of test cases and not to their execution, which can be automated or performed manually. The term model-based in this context refers to the use of a model of the AUT to generate test cases. When automated, test cases employ process-based GUI automation (do A, then do B, etc.).
In contrast, in model-based GUI automation, automation refers specifically to GUI automation (manipulating the GUI without user input). The term model-based refers to the use of a comprehensive model of GUI automation, including the GUI environment, GUI action execution, pathfinding, path traversal, and state management.
3.4 Testing automation versus automated testing
One of the principal innovations of model-based GUI automation is the ability to systematically test the automation code itself. Given that GUI automation is a cornerstone of modern software testing, it is crucial to distinguish between two related but distinct concepts: automated testing and testing automation.
Automated testing is the familiar practice of using software to test another piece of software (the Application Under Test). To use a metaphor, this is like building a robot to test a car’s brakes. The focus is on the car’s performance.
Testing automation, a novel capability enabled by the model-based approach, refers to the process of testing the automation code itself. In the metaphor, this is akin to testing the robot to ensure its own arms, sensors, and logic function correctly before it ever touches the car.
Model-based GUI automation makes this second type of testing feasible for the first time in a structured way, applying standard software engineering practices to a domain where they were previously impractical. Integration and unit testing of GUI automation are discussed in depth in Sect. 11.
4 Challenges in traditional GUI automation
Traditional GUI automation approaches face persistent challenges that limit their effectiveness in complex, real-world applications. Despite decades of incremental improvements, visual GUI automation continues to suffer from fundamental problems that impede its broader adoption. This section examines the core limitations of process-based GUI automation approaches, with particular emphasis on script fragility and the relationship between robustness and code complexity. Understanding these challenges helps us to appreciate why a fundamentally different approach is necessary.
4.1 Script fragility: the fundamental problem
Compared to manual testing, automated testing improves accuracy and scalability but introduces complexities such as maintenance and coding challenges (Dobslaw et al. 2019) [[10](https://link.springer.com/article/10.1007/s10270-025-01319-9#ref-CR10 “Dobslaw, F., Feldt, R., Michaelsson, D., Haar, P., De Oliveira Neto, F.G., Torkar, R.: Estimating return on investment for GUI test automation frameworks (2019). arXiv. https://doi.org/10.1109/issre.2019.00035
“)]. The Achilles heel of visual GUI test automation is script fragility, which refers to the tendency of automation to fail due to changes in the GUI environment. These failures are prominent in visual GUI testing and are discussed extensively in the academic literature. Script fragility is caused by various factors, such as:
Images not found or found erroneously: Pattern-matching algorithms can be sensitive to variations in resolution, scaling, color depth, brightness, contrast, etc., and can be confused by similar-looking or overlapping elements. Such factors can lead to image recognition failures classified as false negatives (the target images exist but are not found) or false positives (the images found are not the target images) (Garousi et al. 2017; Wiklund et al. 2017; Nass et al. 2021; Yandrapally et al. 2014) [[11](#ref-CR11 “Garousi, V., Afzal, W., Çaglar, A., Isik, I.B., Baydan, B., Çaylak, S., Boyraz, A.Z., Yolaçan, B., Herkiloglu, K.: Comparing automated visual GUI testing tools: an industrial case study (2017). https://doi.org/10.1145/3121245.3121250
“),[12](#ref-CR12 “Wiklund, K., Eldh, S., Sundmark, D., Lundqvist, K.: Impediments for software test automation: a systematic literature review. Softw. Test. Verif. Reliab. 27(8), e1639 (2017). https://doi.org/10.1002/stvr.1639
“),[13](#ref-CR13 “Nass, M., Alégroth, E., Feldt, R.: Why many challenges with GUI test automation (will) remain. Inf. Softw. Technol. 138, 106625 (2021). https://doi.org/10.1016/j.infsof.2021.106625
“),[14](https://link.springer.com/article/10.1007/s10270-025-01319-9#ref-CR14 “Yandrapally, R., Thummalapenta, S., Sinha, S., Chandra, S.: Robust test automation using contextual clues (2014). https://doi.org/10.1145/2610384.2610390
“)].
Unexpected delays in the process flow: Visual GUI automation tools execute scripts sequentially based on predefined timings or events. However, the execution speed and order of scripts can be affected by factors such as network latency, system load, and application responsiveness. These factors can lead to unexpected delays or interruptions that can cause scripts to fail (Garousi et al. 2017; Nass et al. 2021) [[11](https://link.springer.com/article/10.1007/s10270-025-01319-9#ref-CR11 “Garousi, V., Afzal, W., Çaglar, A., Isik, I.B., Baydan, B., Çaylak, S., Boyraz, A.Z., Yolaçan, B., Herkiloglu, K.: Comparing automated visual GUI testing tools: an industrial case study (2017). https://doi.org/10.1145/3121245.3121250
“), [13](https://link.springer.com/article/10.1007/s10270-025-01319-9#ref-CR13 “Nass, M., Alégroth, E., Feldt, R.: Why many challenges with GUI test automation (will) remain. Inf. Softw. Technol. 138, 106625 (2021). https://doi.org/10.1016/j.infsof.2021.106625
“)].
Dynamic content generation: Many modern applications have dynamic or adaptive GUIs that can change based on user actions, data inputs, and context information, rendering automation scripts ineffective (Nass et al. 2021) [[13](https://link.springer.com/article/10.1007/s10270-025-01319-9#ref-CR13 “Nass, M., Alégroth, E., Feldt, R.: Why many challenges with GUI test automation (will) remain. Inf. Softw. Technol. 138, 106625 (2021). https://doi.org/10.1016/j.infsof.2021.106625
“)].
GUI changes between versions: Any unanticipated change to the GUI can cause scripts to fail. An AUT with frequent updates easily breaks test scripts built for static environments (Nass et al. 2021; Yandrapally et al. 2014) [[13](https://link.springer.com/article/10.1007/s10270-025-01319-9#ref-CR13 “Nass, M., Alégroth, E., Feldt, R.: Why many challenges with GUI test automation (will) remain. Inf. Softw. Technol. 138, 106625 (2021). https://doi.org/10.1016/j.infsof.2021.106625
“), [14](https://link.springer.com/article/10.1007/s10270-025-01319-9#ref-CR14 “Yandrapally, R., Thummalapenta, S., Sinha, S., Chandra, S.: Robust test automation using contextual clues (2014). https://doi.org/10.1145/2610384.2610390
“)].
Aho and Vos (2018) wrote that GUI tests are process flows and process flows are innately fragile: “Usually a GUI test case contains a sequence of events or interactions...an incorrect GUI state in the middle of the sequence may lead to an unexpected screen, making further test case execution useless [[4](https://link.springer.com/article/10.1007/s10270-025-01319-9#ref-CR4 “Aho, P., Vos, T.E.J.: Challenges in automated testing through graphical user interface. In: 2018 IEEE International Conference on Software Testing, Verification and Validation Workshops (ICSTW), pp. 118–121 (2018). https://doi.org/10.1109/icstw.2018.00038
“)].”
Empirical evidence confirms the pervasiveness of this problem [[4](https://link.springer.com/article/10.1007/s10270-025-01319-9#ref-CR4 “Aho, P., Vos, T.E.J.: Challenges in automated testing through graphical user interface. In: 2018 IEEE International Conference on Software Testing, Verification and Validation Workshops (ICSTW), pp. 118–121 (2018). https://doi.org/10.1109/icstw.2018.00038
“), [11](#ref-CR11 “Garousi, V., Afzal, W., Çaglar, A., Isik, I.B., Baydan, B., Çaylak, S., Boyraz, A.Z., Yolaçan, B., Herkiloglu, K.: Comparing automated visual GUI testing tools: an industrial case study (2017). https://doi.org/10.1145/3121245.3121250
“),[12](#ref-CR12 “Wiklund, K., Eldh, S., Sundmark, D., Lundqvist, K.: Impediments for software test automation: a systematic literature review. Softw. Test. Verif. Reliab. 27(8), e1639 (2017). https://doi.org/10.1002/stvr.1639
“),[13](#ref-CR13 “Nass, M., Alégroth, E., Feldt, R.: Why many challenges with GUI test automation (will) remain. Inf. Softw. Technol. 138, 106625 (2021). https://doi.org/10.1016/j.infsof.2021.106625
“),[14](#ref-CR14 “Yandrapally, R., Thummalapenta, S., Sinha, S., Chandra, S.: Robust test automation using contextual clues (2014). https://doi.org/10.1145/2610384.2610390
“),[15](#ref-CR15 “Nass, M., Alégroth, E., Feldt, R., Leotta, M., Ricca, F.: Similarity-based web element localization for robust test automation. ACM Trans. Softw. Eng. Methodol. 32(3), 1–30 (2023). https://doi.org/10.1145/3571855
“),[16](#ref-CR16 “Alégroth, E., Karlsson, A., Radway, A.: Continuous integration and visual GUI testing: benefits and drawbacks in industrial practice (2018). https://doi.org/10.1109/icst.2018.00026
“),[17](#ref-CR17 “Grechanik, M., Xie, Q., Fu, C.: Experimental assessment of manual versus tool-based maintenance of GUI-directed test scripts (2009). https://doi.org/10.1109/icsm.2009.5306345
“),[18](#ref-CR18 “Eladawy, H.M., Mohamed, A., Salem, S.A.: A new algorithm for repairing web-locators using optimization techniques (2018). https://doi.org/10.1109/icces.2018.8639336
“),[19](#ref-CR19 “Thummalapenta, S., Devaki, P., Sinha, S., Chandra, S., Gnanasundaram, S., Nagaraj, D.D., Kumar, S., Kumar, S.: Efficient and change-resilient test automation: an industrial case study. In: 2013 35th International Conference on Software Engineering (ICSE) (2013). https://doi.org/10.1109/icse.2013.6606650
“),[20](#ref-CR20 “Zheng, Y., Huang, S., Hui, Z., Wu, Y.: A method of optimizing multi-locators based on machine learning (2018). https://doi.org/10.1109/qrs-c.2018.00041
“),[21](https://link.springer.com/article/10.1007/s10270-025-01319-9#ref-CR21 “Memon, A.M., Soffa, M.L.: Regression testing of GUIs (2003). https://doi.org/10.1145/940071.940088
“)]. Tests can fail despite having a functional AUT and false positives occur frequently due to automation failures (Alégroth et al. 2014; Memon and Soffa 2003) [[16](https://link.springer.com/article/10.1007/s10270-025-01319-9#ref-CR16 “Alégroth, E., Karlsson, A., Radway, A.: Continuous integration and visual GUI testing: benefits and drawbacks in industrial practice (2018). https://doi.org/10.1109/icst.2018.00026
“), [21](https://link.springer.com/article/10.1007/s10270-025-01319-9#ref-CR21 “Memon, A.M., Soffa, M.L.: Regression testing of GUIs (2003). https://doi.org/10.1145/940071.940088
“)]. Alégroth et al. (2014) found that 16% of 17,567 tests gave false positives due to script fragility while only 3% identified actual defects [[16](https://link.springer.com/article/10.1007/s10270-025-01319-9#ref-CR16 “Alégroth, E., Karlsson, A., Radway, A.: Continuous integration and visual GUI testing: benefits and drawbacks in industrial practice (2018). https://doi.org/10.1109/icst.2018.00026
“)]. More dramatically, Memon and Soffa (2003) reported that 72% of tests failed on new releases of Adobe Acrobat Reader due to changes in the GUI layout and appearance [[21](https://link.springer.com/article/10.1007/s10270-025-01319-9#ref-CR21 “Memon, A.M., Soffa, M.L.: Regression testing of GUIs (2003). https://doi.org/10.1145/940071.940088
“)]. Despite 20 years of research, these challenges have seen very little progress (Nass et al. 2019) [[13](https://link.springer.com/article/10.1007/s10270-025-01319-9#ref-CR13 “Nass, M., Alégroth, E., Feldt, R.: Why many challenges with GUI test automation (will) remain. Inf. Softw. Technol. 138, 106625 (2021). https://doi.org/10.1016/j.infsof.2021.106625
“)].
4.2 Complexity-robustness trade-off
The fundamental challenge in addressing script fragility lies in the inverse relationship between robustness and code complexity. Thummalapenta et al. (2013) observed this trade-off: “For pragmatic reasons, some amount of script brittleness might be acceptable when weighted against the coding effort required for increasing change-resiliency [[19](https://link.springer.com/article/10.1007/s10270-025-01319-9#ref-CR19 “Thummalapenta, S., Devaki, P., Sinha, S., Chandra, S., Gnanasundaram, S., Nagaraj, D.D., Kumar, S., Kumar, S.: Efficient and change-resilient test automation: an industrial case study. In: 2013 35th International Conference on Software Engineering (ICSE) (2013). https://doi.org/10.1109/icse.2013.6606650
“)].”
Attempts to improve robustness in process-based approaches typically involve:
Adding exception handling for each potential failure point
Creating alternative paths to accommodate various GUI states
Implementing recognition redundancy (multiple ways to identify the same element)
Building in wait mechanisms and retry strategies
Each of these strategies multiplies the code complexity, often exponentially. For example, adding alternative paths for each step in a process with n steps can lead to a combinatorial explosion of paths that must be coded and maintained. Similarly, for each GUI element that might change appearance or position, developers must implement multiple recognition methods and fallback approaches.
As automation tasks grow more complex, this trade-off becomes increasingly problematic. The challenge of maintaining code that addresses all possible exceptions and variations quickly outweighs the benefits of automation itself. This creates a practical ceiling on the complexity of tasks that can be reliably automated using process-based approaches.
The complexity-robustness trade-off explains why, despite its potential benefits, GUI automation has seen limited adoption in areas where reliability is crucial: The cost of developing and maintaining sufficiently robust solutions becomes prohibitive as complexity increases. Model-based GUI automation offers solutions to this fundamental problem, and Sect. 12 explores the practical applications this unlocks.
5 Formal mathematical model of GUI automation
Having established the fundamental challenges of script fragility and the complexity-robustness trade-off inherent in traditional, process-based GUI automation, this section introduces a formal mathematical model designed to overcome these challenges. The core of this model-based approach is a shift from writing rigid, sequential procedures to creating an explicit, navigable map of the GUI environment, referred to as the State Structure. This explicit representation allows an automation framework to find its own paths and recover from errors dynamically, much as a human user would adapt to unexpected changes.
The formal model presented here captures the essential elements and dynamics of these GUI interactions, providing a rigorous foundation for this new automation paradigm. It formalizes key concepts such as GUI states, actions, transitions, and path traversal while directly accounting for the stochasticity of GUI environments. By offering precise definitions and relationships between these components, the model enables clearer reasoning about automation processes and lays the theoretical groundwork for practical implementation in frameworks like Brobot. This model serves as the basis for analyzing the advantages of model-based GUI automation in terms of robustness and code complexity, which are explored in subsequent sections.
5.1 General and applied models
This subsection describes the differences between general and applied models and illustrates how they contribute to a comprehensive understanding of GUI automation.
A general model provides a high-level, abstract framework for understanding GUI automation fundamentals. It intentionally omits implementation details specific to particular GUIs or tasks, focusing instead on essential features relevant across diverse scenarios. Section 5 defines several general models: the Overall Model of GUI automation, the Action Model, the state management model, the Transition Model, the path model, and the path traversal model.
In contrast, an applied model operates at a lower level of abstraction and represents a specific automation application. It integrates the implementation-specific details necessary to automate a particular GUI environment and fulfill specific automation objectives. The applied model provides all the components needed to build a model-based GUI automation application in practice.
The applied model in model-based GUI automation functions as a digital twin of the actual GUI environment. Digital twins are virtual representations that typically model physical systems or environments. In this context, model-based GUI automation presents an atypical case, as it involves creating a digital model of a digital environment. While general models provide the theoretical framework, the applied model instantiates this framework with specific details, creating a virtual representation that can predict and interact with the GUI environment in the same way a traditional digital twin would represent a physical system.
5.2 Theoretical foundations
Model-based GUI automation addresses unpredictable environments using principles from robotics and human cognition. A model-based GUI automation application comprises three key components: a map of the environment, a planning mechanism, and an automation agent. This structure mirrors human problem-solving, combining understanding, planning, and execution. In its use of graph-based models and heuristics, it navigates digital interfaces like robots traverse physical spaces, allowing for dynamic pathfinding and adaptability.
In their paper “Bounded Rationality in Problem Solving,” Langley et al. (2014) discussed how humans approach problem-solving in scenarios with incomplete information [[22](https://link.springer.com/article/10.1007/s10270-025-01319-9#ref-CR22 “Langley, P., Pearce, C.J., Barley, M., Emery, M.: Bounded rationality in problem solving: guiding search with domain-independent heuristics. Mind Soc. 13(1), 83–95 (2014). https://doi.org/10.1007/s11299-014-0143-y
“)]. Situations, understood by the brain as collections of symbols, are organized into a directed graph. Humans use this conceptual graph of situations to imagine the steps required to reach a goal. A different type of symbol, operators representing actions, is employed to move from one situation to another.
Langley et al. divided knowledge into two categories: domain knowledge and strategic knowledge. Situations and available actions define domain knowledge and are specific to the problem. In model-based GUI automation, this corresponds to the GUI environment specific to an automation task. Strategic knowledge is problem-agnostic and allows the brain to build, reorganize, and traverse the directed graph of situations and actions [[22](https://link.springer.com/article/10.1007/s10270-025-01319-9#ref-CR22 “Langley, P., Pearce, C.J., Barley, M., Emery, M.: Bounded rationality in problem solving: guiding search with domain-independent heuristics. Mind Soc. 13(1), 83–95 (2014). https://doi.org/10.1007/s11299-014-0143-y
“)]. This is represented in model-based GUI automation as pathfinding, path traversal, state management, and action execution. Strategic knowledge allows for awareness and manipulation of the GUI environment. Since strategic knowledge is problem-independent, its implementation should be the responsibility of a framework.
In this context, the domain knowledge is represented by a problem-specific entity called the State Structure (see Fig. 4). The State Structure provides a complete map of the problem space. It describes the specific GUI environment to be automated—most importantly, what exists in this environment and how to move from point A to point B.
Fig. 4
A simple State Structure, the expression of domain knowledge for a specific problem
5.3 The overall model
The Overall Model serves as a comprehensive framework that unifies the fundamental components of model-based GUI automation into a cohesive system. It specifies how the automation system perceives the GUI, manages states, performs actions, navigates through the interface, and responds to dynamic changes.
Figure 5 shows the critical dependencies between components of the Overall Model, separated into domain knowledge and strategic knowledge. The Action Model (a) bridges both domains—it receives input from and modifies the visible GUI ((\Xi )), provides the building blocks for transitions in the State Structure ((\Omega )), and takes state elements from (\Omega ) as input. Actions and transitions provide information to the State Management system (M), which tracks active states. Given the set of active states, the Path Traversal Model ((\S )) navigates the GUI with help from (\Omega )’s knowledge of transitions and graph of the GUI environment.
This architecture separates environment representation (what exists) from interaction mechanisms (how to navigate), allowing the framework to handle strategic knowledge while applications focus on domain knowledge and business logic.
Fig. 5
The overall model: ((\Xi ,\Omega ,a,M,\tau ,\S ))
The Overall Model of GUI automation is a tuple ((\Xi ,\Omega ,a,M,\tau ,\S )) comprising:
(\Xi ), representing the visible GUI;
(\Omega ), the State Structure, an abstraction of the GUI environment;
a, the Action Model, for performing individual actions on the GUI;
M, the State Management system, responsible for maintaining the active states.
(\tau ), the Transition Model, for executing transitions between states;
(\S ), the Path Traversal Model, used for moving within the GUI environment;
The visible GUI (\Xi ) contains:
the scene, or the pixel output of the screen;
(E_{\Xi } = f(\Xi ) \subseteq E), the set of all GUI elements in the visible GUI.
The State Structure (\Omega = (E,S,T)) is defined by:
(E = {e_1,e_2,\ldots ,e_n}), the set of all GUI elements selected to model the environment (images, regions, locations, etc.);
S, the set of all GUI states:
Each state (s \in S) is a subset of E.
Multiple states can be active simultaneously; thus, the visible GUI at a particular time can be described as a set of active states (S_{\Xi } \subseteq S).
(s \in S_{\Xi } \text { if and only if } s \cap E_{\Xi } \ne \emptyset ).
T, the set of all transitions t between states.
The GUI environment, the set of all possible screens during an activity, is represented in the model by the set of GUI elements (E) organized into GUI states (S). The current scene ((\Xi )), the screen at a specific time, provides the real-time data necessary for the automation system to perceive, interpret, and interact with the GUI. Since the GUI environment is conceptual and must be abstracted to finite sets of elements and states, only its proxies - elements, states, and scenes—are included in the model.
A state in model-based GUI automation is a collection of related GUI elements. State objects often are grouped spatially or appear at the same time. Objects used together in a process are likely candidates for belonging to the same state. However, these configurations are not absolute rules, and the definition of a state is subjective. A state has a meaning within the automated environment that can vary depending on the automation goals, and a state configuration should make sense in the context of the automation application.
In Fig. 6, the left side of the screen has a browser open to an AI chatbot, and the right side has a different browser with a spreadsheet. These spaces can be clearly defined as separate states since actions performed on one browser will not affect the other. The orange box around the open menu in the spreadsheet defines a third state.
Fig. 6
A GUI screenshot with three states
5.3.1 A practical example: the DoT app
To make the abstract components of the Overall Model ((\Xi , \Omega , a, M, \tau , \S )) more concrete, I use the DoT app, an experimental application designed to automate tasks in the mobile game Dawn of Titans. This app will serve as a running example to illustrate how the theoretical models are realized in practice. The following is a breakdown of how the app’s components map to the model.
5.3.2 Domain knowledge: the state structure (\Omega )
The State Structure represents the domain knowledge of the automation system. It provides a structured map of the problem space, defining the states, elements, and transitions within the GUI environment.
States (S): These are represented by the Home, World, and Island classes, which model discrete sections of the game’s interface;
Elements (E): These are the specific images and regions defined within state classes, such as the toWorldButton in the Home state or the nameRegion in the Island state;
Transitions (T): These are defined in the HomeTransitions, WorldTransitions, and IslandTransitions classes, which handle the logic for moving between states.
5.3.3 Strategic knowledge: the framework F
Strategic knowledge is provided by the framework, which implements problem-independent automation capabilities. The following components ensure efficient GUI interaction:
Action Model (a): Defines atomic operations for interacting with GUI elements;
State Management (M): Maintains the set of active states and their relationships;
Transition Model ((\tau )): Manages the execution of transitions between states;
Path Traversal Model ((\S )): Enables navigation through the GUI environment.
The application utilizes these components via API calls, eliminating the need to implement complex logic such as pathfinding.
5.3.4 Automation instructions ((\iota ))
Automation instructions define the application’s business logic, specifying the tasks to be executed. This logic is primarily implemented in the SaveLabeledImages class, which contains the main loop for navigating to islands and saving images. Additional functionality is provided by helper classes such as GetIslandType, which supports classification and retrieval tasks.
5.3.5 Design principles and component necessity
The Overall Model’s architecture addresses fundamental challenges in GUI automation through a separation of concerns principle. Each component serves a critical role that, if absent, would reintroduce the fragility and complexity problems of process-based approaches:
5.3.6 Visible GUI ((\Xi ))
The critical distinction of this model is its treatment of the Visible GUI ((\Xi )) as an explicit component and a verifiable ground truth. Unlike process-based approaches that use the screen as a transient input for sequential actions, this model enables a crucial feedback loop: it continuously compares its internal belief of active states ((S_{\Xi })) against the reality of the screen ((\Xi )). This verification allows the system to detect unexpected events and confirm transition outcomes. Lack of systematic verification is one of the main causes of fragility in process-based approaches, where automation often assumes an expected GUI state.
5.3.7 State structure ((\Omega ))
Process-based approaches embed environment knowledge within action sequences, creating tight coupling. The State Structure externalizes this knowledge, enabling:
Independent testing of environment representation;
Reusable state definitions across multiple automation tasks;
Clear separation between “what exists” (states) and “how to navigate” (transitions).
Without (\Omega ), environment knowledge remains scattered across procedures, making maintenance exponentially complex as the automation grows.
5.3.8 Action model (a)
While process-based tools also perform actions, they lack a unified model for action execution and result interpretation. The Action Model establishes the following:
Interface Contract: The model defines a standardized result structure ((r_a)) for every action. This creates a consistent interface contract, ensuring that the success or failure of any action—regardless of its type—can be interpreted uniformly by other system components. It allows the state manager, for example, to reliably process the outcome without needing custom logic for each action.
Implementation Scope: The model abstracts environmental stochasticity ((\Theta )) from an action’s internal logic. This deliberately narrows the implementation scope of each action to its core task (e.g., clicking a coordinate, finding an image). The action’s code is not responsible for handling unpredictable events like pop-up windows or network lag; instead, the responsibility for handling the consequences of such events is elevated to the framework’s state management and path traversal components, preventing code duplication and centralizing error recovery logic.
5.3.9 State management (M)
Process-based automation tracks the GUI’s state implicitly through the current position in the action sequence. For example, if line 15 of the script is executing, the GUI must be on the “Payment Details” screen. This method fails in complex scenarios because it lacks a formal, explicit representation of the GUI’s state. The State Management system (M) is necessary to overcome these challenges:
Handling Multiple Active States: A process-based tool lacks an architectural concept to represent the combination of ({StateA, StateB}) as a single, coherent “world view.” It only knows that two separate images were found. A dedicated State Management system maintains an explicit set of active states ((S_{\Xi })). This allows the framework to reason holistically, for example, by looking for transitions that are available from any of the currently active states, something that would require complex, custom, and brittle conditional logic to handle manually in a process-based script.
Recovering from Unexpected Events: The State Management system adjusts the active states using the results of transitions and certain actions. Having an updated set of active states gives the system an accurate understanding of its current context from which to attempt recovery.
Enabling Dynamic Pathfinding: Evaluating alternative paths requires knowing your current location. When a step in a rigid, sequential process fails, the script’s “location” becomes invalid. It has no reliable, independent knowledge of the GUI’s state to begin a new path. The State Management system acts as the reliable “You Are Here” marker on the map. After a failed transition, the Path Traversal Model can query M for the current, valid set of active states ((S_{\Xi })) and use that precise starting point to find a new, viable path to the target.
The State Management system maintains an explicit, dynamic awareness of all active states ((S_{\Xi })), enabling robust adaptation and recovery. Without M, the system cannot reliably reason about its current position in the GUI environment, a prerequisite for any complex or long-running automation task.
5.3.10 Transition model ((\tau )) and path traversal ((\S ))
The Transition Model manages the execution of defined action sequences ((\tau )) that move between states. The Path Traversal Model ((\S )) is responsible for the higher-level strategic task of finding a viable sequence of transitions to navigate from a current state to a target state. This separation of defining a transition from finding a path is a key architectural choice.
5.3.11 Design principle summary
In summary, the design’s core principle is the explicit separation of concerns. Domain-specific knowledge is externalized into an explicit map of the GUI environment (the State Structure, (\Omega )), while the problem-agnostic strategic knowledge (how to find, act, and navigate, implemented in (a, M, \tau , \S )) is encapsulated within the framework. This separation allows the framework to handle reusable, complex logic, while applications focus on defining the environment and their own business logic.
5.3.12 Addressing the robustness-code complexity trade-off
This architecture directly resolves the robustness-code complexity trade-off identified in Sect. 4.2:
5.3.13 Robustness through modularity
Robustness is achieved through the architecture’s modularity and its feedback mechanisms. Because the State Structure ((\Omega )) is decoupled from execution logic, GUI changes often require updating only a single, localized state or transition definition. Furthermore, the explicit state tracking provided by the State Management system (M) combined with verification against the Visible GUI ((\Xi )) allows the system to detect and recover from unexpected events, rather than failing when the GUI diverges from an implicit assumption.
5.3.14 Complexity reduction through abstraction
The path traversal component ((\S )) eliminates the need to explicitly code every possible path. This fundamentally reduces development complexity. The process-based approach requires manually accounting for a number of paths that grows exponentially. In contrast, the model-based effort is polynomial. The developer must define the n states and, assuming an average of m transitions per state, approximately (n \times m) transitions. This reduces development complexity to a manageable polynomial scale, (\mathcal {O}(nm)).
For the example in Sect. 10 with (n=30) states and (m=5) transitions per state, this is the difference between coding a number of paths on the order of trillions versus defining approximately (30 + (30 \times 5) = 180) states and transitions. This represents a shift from a practically impossible task to a well-defined and scalable one.
5.4 The action model (a)
The Action Model serves as the fundamental building block of GUI automation, defining the atomic operations that can be performed on the graphical user interface. It encapsulates individual interactions, such as clicking a button or entering text, that collectively enable task automation.
An atomic action is a tuple (a = (o_a, E_a, \zeta )) comprising:
(o_a), parameters or options assoc