Introduction
Understanding the factors that affect the delivery of software at an organizational level offers businesses and engineering teams the knowledge to deliver value to end-users, maintain competitiveness, and improve developer experience. Given engineering teams’ fundamental role in software delivery, the velocity of their work—that is, the time it takes for task completion—has emerged as a focal point of empirical investigation, particularly through measures like cycle time which captures the duration between ticket opening and ticket closing. Moreover, cycle-time is seen by engineers as the most useful metric of engineering productivity according to a prominent industry report (Carey 2024).
While cycle time is often treated as an indicat…
Introduction
Understanding the factors that affect the delivery of software at an organizational level offers businesses and engineering teams the knowledge to deliver value to end-users, maintain competitiveness, and improve developer experience. Given engineering teams’ fundamental role in software delivery, the velocity of their work—that is, the time it takes for task completion—has emerged as a focal point of empirical investigation, particularly through measures like cycle time which captures the duration between ticket opening and ticket closing. Moreover, cycle-time is seen by engineers as the most useful metric of engineering productivity according to a prominent industry report (Carey 2024).
While cycle time is often treated as an indicator of productivity per se, the concept of productivity remains poorly specified in software engineering contexts, where outputs fundamentally differ from the more readily quantifiable measures used in traditional industrial production. Specific units of work are rarely identical across time for a person, within a team, or across teams. The interpretation of cycle time as a proxy for productivity therefore presents particular challenges because variations could reflect differences in work patterns, task assignment, task scoping, and organizational contexts rather than differences in some underlying rate of task completion.
Nevertheless, the intuitive appeal of cycle time and its widespread use in practice make it a valuable focus for empirical investigation. The above-mentioned complexities necessitate sophisticated statistical methods to detect the unique impact of multiple factors, while carefully characterizing the variability practitioners can expect in day-to-day and month-to-month observations of cycle time. Through rigorous statistical modeling of longitudinal data across multiple organizations, we can both characterize its variability across real-world contexts, while demonstrating methodological approaches for analyzing such complex operational metrics. This analysis also allows us to detect systematic influences from factors commonly believed to affect developer productivity: task scoping, focused work time, collaboration, and time of year.
Our investigation leverages a unique dataset comprising over 11,398 contributors at 216 organizations across diverse industries. This work makes two primary contributions. First, we demonstrate a model for statistically investigating software activity data at both a larger and more longitudinal scale than previous empirical research, allowing us to characterize how cycle time varies across software development contexts (i.e., individuals, organizations, and variable process factors), using hierarchical modeling that appropriately separates individual and organizational variation, combined with the careful disaggregation of within- and between-person effects. This approach allows us both greater precision and nuance in describing effects as well as the ability to highlight potential pitfalls in using such measurements to drive decision-making. Second, we incorporate these multiple measures of process factors simultaneously to isolate unique effects, including a novel measure of collaboration operationalized as degree centrality, taking initial steps toward reflecting the impact of the interactive nature of software development in large-scale analyses of activity data.
Our research questions are:
RQ1. How do common workplace and software development process factors impact cycle time?
RQ2. How much between- and within-person variation is there in cycle time?
The paper proceeds as follows: We first review the literature on software productivity measurement, examining cycle time’s relationship to broader discussions of developer performance. We then present our methodology for analyzing cycle time variation using Bayesian hierarchical linear models. Our results examine both population-level effects and the substantial variation observed between individuals and organizations. We conclude by discussing implications for practice and future research directions.
Background
Productivity
The use of cycle time in the academic and industry literature is almost always as part of a discussion of productivity. This may be in part because cycle time and related metrics are one of the only so-called objective quantitative windows we have into the process of software production (but note that self-reports of perceived productivity are also potentially valid measures of this process). For this reason, it behooves us to discuss the literature on productivity, even as we position the analyses in this report as specifically analyzing what we consider to be at best a very distal indicator of whatever it is people mean when they use the word “productivity.”
Defining software team productivity and performance is a highly contentious exercise and many different definitions are given by both practitioners and researchers (Fraser et al. 2007; C. Hicks, Lee, and Ramsey 2023; C. M. Hicks, Lee, and Ramsey 2024; Murphy-Hill et al. 2021; Sadowski, Storey, and Feldt 2019). Perceptions of what counts as successful software work can meaningfully differ across individuals and roles, as when engineering managers tend to focus on long-term outcomes and individual developers focus on activity, for example (C. Hicks, Lee, and Ramsey 2023; Storey, Houck, and Zimmermann 2022b). Across workplaces, measures of time have been frequently used to assess productivity even while the shortcomings of these measures are also widely acknowledged (Griffin 1993). Alternative measures include self-ratings or peer evaluations (Murphy-Hill et al. 2021; Ramírez and Nembhard 2004) and in software engineering, operationalizations of code work such as lines of code (Blackburn, Scudder, and Van Wassenhove 1996; Maxwell, Van Wassenhove, and Dutta 1996). These have obvious limitations in that the meaning of a particular unit for any of these metrics may be different depending on context (Sadowski, Storey, and Feldt 2019). Some researchers have sought solutions to this problem by asking individuals to rate their own level of, or satisfaction with, productivity (C. Hicks, Lee, and Ramsey 2023; Storey et al. 2021). While it is plausible that perceived productivity could be a good indicator of productivity, it is still not free of the context effects that are often levied as critiques of more “objective” metrics, and self-report, while perhaps overcoming some shortcomings of other methods, bring with them another set of measurement issues.
The difficulty of quantifying productivity arises even prior to the step of choosing one or several indicators. There is often a lack of clear distinction between production (quantity of output regardless of resources provided), productivity (quantity of output given the resources provided), and performance (flexibility, adaptability, dependability, sustainability, and quality of output over time) (C. Hicks, Lee, and Ramsey 2023). As any software developer will be aware, this conceptual complexity is likely the result of the various ways their work counts for professional development, for the success of the product, and for simply meeting deadlines. This piece of research does not aim to solve the issue of how we conceive of productivity but instead seeks to take a deep look at a single popular metric in order to showcase, first, the many factors (themselves, a subset of possible influences of productivity) that affect cycle time, and second, how observing this metric over time informs our view of the ways cycle time varies both within and between people. These views will be helpful both to illuminate specific properties of cycle time as a measure but also to demonstrate how one might approach an in-depth analysis of either “objective” or self-report metrics of productivity.
Evaluating individual developer performance
Given the difficulty of appropriately defining productivity, the many metrics that purport to measure it, and the potential cost to an individual (e.g., career, reputation) of being measured, it is understandable that software developers have an ambivalent stance about the measurement of both work activity and productivity, that metrics adoption can be fraught with failure (Bouwers, van Deursen, and Visser 2013), and that social or socio-technical affordances can be strongly associated with self-reported productivity and necessary to obtain a full picture of software team experience beyond project and technical metrics (C. Hicks, Lee, and Ramsey 2023; Murphy-Hill et al. 2021).
Developers whose teams use metrics generally see those metrics as helpful, and developers who report agreement which team-level metrics are measured tend to report higher perceived productivity (C. Hicks, Lee, and Ramsey 2023; C. M. Hicks, Lee, and Ramsey 2024). However, paired with this are some indicators of uncertainty in whether and how metrics are being tracked or used (C. Hicks, Lee, and Ramsey 2023), there is often backlash against any attempt to define or popularize such metrics (Bruneaux 2024; Chhuneja 2024; Coté 2023; Finster 2023; Orosz 2024b, 2024a; Riggins 2023; Terhorst-North 2023b, 2023a; Walker 2023b, 2023a), and there is concern about mismeasurement by managers inside of organizations (C. Hicks, Lee, and Ramsey 2023), which is part of a broader discussion of surveillance and the discontent it can generate for workers (Ball 2010; Grisold et al. 2024; Mettler 2024). More troublingly, recent scholarship on sociocognitive experiences in the workplace has proposed that severe experiences of employees being treated by an organization as a “mere tool” or a resource may create organizational dehumanization leading to many negative impacts on both well-being measures and on work outcomes (Caesens et al. 2017; Lagios et al. 2022). Moreover, there is evidence that metrics might be used differently depending on a person’s visible identities (e.g, Quadlin 2018).
Likewise, scholarship on employee perceptions of organizational and procedural justice have long documented that when employees perceive a context of organizational injustice, this can exacerbate or redefine experiences of organizational decision-making and performance evaluations (Brockner et al. 1994, 2007). Given such larger organizational dynamics, it is likely that whether or not software metrics adoptions are successful is impacted not only by the choice of metric but also by larger contextual factors such as teams’ sociocognitive experiences and expectations around measurement, and the psychological affordances of their environments which may or may not allow them to address measurement concerns (C. M. Hicks 2024).
We lack holistic evidence about what practitioners in software development believe about developer performance and ability; some reports from researchers with samples at large technology companies have suggested both that definitions of productivity can vary widely between managers and developers, and that software developers perceive many potential trade-offs between types of technical goals, e.g. that quality and speed may be unattainable together (Storey, Houck, and Zimmermann 2022a).
One “industry myth” which is referenced frequently in practitioner commentary is the idea of a “10x engineer”: this position alleges that some small outlier population of software developers consistently outperform others on key development tasks. Potentially springing from small case studies examining a handful of developers’ time spent solving small laboratory tasks (Sackman, Erikson, and Grant 1968; discussed in Nichols 2019), this “law” was generalized from only twelve individuals, uses time spent on the tasks as an estimate of both effort and cost, has failed to replicate in larger examinations of developer performance on similar tasks, and failed to acknowledge large within-individual variation in task performance (Nichols 2019; Shrikanth et al. 2021).
Nevertheless, the idea that “10x engineers” exist and that some individuals in software engineering outperform others by a “rule” of 10x has been cited often and codified in industry commentary, (e.g., Brooks 1975). Modern commentary on this idea frequently refers to it as a myth, but it is also discussed as a potentially real phenomenon1. In our previous work, we have noted that some software practitioners hold field-specific ability beliefs that software development success and productivity is attributable to a quality of “innate brilliance”, and that this belief among practitioners may create a higher likelihood of experiencing threat and anxiety in the face of rapid role change and technological shifts to developer workflows (C. M. Hicks, Lee, and Foster-Marks 2024). Broad reviews on drivers of software development outcomes, particularly frictions in the form of team “debt,” also suggest that social-psychological aspects of shared work processes may be a significant contributor to these outcomes separate from individual performance (Ahmad and Gustavsson 2024).
Despite some recognition that the 10x engineer is a problematic concept, the conflictual measurement of productivity and its use as a tool of surveillance and punishment contra the interests of individual contributors (but to the benefit, at least ostensibly, to a company’s profitability) continues with full-throated glee. A recent unpublished study claims that nearly 10% of engineers contribute almost no work; that is to say, it raises the boogeyman of the 0.1x engineer as the 10x engineer’s inverse2(Obstbaum and Denisov-Blanch, n.d.). The measure of productivity used is something half-way between an objective measurement and self-report: an unspecified machine-learning model trained on expert ratings of the quality of, and work necessary to complete, 70 commits (Denisov-Blanch et al. 2024). Unlike prior work, this method lacks both the transparency of “objective” measures and the temperance of self-report measures.
In taking a deep dive into cycle time, this project does not address every implementation challenge and organizational affordance that may define whether organizations can ensure a healthy and sustainable practice around the measurement of work activity. However, we believe that a more robust understanding of the dynamics of cycle time may help practitioners avoid pitfalls in relying on velocity measures while evaluating software work. We hope to describe the complexity in a way that at least adds some clarity and aligns with the experience of software developers in practice.
Cycle Time
Because lower cycle times are thought to indicate faster delivery times and more efficient software processes, cycle time has long been taken as a key indicator of team health, developer productivity, and team efficiency (Clincy 2003; Agrawal and Chari 2007; Carmel 1995; Evers, Oehler, and Tucker 1998; Gupta and Souder 1998; Nan and Harter 2009; Ruvimova et al. 2022; Sadowski and Zimmermann 2019; Trendowicz and Münch 2009). This suggests that understanding factors that influence cycle time may lead to insights into factors that are important to understand for understanding productivity in general. At minimum, examining cycle time can provide a description of the complexity of factors that impact this popular metric.
Cycle time examines one aspect of the speed of software delivery by measuring the time between task start and task delivery. It has consistently been described by industry research as one of the best and most trusted metrics for software productivity (Carey 2024). In this same report, similar metrics also showed preference, such as lead time, deploy frequency, and change failure rate. The broader software engineering community has emphasized similar constructs through the research program DevOps Research and Assessment (DORA), which identified four key measures of software delivery performance: lead time, deployment frequency, change failure rate, and mean time to recovery (Forsgren, Humble, and Kim 2018). While cycle time is not identical to these measures, it overlaps conceptually—particularly with lead time—in capturing aspects of delivery speed. Positioning cycle time alongside these measures situates it within a family of indicators concerned with the timeliness and reliability of software delivery, even though the operational definitions vary across contexts.
The common thread across these metrics is that the unit of work is defined by the team or company in relation to goals that serve the strategic interests of the project. While there is a good deal of nuance with respect to what goes into setting these units up, they are both discrete (and so “objective”-feeling) but also defined, often collaboratively, with respect to the outcomes that matter. This is in contrast to lines of code, for example, which may or may not be relevant to the goals of the engineering teams, and which is avoided by 70% of respondents in the same industry report. Cycle time may also be considered an important part of developer experience as a component of what leads to a fluid-feeling development and release cycle (André N. Meyer et al. 2021).
In calls to re-examine the complexity of developer productivity, researchers have argued that velocity measures are highly task-dependent, and do not represent the quality of work done or other, longer-term measures of the impact of work (Sadowski, Storey, and Feldt 2019). It is also possible for velocity measures to have multiple directional relationships with desired outcomes depending on software developers’ larger context. For instance, hypothetically speaking, an increase in velocity may associate with more success for a software team when this increase arises because the team engages in process improvements, creating processes that help them to move more quickly through development tasks, and thereby meet a critical deadline for a product launch, leading to business outcomes which then lead to more resources for the team. However in a different scenario, an increase in velocity may be associated with more failures for a software team, for instance, if velocity changes arise because the team begins to eschew quality control processes, eventually leading to costly critical business failures.
Nevertheless, time and output-based measures are frequently used as an outcome measure to make recommendations for software engineering practices, e.g. in evaluating the perceived impact of technical debt (Besker, Martini, and Bosch 2018). These measures have the added benefit of having a concrete referent that is simple to measure and inexpensive and convenient for teams trying to track productivity to collect.
The utility of cycle time has subsequently led to numerous industry experts recommending that engineering managers and leaders track their teams’ cycle times. However, leaders are provided less guidance on how to analyze and decrease cycle time. As such, leaders are left with the dilemma of being aware of their cycle times, but not understanding how to improve their cycle times in an evidence-based way.
In the literature that does directly address this question, four major areas have been proposed to impact cycle time: (1) organizational structure and climate, (2) reward system, (3) software development process and (4) the use of software design and testing tools (Clincy 2003). We focus in this paper on factors from part 3, software development processes, in part because measurements of these processes continue to gather significant interest from the technology industry and are plausibly mobile levers that can be manipulated at the level of an engineering team. They are also themselves relatively easy to measure and track at the team level if a software team within a larger organization were to decide they wanted to try to shift their processes and take measurements to make sure they were successful. We have argued elsewhere that organizational structure and climate are also relatively easy to measure and are powerful levers that should be more often targeted (C. M. Hicks and Hevesi 2024; C. Hicks, Lee, and Ramsey 2023; C. M. Hicks, Lee, and Ramsey 2024), though for the present work we focus on (3) also in part to keep the scope of this analysis manageable.
To reduce cycle times at the level of software development process, the software industry currently recommends strategies centered around three major themes:
- increased coding time
- improved task scoping
- improved collaboration
Industry convention rationalizes that increased coding times increases the amount of code committed and pull requests merged, thus moving tickets through their life cycle more quickly. Improved scoping can similarly yield more efficient teams by breaking work down into more manageable chunks and reducing the amount of unplanned work from bugs and defects. Finally, industry reports posit that improved collaboration can reduce the time it takes for developers to review PRs and increase review rates (Flow, n.d.; Gralha 2022; Waydev 2021). There has also been some work looking at this empirically which supports the idea that collaboration under certain conditions does improve productivity Gousios, Pinzger, and Deursen (2014). We focus on these three areas as possible factors that impact cycle time.
Research design and methodology
Code for these analyses is available as analyses.qmd
, here: https://github.com/jflournoy/no-silver-bullets. Data are considered proprietary and are not available to be shared. This research used aggregated, anonymized GitHub activity data routinely collected through our company’s normal operations and permitted by our Terms of Service. No personal information was gathered specifically for this study, and strict protocols were followed to prevent re-identification of individuals or organizations. Because the dataset was pre-existing, fully anonymized, and did not involve direct interaction with human subjects, the research is exempt from IRB review under 45 CFR § 46.104(d)(4)(ii). All data was stored on secure systems with limited access, ensuring both data integrity and confidentiality.
Data Selection and Characteristics
To examine coding time, task scoping, and collaboration as predictors of cycle time over time, we centered our analysis on a large, real-world dataset of git and ticketing data. This dataset includes 55,619 observations across 12 months in 2022 from 11,398 users in 216 organizations of varying sizes and industries. We chose to use longitudinal data across 12 months, as it allowed us to examine fluctuations within a person’s workflow as well as different stable tendencies between people.This data was available via partnerships between a software metrics tool3 which was incorporated into the workflows of real working software teams, and the 216 organizations which opted in to this tool at any point during the 12 month analytic window. Notably, because this tool was adopted on an organizational level (following partnership agreements that include organizational opt-in and security audits), users themselves did not have to be active users of the software metrics tool itself in order to be included in this dataset, and git and ticketing data was available retrospectively for dates prior to the implementation of the tool in the organization. In other words, the git and ticketing data included in this analysis is not predicated on being an individual user of the software metrics tool, nor on the software metrics tool being used at the organization, as our dataset contains measures both before and after the software metric tool implementation at the organization, and implementation dates for organizations vary across the 12 month period.
Data were selected for analysis based on whether users actively contributed code during the time frame of the study. The 216 organizations each had between 1–2,746 individuals in the dataset, with 90% of organizations being represented by more than 12 users (Median = 130; Figure 1). In previous pilot surveys used to inform the design of this project, professional software developer users from these organizations described their main industries as ranging from Technology, Finance, Government, Insurance, Retail, and others, indicating a wide diversity of business use cases and engineering contexts were present in this sample.
Figure 1: Organization sizes clustered around 130 users, with a long tail of larger organizations. Note that “users” generally refers to developers or other individuals creating and closing tickets.
Computing study variables
Using the most complete data for each user, we used the mean to aggregate each variable at the month level and the year level (see below for more details specific to each variable). For each predictor, we then subtracted each person’s yearly average from their monthly data to produce a within-person deviation variable. This allowed us to disaggregate effects on the outcome due to yearly-level individual differences and within-person, month-to-month fluctuations (Curran and Bauer 2011). This also allowed us to avoid averaging between-person and within-person differences into a single effect estimate. These effects can be different even in the sign of the effect, for example with a positive relationship between some time-invariant factor and the outcome of interest at the between-person level, and a negative relationship between the same factor measured across time and within-person variation over time. A common example that is highly relevant to most technical and knowledge workers is typing speed and errors. Imagine someone trying to type as fast as they can; it is obvious that they will make more errors the faster they type, evincing a negative association between speed and errors. However, if one simply measures the typing speed and error rate of many people, it should be clear that we would see that faster typists tend to make fewer errors, perhaps because of differences in typing experience. In this study, we want to be able to examine average differences between people’s cycle time aggregated at the year while also examining what is associated with cycle time deviations from that yearly trend month-to-month. All year-level individual differences variables were centered at their mean. Exceptions or addenda are mentioned below. See Table 1 for a brief list of variables.
Cycle Time
This is the dependent variable in these analyses. After computing the cycle time for each closed ticket in seconds, we found the median cycle time for each month for each user using all tickets opened in that month. For example, a ticket opened on the 9th of April, and closed on the 3rd of May would contribute 2,246,400 seconds to the calculation of the median for April. Depending on how organizations actually use tickets in practice, it is not guaranteed that work has not already begun prior to ticket opening.
Unclosed Tickets
We were not able to observe the closing date for every ticket given our data collection cutoff of March 7, 2023, and so it is plausible that we underestimate the median cycle time in a way that depends in part on how many ticket closing times we do not observe. For this reason, we also computed the proportion of tickets opened in that month that had not been closed by the end of our data collection. For example, any ticket opened in April, 2022 but not closed by March 7, 2023 would count toward the proportion of unclosed tickets for that month. We transformed proportions from ([0,1]) to ((-\infty, \infty)) using the logistic quantile function (with minimum and maximum proportions forced to be .01 and .99 respectively). We use this in the regressions below as a control variable to adjust for this possibility.
Time (Month, and within-quarter month)
We examined time in two ways: monthly and quarterly. Months were represented as numeric values (i.e. January = 1, February = 2) and centered at month 7, which allows us to interpret certain quantities like the intercept as the average cycle time in the middle of the year. Additionally, because quarters provide meaningful business cadences that may impact engineering work, for instance in that some organizations set quarterly goals at the beginning of each quarter and push to meet those goals at the end of each quarter and that key product deadlines may occur systematically toward the end of quarters, we accounted for any effects of quarterly cycles by using an indicator for the within-quarter month, centered at the middle of the quarter (e.g., -1 for the first month of the quarter, 0 for the middle month, and 1 for the last month of the quarter). This approach allowed us to capture a more stable and realistic trajectory of change over the course of the year.
Team Size
To control for any influence of team size on cycle time, we compute each individual’s team size as the average size of all teams that individual belongs to as defined by individuals’ co-located activity data. Specifically, in the database used, an individual contributor is given membership in any team that they have worked in, and this is updated retroactively. For each individual, we find all teams that person is a member of, compute the size of that team, and then average across those team sizes if an individual is a member of multiple teams. As such, this number is a very rough indicator of the size of teams an individual tends to be a part of and is static across the year. This is a limitation of the database. This is then entered as an individually-varying continuous variable to control for some of the effect of team size on an individual’s cycle time.
Coding days
Coding days was summarized as the average number of days per week that a developer made at least one commit. We divided the number of coding days in a month by the total number of days in that month and multiplied by seven to aid in interpretation. Based on conversations with software developers, we understand that making small commits frequently is often considered best-practice, but the fact that commits can be made independent of actual coding time means that our proxy measure for coding days is imperfect.
Total Merged PRs
One frequently proposed best practice in software work, intended to lead to outcomes such as improved task scoping, involves breaking work into smaller and more manageable chunks or pull requests that can be finished more quickly (Kudrjavets, Nagappan, and Rastogi 2022; Lines 2023; Riosa 2019; Zhang et al. 2022). For a given software development goal, if we assume the set of commits necessary to accomplish that goals remains the same, more pull requests suggests that the task was broken down into smaller discrete goals in a way that groups more closely related subtasks together than one large pull request. Making smaller, more frequent pull requests is also itself a way to break up the task for code reviewers in a way that is thought to improve productivity. As such, we used the number of total merged pull requests as one measure of task scoping. To calculate this, we counted the number of merged pull requests for each user for each month.
Percent Defect Tickets
Another potentially beneficial signal in software activity data is the reduction of unplanned work on bugs and defect tickets, which is also proposed as a bottleneck on improving cycle time (Paudel et al. 2024; Rosser and Norton 2021; Toxboe 2023). As such, we used the percentage of defect tickets as another measure of task scoping to represent unplanned work that may interfere with timely completion of planned work. This may also be a downstream signal of individuals’ opportunity for focused work time and code quality. To account for this possibility, for each user, for each month, we computed the percent of tickets that were defect tickets.
Degree centrality
We measured collaboration by calculating degree centrality. To evaluate degree centrality, a metric derived from network analysis and often used in the analysis of social networks (Watts 2004), we employed a framework where developers were treated as nodes within the network, and their interactions in the form of Pull Requests (PRs) were regarded as connections. In other words, any contribution of code to the same pull request constituted a collaboration edge between developers. We normalized each centrality value by dividing by the total number of developers constituting the organizational network. The calculations were executed using the Python package Networkx (Hagberg, Schult, and Swart 2008). This particular variable serves as an effective proxy for quantifying the extent of collaboration among developers. We multiply the normalized degree centrality, which is between 0 and 1, by 100.
Analytic Approach
The models described below are fit using brms
(v2.21.6, Bürkner 2018, 2017), interface the Stan probabilistic programming language for Bayesian sampling (v2.35.0, Team 2024), with the cmdstanr
backend (v0.8.0, Gabry et al. 2024), in R (v4.3.2, R Core Team 2023).
We developed a model of monthly average ticket cycle time conditional on the following predictors: within-quarter month number, team size, proportion of unclosed tickets, month number, yearly means and month-level deviations for coding days per week, total merged PRs, defect ticket percentage, degree centrality, and comments per PR. Specifically, we modeled cycle time as distributed Weibull with two parameters, (\lambda) (scale), and k (shape). The Weibull distribution is often used to model time-to-event data (Harrell 2015; Rummel 2017), where k determines the change over time in the probability of an event occurring (often called the “hazard rate”), and where (\lambda) determines the time-to-event for some proportion of the cases (or in other words, how spread out the distribution is). For simplicity, we assume that the shape (hazard rate, k) is not influenced by the factors considered, and focus on how these factors affect the scale (time-to-event, (\lambda)) of ticket closures, though we did allow the shape, k, to vary across organizations. In short, the Weibull distribution provides flexibility for accurately describing cycle time data that tend to have a bulk of observations at the low end, with a very long tail of more extreme observations.
The model for (\lambda) is
[\begin{equation} \begin{aligned} \log(\lambda) &= X\beta + \eta_{\text{org}} + \eta_{\text{org:user}} \ \eta_{\text{org}} &\sim \mathcal{N}\left(\begin{bmatrix} \mu_1 \ \mu_2 \end{bmatrix}, \begin{bmatrix} \sigma_{11} & \sigma_{12} \ \sigma_{21} & \sigma_{22} \end{bmatrix}\right) \ \eta_{\text{org:user}} &\sim \mathcal{N}\left(\begin{bmatrix} \mu_3 \ \mu_4 \end{bmatrix}, \begin{bmatrix} \sigma_{33} & \sigma_{34} \ \sigma_{43} & \sigma_{44} \end{bmatrix}\right) \end{aligned} \end{equation}]
where (X) is the matrix of predictors, (\beta) is the vector of coefficients, (\eta_{\text{org}}) is random intercepts with mean (\mu_1) and linear slopes of month with mean (\mu_2) for each organization, and (\eta_{\text{org:user}}) is random intercepts with mean (\mu_3) and linear slopes of month with mean (\mu_4) for each user nested within organization. The specific predictors in (X) are within-quarter month number, team size, proportion of unclosed tickets, month number, yearly means and month-level deviations for coding days per week, total merged PRs, defect ticket percentage, degree centrality, and comments per PR. We also include interactions between month number and the following: team size, proportion of unclosed tickets, and each of the yearly mean predictors. This allows us to account as completely as possible for our control variables (team size and proportion of unclosed tickets), and allow the effect of month on cycle time to vary by the individual differences variables (e.g., to account for the possibility that someone who has higher coding days per week shows a less steep decrease in cycle time across the year than someone with lower coding days per week).
The model for k is
[\begin{equation} \begin{aligned} \log(k) &= \zeta_{\text{org}} \ \zeta_{\text{org}} &\sim \mathcal{N}(\mu_5, \sigma_{5}) \end{aligned} \end{equation}]
where (\zeta_{\text{org}}) is a random intercept with mean (\mu_5) each organization.
Conceptually, this model allows a unique distribution of cycle times (as determined by the random intercepts for both (\lambda) and k) for each organization. It also allows the scale of the distribution of cycle times to vary for each user due to the random intercept for (\lambda). The effect of time (month number) on the scale of the distribution of cycle times is also allowed to vary across organizations as well as users due to the random slopes (with means (\mu_2) and (\mu_4)). This strategy allows two advantages: first, we account for multiple sources of variance that allows our estimates of the effects of interest to be more precise; and second, we are able to provide estimates of this variation across organizations and users. This variation itself is of interest given the various myths mentioned in the introduction about developer performance.
We model the effect of proportion of unclosed tickets and month number as smooth functions of the covariate using thin-plate splines for increased flexibility (Wood 2017). Briefly, thin plate splines (functions made up of smoothly connected segments) allow for flexible, non-linear relationships between predictors and the response variable. These splines are penalized to prevent overfitting, balancing model flexibility and complexity. The interactions between month number and our control variables are parameterized as additional smooth functions of month number multiplied by these variables. While our focal model parameterizes the interactions between year-level means and month number as linear coefficients on multiplicative combinations between the two variables, we also examined a model that uses additional smooth functions of month number multiplied by these variables to allow for additional complexity. We provide the model output for this sensitivity analysis in a supplement.
We set weakly-informative priors centered at zero for all parameters, except for the intercept for (\lambda) and k which were centered on their approximate values in the data (consistent with the default behavior of brms
). We performed prior-predictive checks to ensure our prior specification generated data that covered and exceeded the space of our observations. Given the complexity of the model, we also specified initialization of parameters at small plausible values (e.g., zero for coefficients, .1 for standard deviations of random effects). Full prior and initialization specifications are available in the analysis code.
We sampled from 4 chains with 2,000 total iterations each, discarding the first 1,000 iterations as warmup. Inferences were made on 4,000 post-warmup draws from the posterior probability distribution from the 4 chains.
Inferences
We take a Bayesian approach to making claims about the sign of effects (i.e., whether an association between two variables is positive or negative), and to describing its magnitude. Instead of the common but fraught frequentist approach of describing whether an effect size is unlikely given the assumption of an unrealistic point-null hypothesis, we try to give the reader a sense of the actual probability that the sign of an effect is in a particular direction, and what the impact of the factor is on cycle times in terms that are easy to interpret (Gelman and Carlin 2014).
In more precise statistical terms, unless otherwise stated we describe the posterior of parameters and predictions using the median of the distribution, and characterize its variation using the highest posterior density interval (HDI) which is defined as the interval that contains a specified percentage (usually 95%) of the most probable values of the parameter (Kruschke 2018). We make general descriptive inferences based on the probability that a parameter has the sign of the posterior density’s median value. For example, if 80% of the posterior density of the slope of the effect of month on cycle time is of the same sign as the density’s median, and that median is negative, we would say something like, “given the model and the data, there is an 80% chance that there is a decrease in cycle times across the year.”
R packages
R packages explicitly loaded in this analysis and manuscript preparation include brms (v2.22.7, Bürkner 2018, 2021, 2017), cmdstanr (v0.8.0, Gabry et al. 2024), data.table (v1.15.4, Barrett et al. 2024), ggplot2 (v3.5.0, Wickham 2016), flextable (v0.9.5, Gohel and Skintzos 2024), knitr (v1.46, Xie 2015, 2014, 2024), marginaleffects (v0.19.0, Arel-Bundock 2024), mgcv (v1.9.0, Wood 2011, 2017, 2004, 2003; Wood, Pya, and Saefken 2016), parameters (v0.21.6, Lüdecke et al. 2020), patchwork (v1.2.0, Pedersen 2024), posterior (v1.5.0, Bürkner et al. 2023; Vehtari et al. 2021), rlang (v1.1.3, Henry and Wickham 2024), scales (v1.3.0, Wickham, Pedersen, and Seidel 2023), scico (v1.5.0.9000, Pedersen and Crameri 2025), showtext (v0.9.7, Qiu and details. 2024), StanHeaders (v2.36.0.9000, Stan Development Team 2020), and tidybayes (v3.0.6, Kay 2023).
Results
Results from the linear model reported below were highly similar to those in the more flexible non-linear model sensitivity analysis described above. Also note that parameters in the table are from a linear model for the distribution of log((\lambda)) and log(k), while model expectations are on the response scale and can therefore display curvature even while the model is linear.
The first section of the results concerns the population-level effects and showcases the expectations of cycle time conditional on the various co-varying factors we target. The second section explores the variability in these effects across time, across individuals, and across organizations.
Population-level effects
Table 2: Population-level effect estimates
Parameter
Posterior Median1
Lower 95% HDI2
Upper 95% HDI2
Sign Probability3
Intercept log(λ)
14.3484
14.2727
14.4282
100%
Intercept log(k)
0.1214
0.0807
0.1585
100%
Within-quarter month
-0.0085
-0.0188
0.0013
95%
Team size
0.0001
-0.0644
0.0560
50%
Avg. coding days/week (within-person)
-0.0794
-0.0911
-0.0677
100%
Avg. coding days/week
-0.0839
-0.1100
-0.0587
100%
Total merged PRs (within-person)
-0.0127
-0.0155
-0.0097
100%
Total merged PRs
-0.0083
-0.0139
-0.0027
100%
Defect tickets % (within-person)
-0.0019
-0.0023
-0.0014
100%
Defect tickets %
0.0060
0.0049
0.0070
100%
Degree centrality (within-person)
-0.0023
-0.0040
-0.0006
100%
Degree centrality
-0.0040
-0.0063
-0.0015
100%
Comments per PR (within-person)
0.0046
0.0037
0.0054
100%
Comments per PR
0.0098
0.0075
0.0120
100%
Avg. coding days/week × Month
-0.0047
-0.0098
-0.0001
97%
Total merged PRs × Month
-0.0007
-0.0017
0.0003
90%
Defect tickets % × Month
-0.0001
-0.0003
0.0001
78%
Degree centrality × Month
-0.0001
-0.0005
0.0002
71%
Comments per PR × Month
0.0001
-0.0004
0.0005
59%
1Median of the posterior distribution, used as point estimate. 295% Highest Density Interval, containing the most probable parameter values with 95% posterior probability mass. 3Probability that the effect is in the reported direction, calculated as the proportion of posterior samples with the same sign as the point estimate.
We find that all measured factors, both individual-difference and within-person deviation