Everyone wants metrics nowadays. If you open the majority of Scrum Master job descriptions, you will see that most of them mention, most of the time, at least velocity. While velocity is nice... ok, no, I wouldn’t pretend it’s “nice” or meaningful, but that’s not the topic of this post.
So, metrics. There are some traditional metrics, like cycle time, and some emerging metrics, like DORA (fun to call the set of metrics “emerging” when they have been known for at least 10 years already), and there are highly recommended but rarely used metrics like flow efficiency. I don’t want to jump into creating yet another article enumerating various metrics a team can use, just to get into the plethora of similar articles that are getting forgotten immediately after you read them, because “…
Everyone wants metrics nowadays. If you open the majority of Scrum Master job descriptions, you will see that most of them mention, most of the time, at least velocity. While velocity is nice... ok, no, I wouldn’t pretend it’s “nice” or meaningful, but that’s not the topic of this post.
So, metrics. There are some traditional metrics, like cycle time, and some emerging metrics, like DORA (fun to call the set of metrics “emerging” when they have been known for at least 10 years already), and there are highly recommended but rarely used metrics like flow efficiency. I don’t want to jump into creating yet another article enumerating various metrics a team can use, just to get into the plethora of similar articles that are getting forgotten immediately after you read them, because “these metrics are not supported by JIRA.” Let’s talk more about the practical side: what decisions can we make purely on data? These are sample cases that can help Scrum Masters better understand and evaluate their teams. There is more to data than this, but time is precious, so I will show what I use first when I join the next team.
Most of these cases require data that is not available in JIRA. You will need to either install a plugin like EasyBI or write a script to pull the data from JIRA via API. I recommend the second approach, as installing plugins is not universally possible for Scrum Masters, and a script is easy to write with rudimentary Python or JavaScript knowledge, with some help from ChatGPT. No, really, I’m not flexing, it’s THAT easy, I bet it will not take you more than a couple of hours to create one (and a couple of weeks to actually connect it to your Enterprise JIRA with SSO enabled).
There is an unspoken assumption about story points in every team: the size should correlate with complexity. There is one more unspoken assumption: the more complex it is, the more time it takes. The more challenging items should be bigger, and bigger items should be longer, right? Right?
The issue with these assumptions is not that “it is not how Story Points are supposed to work!” (I reserve the right to make a separate ramble on how the Story Points are supposed to work, how teams expect them to work, and how they actually work, but that will be a separate text). The issue is that we assume these things without verification and act on that belief. So let’s validate it.
Collecting the data first. We need a set of items from your teams’ completed sprints: the cycle time (I recommend “business days” as the duration type) for each item and the item size. Six-seven sprints will be enough, and I recommend you select “typical” stable sprints: when your team was not shaken by unexpected people changes, major production incidents, “we have something new to work on, so drop everything and jump right here” situations, etc. If you are within SAFe, do not include IP sprints. Don’t be too selective either; if none of your last three sprints fit into a “stable” sprint category, maybe this analysis is not something you should focus on right now. If your sprint bandwidth is low, aim for at least 75 completed items as an alternative.
Visualizing the data. There are multiple ways to display data distribution, and my favorite is a box plot. A box plot is a somewhat complicated graph, but still readable:
“Whiskers” on the top and the bottom show the data boundaries, basically the minimum and maximum data points, with outliers cleaned out
“Box” is where the majority of the data is
“Dots” above and below the graph are the outliers, some data points that do not fit into the picture
The line in the middle of the box is a “median”: half of the values are smaller than it, half are bigger
Build a boxplot for each story point size in your data set. Optionally, draw a line at 10 (the number of business days in your sprint) for visual reference. The simplest way to build this is to use Excel, since boxplots are among the “statistical” charts available there. Create a spreadsheet with story point sizes as column headers and a list of cycle time values as column content, and you will get (PivotTable will help you here).
First, do boxes differ much? This is my favorite point to check, cause if your assumption about the relationship between complexity and time is right, you would see a story point boxes build something like a ladder on the graph. My experience screams - you will not see a ladder. In the best-case scenario, you will see median lines forming a ladder, which indicates that at least some relationship between size and cycle time exists.
Let’s look at the differences between box sizes. Look at the neighboring boxes and see whether there is any significant difference in the size and segment distribution. If there is nothing significant, it means there is no significant difference in how your team works with these sizes, either. My experience again: you will see a difference between 1 and 5 and between 5 and 8, but less between 1 and 3 and between 3 and 5, and no meaningful difference between 1 and 2 and between 2 and 3. Look at these boxes again and think hard: was the last time you spent 20 minutes arguing over whether this story is 3 or 5 time well spent? (The answer is not that simple, btw, size is not ONLY duration).
Let’s look at the distribution now. The “higher” the box, the greater the range of values in this box. Large variance (large range) means that items in the category are not that similar, at least in duration. It is not specifically good or bad, but it’s something to investigate further: is something with this story point type off, for example, a “default” size when the team does not actually know but does not want to decide? What is concerning is when all of your sizes have a large distribution: if your 3-pointer can be 1 day or 20 days with equal probability, your sprint plans will unlikely be reliable.
And one more thing: give a quick glance at a graph, is anything off? Does your 1-pointer have a larger distribution and a higher “max” point than your 8-pointer? Is there some box so large that other boxes disappeared from the chart? Is there a box with a collection of “outliers”? These are indicators of irregularities that cannot be easily described as a “typical” pattern, and you need to verify them item by item.
Let’s work with the new assumption: the larger the story point size, the greater the risk associated with the item. To verify it, we will need the same data, one definition, and a new, much simpler chart.
Let’s define a carryover for this analysis: any item that exceeds the sprint duration is a carryover. If your sprint is 10 business days, any item with a cycle time longer than 10 days is a carryover. Yes, there might be cases where carryover items have a cycle time of less than 10 days, but “proper” carryover analysis requires a different dataset, and for this analysis, missing some carryover items is ok.
Calculate a carryover %. It’s simple and straightforward: what % of items of a certain size is >10. Excel can do it fast and easily.
Now, let’s build a chart: it’s a simple stacked bar chart here; bar heights are 100%; “carryover” is at the bottom, “not carryover” on top. You can use Excel for this as well; you will need three columns: size, carryover, and remaining, but make “remaining” as “100-carryover”, select a “stacked bar chart” as a chart, and you’ll get it.
This one is easier because the chart has few parts. Do you see a “ladder” where smaller items have less carryover and bigger - more? IF yes, then the assumption “bigger item = more risk” is correct.
But what if not? Well, there might be multiple deviations, like a spike in a random story-point-size bucket. There might be multiple reasons for such deviations, from data anomalies to process issues; you will have to investigate yourself, as this chart is just an indicator.
This chart serves a purpose beyond story point analysis; it can inform your team in adjusting its definition of ready and planning practices. If you see that, for example, an 8-pointer has 35-40% of carryover, it may be that it would be more beneficial to split it into smaller stories, rather than push it to the sprint anyway. Or you can go even deeper and analyze what makes 8-pointers so prone to carryover, and maybe adjust how you write and slice your stories in general. This application becomes pretty unique to the team and requires more parameters than cycle time to identify.
This assumption is a bit more complicated: even if stories of different sizes have different completion times and different risks, their relative progression through statuses should be similar? An 8-pointer will spend more time in development and testing in absolute terms, but should spend the same % of time in each status as a 3-pointer, for example. Let’s check.
For this, your dataset is a bit more complicated: you need a time-in-status for each story, not just a cycle time.
Visualization: it will be a stacked bar chart again, with each “bar” being a status on your board. You have a choice here: you can either create a “relative” chart, where all the time is equal to 100%, and time in each status is equal to the percent of that time. Or you can use an “average” value of time-in-status for each section, leaving the bar to represent the cycle time itself. Option A is more visually readable and easier to analyze; option B is easier to implement (no need to adjust for the missing 0,01% due to Excel rounding errors, for example) and provides extra visual for the cycle time difference between different sizes.
Interpretation here is complex. First, look at your assumptions: do sizes have relatively the same workflow? I bet not; it’s actually pretty rare for a job of a different size to be “same but more complicated”. Applicable patterns may vary, but sometimes it can be “similar in development but highly different in testing” or “relatively similar in effort to develop and test, and extremely different in how long they are being accepted”.
Second, look at what these numbers tell you about your flow. How long does the actual “touch” time versus “wait” time? If your wait time is long (for example, “ready for testing” is 3 days versus “in testing” 1 day), it might indicate a bottleneck, and you can see it visually. Maybe you see bottlenecks forming in a particular size? Now you have a place to start digging deeper.
These charts make sense if you have enough data, and even more if you have enough data for each size. Build a simple pie chart that shows the % of each size in the overall dataset. Look at the sections: they should not be equal, but there might be anomalies, such as one size eating the majority of the pie or one size represented by just a few items. Such anomalies may make the analysis less reliable, as the charts rely on statistics and distributions, and a group of 5 items for one size might be compared to a group of 500 items for the next size.
Individual charts paint a picture for you, helping you build a better understanding of your process. Now you can verify this picture by placing these charts side by side and comparing whether they are telling the same story. Are there any relationships between the distribution size and carryover probability, for example?
Go the extra mile - build the same charts but for issue types instead of story points: do you see something new? Something similar?
What you can discover (this is an example list of typical things I see with this analysis):
Story points have less difference in cycle time than the team assumes
Big work fails often (duh)
Your items spend more time waiting for something than being actually worked on
Bugs carry the burden of carryover
This article presents only the analysis, based on time-in-status and cycle time. There are just a few examples; there are many more applications, from Monte Carlo Forecasting to real-time risk assessment.
There are more dimensions you can look at as well: how these numbers change with time, for example. Or how different epics behave? Different releases? What if we add a semantic analysis from LLMs to the mix? How does the quality of your acceptance criteria influence the cycle time, for example? Or we can extend the source data: let’s see the percent of rework (how often do items travel to the right on your board, or even from closed back to open). And what pattern can emerge from the size, type, or semantic perspective?
You have an untapped well of data and the ability to generate a lot of insights from it; don’t miss the opportunity, and stop treating metrics as a reporting tool only. Good luck with your journey.
Before you start commenting, note that some explanations are simplified and lack nuance; this does not make them inaccurate, but I do not think many people will benefit from the details on how the box plot boundaries are calculated (whiskers extend to 1.5 × IQR). This is based on my experience, so don’t jump into the “we don’t have it” wagon. Described assumptions about story points are practical assumptions held by different teams, not “this is a true and proper definition of story points.”
Text is written by me, and proofread by AI. Illustrations are generated by Gemini and Plotly based on a synthetic dataset.