A question I’m often asked is: “How do I measure the health and productivity of a product delivery team”. The question can cause a lot of stress as the people being considered often assume the person asking is trying to measure them by some objective measure and to compare one team to another this doesn’t lead to any useful conclusions but there are indicators that a team could use some help and these are the ones I use.
Before getting into the indicators I start by assessing some initial principles. If any of these aren’t true the chances of the team being healthy and productive are low:
- Do the team have an appropriate level of responsibility, accountability and authority over what they are building?
- Do the team have a good understanding of who they are providing solutions for and how well those solutions are working for those people?
- Are the team working with tools, processes, systems and technologies that they want to be working on OR are they happy that they are part of an initiative that resolves whatever problems they might have with the tools, processes, systems and technologies they are working with.
If all those things are true I can start looking at some more objective measurements to get a sense of whether the team is doing good work. If they aren’t true we need to fix those things before we measure anything.
The measures I care most about are:
- Time to onboard new developers
If it takes more than one or two days for a new team member to review part of a solution and contribute something useful to it then that is a sign that the systems or the solution is too complicated or poorly documented which is a sign of poor health.
- DORA Metrics
- Deployment Frequency
- Lead time for changes
- Mean time to recovery
- Change failure rate
The DORA metrics are useful because if you attempt to game the metrics you still end up with a good result. If you game Deployment Frequency and release even ridiculously tiny changes like config variable changes you still force your pipeline to run quickly or your lead time for changes increases and you have to have good testing or your change failure rate increases.
DORA tests processes, tools, systems and test capability. If all those things are good the team is in a good place to do good work. DORA elite status is a practical and useful goal: https://cloud.google.com/blog/products/devops-sre/using-the-four-keys-to-measure-your-devops-performance
Estimation metrics are useful for a team to have a view about itself but they aren’t of much use to anyone else. The characteristics of their burndown charts are a useful indicator that a team might need support.
If the burndown is flat and then drops off a cliff, there’s a clue that the team is not managing their work well or are not well informed about their requirements.
Large downward vertical drops indicate that the team is tackling work that is too large and might have been passing a story from week to week for too long. Large upward vertical inclines suggest the team has been tackling too large a piece of work and have suddenly discovered they need to break something down in several smaller stories. Neither of these things are necessarily bad but they are indicators that a leader might want to get closer to the team to see if they need help.
Those are the most important measures, once I’m happy with those I then focus on the items below.
You’ll notice that there is no attempt to measure productivity objectively. That’s because it’s unecessary and impossible in software engineering. If you object to the idea that measuring productivity is unecassary then you should read about Theory X and Theory Y. Software engineers are internally motivated they don’t need someone telling them they aren’t productive enough, they know when they’re being productive and they will tell you when they aren’t! If productivity is inadequate it will be because something isn’t right and the answer will be exposed by one of the measures detailed on this page. The reason objective productivity is impossible to measure is because no two software problems are ever exactly the same. Even if the problems seem identical there will be some external factor that will make it different. The difference might be as trivial as a minor version upgrade of a dependency or it could be something as major as a combination of browser upgrades, cloud system upgrades and a different team with different experience. Either way comparing two teams trying to solve the same problem is fruitless.
If all the principals mentioned above are true and you have the measures in place for the metrics I mentioned and something still isn’t right then continue to read the additional measures below that can help pinpoint precisely where the problem is:
- Good understanding of their role and place in the strategy
Skip level 1:1s are a useful tool to determine if a team has a good understanding of the wider business strategy and how that has informed their product roadmap and the features they are building. If a team member can’t explain what they’re doing in the context of the business strategy there is a real chance that the team isn’t in a position to make good decisions about their solution.
- How long PRs wait for review, or how many PRs are waiting for review.
Charting these metrics by team can be useful to identify when a team is overburdened or when a team is working on something they are struggling with. A platform team with a lot of PRs or a long-lead time for reviewing PRs could indicate they are overworked or it might indicate training needs in the teams consuming the platform.
If a team of six is tackling six stories simultaneously there is a chance that it isn’t a team it’s six individuals. There’s some nuance to this, if they break their stories down well and split an epic between them and act as reviewers for each other then it could be a healthy sign but more often than not it’s more healthy for a team to tackle one or two objectives at once and to work together on them.
- Systems/Application scorecard
If a team’s systems and applications are meeting all the organisation’s security, compliance and engineering standards then the team is likely to be doing good work.
At Just Eat we implemented a Scorecard application. We were inspired by a similar project at Linkedin. This scorecard graded Just Eat’s microservices. You can read more about that on the Just Eat tech blog https://tech.justeattakeaway.com/2018/03/07/how-we-use-activity-oriented-metrics-to-drive-engineering-excellence/
This solution was so successful the teams were actually motivated to race each other to be first to get all A’s or who could get the Spectre badge we offered for the first team to patch the Spectre and Meltdown bugs.
This scorecard should include the four golden signals.
- The four golden signals (https://sre.google/sre-book/monitoring-distributed-systems/)
These notes are lifted straight from the Google article linked above.
The time it takes to service a request. It’s important to distinguish between the latency of successful requests and the latency of failed requests. For example, an HTTP 500 error triggered due to loss of connection to a database or other critical backend might be served very quickly; however, as an HTTP 500 error indicates a failed request, factoring 500s into your overall latency might result in misleading calculations. On the other hand, a slow error is even worse than a fast error! Therefore, it’s important to track error latency, as opposed to just filtering out errors.
A measure of how much demand is being placed on your system, measured in a high-level system-specific metric. For a web service, this measurement is usually HTTP requests per second, perhaps broken out by the nature of the requests (e.g., static versus dynamic content). For an audio streaming system, this measurement might focus on network I/O rate or concurrent sessions. For a key-value storage system, this measurement might be transactions and retrievals per second.
The rate of requests that fail, either explicitly (e.g., HTTP 500s), implicitly (for example, an HTTP 200 success response, but coupled with the wrong content), or by policy (for example, “If you committed to one-second response times, any request over one second is an error”). Where protocol response codes are insufficient to express all failure conditions, secondary (internal) protocols may be necessary to track partial failure modes. Monitoring these cases can be drastically different: catching HTTP 500s at your load balancer can do a decent job of catching all completely failed requests, while only end-to-end system tests can detect that you’re serving the wrong content.
How “full” your service is. A measure of your system fraction, emphasizing the resources that are most constrained (e.g., in a memory-constrained system, show memory; in an I/O-constrained system, show I/O). Note that many systems degrade in performance before they achieve 100% utilization, so having a utilization target is essential.
Finally, saturation is also concerned with predictions of impending saturation, such as “It looks like your database will fill its hard drive in 4 hours.”
- Core web vitals and the OODA Loop (https://en.wikipedia.org/wiki/OODA_loop)
While Core Web Vitals are a web application performance metric they also provide a view into how well a team is able to observe, orient, decide and act (OODA) the speed of a team’s OODA loop is another insight into how healthy a team is. Each aspect of the OODA loop can provide insight into where a team might need help:
- Observe: Does a team have access to the right signals? Do they even know how their app performs?
- Orient: Is the team able to review those signals in context? Does the team know if their application performance is acceptable?
- Decide: Can the team use the information they have in context to make decisions? Can the team make the difficult decisions necessary to balance their application requirements against Google’s expectations expressed in the Core Web Vitals?
- Act: Having decided what to do can the team actually implement the change in a reasonable time frame?
The OODA loop is an essential component of the Lead time for changes in the DORA metrics.
- Measurable improvements to business Outcomes
The ultimate purpose of a product delivery team is to achieve some desired business outcomes such as customer acquisition, or conversion. This isn’t always easy to measure but some attempt must be made to describe what success means for that team. If success is building a feature with no link to the reason why that feature is being built the team will inevitably get stuck on requirements and may be accused of gold-plating their solution.
- Experimentation rate
The time taken for a team to design an experiment, agree the conditions for the experiment and time or results required for conclusion of the experiment and then generating insight is a useful metric to describe the health of a team, their understanding of their users and quality of their solution and the tools they have to manage it.
Team retrospectives are a good indicator of team health. I always tell my teams that I won’t attend their retros so they can feel free to criticise or comment on anything they want to and then I ask them to provide me with a summary of whatever they feel comfortable sharing. I can then use this to try and determine what issues might be impacting the team.
- Team health checks
Team health checks have become quite popular recently and there are a few different solutions that are summarised quite nicely in this Medium article written by Philip Rogershttps://medium.com/agile-outside-the-box/team-health-checks-b4874c15bd73: Measuring the health and productivity of a delivery team
NPS is a useful tool but needs to be used sparingly, once a quarter is about as often as you can ask for an NPS score before people get frustrated with the process.