Essay
PR Counts, Token Spend, and What They Actually Measure
PR counts and token spend do not measure contribution well, but early on they can show whether a team is building useful AI-era habits.

PR counts and token spend make a lot of people uneasy, and for good reason.
More PRs do not necessarily mean better work. More token usage does not necessarily mean better software. In companies selling AI tools, there is also an obvious temptation to treat usage as proof of value. None of this is subtle. The skepticism is deserved.
What gets lost, though, is that rough metrics can still be useful in a narrower way. They may not reveal much about quality or contribution, but they can say something about whether a team is actually changing how it works.
Writing has worked like this for a long time. Daily quotas never guaranteed anything about the final result. They did not say whether the argument held together or whether the sentences were worth keeping. What they did was make regular contact with the work more likely. Stephen King has described aiming for roughly 2,000 words a day, and Graham Greene was known for a smaller daily quota. Those routines did not produce quality on their own. They helped create the conditions in which quality might eventually emerge.
Running is similar. Marathon training does not begin with race pace, dialed-in fueling, and interval sessions several times a week. The early work is more basic than that. Build mileage. Add load gradually. Put some foundation in place before trying to optimize.
That seems like the fairest case for PR counts and token spend. On their own, they say very little about contribution. Early on, though, they can show whether new habits are actually taking shape.
The problem is not activity metrics by themselves. The problem is how long they tend to stick around.
Even a rough metric can tell something worth knowing.
Before AI, the better productivity frameworks already knew activity was only one slice
The old argument about developer productivity never really ended in a clean answer. Even before AI, the more serious research had already moved away from one-dimensional measurement. The SPACE framework mattered because it did not reduce productivity to output volume. It looked at several different things at once: satisfaction and well-being, performance, activity, communication and collaboration, and efficiency and flow.
That helps here because it gives PR counts a more reasonable place. Activity is not meaningless. It is just partial. A team can show a lot of visible motion because the work is moving cleanly. The same pattern can also show up when the work is fragmented, planning is sloppy, or people are spending a lot of effort pushing through avoidable friction. The number itself does not sort any of that out.
Here is a compact version of that older view.
The SPACE view before AI
| Dimension | What it was trying to capture | Useful examples | Common mistake |
|---|---|---|---|
| Satisfaction & well-being | Whether developers felt supported and able to do good work | Surveys, perceived efficacy, team health | Treating this as soft or irrelevant |
| Performance | Whether the work created meaningful results | Reliability, quality, customer or business outcomes | Confusing local output with real outcomes |
| Activity | Visible engineering motion | PRs, commits, tasks completed, code reviews | Treating motion as proof of value |
| Communication & collaboration | How work moved between people | Review responsiveness, coordination, documentation | Ignoring coordination costs |
| Efficiency & flow | How easily developers could move work forward | Cycle time, interruption cost, time to feedback | Mistaking speed for ease or clarity |
The acronym matters less than the basic reminder. Activity belongs in the picture, but it cannot stand in for the whole picture.
Why activity metrics still deserve a little more respect
A lot of teams do not run into trouble because the wrong advanced metric was chosen. The trouble often starts earlier.
Work does not get broken down well. Integration happens too slowly. New tools get discussed more than used. Opinions arrive before experience does. AI is transformative. AI is overhyped. Agents are the future. Agents are sloppy. Plenty of time can disappear into those arguments without much evidence that anyone has learned how to work with the tools in front of them.
At that stage, simple activity metrics can have some use. PR count can push a team toward smaller batches and faster feedback. Token spend can at least show whether AI tools are being used on real work instead of living mostly in presentations and meetings. Neither number says much about whether the output is good. What the numbers can sometimes show is whether enough repetition is happening for judgment to start forming.
That is why the writing analogy holds up. A daily word target is not a literary judgment. It is more like a routine that keeps the work alive. The running analogy works for the same reason. Base mileage is not race execution, but without that base the later refinements do not have much to build on.
So the case for PR counts and token spend is fairly limited. These are not strong measures of contribution. In the early phase, though, they can still be useful signs that a team is practicing in a way that may lead to better judgment later.
PR counts and token spend do a poor job capturing contribution, but they can still show whether new habits are forming.
The role of leadership is less interesting than the dashboards make it look
This part is mostly about process.
Sometimes a team gets a target and not much else. “Use AI more” sounds clear until the work starts. Then the ambiguity shows up all at once. Where is AI actually useful? What still needs careful human review? How small should a PR be? What experiments are safe? How does what gets learned in one corner of the team become shared practice instead of isolated experience?
None of that is glamorous, but that is usually where the real difference is made. In the early phase, a concrete operating loop matters more than a polished theory. Teams usually move better with a process that is clear enough to run and revise than with vague pressure to adapt.
In that setting, metrics can work as temporary scaffolding. They make expectations visible. They give people something concrete to organize around. They help anchor a transition without pretending to measure the whole job.
That is often true in any period of team formation, and it feels especially true during a shift like this one. When new habits are still fragile, clarity around goals, roles, and operating rules matters more than elegant language about long-term value.
The point where the metric starts to mislead
The same metric can help in one phase and distort in another.
A writer who keeps staring at daily word count will eventually neglect revision, structure, and selection. A runner can keep adding mileage long after it stops being intelligent training. Engineering has the same problem. A high PR count can reflect smaller batches and faster iteration, or it can reflect fragmented work and cosmetic slicing. High token spend can reflect real experimentation, or it can reflect waste, confusion, and a growing verification burden that never gets counted.
Once a team has some base, the bottleneck usually moves. In AI-assisted work, generation gets cheaper. The harder part shifts toward framing the problem well, reviewing output carefully, integrating changes cleanly, and deciding what deserves trust. At that point, simple activity metrics start to sit too close to the tool and too far from the outcome.
That is where a lot of the current argument goes off track. One side dismisses PR counts and token spend as obviously foolish. The other side treats more usage and more output as if value naturally follows. Neither view gets very far. A metric only makes sense in relation to the stage of the team and the actual bottleneck in the work.
Activity metrics become misleading once the harder work has moved somewhere else.
What changes in the AI coding era
The AI era does not remove the old measurement problem. It makes it easier to hide inside a flood of output.
There is now much more exhaust to count. More code can be generated. More drafts can be produced. More suggestions can be accepted. More tokens can be spent. That makes activity look impressive very quickly, while making it less reliable as a signal on its own.
DORA’s recent work on AI-assisted software development points in a useful direction. The interesting part is not simply that AI can make teams faster. The more useful reading is that AI often acts as an amplifier. When the surrounding system is strong, throughput may improve in ways that hold up. When the surrounding system is weak, some of the gains come back in shakier form, with more pressure pushed into review, validation, cleanup, and coordination.
That is why the search for a replacement metric feels beside the point. The better approach is to use different measures at different stages. Early on, it may be enough to watch for signs that new habits are actually forming. Later, the question changes. The issue is no longer whether the tools are getting used. The issue is whether the work holds up under real conditions.
A SPACE-style table for the AI coding era
The original SPACE categories still make sense, but the center of gravity shifts a bit. Activity still matters. It just becomes much easier to overread. The more useful question is whether cheap generation is turning into reliable progress with limited rework.
| Dimension | What mattered before AI | What changes with AI coding tools | Better questions to ask now |
|---|---|---|---|
| Satisfaction & well-being | Whether developers felt supported and effective | Still relevant, but less central here than performance evaluation | Are tools helping or adding hidden friction? |
| Performance | Quality, reliability, and business outcomes | More important because raw output is cheaper | Did AI-assisted work hold up in production? |
| Activity | PRs, commits, tasks completed | Easier to inflate; easier to misread | Is this activity building useful habits or just generating exhaust? |
| Communication & collaboration | Review loops, handoffs, documentation | More important because AI changes handoffs and review load | Is shared context improving? Is review getting better or heavier? |
| Efficiency & flow | Cycle time, interruptions, time to feedback | Verification and cleanup matter more inside this category | Did AI reduce total effort, or did the effort just move to a later phase? |
A more explicit version for the AI era might look like this.
What to measure after the team has built the base
| Category | Early-stage metric | Why it can help early | Why it becomes weak later | Better later-stage measure |
|---|---|---|---|---|
| AI adoption | Token spend | Shows whether tools are actually being used | Can reward waste or vendor-friendly behavior | Task-level usefulness, verification cost, rework burden |
| Shipping rhythm | PR count | Encourages smaller batches and faster feedback | Can reward fragmentation or cosmetic slicing | Trusted throughput, cycle time to validated merge |
| Coding output | Accepted suggestions, generated code volume | Shows experimentation and rep-building | Can hide review burden and defect risk | Escaped defects, rollback rate, integration quality |
| Individual initiative | Number of experiments run | Encourages active learning | Can turn into motion without synthesis | What was learned, standardized, and reused |
| Team learning | Usage targets | Creates shared attention during transition | Stops meaning much once the habit exists | Better prompts, clearer guardrails, stronger norms, better tests |
| Human judgment | Rarely measured early | Hard to assess before enough reps exist | Becomes more important as generation gets cheaper | Problem framing, delegation choices, review quality, trust calibration |
The basic point is simple enough. Early metrics can help create motion. Later metrics have to say something about whether that motion is adding up to anything solid.
The strongest version of both sides
Both sides are getting at something real.
PR counts and token spend deserve a little more credit than critics usually give them. For a team still learning how to work in a different way, those metrics can function a bit like word-count goals for writers or mileage targets for runners. Not as judgments of quality, but as signs that repeated practice is happening.
At the same time, those metrics deserve much less authority than many companies want to give them. Once the basic habits are in place, the numbers become weak signals at best and distortions at worst. They remain attractive mostly because they are easy to collect, easy to graph, and easy to explain upward.
That is part of what makes them tricky. Metrics often become most misleading when they contain just enough truth to survive well past the stage where they were actually useful.
A runner can mistake base training for race preparation. A writer can mistake output quota for finished prose. Engineering organizations can do something similar when tool usage starts standing in for contribution.
So the interesting question is not whether PR counts and token spend are good or bad metrics in the abstract. The more useful question is whether they are being used for the right phase, for the right reason, and with some sense of when they stop saying much.
What performance evaluation probably needs instead
Probably not a new magic number. More likely a better sense of sequence.
Habit formation comes first. Then process improvement. Then a higher standard of evaluation, once the earlier groundwork is actually there.
A company that has not yet built the habit of smaller changes, repeated AI usage, and practical experimentation may have a reasonable case for tracking PR count or token spend for a while. A company that keeps using those as central measures of engineering value is usually revealing something else. The evaluation model never really grew past the early phase.
Put more simply, the warm-up never ended.
That seems like the part worth sitting with. Starting with simple metrics is not the danger. Refusing to move past them is.
References
- Nicole Forsgren, Margaret-Anne Storey, Chandra Maddila, Thomas Zimmermann, Jenna Butler, and Brian Houck. “The SPACE of Developer Productivity.” ACM Queue (2021). https://queue.acm.org/detail.cfm?id=3454124
- DORA. State of AI-assisted Software Development 2025. https://dora.dev/research/2025/dora-report/
- Stephen King. On Writing: A Memoir of the Craft. https://www.amazon.com/Writing-Memoir-Craft-Stephen-King/dp/1982159375