The measurement trap

23 June

Well-intentioned programs produce unintended consequences. Our evaluation systems make it worse.

There's a moment in almost every evaluation that most of us recognise. You're partway through your consultations, and someone says something that doesn't fit. A participant describes an experience unrelated to the specific outcomes you are assessing. Or a community member mentions a pattern that the theory of change didn't anticipate. Or a story surfaces that is clearly important to the person telling it but sits outside the boundaries of what you've been asked to measure.

What happens next varies. Sometimes you note it, label it an outlier or something not representative, and move on. It’s at least acknowledged, though without anything more substantive. Sometimes you flag it in a limitations section. Sometimes you pursue it, but only if you have the space, the budget, and the institutional permission. But often, it falls away. Not because it doesn't matter, but because the system you're operating within wasn't designed to prioritise it.

What we mean by unintended consequences

An unintended consequence is an effect of your work that wasn’t envisioned in the original design. It can be positive, negative, or neutral. It can emerge quickly or take years to become visible. And it doesn't have to affect large numbers of people to matter. A consequence experienced by a handful of participants, or even a single community, still counts.

Consider a program designed to strengthen women's economic participation in a context of high unemployment. A conventional theory of change often goes something like this: Providing skills training leads to improved small business development, which increases income, expands agency and decision-making power within the household. The intended outcomes may well be achieved. But alongside them, other things may happen that nobody planned for. For example, new income may disrupt dynamics within traditionally patriarchal households, which may result in increased gender-based violence. Women who participate in the program may be stigmatised by those who don't. Childcare may need to be organised, which may cause additional burdens. These aren't fringe cases. They reflect the contextual complexities shaped by poverty, gender inequality, and social norms, which are often obvious to community members but less understood by others.

Unintended consequences don't have to be negative to matter. The same program might produce unexpected improvements as well. But it's the negative ones, the harms, that carry the sharpest ethical weight, because they tend to fall on the people who were already most vulnerable.

Three data problems

When we talk about evaluation missing important consequences, it's tempting to treat this as a single problem. But there are actually three distinct biases at work, and they reinforce each other.

The first is structural. Evaluation systems are designed to test a theory of change, which means they are oriented toward intended outcomes by default. If something wasn't in the framework, the evaluation probably won't detect it, regardless of whether that something is positive or negative.

The second is an incentive problem. Even for outcomes the system can see, including the intended ones, there is a strong privileging of positive findings. While there is a lot of genuine interest and investment in 'learning' and in leaning into 'what didn't work,' the reality is that there is an inbuilt bias towards finding the positive and proving that things worked. Funding depends on demonstrating results. Reputations are built on success stories.

The third is about whose experience counts. Programs don't affect everyone the same way. An intervention might be genuinely beneficial for the majority of participants while causing harm to a minority. Evaluation systems tend to prioritise the experiences of the majority, which often validate the original theory of change. The nuance that something works for some people and damages others is easily lost when the data is aggregated, when the sample doesn't include those on the margins, or when the questions asked don't leave room for divergent experiences to surface.

These are all, at their core, data problems. The first means we collect the wrong (or at least incomplete) data. The second means we interpret data through a lens that favours good news. The third means we select which data to consider. Together, they produce a consistent pattern: evaluation practice systematically under-detects the harms that programs produce, particularly harms experienced by people whose voices are hardest to hear.

There's a further dimension here that is only going to become more pressing. As the sector increasingly turns to technology and artificial intelligence to support evaluation, the biases embedded in our current systems risk being amplified. AI-driven analysis is shaped by the information it's given. If the underlying data already excludes certain voices and certain outcomes, then algorithmic tools will reproduce and entrench those exclusions, faster, at greater scale, and with a veneer of objectivity that makes them harder to challenge.

What doing it differently looks like

If these are data problems, then the responses are about data too. The structural bias requires a design shift: building evaluation approaches that create space for discovery, not just verification. The incentive bias requires an institutional shift: changing what gets rewarded, what gets resourced, and what gets heard. And the third bias requires the most deliberate work of all: actively seeking out the experiences of people who are traditionally excluded, difficult to reach, or vulnerable, rather than assuming that majority experiences tell the full story.

In our work, we've found that this begins with how evaluations are designed, not how they're reported. The most important moment is before consultations begin, when questions are framed, tools are built, and decisions are made about who will be heard and what will be asked. Leaving unintended consequences to emerge organically is a gamble that rarely pays off.

This kind of design requires specific conditions. It requires evaluators with the skills and cultural competence to hold open-ended conversations in context, in the right languages, with the right relational proximity, in settings where people feel safe to speak honestly. It requires meaningful input from local communities in shaping evaluation questions and approaches, not just participation as respondents. It requires terms of reference that explicitly include unintended consequences as an evaluative focus. It requires budgets that fund the time it takes to do iterative, qualitative work. And it requires funders and implementing organisations that are willing to hear what emerges, even when it's uncomfortable.

None of these conditions are unreasonable. They're choices about how evaluations are commissioned, how they're resourced, and what institutional culture permits.

The learning we owe

Unintended consequences are, at their core, a learning signal. They tell us something about the system we're working in that our theory didn't capture. A household conflict triggered by a livelihood program isn't just a harm to be mitigated. It's a signal about gender norms, power dynamics, and economic stress that the program design didn't account for. A community divide between participants and non-participants isn't just an awkward finding. It's a signal about social cohesion, about who was chosen to participate and who wasn't, and about the way external resources reshape local relationships.

If we treat these signals as noise, or worse, as threats to a program's narrative of success, we lose the most valuable information an evaluation can produce. The programs that have handled this well, in our experience, are the ones that treated unintended consequences as adaptation opportunities: findings that triggered redesign, shifted the theory of change, and led to better outcomes in subsequent phases. Not because the first phase failed, but because it revealed something that mattered.

This is work that colleagues and I have been thinking about and practising for several years now, and we're still learning. We don't have a tidy framework that solves the problem. What we do have is a growing conviction that the sector's current approach to unintended consequences – acknowledging them in principle, under-resourcing them in practice, and treating them as marginal to the main evaluative story – is not good enough. Not for the people whose lives are affected by programs, and not for the integrity of the work we claim to do.

Unintended consequences will keep happening. The question is whether we'll keep choosing not to see them or whether we're willing to build systems that are genuinely oriented toward discovery, not just verification.

Niketa Kulkarni

The measurement trap

What we mean by unintended consequences

Three data problems

What doing it differently looks like

The learning we owe

The sustainability myth