Eight Frameworks for Measuring AI ROI — And How to Use Each One

AI Value Acceleration's AI ROI Framework Atlas maps eight published measurement systems against the same grid. What each one actually measures, how to collect the data, and which question it answers.

A
Arpy Dragffy · · 8 min read
Editorial photograph: Eight Frameworks for Measuring AI ROI — And How to Use Each One
Photo: Generated via Flux 1.1 Pro
Overview
  • MIT's GenAI Divide research found that 95% of enterprise AI pilots produce no measurable return within six months — not because AI fails, but because most programs were never instrumented to measure value.
  • Forrester TEI is a capital defense instrument. The Microsoft-commissioned version claims 116% ROI; the Google-commissioned version, same methodology, claims 416%. Neither is an operational measurement tool.
  • The Anthropic Economic Index shows the real shape of enterprise consumption: the top 10% of users consume 60–70% of tokens. Lifting the spending cap without cross-team knowledge transfer produces a cost spike, not organizational learning.
  • AI investment should be structured as a VC portfolio — safe bets with documented returns, big bets that force structural change, and moonshots designed to generate organizational lessons regardless of production outcome.

The data on enterprise AI should force a rethink. MIT's GenAI Divide research found 95% of generative AI pilots produce no measurable return within six months. Kyndryl's 2026 research found 74% of senior leaders targeting revenue growth through AI are achieving it in only 20% of cases. Forrester's Q1 2026 tracker found 78% of organizations exceeded their 2025 AI budgets by 47%.

These are programs running with the wrong measurement tools — or none at all. AI Value Acceleration's AI ROI Framework Atlas maps eight published measurement systems on the same six-field grid. Here's what each one actually measures and how to apply it.


The Capital-Defense Layer

Forrester Total Economic Impact

Use this to justify an AI investment to a board or CFO before deployment begins.

What to measure:
- Total cost of ownership — sum license fees, implementation, training, support, and ramp-time productivity loss before projecting any return; most programs model gross benefit and skip costs, which produces inflated headline numbers
- Productivity time savings — use a time-diary method: users log actual minutes on defined tasks before and after the deployment, not estimates from memory; cut self-reported figures by 30–50% to account for perception bias
- 3-year NPV — discount projected benefits and costs back to present value using your organization's standard capital rate (typically 8–12%); this is the number boards compare against competing investment options
- Payback period — the month cumulative benefits exceed cumulative costs; most boards want this under 18 months; if your model shows 24 months or more, revise deployment scope before presenting

BCG Frontier Firms / AI at Work

Use this to understand where your program sits relative to industry peers and why maturity isn't advancing.

What to measure:
- Workflow integration rate — ask all AI users quarterly: "Is AI embedded in how you do your daily work?" (yes/no); anything below 30% means surface-level adoption, not workflow change; this is the gap BCG's research consistently finds between headline adoption and real integration
- Manager active use rate — survey managers weekly: "Did you personally use an AI tool in your own work this week?" (behavior, not attitude); teams with active-user managers show 3–5× higher adoption rates; most programs never track this
- Maturity tier — run BCG's AI at Work benchmark or equivalent self-assessment annually with a defined industry peer set; this is an annual planning anchor, not a monthly KPI
- People-centricity audit — for each deployment, ask: was the design built around changing how people work, or around deploying technology? Deployments that score low here stall regardless of tool quality


Activity and Distribution

Microsoft Productivity Index / Copilot Telemetry

Use this to track who is actually using AI tools, at what depth, and where engagement is stalling.

What to measure:
- Weekly active users (WAU) as % of licensed seats — pull from your vendor admin console weekly, not monthly; monthly figures hide sharp weekly drop-offs; flag any cohort below 40% WAU at the three-month mark as a stall signal
- Feature depth score — track which features each cohort uses: basic prompts only, multi-step workflows, external data integrations; surface-level users don't generate productivity gains regardless of how high WAU appears
- Six-month retention rate — share of licensed users still active at month six; below 40% means deployment design needs review before expanding the program or adding seats; this is the metric that surfaces the idle-license problem
- Cohort-level segmentation — break engagement data by role, department, and manager adoption status; per-user averages hide whether your program has healthy spread or 10% power users and 90% inactive

Anthropic Economic Index

Use this to understand whether AI capability is spreading organizationally or concentrating in a small group.

What to measure:
- Token consumption by user decile — pull monthly usage data from your vendor admin and rank users by total consumption; calculate what % the top 10%, middle 40%, and bottom 50% each drive; top 10% above 60% of total signals an enablement gap, not a scaling success
- Frontier cohort feature usage — for your top-decile users, audit which features and integrations they're using; this cohort is the organization's most advanced AI practice and the starting point for replication planning
- Consumption growth source — is new monthly volume coming from previously inactive users, or from existing heavy users consuming more? Growth that stays in the top cohort means knowledge isn't transferring across the organization
- Agentic workflow cost per outcome — for any multi-step autonomous pipeline, track cost per completed task (not cost per token); agentic pipelines can run $500–3,000+/month per active user; cost-per-outcome is the only figure that shows whether the spend is working


Program-Level Diagnostics

McKinsey Value Realization

Use this to identify which deployments are producing returns and build replication pathways from what's working.

What to measure:
- Outcome documentation rate — for each active deployment, can you name one specific, verified outcome: a decision made, output shipped, revenue attributed, or error rate reduced? Track this as % of total deployments; most programs discover it's below 20%
- Value concentration index — rank deployments by verified outcome evidence; calculate what % of total documented value your top three use cases represent; high concentration tells you where to focus replication efforts and where to cut
- Structural conditions inventory — for each top-performing deployment, document: the role type, specific task, the manager's adoption behavior, and how the output was integrated into existing workflow; this becomes the replication template for new deployments
- Replication attempt rate — have structural conditions from successful deployments been intentionally applied to new ones? Track yes/no per new deployment; low rates explain why programs don't scale even when individual pilots work well

MIT GenAI Divide

Use this to diagnose what your program is actually capable of measuring right now and what needs to change first.

What to measure:
- Stage classification — list every metric your program currently tracks and classify each as activity (usage, logins, tokens) or outcome (verified result produced); if outcome count = 0, you're stage one; run this 30-minute audit at the start of every planning cycle
- Measurement infrastructure inventory — document what data your program produces automatically vs. what requires manual collection; stage one programs typically have only vendor activity dashboards with no outcome data at all
- Outcome-to-activity ratio — divide the number of deployments with documented outcomes by total active deployments; below 10% is stage one/two; above 50% is stage three/four; track this quarterly as a program health signal
- Stage advancement milestone — define what advancing one stage requires before setting any ROI targets (e.g., "stage one to two = outcome tracking live for at least three deployments"); make this the program's explicit planning cycle goal


Behavioral Evidence

Wharton/Mollick Behavioral Cohort

Use this to understand whether the behavioral conditions in your organization will produce sustained AI adoption.

What to measure:
- Manager active use rate — survey weekly: "Did you personally use an AI tool in your own work this week?" (yes/no); compare WAU between teams whose managers answered yes vs. no; a less than 2× difference means your manager enablement isn't working
- Team adoption rate by manager type — pull weekly active user data from your Productivity Index and segment by manager adoption status; this comparison is the clearest leading indicator of whether adoption will sustain or stall
- Disclosure safety score — run a quarterly anonymous survey with one question: "I feel comfortable telling my manager when I use AI tools" (agree/disagree); low scores explain the gap between reported and actual usage in your program
- 12-week output quality delta — select one task with a measurable output; run two matched cohorts for 12 weeks (AI-enabled vs. not); have outputs independently evaluated against a pre-agreed rubric; this is the defensible internal ROI evidence that survives board review

Task-Level Outcome Measurement

Use this when you need peer-defensible evidence that AI improved performance on a specific workflow.

What to measure:
- Task-specific output metric — define before the study begins: accuracy rate, time-per-task, error rate, or quality score; if you can't define the metric in advance, the task isn't bounded enough for this framework to produce a defensible result
- Pre-study baseline — collect four weeks of performance data before any AI tool is introduced to the cohort; without a clean baseline you can't rule out that performance was already improving before AI arrived
- Independent evaluation rubric — outputs are scored by someone who doesn't know whether AI was used to produce them; agree on the rubric before the study starts; this is the methodological step that makes results board-defensible rather than self-reported
- Minimum study duration — 12 weeks minimum; studies shorter than this produce novelty-effect inflation because early engagement regresses; at 12 weeks the performance delta reflects genuine behavioral change, not initial enthusiasm


Why Token Volume Isn't a Measurement Framework

Forrester's Q1 2026 tracker found 78% of organizations exceeded their 2025 AI budgets by 47%. AI Value Acceleration's token economics analysis found 30–50% of enterprise AI token consumption is waste: tokens generated but never used to inform a decision or produce an acted-on output. Lifting spending caps without cross-team knowledge transfer infrastructure doesn't produce organizational capability — incremental consumption lands in the frontier cohort, not across the organization. Token volume confirms spend. It doesn't confirm learning.


Structuring Investment as a Portfolio

The AI Strategy Summit in May framed enterprise AI investment using a VC portfolio structure — and the evidence across all eight frameworks supports it.

Safe bets are bounded deployments with documented evidence bases. Run the full composite stack: Productivity Index for activity, Economic Index for distribution, task-level outcomes for attribution, Wharton for behavioral conditions.

Big bets require structural change to produce value. Use longer measurement horizons: BCG maturity staging to assess readiness, McKinsey use case audit to identify where value will concentrate, and Wharton to confirm manager adoption is driving the behavioral change the deployment requires.

Moonshots generate organizational learning regardless of production outcome. Use MIT GenAI Divide stage classification as the measurement design — the documented lessons are the deliverable, not the deployment itself.

Running all three tiers on the same six-month ROI horizon is how programs produce the MIT 95%.


Work with AI Value Acceleration

The AI ROI Framework Atlas from AI Value Acceleration profiles all eight frameworks against the same six-field comparison grid — what each measures, what it excludes, its unit of analysis, data inputs, credibility profile, and where it breaks down. It's free to read.

If your program needs a direct diagnosis — which frameworks are missing from your current stack, where value concentration is happening, and what your measurement infrastructure actually supports — AI Value Acceleration works directly with enterprise teams to run this work. Start at aivalueacceleration.com.

How helpful was this article?

Have a story to share?

0 / 500
A
Arpy Dragffy

Founder, PH1 Research · Co-host, Product Impact Podcast

Latest Episodes

All episodes

Product Impact Newsletter

AI product strategy delivered weekly. Free.