The premise
AI return on investment is the measurable change in business outcomes (time per case, cost per case, throughput, conversion rate, or quality) that can be attributed to an AI deployment, net of the cost of building and running it. The number is defensible when there is a baseline, a measurement cadence, and an explicit attribution model. Without those three, it is a story.
Most AI ROI numbers we see in board decks are stories. The pattern is consistent: the team picks the metric that moved, attributes the entire delta to the AI feature, ignores the seasonal and product-mix effects, and reports a percentage that is large enough to justify the next investment. The conversation then moves on. Three quarters later, when the next AI investment also needs to be justified, the original feature's actual impact has quietly stopped being measured.
This piece is the framework SDEN uses to make AI ROI measurable. The four metrics that count, the baseline discipline that makes them defensible, the attribution failure modes that quietly destroy the case, and what good looks like at month one, month three, and month twelve.
If you do not measure before, you cannot measure after
The single biggest reason AI ROI numbers are not defensible is that nobody captured the before.
An AI deployment without a documented pre-deployment baseline is not measurable. It does not matter how sophisticated the post-deployment dashboards are: without a number from before, every comparison is to a remembered impression of how slow or expensive the old process was, and human memory of operational metrics is unreliable. We have audited deployments where the team was certain the AI feature saved 40% on time-per-case; the actual number, against the recovered baseline, was 12%. We have also seen the reverse: a team that felt the AI feature was disappointing, while the recovered baseline showed a real 25% improvement that nobody had credited because the new process felt the same.
The baseline is not difficult to capture. For most operational workflows, it is four measurements: time per case (median and p95), cost per case (fully loaded with human time), throughput (cases handled per person per week), and quality (a sampled audit of correctness, usually 30 to 50 cases). It takes a week, sometimes two if the data is scattered across tools, and it is the single highest-leverage step in any AI engagement.
We refuse to ship an AI feature without a captured baseline. Not because we want to look good, but because without it, the feature has no governance path. Nobody can roll it back when it stops working, because nobody can prove it ever started working.
Time, cost, throughput, quality, and the trap of the fifth
AI deployments move four metrics. Time per case is the most visible: how long does it take to handle one instance of the workflow, end to end. Cost per case is the fully loaded version: time per case multiplied by the cost of the people doing it, plus the cost of the AI itself. Throughput is the team-level view: how many cases does the team handle in a week, holding headcount constant. Quality is the discipline against optimization theatre: are the cases handled correctly, sampled against the same audit as before.
Most teams report on one of these and call it ROI. The honest version reports on all four, because optimizing one without the others is usually how AI deployments quietly fail. The classic pattern: the AI feature cuts time per case by 50%, the team handles 80% more cases per week, leadership reports a productivity win. Six months later, the quality audit shows that error rates have doubled: the team rushed, the model missed edge cases, and the cost of the errors landed downstream as customer churn or refund obligations. The actual ROI was negative; nobody measured it.
The fifth metric, the trap, is 'team satisfaction' or 'time saved' as reported in a survey. These are useful signals; they are not ROI metrics. People consistently overestimate the time AI tools save them, by factors of two to three in studies we trust. Use survey data for product feedback. Do not use it to justify the next AI investment.
Three ways the ROI number lies
The first failure mode is unattributed concurrent changes. The AI feature shipped in the same quarter as a UX redesign, a new training program, and a market-mix shift. The metric moved; the AI feature gets credit for the whole delta. The corrective is a holdout group, an A/B, or at minimum an explicit list of concurrent changes documented in the ROI memo. We default to a small holdout group on every deployment unless the workflow makes it impossible.
The second failure mode is the seasonality glitch. The baseline was captured in a quiet quarter; the post-deployment measurement is from a peak quarter. The improvement looks real and is partly seasonal. Corrective: compare year-over-year if the cycle is annual, or use a rolling four-week baseline that controls for short-term variance.
The third failure mode is the silent quality drift. The model performs well at launch, performance erodes slowly over six months, nobody resets the baseline, and the reported ROI keeps using the launch-quarter quality number. The deployment looks healthy on the dashboard while customers are noticing the degradation. Corrective: quality is measured at the same cadence as cost and time, and the dashboard surfaces drift explicitly.
What disciplined ROI measurement actually changes
Four shifts we have seen when AI ROI moves from narrative to discipline. None of them are sophisticated; all of them depend on doing the work of capturing the baseline.
Leadership reports a 40% productivity improvement from the new AI workflow. The number comes from a team survey; the underlying operational metric was never captured.
The actual time-per-case improvement is 18%, against a captured baseline. The number is smaller, defensible, and survives a board challenge. The next AI investment gets approved on the strength of a real number, not a contested one.
Takeaway · Smaller defensible numbers buy more credibility than larger contested ones.
An AI deployment is declared successful at launch based on first-month metrics. Six months later, the team has stopped using it; nobody has measured why.
Monthly ROI review surfaces the drift in month four. The team adjusts the prompt and the retrieval index; usage recovers. The feature is alive at month twelve, with documented quality.
Takeaway · ROI is a running discipline, not a launch event. The drop is what you measure for.
Sales AI workflow takes credit for a 25% lift in conversion. A new pricing change and a competitor's outage both landed in the same quarter.
A small holdout group shows the AI workflow's actual contribution is 9 percentage points. The pricing change explains another 11; the competitor outage another 5. The narrative is messier and more honest, and the next quarter's plan reflects what actually worked.
Takeaway · Attribution discipline keeps the next investment from chasing the wrong cause.
A support AI assistant shows 50% faster first-response times. Customer satisfaction scores drift down quietly over six months.
Quality audit is added to the monthly ROI review. The drift gets attributed to the AI assistant routing complex cases to junior reps. Routing is corrected; CSAT recovers. The 50% number stays, at the same quality bar as before.
Takeaway · Quality is the lagging metric that traps lazy ROI stories. Measure it from day one.
Three commitments on AI measurement
We do not ship an AI feature without all three. They are the bar for the engagement, not optional add-ons.
Baseline captured before launch
We do not ship until time, cost, throughput, and quality are measured for the pre-AI process. Without the four numbers, the deployment cannot be governed afterwards.
Holdout or documented confounders
Default to a holdout group. When that is not possible, the ROI memo names every concurrent change in the same quarter, and the attribution model that handles them.
Monthly review, twelve-month horizon
The same four metrics are reviewed monthly, dashboarded continuously, and re-baselined annually. The honest test is whether the feature is still working at month twelve, not at month one.
An AI portfolio with defensible numbers
Twelve months in, the leadership team can defend every AI investment with a number that survives a board challenge.
The companies that get AI ROI right are not the ones with the biggest numbers. They are the ones whose numbers survive scrutiny. The CFO can trace each percentage point of impact to a measurement methodology. The CEO can explain in a board meeting which AI investments worked and which did not, and what the company learned from the failures. The head of engineering can roll back an AI feature when the numbers stop moving, and has actually done so, at least once, without political cost.
The wider effect is that AI stops being a special category of investment and becomes a normal one. A new use case lands with a baseline, ships with an eval, gets reviewed monthly, and is killed when it stops paying back. The same discipline the company applies to ad spend or pricing experiments, applied to AI. That is what a mature AI portfolio looks like.
The numbers are also smaller. Companies with rigorous ROI measurement report 15 to 35% improvements on the workflows they target, not the 200 to 400% improvements that show up in vendor case studies. The smaller numbers are the real ones, they compound across the portfolio, and they survive the audit.
AI for founders:
questions we get asked.
Direct answers to the questions we get asked the most. If yours isn't covered, write to the team.