AI productivity papers: Goldman, MIT, BCG — what they actually show and don't

The productivity paradox is not a new phenomenon, but the generative AI era has accelerated its arrival. In 2023 and 2024, a wave of research emerged claiming that artificial intelligence would fundamentally alter economic output. By the time 2026 arrives, these claims are no longer speculative; they are the baseline against which enterprise strategy and macroeconomic policy are measured. Yet, a review of the most-cited literature from this period reveals a disconnect between headline numbers and operational reality. The three dominant studies—Goldman Sachs Global Investment Research (March 2023), the MIT/NBER working paper by Brynjolfsson et al. (October 2023), and Boston Consulting Group’s enterprise analysis (2023–2024)—provide distinct lenses. One offers a macro forecast, another a micro experiment, and the third a procurement reality.

This essay synthesizes these sources to separate capability from adoption. It does not dispute that AI can improve task efficiency; it questions the aggregation of those gains into national or corporate productivity metrics. The evidence so far is mixed. Where tasks are bounded, gains are measurable. Where tasks require integration, ambiguity, or long-horizon planning, the data is thin. For leaders and investors, the critical distinction lies not in the model’s performance on a benchmark, but in the systemic friction required to convert a model into a workflow.

The Macro Forecast: Goldman Sachs and the 7% GDP Claim

The most widely cited macroeconomic projection comes from Goldman Sachs Global Investment Research in their March 2023 report, The Potentially Large Effects of Artificial Intelligence on the Global Economy. The headline figure is specific: generative AI could raise global GDP by 7% (approximately $7 trillion) over a 10-year period. This estimate relies on a task-based modeling approach, mapping AI capabilities to the 60% of occupations identified by the OECD as having high exposure to automation.

The methodology is transparent but carries inherent assumptions. Goldman Sachs analysts assume that AI will automate 300 million full-time equivalent jobs globally, with a significant portion of those tasks being augmented rather than replaced. The 7% figure is an upper-bound scenario, assuming rapid adoption and minimal regulatory friction. Crucially, the report distinguishes between labor productivity (output per hour) and total factor productivity (TFP), noting that the former may rise faster than the latter in the short term due to capital deepening.

However, the 10-year horizon introduces significant uncertainty. The report acknowledges that historical technology waves—from electricity to the internet—took decades to fully permeate the economy. In the immediate term, the cost of implementation often offsets the gain in efficiency. For example, while a model might draft a legal brief in minutes, the legal team still requires hours to verify accuracy, manage liability, and integrate the output into a client strategy. Goldman’s model accounts for displacement costs but may underweight the training and integration costs required to retool workflows.

Furthermore, the 7% figure is a global aggregate. It does not capture distributional effects. Sectors like software development, customer service, and content creation may see outsized gains, while heavy industry or regulated healthcare may see slower adoption. The report notes that knowledge workers are the primary beneficiaries, but it does not quantify the risk of productivity divergence between firms that adopt AI and those that do not. For a C-suite executive, the macro number is less useful than the micro implication: if the industry average rises 7%, but your firm lags due to legacy systems, your relative competitive position shrinks even if absolute output grows.

The Micro Experiment: MIT, NBER, and the 14% Gain

While Goldman looks at the economy, researchers at MIT and the National Bureau of Economic Research (NBER) looked at the individual worker. In October 2023, Erik Brynjolfsson, Danielle Li, and Lindsey R. Raymond published the working paper Generative AI at Work. Their study analyzed customer support agents at a large technology company over a 10-month period. The result was specific and widely quoted: a 14% increase in productivity for agents using generative AI assistance.

The study design is robust in its context. It measured tasks completed per hour and resolution rates, controlling for agent experience and shift time. The gains were not uniform; they were concentrated among less experienced workers, who closed the performance gap with senior agents faster. This suggests AI functions as a skill equalizer in the short term, reducing the training curve for complex queries.

However, the scope of the study limits its generalizability. Customer support tickets are bounded tasks with clear success criteria (resolution, sentiment, time). They do not require the creative synthesis or strategic judgment found in software engineering, management consulting, or clinical diagnosis. When the MIT team extended their analysis to other domains, the variance increased. A separate study by Noy and Zhang (2023) on writing tasks found similar gains in speed but noted a quality variance that required human review. In coding, a study by Chen et al. (2023) on GitHub Copilot showed that developers completed tasks faster, but the defect rate in generated code required careful monitoring.

The MIT/NBER paper explicitly notes that the 14% figure is a treatment effect under controlled conditions. It does not account for systemic bottlenecks. If a company adopts AI for drafting but keeps the same approval workflows, the time saved in drafting is lost in review. The study’s authors caution against extrapolating the 14% to the entire economy without accounting for complementary investments. For every hour saved on writing, an organization must invest in prompt engineering, data security, and quality assurance. If those investments are not made, the 14% gain evaporates into technical debt.

The Enterprise Reality: BCG and the Pilot-to-Production Gap

If Goldman provides the ceiling and MIT provides the floor, Boston Consulting Group (BCG) provides the friction. In their 2023–2024 analysis of enterprise AI adoption, BCG researchers surveyed hundreds of organizations to measure the transition from pilot to production. Their finding is less about productivity gains and more about implementation failure. They report that roughly 50% of AI pilots do not scale beyond the initial department.

The BCG data highlights a specific bottleneck: data readiness. In their 2024 report, The Productivity Promise of Generative AI, BCG analysts found that organizations with mature data governance saw 2x higher ROI than those without. This aligns with the technical reality that models require clean, structured, and accessible data to function reliably. A model cannot retrieve a policy document if that document is trapped in a legacy PDF repository without indexing.

BCG also quantifies the cost of integration. Beyond the subscription fee or inference cost, the engineering hours required to connect a model to internal APIs, enforce access controls, and log outputs for compliance are substantial. In one case study cited in their 2024 analysis, a financial services firm spent 18 months moving a pilot from a sandbox environment to a production workflow. During that time, the model performance on the original task degraded as the underlying API changed, requiring retraining.

This “last mile” problem is the primary reason why enterprise productivity metrics often lag behind research benchmarks. A model might achieve a 90% accuracy rate in a lab setting, but if it requires a human to verify 30% of its output due to hallucination risks, the net gain is lower. BCG’s research suggests that the ROI curve for AI is not linear; it is step-function. Organizations must cross a threshold of data quality and process redesign before gains compound. Until that threshold is reached, AI acts as a cost center rather than a productivity engine.

Methodological Fractures: What the Numbers Hide

The three studies above share a common limitation: they measure output, not value. A 14% increase in code commits does not necessarily mean a 14% increase in product value. A 7% GDP boost does not account for the energy consumption or hardware costs required to generate that output. To understand the true productivity impact, we must look at the methodological gaps that these papers acknowledge but do not fully resolve.

First, selection bias in pilot studies is pervasive. The MIT/NBER study sampled a single company. The Goldman Sachs model assumes a uniform adoption curve. BCG’s survey relies on self-reported data from organizations that chose to participate, which skews toward those already confident in their AI strategy. There is a survivorship bias here: we see the successes because they publish, but we do not see the failures that are quietly abandoned.

Second, time horizons are mismatched. Macro studies look at 10 years; micro studies look at 10 months. The Total Factor Productivity (TFP) data that economists use to validate these claims is released with a lag of 12 to 18 months. By the time the Bureau of Economic Analysis confirms a productivity uptick, the technology landscape may have shifted. In 2024, the focus was on text generation; by 2026, the focus may be on autonomous agents. Measuring productivity based on the wrong task definition leads to false negatives.

Third, cost accounting is often incomplete. The Goldman Sachs report includes labor displacement costs but often excludes compute costs. As models grow larger, the inference cost per token becomes a significant line item. If a company spends $100,000 on inference to save $50,000 in labor, the productivity metric is negative, even if the task is faster. BCG’s data suggests that many firms fail to track unit economics at the task level, aggregating costs at the department level where they are hidden.

Finally, there is the issue of quality variance. In high-stakes domains like healthcare or law, a 14% speed gain is irrelevant if the error rate increases by 1%. The MIT study measured resolution rates, but it did not measure long-term liability. If an AI-assisted agent resolves a ticket faster but leaves a compliance gap that triggers a fine six months later, the productivity gain is illusory. The evidence so far is mixed on whether AI reduces tail risk or merely shifts it.

What Changes the Picture: From Capability to Integration

The picture changes not when models get smarter, but when integration costs get cheaper. The current bottleneck is not intelligence; it is connectivity. For AI productivity to move from a pilot metric to a macroeconomic reality, three conditions must be met.

First, evaluation must shift from benchmark to workflow. The current reliance on MMLU or coding benchmarks measures capability, not utility. Organizations need task-specific evaluation suites that measure error rates in production, rework cycles, and customer satisfaction. If a model writes code that passes tests but fails security review, it is not productive. The metric must be net value, not gross output.

Second, data infrastructure must mature. The BCG finding on data readiness is the most actionable insight. Productivity gains are contingent on retrieval accuracy. If a model cannot find the correct internal document, it hallucinates. Investment in vector databases, access control, and data governance is not optional; it is the prerequisite for productivity. Until this infrastructure is in place, AI remains a feature, not a system.

Third, economic incentives must align. Currently, vendors sell tokens or seats, but enterprises need outcomes. Pricing models that align vendor revenue with task success (e.g., per resolved ticket, per deployed feature) would force better integration. Goldman Sachs’ 7% GDP forecast assumes full adoption; if adoption stalls due to misaligned incentives, the forecast will be revised downward.

The path forward requires measured skepticism. The Goldman, MIT, and BCG papers provide a foundation, but they are not a guarantee. The 7% GDP boost is a possibility, not a promise. The 14% productivity gain is a real effect in specific contexts, not a universal law. The 50% pilot failure rate is a warning, not a verdict.

For the 2026 observer, the lesson is clear: productivity is a system property, not a model property. It emerges from the interaction of technology, process, and people. The models are ready. The workflows are not. Until the workflows catch up, the productivity numbers will remain volatile. The evidence so far suggests that the real gains will come not from replacing workers, but from reorganizing work to leverage the specific strengths of generative AI—drafting, retrieval, and pattern matching—while keeping humans in the loop for judgment, liability, and strategy. That is where the durable productivity evidence lies, and that is where the next wave of research must focus.