BACK TO INSIGHTS

BACK TO INSIGHTS

Agentic Generation

15min

Generation Is Easy. Verification Is Hard.

The open problems in agentic website generation — why scaling models won't fix them, and where to set your bar to compete with the best.

Published: 20th June 2026

 

Load-bearing figures in this report are traced to primary sources and stated with their exact study conditions; forward projections are labeled as projections; vendor claims are flagged as claims.

Executive Summary

The defining problem in agentic website generation is not generation — it is verification. Today's tools can produce a plausible, polished, working-looking website from a sentence. What they cannot do is reliably know whether what they produced is actually good. And for the dimensions that matter most to a real product — design quality, brand fit, UX intent, and business-logic/authorization correctness — there is no automated oracle for "good" at all. For the dimensions that do have oracles — functional correctness (tests), performance (Core Web Vitals), security (static analysis), accessibility (automated WCAG checks) — those oracles sit dormant, because the generator authors the code but rarely authors the tests, checks, and annotations that would activate them. A system that cannot evaluate its own output cannot reliably self-correct, and this is precisely why scaling the model alone has not solved the problem.

The defining problem in agentic website generation is not generation — it is verification. A system that cannot evaluate its own output cannot reliably self-correct.

The hard evidence for that last claim is stark: across two years of "revolutionary" model releases, the security pass rate of AI-generated code has stayed flat at roughly 55%, even as syntactic correctness climbed from ~50% to over 95% (Veracode, Spring 2026). Bigger models write code that compiles beautifully and is no more secure. As Veracode puts it, "a larger model is not a security control."

Two findings anchor this report. (I) The Verification Meta-Problem: the absence of a ground-truth oracle for the high-value dimensions is the root cause that makes most other problems unsolvable by scaling. (II) The Self-Improvement Paradox: the obvious fix — iterate more, add more agents — actively backfires. Asking a model to iteratively "improve" its code increased critical vulnerabilities by 37.6% after five rounds, and none of four prompting strategies (including a security-focused one) improved security with iteration (IEEE/arXiv, 2026). You cannot iterate your way to quality when there is no oracle to iterate against.

For a leader deciding where to set the bar against the best in the business, the single most important takeaway is this: the right benchmark depends on the dimension. For verifiable dimensions (functional correctness, performance, security, accessibility), the bar is movable — and the frontier has decisively shifted from the model to the harness: the scaffolding, tests, verifier loops, and evaluation infrastructure wrapped around the model. Compete there and you can win, because the models themselves are commoditizing. For judgment-bound dimensions (design taste, brand fit, UX intent, authorization correctness), the bar is not autonomy — it is the best human-in-the-loop workflow — and betting against the missing oracle is a losing strategy no amount of model progress will rescue.

The problems below are both agentic (LLM-bound) and non-agentic (newly hard because of AI generation), exactly as the frontier demands they be examined.

How to Read This Report

We apply three analytical lenses throughout:

  • Capability gap vs. fundamental gap — with explicit confidence levels. A capability gap plausibly closes as models and harnesses improve. A fundamental gap persists regardless of scale. We state our confidence per claim rather than presenting all "fundamentals" as equally certain — security's flat-since-2023 data is ironclad; long-horizon coherence is strong but directional.
  • The "newly-hard-because-of-AI" gate. Every problem here answers one question crisply: what is qualitatively new because of AI generation? If a problem is merely "web development was always hard," it does not qualify unless AI generation materially changes its character — its volume, its operator, or its opacity.
  • The failure-mode-at-scale lens. Many problems are invisible in a single generation and only emerge at 100 pages, 50 iterations, or when the output is handed to a non-coder.

Two operator personas recur, because the severity and the bar to compete differ sharply depending on who is driving:

Persona Who Reads the diff? Representative tools Primary failure exposure
A — Non-Technical / Vibe Operator founders, merchants, non-developers No Lovable, Bolt, Shopify merchants, v0-for-founders Ships the polished shell unaware it's broken
B — Engineer-as-Accelerant professional developers Yes Cursor, Claude Code, Devin, v0 in a real stack Bypasses their own CI/SAST under time pressure
Our evidence standardLoad-bearing numbers are traced to primary sources and stated with their exact study conditions; forward projections are explicitly labeled as projections; vendor marketing claims are flagged as claims, not facts.

What's Already Solved (So We Don't Cry Wolf)

Before mapping what's unsolved, it's worth bounding what genuinely works today — both for honesty and because the unsolved problems cluster precisely where these capabilities run out.

  • Local, well-scoped code generation is strong. Single-function bug fixes, feature stubs, single-module refactors, and test generation are rated "very strong" across current agent evaluations. First-draft pull requests under a clear specification, with good existing tests, are reliably useful.
  • Component-level UI generation is genuinely good. Tools like Vercel v0 produce clean, idiomatic React/Next.js components that drop into a real codebase.
  • Tool-augmented navigation beats brute force. Agents equipped with grep, language servers, and test runners navigate large codebases far more reliably than agents that try to load everything into context.
  • Some real safety mitigations have shipped. Replit's Package Firewall, which automatically blocks known-malicious or compromised packages, is a concrete, working defense against one real supply-chain risk.

Bounding what's solved earns the right to be precise about what isn't. The rest of this report is about where these capabilities hit a wall — and why.

Two Anchoring Findings

These two findings are not items in a catalog; they are the lens through which every problem below should be read.

Finding I — The Verification Meta-Problem

The problem. There is no automated, ground-truth-less way to know whether a generated website is "good." "Good" is multi-objective — correct, on-brand, usable, accessible, performant, secure, and conversion-worthy — and for the highest-value dimensions, no oracle exists at all. Design quality, brand fit, and UX intent have no metric; business-logic and authorization correctness have no complete static oracle (logic and privilege-escalation flaws routinely evade static analysis and only surface at runtime with real data). The subtler half of the problem: oracles that do exist stay dormant. Tests can verify functional correctness — if someone writes them. Lighthouse measures performance. Static analysis catches a class of vulnerabilities. Automated tooling catches part of accessibility. But the generator produces the artifact without producing its own checks, so the latent oracles never fire.

Why it's hard. A human developer verifies implicitly as they write — they hold the intent, they notice when something feels wrong. An agent generates output decoupled from any internal model of "is this right," and the dimensions where humans add the most judgment are exactly the dimensions with no machine oracle to delegate to.

Why scaling won't fix it (high confidence). This is the report's most direct evidence. Veracode's longitudinal testing shows the security pass rate of AI-generated code has remained flat at approximately 55% since 2023 — "an essentially unchanged failure rate regardless of model generation, parameter count, or provider" — even as syntax correctness rose from ~50% to over 95%. Veracode is blunt: "parameter count shows minimal correlation with security performance… upgrading to a larger or more expensive model is not a security control." On the accessibility side, Deque's research finds automated tooling identifies 57.38% of accessibility issues (across 2,000+ audits, 13,000+ pages, ~300,000 issues) — the remaining ~43% require human judgment and cannot be automated away by a better model. And developer-survey data (CodeRabbit, Sonar, SmartBear, 2026) converges on the same theme: AI "often produces code that looks correct but isn't reliable."

BridgeThis meta-problem is why the harness exists. The harness's entire job, as the rest of this report shows, is to force the latent oracles to fire — to write the tests, run the scanners, gate on the checks — because the generator won't do it itself.

Finding II — The Self-Improvement Paradox

The problem. The intuitive way to make AI output better is to iterate — ask it to improve, add a second agent to review, run more passes. The evidence says this often makes things worse, not better.

The evidence (with exact conditions). A 2026 study, "Security Degradation in Iterative AI Code Generation" (IEEE/arXiv), found that iteratively asking a model to improve its own code increased critical vulnerabilities by 37.6% after five iterations, across 400 code samples and four distinct prompting strategies — and crucially, none of the four strategies, including an explicitly security-focused one, improved security as iteration continued. Separately, Google Research's "Towards a science of scaling agent systems" (180 agent configurations across three model families) found that independent, non-communicating multi-agent systems amplified errors 17.2×, while centralized orchestrated systems reduced that to 4.4× — and, importantly, the same study found multi-agent setups produced gains on parallelizable tasks. So the finding is not "multi-agent is bad"; it is architecture-dependent: naive parallel independence amplifies error, orchestration matters. Underlying both results is the "self-conditioning" effect — when a model's context includes its own prior errors, it becomes measurably more likely to produce further errors — and the compound-error effect, where small per-step error rates multiply across long chains.

Why it matters. This inverts the naive roadmap. "More iterations, more agents" is not a path to quality; without an oracle to steer toward, iteration is undirected drift, and the model's own bad output pollutes its future reasoning.

BridgeThis is the foundation of the report's central decision principle. You cannot iterate your way to quality without an oracle — which is exactly why some dimensions are movable by better harness engineering (where oracles can be built and made to fire) and others are structurally capped (where no oracle exists and a human must remain in the loop).

The Problem Catalog

Each problem below follows the same structure: what it is, why it's hard, why scaling alone won't fix it (with a confidence level), where the best-in-class sits today, what the bar to compete actually requires, and who it bites hardest.

Bucket 1 — Agentic / LLM-Bound Problems

1. The Generate-vs-Edit Asymmetry (Maintenance and Context Rot)

What it is. AI is impressive at greenfield creation and unreliable at editing a large existing codebase without introducing regressions. Creation is a demo; maintenance is the business — and the asymmetry between the two is one of the most consequential unsolved problems in the field.

Why it's hard. A human maintainer carries a persistent mental model of the system, built over months. An agent reconstructs that model from scratch each session, from a context window that is both too small and too noisy. Real monorepos run to several million tokens of code, plus far more in undocumented architectural and organizational knowledge that never made it into a file. The agent cannot see it, so it violates layering rules, misses call sites, and re-introduces previously fixed bugs.

Why scaling won't fix it (high confidence on the mechanism). "Context rot" is real and architectural: even with million-token windows, model reliability degrades as input grows, with a performance knee commonly observed around 16,000–32,000 tokens and a well-documented "lost-in-the-middle" effect where information buried mid-context is attended to least. Critically, more context can reduce quality — effective usable context is far shorter than the advertised window. This is a gradual degradation, not a hard cliff, and targeted retrieval and fine-tuning can push it back — hence "high confidence on the mechanism" rather than on permanence. The tell is in the changelogs: every major platform is grinding on exactly this — v0's diff view and branch-per-chat, Bolt's context-management push, Replit's per-project "skills" — which is the industry implicitly admitting it is unsolved.

State of best-in-class today. Diff-based editing, branch-per-chat workflows, scoped retrieval, and per-project instruction files (airules.md-style) that re-inject conventions.

Bar to compete. Best-in-class context engineering plus deterministic regression gates and scoped, bounded-blast-radius edits. You compete by making the harness preserve invariants the model cannot hold in its head — not by waiting for a bigger context window.

Who it bites hardest. Both personas — but Persona A cannot rescue a botched multi-file edit, while Persona B can manually intervene.

2. Long-Horizon Multi-File Coordination and Global Invariants

What it is. Coordinated change across many files — a refactor that must keep cross-file invariants intact (security boundaries, API backward-compatibility, performance contracts) — exceeds what current agents can do reliably and autonomously.

Why it's hard. The agent has no typed, checked internal graph of the program; it approximates control and data flow from snippets of text. It will update a function signature and miss a caller, or apply subtly different patterns across modules — maintaining local syntactic validity while breaking a global invariant nobody wrote down.

Why scaling won't fix it (medium-high confidence). METR's measurements quantify the ceiling: frontier agents reliably complete software tasks of roughly 50 minutes of human-equivalent effort at a 50% success rate, with success dropping below 10% on tasks exceeding ~4 hours, and the 80% success horizon is roughly 5× shorter still. The honest caveat — and the reason this is medium-high rather than high confidence — is that this horizon has been doubling roughly every seven months, so it is today's measured ceiling on a fast-moving trend, not a permanent wall. METR itself notes measurements above 16 hours are currently unreliable. What looks fundamental is the coherence limit: the UltraHorizon benchmark finds agents consistently underperform humans on tasks requiring sustained planning and memory, and scaling has not closed that gap. The bounded horizon will lengthen; the structural fragility of long autonomous chains is the durable part.

State of best-in-class today. DAG-style decomposition into small, independently verifiable sub-tasks; verifier agents after each step; hard test/type-check gates; narrowly scoped pull requests.

Bar to compete. Architect for bounded blast radius and deterministic verification between steps. The winning systems treat the model as a strong local-change engine inside a scaffold that enforces global consistency — they do not ask it to hold the whole system in its head.

Who it bites hardest. Both personas; the larger and older the codebase, the worse it gets.

3. Design Taste and Brand Homogenization

What it is. AI can produce plausible layouts but not reliably distinctive, on-brand ones — and at population scale, AI-generated sites are converging on a shared, generic aesthetic.

Why it's hard. Taste and brand are non-local emergent properties — thousands of micro-decisions informed by market positioning and tacit knowledge, much of it never written down. Models are trained on the open internet, not your brand book, and they sample from the densest region of that distribution: the median SaaS look. Shared component libraries (Tailwind, shadcn-style systems) baked into these tools reinforce the convergence.

Why scaling won't fix it (high confidence for originating taste; capability gap for enforcing a system). A 2026 study by Hintze, Proschinger Åström, and Schossau found that generative systems used autonomously and repeatedly collapse toward a narrow band of generic output — "visual elevator music," polished yet devoid of meaning. Adobe's 2025 design guidance is explicit that as AI scales output, most of it trends toward sameness, and "taste remains the true differentiator" — the judgment of "what to leave out… when something feels right versus just looks right." Google's own developer discussion names the failure mode "brand drift": ask for a "professional banner" and you get generic-corporate, because the model conflates your specific brand with the internet average. Originating taste is judgment-bound and has no oracle (high confidence it won't yield to scale); enforcing a pre-defined design system is a tractable capability problem.

State of best-in-class today. Design-Systems-Agent approaches (Bolt) that force the model to compose from your existing tokens and components rather than invent — constraining the output into a brand rather than generating one.

Bar to compete. Own the design-system-enforcement layer and keep human taste in the loop. Do not bet on autonomous taste — bet on making it trivial to express and enforce a human-defined system.

Who it bites hardest. Both personas equally — taste is non-local regardless of operator skill.

4. The Polished-UI Trust Inversion

What it is. Better-looking AI output makes the trust problem worse, not better — a genuine inversion of how quality normally works.

Why it's hard. Humans evolved visual-review heuristics: confident, clean, idiomatic-looking work earns less scrutiny. AI generation produces exactly that surface polish decoupled from underlying correctness, so it defeats the very heuristic reviewers rely on. The polish is real; the substance may be absent.

Why scaling won't fix it (high confidence — it's a human-perception property). This is not a model capability that improves with scale; it is a property of the human reviewer. The evidence: developer surveys report 58% trust AI output without testing it, and analyses note reviewers "under-scrutinize confident, idiomatic-looking code." The "polish trap" is well-documented in practice — a beautiful, demo-ready front end that misleads a team into thinking the app is near-done while the entire backend and data layer are missing. As generators get better at surface polish, the inversion gets stronger, which is why scale works against us here.

State of best-in-class today. Forcing function: harnesses that surface what's not done (missing tests, unhandled states, absent auth) rather than just showing the pretty preview; plan-approval steps that itemize scope before building.

Bar to compete. Build for "trust calibration" — make the harness expose the gap between how finished the output looks and how finished it is. The differentiator is honesty surfaces, not prettier previews.

Who it bites hardest. Persona A hardest (no ability to look past the polish), but it also bites Persona B via under-scrutiny of idiomatic-looking code.

5. Security: The Logic/Authz Verification Gap and Insecure-by-Default at Scale

What it is. Security here has two faces. One is fundamental: there is no complete oracle for business-logic and authorization correctness, so the most dangerous flaws evade automated detection. The other is the old set of bug classes (injection, broken auth, exposed secrets) made newly hard by AI's volume, its non-technical operators, and the conversational loop that skips traditional security gates.

Why it's hard. Logic and privilege-escalation flaws are context-dependent and only manifest at runtime with real data — static scanners can't catch them, and there's no automated ground truth for "is this authorization rule correct." Meanwhile, the velocity of generation overwhelms review, and in vibe-coding workflows the live application effectively becomes the security test environment.

Why scaling won't fix it (high confidence for the verification gap). Veracode's flat-since-2023 ~55% security pass rate is the anchor: 45% of AI-generated code introduces an OWASP Top 10 vulnerability, and this is "stubbornly flat" across model generations. Specific failure rates are severe — cross-site scripting defended in only ~14–15% of cases (≈85–86% failure), log injection ~12–13% pass (≈87–88% failure). At scale, Escape.tech's study of 5,600+ vibe-coded applications found 2,038 highly critical vulnerabilities, 400+ exposed secrets, and 175 instances of exposed PII. (A separate, frequently-cited figure that ~62% of AI-generated solutions contain flaws comes from a different source and should not be attributed to the Escape.tech study.) The bug classes are a capability problem that better defaults can improve; the authorization-correctness oracle gap is fundamental.

State of best-in-class today. Secure-by-default scaffolding, automatic package-vetting (Replit's Package Firewall), and — for Persona B — SAST/SCA gates in CI, when they aren't bypassed.

Bar to compete. Secure-by-default generation plus a verification layer that routes authorization and business-logic correctness to deterministic checks and human review, because that part cannot be fully automated. Treating "the model got bigger" as a security improvement is a category error.

Who it bites hardest. Persona A catastrophically (no CI, live app is the test environment); Persona B moderately (their gates help — if they don't skip them under pressure).

Bucket 2 — Non-Agentic-but-Newly-Hard Problems

These problems are not LLM-internal, but AI generation has made them qualitatively harder — through volume, non-technical operators, or new downstream consumers.

6. Accessibility-by-Default and the 43% Human-Judgment Residual

What it is. AI-generated interfaces ship inaccessible by default, and the hardest portion of accessibility cannot be automated at all.

Why it's hard. Accessibility requires correct semantics, focus management, keyboard navigation, meaningful alt text, and judgments about whether an experience actually works for a person using assistive technology — much of which depends on meaning and context, not pattern. AI tools default to non-semantic "div soup" unless explicitly instructed otherwise.

Why scaling won't fix it (partially fundamental). Deque's research is the anchor: automated tooling catches 57.38% of accessibility issues; the remaining ~43% require human judgment and are not automatable by a better model — that residual is fundamental. The automatable 57% is a capability problem, but in practice it stays unsolved because AI generates inaccessible markup by default; benchmarking shows a single sentence in the prompt about WCAG/ARIA meaningfully improves output, but that's fragile and easily omitted, especially by a non-technical operator who doesn't know to ask.

State of best-in-class today. AI-assisted triage tools plus mandatory human review for critical flows; accessibility rules injected into generation defaults.

Bar to compete. Make accessibility a non-optional default of the generator (not a prompt the user must remember), and build the human-review workflow for the irreducible 43%.

Who it bites hardest. Both personas — and end-users with disabilities hardest of all.

7. Performance, SEO, and Core Web Vitals — and the GEO Inversion

What it is. AI-generated frontends routinely ship bloated, JavaScript-heavy, non-semantic markup that harms performance and search — and, in a new twist, harms their legibility to the very AI agents that increasingly consume the web.

Why it's hard. Generators favor heavy SPAs and deeply nested "div soup" even when a lean static page would serve better, producing render-blocking scripts, poor Largest Contentful Paint and Interaction to Next Paint, and content that is invisible until client-side rendering completes.

Why scaling won't fix it (capability-leaning, but unsolved in practice). This is more tractable than the fundamental gaps — better defaults and constraints can fix most of it — but it remains unsolved in practice today, and the qualitatively new dimension is the hook: AI-generated frontends violate the very machine-readability the agentic web now demands. Google has published an official "Optimizing your website for generative AI features" guide, and a distinct discipline — Generative Engine Optimization — has emerged around making sites legible to LLMs and AI search. JavaScript-only SPAs that produce "invisible content," div-soup that "removes any chance of the agent grasping page hierarchy," and synthetic-Lighthouse-score-chasing that masks poor real-world Core Web Vitals all push in the opposite direction. The irony is sharp: the same tools generating the agentic future are producing output that future can't read.

State of best-in-class today. Performance budgets and semantic-HTML defaults in generation; server-side rendering by default; structured-data emission.

Bar to compete. Generate lean, semantic, server-visible HTML with structured data by default, and treat machine-readability as a first-class output requirement — not an afterthought a user must request.

Who it bites hardest. Both personas, plus the site's own discoverability in both classic and AI-mediated search.

8. The Metered-Iteration Cost Economics

What it is. The economics of AI generation are structurally backloaded: the hardest last 20% of a build — debugging, edge cases, integration — consumes a wildly disproportionate share of cost, and it's unpredictable.

Why it's hard (and an honest caveat: this is not a fundamental problem — it is structurally unsolved at the product/pricing layer). Agents re-read large slices of context on every loop, so cost scales with session length and difficulty, not with the size of the change. A 2026 analysis (DigitalApplied) found per-task cost varies by roughly 100× across tools, driven mainly by loop count — a 10-loop refactor can cost ~40× a 2-loop fix on the same model. Because you can't know in advance how many loops a hard bug needs, the cost of "finish this feature" is unknowable upfront, and metered/credit pricing makes the bill volatile exactly when the work is hardest. (Specific per-tool plan prices are illustrative and move constantly; the structural pattern is the durable point.) DX's 2026 reporting adds that AI's upfront speed is often offset by downstream review and remediation, yielding near-net-zero productivity once quality is accounted for.

State of best-in-class today. Plan-approval before execution (Replit Agent 4), model-routing to cheaper models for easy steps (Bolt), context compaction to limit per-loop token burn.

Bar to compete. Bound and cache iterative reasoning, price on successful outcomes rather than raw loops, and give targeted last-mile debugging instead of full-context reprocessing. The economics — not just the engineering — are a competitive surface.

Who it bites hardest. Both personas; Persona A often unknowingly, until the bill or the credit ceiling arrives mid-task.

The Frontier Is the Harness, Not the Model

Read the catalog and the platform changelogs together and a single conclusion emerges: the industry has implicitly conceded that the model alone will not solve these problems, and is now competing on the harness — the scaffolding, tools, verifier loops, and evaluation infrastructure wrapped around the model. This is the most important strategic shift for anyone setting a competitive bar.

The convergence is unmistakable. Vercel's "new v0" wraps the model in a production-mirroring sandbox, Git branch-per-chat, agentic plan/search/debug loops, and a diff editor. Replit Agent 4 adds parallel agents, plan-approval, the Package Firewall, and production alerts. Bolt v2 ships Cloud Code agents, a Design-Systems-Agent, automatic model-routing, and auto testing/refactor. Shopify pairs its AI with explicit "the merchant is responsible" disclaimers — an admission that the human is the verification layer. And the discourse has caught up: the framing "agent = model + harness" is now standard; Braintrust describes AI evaluation and observability as "the CI/CD of probabilistic systems"; Red Hat advocates "eval-driven development" as a first-class loop; Anthropic's 2026 Agentic Coding Trends Report prioritizes multi-agent coordination and AI-automated review. The recurring line across the stack analyses — "eval infrastructure beats shipping first" — is the thesis in five words.

The harness-capability checklist (which doubles as a competitive benchmark — if you're building to compete, this is the bar):

  • Forces latent oracles to fire: auto-generates and runs tests, gates merges on SAST/SCA, runs automated accessibility and performance checks rather than trusting the model.
  • Bounds blast radius: scoped edits, branch-per-task, deterministic regression gates before changes land.
  • Manages context deliberately: retrieval over context-dumping, compaction, persistent project/brand state, per-project convention files.
  • Orchestrates rather than parallelizes naively: centralized coordination (the 4.4× path) over independent multi-agent sprawl (the 17.2× path).
  • Inserts verifier/critic steps with deterministic pass/fail, not the model's self-opinion.
  • Surfaces honesty: exposes what's not done (missing tests, unhandled states, absent auth) instead of only showing the polished preview.
  • Closes the eval loop: production signals feed back into evaluation as a compounding flywheel.

The steelman — and where it's right. The strongest case against this report's framing is that the harness convergence means these problems are being solved — just at the system level rather than the model level — so calling them "unsolved" is too pessimistic. This is partly correct, and worth conceding plainly: for the verifiable dimensions (functional correctness, performance, security defaults, the automatable share of accessibility), harness engineering is closing the gap faster than model-skeptics expect. A harness that refuses to ship code failing a SAST gate makes Veracode's flat 55% model-level statistic far less relevant at the system level. The latent oracles are, for these dimensions, an engineering to-do list the best harnesses are steadily checking off.

Where the steelman fails. Three things the evidence won't let the optimist have. First, the Self-Improvement Paradox shows the harness cannot iterate its way to quality without an oracle — and for design taste, UX intent, and authorization correctness, the oracle simply does not exist, so the harness can at best route to a human, not solve it. Second, those judgment-bound dimensions are not an engineering backlog; they are definitionally human-judgment-bound, and no amount of scaffolding manufactures taste or ground-truth-less correctness. Third, the harness convergence is itself the evidence that the model didn't solve these — which is precisely the report's point, restated. The optimist is right that the bar is movable where oracles can be built; wrong that it's movable everywhere.

Forward-Looking: What's Durable, What's Soon-Solved, and Where the Whitespace Is

Problem Classification Confidence Likely trajectory
Verification meta-problem Fundamental High Persists; harness mitigates verifiable dimensions, judgment-bound dimensions remain human-bound
Self-improvement paradox Fundamental High Persists; mitigated only by oracles + orchestration, not by more iteration
Generate-vs-edit / context rot Fundamental mechanism, capability mitigations Medium-high Eases with retrieval/harness; the coherence limit endures
Long-horizon coordination Fundamental coherence limit Medium-high Horizon lengthens (~7-mo doubling) but long-chain fragility persists
Design taste / homogenization Fundamental (originating) High Originating taste stays human; enforcing a system becomes routine
Polished-UI trust inversion Fundamental (human-perception) High Worsens as polish improves; mitigated only by honesty surfaces
Security: authz/logic gap Fundamental High Defaults improve; the authorization-correctness oracle gap endures
Accessibility 43% residual Partially fundamental High Automatable 57% gets solved-by-default; ~43% stays human
Performance / SEO / GEO Capability Medium Likely solved-by-default within a few harness generations
Metered-iteration economics Product/pricing, structurally unsolved Medium Solvable with outcome-pricing + caching; not a deep barrier

The whitespace — the website's machine-readable surface. The most important forward-looking shift is that the consumer of a website is increasingly another AI agent, which redefines the design target. The Model Context Protocol has moved from an Anthropic spec to cross-vendor infrastructure, donated to the Linux Foundation's Agentic AI Foundation, with roughly 9,600 servers in the public registry — though adoption should be read soberly: a Stacklok survey puts production use around ~41% of surveyed organizations, and the previously circulated 78% figure was debunked. Shopify now ships Storefront MCP on-by-default for Plus stores created after March 2026, plus "MCP UI" so agents can return interactive components. A checkout-protocol race is underway: ACP (OpenAI + Stripe), AP2 and UCP (Google), x402 (Coinbase's crypto-settlement layer), with PayPal slated to auto-support ACP in 2026.

This points to a genuinely unsolved and barely-named design problem: generating for a dual surface — a human UX layer and an agent-legible layer (structured feeds, MCP endpoints, machine-readable semantics). The market framing is large but should be read as projection, not fact: eMarketer projects AI platforms will account for roughly $20.9B, about 1.5% of retail spending, in 2026, and Bain/McKinsey-style analyses project that 15–25% of e-commerce, or $3–5 trillion of global retail, could flow through agentic channels by 2030. Whether or not those projections land, the design problem is real today: almost no tool generates a first-class machine-readable surface by default, even as AI-generated frontends actively degrade that surface with div-soup (see Problem #7). This is the clearest early-mover opportunity in the space.

Where to Set Your Benchmark

This is the decision-grade synthesis. For each problem: where the best-in-class sits today, what competing with the best concretely requires, and — most importantly — whether the bar is movable by engineering or structurally capped.

Problem Best-in-class today Bar to compete Movable or Capped
Verification meta-problem Harness-enforced test/SAST/a11y gates; eval-driven development Own eval infrastructure as a moat, not just a good model call Movable (verifiable) / Capped (judgment)
Self-improvement paradox Orchestrated agents + verifier loops; bounded iteration Centralized orchestration + deterministic verifiers; never naive multi-agent Movable (architecture)
Generate-vs-edit / context rot v0 diff + branch-per-chat; Bolt context-mgmt; Replit skills Best-in-class context engineering + scoped edits + regression gates Movable, hard
Long-horizon coordination DAG decomposition + verifier agents + hard gates Bounded blast radius + deterministic inter-step verification Movable, hard (coherence capped)
Design taste / brand Bolt Design-Systems-Agent (compose, don't invent) Own design-system enforcement + human taste in the loop Capped — best human-in-loop
Polished-UI trust inversion Plan-approval; "what's not done" surfaces Trust-calibration / honesty surfaces, not prettier previews Capped (human perception)
Security: authz/logic Secure-by-default scaffolding; Package Firewall; CI gates Secure defaults + human/deterministic authz verification Movable (defaults) / Capped (authz)
Accessibility AI-assisted triage + mandatory human review Accessibility-by-default generation + workflow for the 43% Movable (57%) / Capped (43%)
Performance / SEO / GEO Perf budgets, SSR + structured-data defaults Lean semantic HTML + machine-readability by default Movable
Metered-iteration economics Plan-approval; model-routing; context compaction Outcome-pricing + caching + targeted last-mile debugging Movable
Machine-readable surface (whitespace) Shopify Storefront MCP on-by-default Treat the agent-legible surface as a first-class design target Open opportunity

The close — set your bar by dimension type. The strategic instruction that falls out of this entire analysis is simple and durable:

  • For verifiable dimensions — functional correctness, performance, security defaults, the automatable share of accessibility — compete on harness and evaluation engineering. The models are commoditizing; the harness is where the moat is, and the bar is movable by teams who build the best scaffolding, the tightest verifier loops, and the strongest eval flywheel. This is where ambition should be highest.
  • For judgment-bound dimensions — design taste, brand fit, UX intent, and authorization/business-logic correctness — compete on the best human-in-the-loop workflow, not on autonomy. The oracle is missing and scaling will not supply it. The winning product here is not the one that removes the human; it is the one that makes the human's judgment fastest, cheapest, and most leveraged. Betting against the missing oracle is the most common — and most expensive — strategic error in this space.

The best in the business have already internalized this split. The bar to join them is to stop trying to make the model solve what only a verification system or a human can, and to build relentlessly where engineering actually moves the line.

Sources

Security & evaluation. Veracode, Spring 2026 GenAI Code Security Update — ~55% security pass rate flat since 2023; syntax >95%; XSS ~85–86% / log injection ~87–88% failure; "a larger model is not a security control." · Escape.tech, State of Security of Vibe-Coded Apps — 5,600+ apps; 2,038 highly critical vulnerabilities; 400+ exposed secrets; 175 PII instances. · "Security Degradation in Iterative AI Code Generation," IEEE/arXiv 2026 — +37.6% critical vulnerabilities after five iterations; 400 samples; four prompting strategies; none improved security. · Google Research, Towards a Science of Scaling Agent Systems — 17.2× error amplification (independent multi-agent) vs. 4.4× (centralized); 180 configurations; three model families. · CodeRabbit / Sonar / SmartBear 2026 developer-survey cluster — "looks correct but isn't"; 58% trust AI output without testing.

Accessibility. Deque — automated tooling identifies 57.38% of accessibility issues (2,000+ audits; 13,000+ pages; ~300,000 issues). · WebAIM 2026 predictions — AI assists but cannot replace human judgment for meaning, flow, and context.

Long-horizon & context. METR, Measuring AI Ability to Complete Long Tasks (+ Epoch tracking) — ~50-min 50% horizon; <10% on >4hr; 80% horizon ~5× shorter; ~7-month doubling; >16hr unreliable. · Chroma "Context Rot" / Together AI long-context analyses — gradual degradation; ~16–32K knee; lost-in-the-middle; effective context far below advertised window.

Design & homogenization. Hintze, Proschinger Åström & Schossau, 2026 — generative loops collapse to generic "visual elevator music." · Adobe, 2025 — "taste remains the true differentiator"; volumes of sameness. · Google developer discussion — "brand drift" from open-internet training.

Platforms & roadmaps. Lovable (changelog + April 2026 incident response); Vercel "the new v0"; Bolt v2 / Azure; Replit Agent 4 + June 2026 changelog; Shopify Summer '25 / Winter '26 editions; Google Firebase Studio (ex-Project IDX) & AI Studio "Build" (I/O 2026).

Agentic web & commerce. Model Context Protocol — Linux Foundation Agentic AI Foundation; ~9,600 registry servers; Stacklok ~41% production (78% figure debunked). · Shopify Storefront MCP (on-by-default for Plus post-March 2026) + MCP UI. · Checkout protocols — ACP (OpenAI + Stripe), AP2 (Google), UCP (Google), x402 (Coinbase); PayPal auto-ACP 2026. · eMarketer — projection: ~$20.9B / ~1.5% of retail, 2026. Bain/McKinsey-style — projection: 15–25% of e-commerce / $3–5T by 2030.

Harness & eval-as-moat. "Agent = model + harness" stack analyses; Braintrust ("CI/CD of probabilistic systems"); Red Hat ("eval-driven development"); Anthropic 2026 Agentic Coding Trends Report (multi-agent + AI-automated review).

Ready to map your AI future?

Stop doing 'science projects.' Let's build a concrete, 4-phase execution plan aligned with your business goals.

Partner. Build. Scale

Partner. Build. Scale

READY TO START ?

Let's discuss how we can engineer it into reality. We are ready to partner with you.

All rights reserved. WBOS.
POWERED BYOPMIZ