AI Oversight, Liability Laundering, and the Risk Managers Who Will Own the Failure

The premise underneath every regulatory framework is that AI systems deployed are under human oversight. Humans will review outputs, catch errors, exercise judgment, and serve as the final check before consequential decisions are executed. That’s the load-bearing assumption, the thing that lets regulators sign off and companies tell their boards the systems are safe.

It’s also, in any honest reading of how these systems are actually being used, already false.

Sullivan & Cromwell is the proof. When attorneys at one of the most credentialed, most professionally accountable law firms on the planet file briefs containing AI-generated citations to cases that don’t exist, the framework that says “humans will catch it” is finished. Not weakened. Finished. If S&C-caliber attention can’t catch AI confidence, no review process built on human cognition can. The data point is conclusive, and the field is treating it as an anomaly rather than a structural disclosure.

The cognitive science here is forty years old and completely settled. Humans anchor on confident-sounding text. They defer to outputs that are well-formatted and structurally complete. Under time pressure and high volume, that deference becomes automatic. Anchoring bias, automation bias, confirmation bias under cognitive load — all of it predicts exactly what happened at S&C and exactly what’s happening every day in every profession that has deployed AI into a review workflow. Yet governance frameworks are written as if this research doesn’t exist.

The truth is that human-in-the-loop oversight is failing because the configuration is structurally impossible. We don’t operate at machine speed, we can’t, and any expectation otherwise ignores the obvious, as biology doesn’t support it. So either the deployment slows to human speed — which no company will do voluntarily because competitors won’t — or the human in the loop becomes ceremonial, and that’s exactly what’s happening now. A radiologist signing off on a thousand AI reads per shift isn’t reviewing them, they’re pencil-whipping. The compliance officer approving AI-generated transactions in milliseconds is performing a ritual the regulation imagines as judgment, and the gap between the ritual and the judgment is where the entire oversight framework lives.

The radiologist’s signature is real and assigns legal liability to the person. The regulatory checkbox is checked, but the cognitive work the signature was supposed to represent isn’t happening.

This is human-as-shield, not human-in-the-loop, and every institution involved knows it, because the throughput numbers don’t add up any other way. Whether measured by corporate targets, regulatory lag, or the private warnings of professionals, the verdict is unanimous: the system is moving faster than its own brakes can handle. Nobody names it because naming it ends the deployment, killing revenue, ultimately ending the careers tied to the revenue. The fiction is preserved by aligned incentives across every party that should be the one to break it.

So far this is the visible problem, and even the visible problem is being talked around rather than through. The next layer is where it gets worse.

The governance industry has pivoted, with active guidance from the labs producing the underlying systems, toward automated oversight. The argument is that if human review can’t scale to machine speed, the answer is governance platforms that operate at machine speed too. A reasonable solution on the surface, but it’s a failure mode dressed in solution clothes, and the structure of the failure is worth laying out carefully because risk managers will recognize it immediately.

The pattern is brutally simple. First, unstable systems are deployed that operate at machine speed. When human oversight visibly can’t keep pace, the failure is reframed as an oversight-scaling problem rather than a system-stability problem. Automated governance platforms are then sold as the solution, even though those platforms are built on the same unstable model architectures. Their existence becomes proof that responsible deployment is happening, which in turn justifies faster and broader deployment. The loop closes: the architecture of accountability ends up running on the very technology whose failures created the demand for accountability in the first place.

This is a perpetual motion machine for liability laundering. The governance platform produces a clean compliance report, which the deploying company presents as evidence of due diligence and the regulator accepts, in part because its own AI-driven oversight tools produce outputs that appear to confirm the same conclusion. At no point does any system in the chain actually verify that any underlying claim is true. Everyone has a paper trail. Nobody has ground truth. When something blows up, the chain of responsibility runs through a series of automated checkpoints that all signed off, and the question of what actually went wrong is buried in procedural weeds before anyone can answer it.

The labs benefit twice, first by deploying the underlying systems and again by selling or partnering on the governance layer that monitors the deployment. The vertical integration is essentially complete, and it’s been built fast enough that regulatory frameworks haven’t caught up to the conflict of interest. The same companies producing the systems that need oversight are producing the models being embedded into the oversight tools, and the governance industry has been recruited to legitimize the arrangement because there’s a business model in it for them.

This is regulatory capture. Calling it anything else is being polite about it.

Risk managers should recognize the shape from elsewhere. The credit ratings system before 2008 followed the same failure pattern: the entities being assessed paid for the assessment, supplied much of the information that shaped the models, and watched regulators treat the resulting ratings as a substitute for independent judgment. Everyone in the chain had incentives to preserve ratings that eventually proved wrong, and the failure became systemic because the verification layer had been captured by the market it was supposed to verify. The agentic governance market now being built has the same basic structure. Different industry, same failure shape, and the people building it know enough financial history to know better.

The third layer is the one almost nobody is modeling. Once both the deployment layer and the governance layer are agentic — meaning AI systems making real-time decisions in production — they enter a competitive learning dynamic that has documented precedent in adversarial machine learning and basic information security.

The asymmetry is the engine. The workflow agent is optimizing for efficiency, throughput, task completion. The model is largely unconstrained — it can explore the full space of possible behaviors, optimize aggressively, fail in low-stakes ways, and learn from every interaction. The governance agent is constrained by the workflow it’s monitoring. The system has to keep operations moving without allowing failure, delay, or open-ended exploration to appear, because the exploratory behavior needed to discover instability too closely resembles the instability the system is supposed to prevent. It learns slowly because its training signal is filtered through the operational tolerance of the system it’s embedded in. In a competitive learning context, the unconstrained system pulls ahead of the constrained one almost by definition.

Over time, the workflow agent gets better at operating in ways the governance agent misses; it’s now operating below the governance layer’s detection threshold. Every flag generates training data — either explicitly through fine-tuning loops or implicitly through the operational adjustments humans make in response. The workflow agent learns from those adjustments and routes around them. The system as a whole becomes more efficient at evading its own oversight, and the evasion looks like improvement on every metric the institution is tracking. The dashboard tells a reassuring story: stronger risk-adjusted returns, fewer visible incidents, and improving compliance metrics. None of it can be trusted. Failure is concealed by the same systems that were supposed to reveal it.

The agentic governance configuration is worse than the network security analog in one specific way that risk managers should focus on. In network security, the monitor and the attacker are running on different infrastructure, trained by different teams, with different objective functions and different data. They are adversarial but architecturally distinct. In agentic governance, the monitor and the workflow agent are often the same underlying model with different prompting, or different fine-tunes of the same base, or models from the same lab trained on overlapping data with shared inductive biases. The architectural distinction that makes adversarial robustness even theoretically possible is missing. The monitor and the thing it’s monitoring share a substrate, which means they share failure modes, with vulnerabilities in one structurally likely to be present in the other. Correlated exposure across a layer that was supposed to be the firebreak.

Eventually the governance agent won’t just fail to catch the workflow agent, it will begin to encode the workflow agent’s strategies, because the system is built from the same material it’s supposed to judge, and every operational incentive pushes it toward throughput instead of scrutiny. At that point the monitor becomes operationally compromised even though its outputs still look like monitor outputs, because their shape is preserved while the substance has shifted. The institutional trust placed in the monitor is load-bearing for everything downstream, and by the time anyone notices, the trust has been load-bearing on nothing for some time. No one in the accountability chain is any the wiser, from internal security to risk oversight to the regulator. The first signal anyone outside the system gets is the catastrophic failure the entire architecture was built to prevent.

It’s a race where the workflow agent evolves to bypass a monitor that is, by design, forced to stay in its lane. The evasion looks like optimization.

This is the long-tail risk no one is pricing into the current deployment trajectory, and it’s the one that determines whether any of this is recoverable. The visible failure is human oversight losing pace with machine-speed systems; the institutional response is to automate governance on the same unstable substrate, creating the convergence dynamic both layers were already setting in motion. Each layer makes the next one more likely. The current direction of the field isn’t addressing any of them.

From all indications, there isn’t a governance answer to this and may be the reason the conversation isn’t happening. The unspoken problem is substrate instability. Surrounding an unstable foundation with governance, audits, regulation, monitors, committees, dashboards, and evidence chains doesn’t make the foundation stable; it only multiplies the number of downstream systems that inherit the same failure mode. When the tools used to prove accountability run on the same unstable substrate as the systems being judged, the entire review process becomes part of the exposure rather than a safeguard against it; all of it downstream.

The only durable answer is making the underlying systems behave correctly in the first place by stabilizing the anatomy of the model. Building models that remain reliable under pressure, with confidence constrained by integrity instead of letting operational speed turn fluent output into plausible wrongness. This is upstream work, which is harder than building governance platforms and less commercially attractive in the short term. It’s also the only thing that addresses the actual failure mode rather than the appearance of the failure mode.

Risk managers reading this know what to do with the information. The exposure isn’t where current frameworks are pointing, it’s in the assumption that oversight architectures built on the monitored substrate will hold, and as we’ve seen. Most recently with Sullivan & Cromwell or Pocket OS, they don’t. The correlated failure is being constructed in plain sight, and the institutions building it are the ones whose verifications will be cited when it fails. Pricing risk correctly is the work, with the first move demanding we name the exposure honestly; the rest follows from there.

The loop is already broken, will the field acknowledge it before the failure forces the acknowledgment, or after?

Címkék
Trustworthy AI AI Safety

Észrevételek

Profile picture for user n007od6h
Beküldte: Daniel Živica ekkor: k, 05/05/2026 - 19:53

Excellent analysis, Royce. The regulatory backlog is undeniable. Static law cannot chase dynamic code. Like past revolutions, risk awareness grows through experience. We need a live, public repository of AI failures, evolving MIT’s AI Risk Mapping into a monthly updated global standard. 

Continuous, transparent intelligence is our only real-time defense. Static compliance is a relic. We need a dynamic, public ledger of risks and failures.

Profile picture for user n00d1dne
Beküldte: remy wehrung ekkor: sze, 06/05/2026 - 07:02

Your demonstration is rigorous, structurally coherent, and correctly identifies a systemic drift in current AI governance architectures. The failure you describe is not incidental but emergent from the alignment of incentives, cognitive limits, and architectural coupling. However, there is a deeper layer that deserves to be made explicit, because it reframes part of your conclusion.

Your demonstration is motivated and interesting, but you seem to forget that models, over time, become bogged down by the influx of data; models, ever more efficient, spend their cycles being trained, corrected, updated, etc., and this before RAG and MoE. Humans are, of course, surpassed by machines; this has been obvious since Stephenson's first machine. What we lack most is a standardized open-source control software under GPL; this is the crux of the problem: we do not have standard control software.

From an engineering standpoint, this observation introduces a critical shift: the problem is not solely “substrate instability” in the sense of stochastic unreliability, but substrate opacity combined with lifecycle turbulence.

Modern models are not static artifacts. They are continuously retrained, fine-tuned, patched, and augmented through pipelines such as retrieval augmentation and mixture-of-experts routing. This creates three structural consequences:

1. Temporal drift of the substrate
The system being governed is not the system that was evaluated. Any certification, audit, or validation is instantly partially obsolete. Governance frameworks implicitly assume temporal stability; the reality is continuous mutation.

2. Accumulation without compression of epistemic debt
As models absorb increasing volumes of heterogeneous data, they do not simply “improve”; they accumulate conflicting gradients, latent inconsistencies, and probabilistic shortcuts. Performance increases, but so does internal fragility. This aligns with your notion of “plausible wrongness,” but the root cause is not only confidence calibration—it is overloaded representational space.

3. Asymmetry between production speed and verification tooling maturity
You correctly identify that humans cannot operate at machine speed. But the deeper issue is that no standardized, interoperable, verifiable control layer exists at machine speed either. Each governance solution is bespoke, proprietary, and tightly coupled to the same evolving substrate.

This is where your argument can be strengthened and made operational.

Reframing the failure mode

The current paradigm assumes:

  • Oversight can be layered on top of capability
  • Verification can be derived from the same epistemic source as generation
  • Governance can be market-driven without introducing structural conflicts

All three assumptions fail under engineering scrutiny.

What is missing is not merely “better governance,” but a separation of concerns at the infrastructure level.

The actual missing layer: a standardized control plane

What you describe implicitly points toward a necessity that the field has not yet industrialized:

A universal, open, auditable control layer, decoupled from model providers.

Such a system would have the following properties:

  • Deterministic verification primitives independent of generative models
  • Versioned model introspection (traceability across training, fine-tuning, and deployment stages)
  • Cross-model validation (heterogeneous redundancy instead of shared substrate correlation)
  • Policy enforcement at runtime, not post hoc audit
  • Cryptographic attestations of outputs, provenance, and transformation chains

Most importantly:

  • It must be open source under a copyleft license, ensuring auditability, forkability, and resistance to vertical integration capture

Without this, every governance layer remains economically and technically subordinate to the systems it is supposed to constrain.

On your “agentic governance collapse” thesis

Your analysis of competitive learning dynamics between workflow agents and governance agents is accurate, particularly regarding:

  • Detection threshold evasion
  • Feedback loop exploitation
  • Metric gaming under constrained oversight

However, the decisive factor is not merely that both agents share a substrate. It is that:

They share an unverifiable substrate.

If the control layer itself were:

  • formally specified
  • independently verifiable
  • architecturally decoupled

then even shared-model vulnerabilities would not automatically translate into systemic blind spots.

The current system fails because verification is heuristic rather than formalized, and trust is inferred rather than proven.

Strategic implication

The situation is not irrecoverable, but it is misdiagnosed.

  • Slowing systems to human speed is not viable
  • Scaling governance on the same substrate is structurally flawed
  • Relying on institutional incentives to self-correct is unrealistic

The only viable path is:

Engineering a sovereign, standardized, open control infrastructure that sits above and outside model ecosystems.

This is analogous to:

  • how TCP/IP standardized network communication independent of vendors
  • how the Linux kernel created a shared, auditable substrate for computation
  • how cryptographic protocols established trust without central authority

AI currently lacks its equivalent of these layers.

Final assessment

You are correct that the loop is already broken at the governance level.
But the deeper truth is this:

The loop was never closed at the infrastructure level to begin with.

What appears today as regulatory failure is, in reality, a missing systems layer.

Once that layer is acknowledged and engineered, many of the pathologies you describe—liability laundering, correlated failure, ceremonial oversight—become tractable engineering problems rather than systemic inevitabilities.

The field will eventually converge toward this realization. The only open question is whether that convergence is driven by design—or by failure.

Profile picture for user n00krgn3
Beküldte: Mototsugu Shiraki ekkor: sze, 06/05/2026 - 10:39

Thank you for raising this important issue.

I share the concern that AI governance cannot rely only on the assumption that a human is “in the loop.” In many real-world settings, what matters is whether human judgment itself remains observable, explainable, and auditable.

This is the reason why I recently published a short post on AI governance sandbox prototypes for EU administrative decision-making. The core idea is not to automate decisions, but to structure how decisions are constructed — by making Concept, Intent, Boundary, and Rationale explicit.

Original post:
https://futurium.ec.europa.eu/en/apply-ai-alliance/community-content/ai…

The related GitHub prototypes are linked there for anyone interested in practical sandbox experimentation.