Manifesto Documentation

The Agentic Engineering Manifesto

Principles for building systems where humans steer intent, agents execute within governed boundaries, and verified outcomes are the only measure that matters.

This is a living document. Agentic engineering is a fast-moving field, and this manifesto evolves continuously — informed by our own practices, what we witness in the field, and the new technologies, trends, and practices that emerge. Contributions are welcome.

Start Reading Open GitHub Repository

12principles

29linked sections

3named authors

Editorial Stance

Iterative steering and alignment

over

Rigid upfront specifications

Verified outcomes with auditable evidence

over

Fluent assertions of success

Right-sized agent collaboration

over

Monolithic god-agents

Curated, high-signal context and memory

over

Stateless sessions and noisy memory

Tooling, telemetry, and observability

over

Chat-based heroics

Resilience under stress

over

Performance in ideal conditions

The Agentic Loop

Specify→Design→Plan→Execute→Verify→Validate→Observe→Learn→Govern↻

Principles for building systems where humans steer intent, agents execute within governed boundaries, and verified outcomes are the primary measure that matters for agentic work.

The Agile Manifesto was written for a world where humans wrote all the code. That world no longer exists.

In agentic workflows, generation, verification, and deployment run at machine speed. Legacy ceremonies — sprint cadence, velocity scoring, manual review-first pipelines — become bottlenecks and blind spots. Early empirical evidence, including the SWE-CI benchmark showing regression rates above 75% per CI iteration across 18 models (arXiv:2603.03823), confirms that agentic systems require purpose-built engineering discipline, not retrofitted Agile ceremonies.

This repository provides a complete alternative: the case for change, the manifesto itself, a companion implementation guide, an organizational adoption playbook, and domain-specific regulatory alignment for six industries.

Read the published version →

The agentic governance stack

The Agentic Engineering Manifesto (AEM) is the root layer of a layered governance stack for agentic systems. The dependency direction is explicit:

Agentic Engineering Manifesto (AEM)
   ├─ Agentic SDLC (ASDLC) — engineering-side governance of agent-built code
   ├─ Agentic Product Lifecycle (APLC) — product-side governance of agent behavior
   ├─ Intelligence Governance Manifesto (IGM) — substrate that agents reason over
   └─ Agentic Enterprise Manifesto (AEnt-M) — enterprise coordination of multiple agents on a shared substrate
       ├─ depends on IGM (substrate)
       └─ inherits AEM principles

ASDLC, APLC, IGM, and AEnt-M extend AEM and inherit its twelve principles. Earlier framings of the relationship as "complementary" or "companion" are retired in favour of the layered model above. See agentic-governance-stack.md for the canonical one-page reference, the layer-by-layer scope, and the term-collision preface that resolves cross-stack vocabulary collisions (e.g. confidence, autonomy tier, governance, initiative).

Where to Start (Across the Stack)

The Agentic Engineering Manifesto, ASDLC, and APLC are a layered set, but each is independently adoptable. Pick the one that matches the pain you are feeling now — you do not need to adopt all three at once, and you do not need to read them in order.

If your pain is…	Start with…
An AI agent already in market or about to be, and you cannot describe its behavior, prove its drift, or govern foundation-model updates that change it without warning	APLC — the Agentic Product Lifecycle. Governs the agent product itself: behavioral specification, evaluation, drift, foundation-model update governance, regulated retirement.
Software delivery by teams using AI agents to write code, where the inner loop runs faster than your demand validation, release governance, or operational readiness can keep up	ASDLC — the Agentic Software Delivery Lifecycle. Governs the four-layer delivery lifecycle around agent-built software: demand, execution, release, operations.
Engineering practice itself — how humans steer intent, how agents execute within governed boundaries, what verified outcomes look like inside the inner loop	Manifesto — the Agentic Engineering Manifesto. Defines the inner engineering loop that both APLC Stage 3 and ASDLC Layer 2 reference.

Each framework is independently useful. Together they form a complete governance stack for organisations dealing with both agent-built software and agent products in market.

Six Values

We value more	over	We also value
Iterative steering and alignment		Rigid upfront specifications
Verified outcomes with auditable evidence		Fluent assertions of success
Right-sized agent collaboration		Monolithic god-agents
Curated, high-signal context and memory		Stateless sessions and noisy memory
Tooling, telemetry, and observability		Chat-based heroics
Resilience under stress		Performance in ideal conditions

While there is value in the items on the right, we value the items on the left more.

Twelve Principles

Outcomes are the unit of work
Specifications are living artifacts that evolve through steering
Architecture is defense-in-depth, not a document
Right-size the swarm to the task
Autonomy is a tiered budget, not a switch
Knowledge and memory are distinct infrastructure
Context is engineered like code
Evaluations are the contract; proofs are a scale strategy
Observability and interoperability cover reasoning, not just uptime
Assume emergence; engineer containment
Optimize the economics of intelligence
Accountability requires visibility

See full text in manifesto-principles.md.

The Agentic Loop

Specify → Design → Plan → Execute → Verify → Validate → Observe → Learn → Govern → Repeat

Any phase can trigger a return to an earlier one based on evidence. The loop is the system. The principles are how you keep it honest.

Who Is This For?

If you are	Start with
New to agentic engineering	Beyond Agile → The Manifesto → Adoption Playbook
A practitioner implementing now	Twelve Principles → Principle Guidance → Patterns → Adoption Path
An engineering leader or change owner	Beyond Agile Landscape → Adoption Roles → Metrics
In a regulated industry	Domain Overview → your domain document

Repository Map

1) Beyond Agile (Case for Change)

beyond-agile/main.md: The argument for why Agile is insufficient for agentic systems.
beyond-agile/failures.md: Ten structural failures in values, practices, and conceptual coverage.
beyond-agile/landscape.md: Critical comparison of competing manifestos, standards, and frameworks.
beyond-agile/sources.md: Twenty-three cited sources including academic benchmarks (SWE-CI, Feldt et al.), industry frameworks (AWS, P3 Group, ISO 5338), and practitioner perspectives.

2) The Manifesto (Normative Core)

manifesto.md: Core values, scope, Agentic Loop, and reading guide.
manifesto-principles.md: Twelve principles with minimum bars.
manifesto-done.md: Agentic Definition of Done (seven criteria plus evolvability) and Definition of Done for Hardening (vibe-to-prod path).
glossary.md: Canonical definitions for all terms used across the manifesto document set.

3) Implementation Guide

companion/principles.md: Extended guidance and tradeoffs by principle. Includes the Architect–Programmer pattern, evaluation holdout and probabilistic satisfaction, and behavioral vs. structural regression analysis.
companion/frameworks.md: Maturity spectrum, boundary conditions, and operational definitions.
companion/patterns.md: Worked patterns and failure patterns.
companion/re-framework.md: Requirements engineering framework — two-axes classification, behavioral envelopes, and probabilistic assurance targets.
companion/reference.md: Failure modes and skill requirements.

4) Adoption Playbook (Organizational Transition)

adoption/playbook.md: Playbook overview and new way of working.
adoption/roles.md: Role evolution and human-side transition guidance.
adoption/path.md: Incremental technical adoption path and phase transitions.
adoption/vmodel.md: V-model-specific adoption path for regulated and verification-heavy organizations.
adoption/pilot.md: Resistance management and first pilot execution.
adoption/metrics.md: Success metrics, quarterly review cadence, and failure modes.

5) Domain-Specific Regulatory Alignment

domains/README.md: Navigation and disclaimers.
domains/aviation.md: DO-178C, DO-330, DO-333, ARP 4754A.
domains/medical-devices.md: IEC 62304, ISO 14971, ISO 13485, FDA SaMD.
domains/pharma.md: GAMP 5, CSA, 21 CFR Part 11, ICH.
domains/financial-services.md: SR 11-7, DORA, EU AI Act, SOX, Three Lines of Defense.
domains/automotive.md: ISO 26262, SOTIF, ASPICE.
domains/defense-government.md: MIL-STD-882, DO-326A, NIST AI RMF.

This is a living document. Contributions are welcome — see CONTRIBUTING.md for guidelines on proposing changes, submitting worked patterns, or reporting issues. See AUTHORS.md for contributors. See LICENSE for terms.

The Agile Manifesto was written for a world where humans wrote all the code. That world no longer exists.

The Agile Manifesto Is Twenty-Five Years Old — and It Shows

In February 2001, seventeen software developers gathered at a ski lodge in Snowbird, Utah, and wrote a document that would reshape an industry. The Agile Manifesto was a rebellion against waterfall bureaucracy, and it won. Its four values — individuals and interactions, working software, customer collaboration, responding to change — liberated a generation of engineers from Gantt charts and heavyweight process.

But the Agile Manifesto was written with an unstated assumption so fundamental that nobody thought to say it aloud: humans write all the code.

Every practice built on the manifesto's assumptions — Scrum's two-week sprints, SAFe's velocity tracking, daily standups, story points, pair programming, retrospectives — is calibrated to the pace, cognition, and coordination needs of human teams. When autonomous agents can generate functional applications in hours, execute legacy migrations during a single flight, and run verification pipelines that dwarf what any human QA team could attempt in a quarter, these practices do not merely feel dated. They become structural liabilities ¹⁵.

Steve Jones, in his widely-circulated essay "AI Killed the Agile Manifesto," argues that the manifesto is "a great way to screw up in a big way when using Agentic SDLCs at scale" because "Agentic SDLCs are too fast for Agile" ¹. The P3 Group's From Sprints to Swarms white paper goes further, calling established agile frameworks "strategic liabilities" that throttle innovation when confronted with AI-driven workflows, and declaring the daily standup "an exercise in absurdity" when an AI orchestrator knows the precise status of every task at any given microsecond ⁵.

Not everyone agrees. Jon Kern, one of the original seventeen signatories, describes himself as "smitten" with vibe coding but insists the manifesto "will endure" — arguing that you "need to understand agility more than ever" and should "learn a little bit more about what constitutes the ability to create high-quality software at speed with responsibility" ¹⁰. Martin Fowler, hosting a 25th-anniversary workshop at Thoughtworks, said he doesn't "have a lot of time for manifestos" and that writing a new one is "way too early" — though the workshop itself concluded that test-driven development "has never been more important" and "produces dramatically better results from AI coding agents" ⁹¹⁶. According to InfoQ's summary of Forrester's 2025 State of Agile Development report (primary report not publicly available), 95% of surveyed professionals affirm Agile's critical relevance ¹⁴.

But relevance and sufficiency are not the same thing. A compass is relevant in a car, but it does not replace the steering wheel. The Agile Manifesto remains relevant as a philosophical compass. It is fundamentally insufficient as an operating system for agentic engineering.

Where Agile Breaks: Ten Structural Failures

The four Agile values challenged, the practices rendered obsolete, and the conceptual gaps — memory, non-determinism, self-improving systems — that Agile never addressed because it never needed to.

The Existing Manifestos: What They Get Right and What They Miss

A critical review of every competing framework: Casey West's Agentic Manifesto, the SASE Framework, the DEV Community manifesto, P3 Group's "From Sprints to Swarms," the AWS Prescriptive Guidance, and ISO/IEC 5338.

Sources

Sixty cited sources classified by type: press, blog, industry, academic, standard, and internal reference.

What Is Actually Needed: The Case for a New Agentic Engineering Manifesto

Every existing framework gets something right. None of them are complete. The gap is not philosophical — it is operational. The industry needs a manifesto that is simultaneously:

Philosophical — values that reframe priorities for a probabilistic world
Principled — concrete engineering principles with minimum bars
Operational — connected to real tooling and measurable outcomes
Evolutionary — a maturity spectrum, not a binary switch

The Agentic Engineering Manifesto — six core values, twelve principles mapped to operational tooling, an agentic definition of done, and a maturity spectrum from first adoption to recursive self-improvement — is one attempt to meet this standard. Whether it succeeds is for the engineering community to determine through practice, evidence, and iteration.

What recent agentic-AI research adds is a stronger explanation of why a new discipline is needed. If intelligence at frontier scale is increasingly plural, relational, and organized through internal or external societies of thought, then the engineering problem is no longer "how do we steer one smart assistant?" It becomes: how do we govern distributed cognition across agents, tools, and humans using explicit protocols, evidence, and institutional checks and balances? Likewise, if agents can improve by externalizing reusable skills and refining them through experience, then memory governance is no longer a nice- to-have optimization. It becomes part of the control plane for learning systems ⁵⁹⁶⁰.

The Urgency

Current industry forecasts and long-horizon maintenance benchmarks suggest that agentic delivery fails when teams force probabilistic systems through legacy SDLC workflows ⁸²². The signal is strongest for maintenance-heavy, multi-iteration work, not every software activity. Methodology mismatch is one failure mode among several; cost, governance, tool quality, and organizational incentives also matter.

The failure is not only technical — it is organizational. Enterprises are trying to adopt agentic technology using Agile governance structures designed for human teams. The Sprint Review is the governance checkpoint, but who reviews agent output at machine speed? The Scrum Master is the process guardian, but who governs a recursive feedback loop? The Product Owner is the requirements authority, but who owns the specification that constrains agent behavior across a swarm? These roles do not map to agentic engineering, and the P3 Group's observation that organizations face a choice between evolutionary and revolutionary adoption ⁵ understates the challenge: most organizations are attempting neither, clinging instead to methodologies calibrated for a world that no longer exists.

The industry is repeating the pattern it followed with every previous paradigm shift: rushing to adopt the new technology while clinging to the old methodology. The two-week sprint does not accommodate machine-speed execution. Story points do not measure probabilistic output. Human code review does not scale to agent-generated volume. Standups do not synchronize digital swarms. Early empirical evidence confirms the scale of the problem: the SWE-CI benchmark, testing 18 models across 100 tasks spanning an average of 233 days of real development history, found that most agents introduce at least one regression in three out of four CI iterations — many of them structural regressions that pass current tests but degrade the codebase's capacity for future change ²².

The question is not whether the Agile Manifesto needs a successor. The question is whether the successor will emerge from principled engineering or from the wreckage of failed projects.

The window is closing. Every month without a coherent engineering discipline for agentic systems is another month of "vibe coding" masquerading as engineering, another month of hallucination loops shipping to production, another month of technical debt accruing at machine speed. The organizations that adopt a rigorous agentic engineering manifesto now — with verified outcomes, governed autonomy, curated memory, defense-in-depth architecture, economics-aware routing, formal verification where risk warrants it, and human accountability at every tier — will define the next era of software. The rest risk becoming Gartner's statistic.

Exploration is a phase. Engineering is a discipline.

The four values challenged, the practices rendered obsolete, and the conceptual gaps Agile never addressed.

See Beyond Agile for the full argument. See the Existing Manifestos for what competing frameworks get right and miss. All references link to Sources.

The failures are not cosmetic. They are structural — rooted in assumptions that no longer hold. Some belong to the Agile Manifesto itself: its four values, written for a human-only world. Others belong to the practices that grew around it — Scrum's sprints, SAFe's velocity tracking, the ceremonies and metrics that became the operational expression of Agile but were never part of the original document. The distinction matters: the manifesto is a philosophical statement; the practices are an implementation. Both break, but for different reasons. These failures are failures of Agile in agentic systems; they are not a claim that Agile is obsolete for all human-led software work.

The Four Values — Challenged

1. "Individuals and interactions over processes and tools" — Inverted

This was Agile's most liberating principle: trust people, not bureaucracy. But in an agentic pipeline, the toolchain is the capability. The choice of orchestration platform, the choice of verification fleet, the choice of memory infrastructure — these are not implementation details. They are architectural decisions that determine what is possible. Using one tool versus another creates fundamentally different operational realities ¹.

In agentic systems, processes and tools are now fundamental to success, not obstacles to it. The human's role has shifted from writing code to architecting the environment in which agents write code. The Agile Manifesto's founding value has been inverted by the reality it never anticipated ¹⁶.

2. "Working software over comprehensive documentation" — Dangerous

This value was a corrective against waterfall's thousand-page specifications that nobody read. It made sense when humans wrote code deliberately and could explain their reasoning. It is actively dangerous when applied to autonomous agents.

In agentic systems, AI models excel at producing software that appears to work. Jones calls this out directly: "AI is spectacular at building software that looks like it works" but "can create technical debt at a rate that normal developers absolutely couldn't" ¹. The phenomenon — what Andrej Karpathy termed "vibe coding" ⁷ — generates code satisfying immediate tests while lacking modularity, architectural integrity, and scalability. Without documentation serving as the contractual boundary holding agents accountable, systems hallucinate from legacy training data and corrupt their own operational context ¹³.

In agentic engineering, documentation is the specification that constrains agent behavior. Architecture Decision Records, formal contracts, constraint files, capability definitions — these are not bureaucratic overhead. They are the machine-readable rules that prevent autonomous systems from optimizing for the wrong thing. The Agile Manifesto's suspicion of documentation becomes negligence when your workforce is probabilistic ³⁶.

3. "Customer collaboration over contract negotiation" — Reframed

Agile rightly elevated direct customer collaboration over adversarial contract negotiations. But in an agentic system, the "contract" is no longer a legal document between humans — it is the machine-readable specification that governs agent behavior. The agent does not collaborate; it executes within constraints. If those constraints are vague, the agent will fill the gaps with its own probabilistic inference — and the customer will receive something nobody specified ³.

Contract negotiation has been reborn as specification engineering: defining precise, testable, machine-enforceable boundaries. The collaboration happens between humans during specification. The contract happens between human intent and agent execution ⁴¹¹.

4. "Responding to change over following a plan" — Incomplete

Agile rightly valued adaptability over rigid upfront planning. But it assumed the entity responding to change was a human with judgment, context, and accountability. When agents respond to change, they do so probabilistically — and without the judgment to know when adaptation has become drift ³.

In agentic systems, specifications need to steer behavior and evolve through evidence — something the Agile Manifesto never contemplated. Not rigid plans, but not unconstrained adaptation either. Living specifications that tighten through iterative refinement — specify, execute, evaluate, adjust — with convergence criteria that distinguish productive evolution from scope drift ³⁴.

The four Agile values each assumed a human-only world. But the structural failures extend beyond the values themselves — into the practices, metrics, and ceremonies that Scrum, SAFe, and related frameworks built on top of those values.

The Practices — Obsolete

5. Sprint Cadences Are Irrelevant to Machine-Speed Execution

A two-week sprint assumes human pace. When agents can complete a full development cycle in hours ¹, the sprint boundary is not just arbitrary — it is a bottleneck that prevents the system from shipping validated increments the moment they are architecturally sound ⁵¹¹.

The replacement is continuous flow with verification gates: agents produce work continuously, and every increment passes through deterministic checks, evaluation harnesses, and proof generation before it advances. The cadence is not time-boxed — it is evidence-gated ⁴¹¹.

6. Estimation and Velocity Tracking Lose Meaning

Story points and velocity metrics assume human cognitive throughput as the constraint. When an agent can generate ten implementations in the time a human would estimate one, velocity tracking measures the wrong thing ⁵. The meaningful metric becomes total cost of correctness (the sum of inference spend, verification overhead, and incident remediation when failures escape) — not story points completed ³¹⁸.

This point is worth sharpening because even sophisticated frameworks miss it. McKinsey's AI Transformation Manifesto frames success in terms of EBITDA uplift and return on AI investment — business outcomes, not engineering activity. That framing is correct and the manifesto community should adopt it. But McKinsey's own framework contains no mechanism for how those outcomes are verified at the task level. The missing link is exactly total cost of correctness: the economics-aware routing, verification overhead, and incident cost that determine whether a given investment in AI delivery produces real return or just faster output that fails downstream.

The economics shift runs deeper than metrics. In Agile, cost is simple: developers cost X per sprint, multiply by sprints. In agentic engineering, the cost model is per-token, per-model, per-task — and it varies by orders of magnitude depending on which model is routed to which task. Sending every task to the most capable model is like flying first-class for a cross-town trip; sending every task to the cheapest is like taking a bicycle to the airport. The Agile Manifesto has no vocabulary for economics-aware routing — selecting which model handles which task based on the cost-quality tradeoff — because it never needed one ¹⁷. Reuven Cohen has described a "sudden flip in the cost curve" around mid-2025 — the moment long-horizon agentic swarms became economically feasible, making cost-quality routing not just desirable but essential ²¹. The Gartner prediction that 40% of agentic projects will be canceled cites "escalating costs" as a primary driver ⁸ — this is precisely the problem: organizations burning through inference budgets because they lack cost-quality routing discipline.

7. Human Code Review Becomes the Bottleneck

When agents produce code at machine speed, the human reviewer becomes the rate limiter. The Agile Manifesto has no answer for this because it never imagined a world where code generation was not the bottleneck ¹⁸. The answer requires tiered verification: deterministic checks filter autonomously, statistical evaluation filters semi-autonomously, and human review focuses exclusively on high-risk deltas and policy exceptions ¹³¹⁸.

The Conceptual Gaps — Missing Entirely

8. No Framework for Non-Deterministic Behavior

In agentic systems, this is the deepest category of failure — concepts that neither the manifesto nor its derivative practices ever addressed, because they did not need to. Agile assumes deterministic execution: write code, run tests, the same input produces the same output. Agents are probabilistic. The same specification can produce different implementations across runs. The same tool call can produce different results depending on context window contents, model temperature, and retrieved memory ³⁴.

The Agile Manifesto has no vocabulary for emergence, containment, hallucination loops, memory poisoning, or probability-compounding across multi-agent systems ⁴¹³. These are not edge cases — they are routine operating conditions in agentic engineering.

It is worth noting that even current enterprise guidance — including McKinsey's AI Transformation Manifesto — reproduces this gap at the strategic level. McKinsey's theme on agentic engineering (#11 of their twelve themes) describes the challenge as "ingesting unstructured data, extending AI platforms with agentic capabilities, automating guardrails and controls." This describes Agile-era configuration management dressed in agentic vocabulary. It has no concept of blast radius, swarm topology, correlated failure modes, or the verification/validation distinction that separates "the agent said it worked" from "we can prove it worked." The absence of non-determinism vocabulary in enterprise AI guidance is not a strategic oversight — it is a symptom of the same conceptual gap that limits Agile: the framework was designed for human executors, and the vocabulary has not caught up to probabilistic ones.

9. No Concept of Systems That Learn from Their Own Execution

In agentic systems, this may be the most consequential failure — the one that generates all the others. Agile's feedback loop is the retrospective: humans reflecting on what happened, deciding what to change, implementing changes in the next sprint. That loop runs on a two-week cadence because it requires human cognition. The feedback is soft — "we should try shorter sprints," "let's refine our definition of done."

Agentic systems have a qualitatively different feedback loop. Errors, logs, successes, and failures feed back into the system in real-time. The system does not wait for a retrospective. It does not require human reflection to adapt. The feedback is hard, unambiguous signals: passing tests, zero runtime errors, validated API responses, converging evaluation metrics ²¹⁷. This is not merely faster iteration — it is a different kind of learning. Reasoning consolidation cycles can compress chains of inference using reinforcement-learning algorithms in seconds. Meta-cognitive layers enable systems to monitor and modify their own operational parameters. Self-optimizing architectures adapt query and retrieval strategies in microseconds based on access patterns ¹⁹.

The Agile Manifesto has no concept of a system that improves its own process without human intervention. It assumes the learner is human, the cadence is weekly, and the feedback requires interpretation. In agentic systems, the learner is the system itself, the cadence is continuous, and the feedback is machine-readable. This is not an incremental improvement over retrospectives — it is a paradigm shift that Agile's vocabulary cannot express ².

10. No Treatment of Memory as Infrastructure

In agentic systems, Agile's concept of institutional memory — tribal knowledge and documentation that nobody reads — is fatally insufficient. There is no Agile practice for curating, governing, or versioning what the organization knows — let alone what it has learned. In an agentic system, this gap is fatal.

An agent without persistent memory — whether internalized or externalized through retrieval layers, episodic stores, or vector databases — must be reconstructed from scratch for every task. Some architectures externalize memory entirely, separating the agent from its state. But the memory still exists as infrastructure; it still requires curation, governance, and retrieval engineering. The question is not whether memory is needed but where it lives and who governs it. In practice, the distinction between a stateless tool and a memory-augmented agent is the ability to accumulate, curate, and act on context across invocations ¹⁹. And memory itself is not monolithic: knowledge (what was given) and learned memory (what was discovered through execution) are distinct infrastructure with different curation, governance, and retrieval requirements ¹⁷.

Context windows are finite. What goes into them determines what comes out. Memory governance — the discipline of deciding what to retain, what to forget, how to retrieve, and how to prevent poisoning of an agent's accumulated context — is an engineering discipline as consequential as database design. The Agile Manifesto treats memory as overhead. Agentic engineering treats it as infrastructure ³.

A critical review of every competing framework.

See Beyond Agile for the full argument. See Ten Structural Failures for how Agile breaks. All references link to Sources.

The industry has not been idle. Multiple manifestos and frameworks have emerged to fill the vacuum. But none of them are sufficient.

Casey West's Agentic Manifesto

What it gets right: The shift from verification ("did it do what I said?") to validation ("did it do what I wanted?"). The Agentic Delivery Lifecycle (ADLC) across five non-linear phases. The "Determinism Gap" — the fundamental difference between a system whose output is known in advance and one whose output is discovered in real-time. The emphasis on continuous flow over time-boxed sprints. The insistence that human engineers and agents must work together continuously, rejecting fully unsupervised delegation ⁴.

What it misses: No treatment of memory as infrastructure — West does not distinguish knowledge from learned memory or address memory governance. No economics-aware routing — no recognition that model choice is a runtime decision with cost implications ¹⁷. No framework for formal verification or proof generation — and this matters because when execution is non-deterministic, at least one layer of the verification pyramid must be provably correct; executable specification languages and model checkers are now production-viable for this purpose ¹²¹⁹. No treatment of swarm topology as an engineering decision. No recognition that alignment must move from the single-agent prompt layer toward institutional alignment across interacting agents, tools, and humans ⁵⁹. The manifesto reads as a philosophical reframe of Agile rather than a new engineering discipline.

The SASE Framework (Academic SE 3.0)

What it gets right: The dual modality of SE4H (Software Engineering for Humans) and SE4A (Software Engineering for Agents). The elevation of the developer from syntax author to "Agent Coach." The structured artifacts: BriefingScripts, Merge-Readiness Packs (MRPs), Consultation Request Packs (CRPs). The separation of Agent Command Environment (ACE) and Agent Execution Environment (AEE). The Plan-Do-Assess-Review (PDAR) loop with agent-initiated callbacks ¹².

What it misses: Overly academic — lacks operational tooling references. No treatment of cost-quality routing. No framework for memory governance beyond "institutional memory." No recognition that formal verification and statistical evaluation are complementary disciplines with different cost curves ¹⁹. No treatment of self-improving recursive systems or of skill memory as an external learning substrate that must itself be governed ⁶⁰.

The DEV Community "Agentic Manifesto"

What it gets right: Four values that correctly reframe priorities: human intent over exhaustive requirements, continuous flow over sprints, architectural integrity over feature output, automated validation over manual estimation ¹¹.

What it misses: Values without principles are aspirational, not operational. No definition of "done." No treatment of observability, memory, domain boundaries, or accountability. No framework for what happens when architectural integrity conflicts with continuous flow. No recognition that "automated validation" requires a multi-layered verification pyramid (deterministic → statistical → formal → human) rather than a single binary gate ¹³.

The P3 Group's "From Sprints to Swarms"

What it gets right: The most thorough deconstruction of how specific Agile practices (standups, sprints, estimation, retrospectives) fail under agentic workflows. The strategic framing of evolutionary versus revolutionary adoption paths. The recognition that the Agile Manifesto's values can survive as governance principles even as its practices become obsolete ⁵.

What it misses: Primarily diagnostic rather than prescriptive. Identifies what breaks but does not provide the replacement engineering discipline with sufficient depth. No treatment of formal verification, memory governance, or economics-aware routing. No operational tooling framework.

The AWS Prescriptive Guidance

What it gets right: "Zones of intent" — bounded operational spaces where agents have high autonomy within architectural constraints. The evolution from "Sprint Planning" to "Intent Design." The recognition that "done" must be redefined as runtime readiness with observability, explainable traces, and feedback mechanisms ³.

What it misses: Vendor-contextualized (AWS-centric). No treatment of multi-vendor swarm coordination. No framework for formal verification. Limited treatment of memory and learning systems.

ISO/IEC 5338:2023

What it gets right: Among the first comprehensive international frameworks for AI system lifecycle processes ²⁰. The integration of Model Engineering into standard Implementation processes. The mandate for Continuous Validation — acknowledging that AI agents can suffer from context drift, hallucination, and data staleness over time. The emphasis on bias mitigation, transparency, and purpose-binding for training data ¹⁵.

What it misses: Designed for AI systems broadly, not for agentic engineering specifically. No treatment of multi-agent coordination, swarm topologies, or inter-agent trust. No framework for memory governance, economics-aware routing, or self-improving systems. Compliance-oriented rather than engineering-oriented.

The Agentic AI Foundation and the Emerging Standards Stack

Since the frameworks above were published, the most significant structural development has been institutional: in December 2025, the Linux Foundation launched the Agentic AI Foundation (AAIF), co-founded by Anthropic, OpenAI, Google, Microsoft, AWS, and Block ³⁵. MCP, A2A, AGENTS.md, and goose were donated as founding projects.

This matters because the competing frameworks above all suffer from the same gap: they describe what agentic engineering needs without naming the protocols that implement it. As of the current ecosystem snapshot, the industry is actively standardizing into four complementary layers, all under neutral governance:

MCP (Model Context Protocol) — agent-to-tool connectivity. Defines typed schemas, auth boundaries, and replayable tool logs ³⁶.
A2A (Agent-to-Agent Protocol) — agent discovery, task delegation, and cross-framework collaboration ³⁷.
Agent Skills — capability definition via SKILL.md files consumed at runtime ³⁸.
AGENTS.md — repository-level machine-readable constraints for coding agents ³⁹.

None of the six frameworks reviewed above anticipated this convergence. The Agentic Engineering Manifesto does not prescribe specific protocols — its contribution is the governance model that sits across all four layers. But the existence of AAIF supports one of the manifesto's core theses: vendor-neutral, interoperable architecture is not aspirational but actively being built.

A parallel movement reinforces the shift: specification-driven development (SDD) frameworks are emerging as an increasingly common workflow pattern for agentic coding. Multiple widely-adopted open-source frameworks ⁴³⁴⁴⁴⁵⁴⁶⁴⁸ now enforce the same discipline — write the specification before the agent writes the code. This effectively inverts Agile's founding principle of "working software over comprehensive documentation." In agentic workflows, comprehensive specification is the precondition for working software. The documentation is not overhead; it is the control surface.

Sources are classified by type: [P] press/trade, [B] blog/opinion, [I] industry/vendor, [A] academic, [S] standard, [R] internal reference.

See Beyond Agile for the full argument.

Source Weighting

Not all sources carry equal evidentiary weight. When drawing conclusions, rely on the highest-weight source available for the claim:

Primary — standards, regulations, official documentation, peer-reviewed academic papers. Strongest evidentiary weight.
Secondary — industry and vendor guidance, white papers, practitioner frameworks. Useful for operational context; may reflect vendor perspective.
Tertiary — press, blogs, opinion pieces. Useful for framing and practitioner sentiment; treat as directional, not conclusive.

Conclusions about requirements, governance obligations, or regulatory constraints should rest on primary sources. Secondary and tertiary sources provide context and practitioner signal, not evidentiary grounding for technical claims.

[1] S. Jones, "AI Killed the Agile Manifesto," MetaMirror (blog), Jan 2026. [B] https://blog.metamirror.io/ai-killed-the-agile-manifesto-805ad9a639db

[2] Infosys, "How Is AI-Native Software Development Lifecycle Disrupting Traditional Software Development?" Infosys IKI TechCompass, 2025. [I] https://www.infosys.com/iki/techcompass/ai-native-software-development-lifecycle.html

[3] AWS, "Evolving Software Delivery for Agentic AI," AWS Prescriptive Guidance, 2026. [I] https://docs.aws.amazon.com/prescriptive-guidance/latest/strategy-operationalizing-agentic-ai/software-delivery.html

[4] C. West, "The Agentic Manifesto: Engineering in the Era of Autonomy," caseywest.com, Nov 2025. [B] https://caseywest.com/the-agentic-manifesto/

[5] P3 Group, "From Sprints to Swarms: Navigating the Post-Agile Future in the Age of AI," P3 Group White Paper, Sep 2025. [I] https://www.p3-group.com/en/p3-updates/navigating-the-post-agile-future-in-the-age-of-ai/

[6] D. Shortino, "The Software Development Lifecycle as We Know It Is Over," WebProNews, Jan 2026. [P] https://www.webpronews.com/the-software-development-lifecycle-as-we-know-it-is-over-and-ai-agents-are-writing-the-obituary/

[7] D. Rubinstein, "Is Agile Dead in the Age of AI?" SD Times, 2025. [P] https://sdtimes.com/agile/is-agile-dead-in-the-age-of-ai/

[8] Gartner, "Gartner Predicts Over 40% of Agentic AI Projects Will Be Canceled by End of 2027," Gartner Press Release, Jun 2025. [I] https://www.gartner.com/en/newsroom/press-releases/2025-06-25-gartner-predicts-over-40-percent-of-agentic-ai-projects-will-be-canceled-by-end-of-2027

[9] L. Claburn, "Test-Driven Development Ideal for AI, Says Agile Workshop," The Register, Feb 2026. [P] https://www.theregister.com/2026/02/20/from_agile_to_ai_anniversary/

[10] L. Claburn, "Agile Manifesto Co-Author 'Smitten' with Vibe Coding," The Register, Feb 2026. [P] https://www.theregister.com/2026/02/19/jon_kern_vibe_coding/

[11] crywolfe, "The Agentic Manifesto: Why Agile Is Breaking in the Age of AI Agents," DEV Community, 2025. [B] https://dev.to/crywolfe/the-agentic-manifesto-why-agile-is-breaking-in-the-age-of-ai-agents-1939

[12] R. Feldt et al., "Agentic Software Engineering: Foundational Pillars and a Research Roadmap," arXiv:2509.06216v2, Sep 2025. [A] https://arxiv.org/html/2509.06216v2

[13] B. Linders, "From Prompts to Production: A Playbook for Agentic Development," InfoQ, 2026. [P] https://www.infoq.com/articles/prompts-to-production-playbook-for-agentic-development/

[14] B. Linders, "Does AI Make the Agile Manifesto Obsolete?" InfoQ, Feb 2026. Note: cites Forrester's 2025 State of Agile Development report; primary report not publicly available. [P] https://www.infoq.com/news/2026/02/ai-agile-manifesto-debate/

[15] Software Improvement Group, "ISO/IEC 5338: Get to Know the Global Standard on AI Systems," SIG Blog, 2024. [I] https://www.softwareimprovementgroup.com/blog/iso-5338-get-to-know-the-global-standard-on-ai-systems/

[16] T. Claburn, "From Agile to AI: Anniversary Workshop Says Test-Driven Development Ideal for AI Coding," DevClass, Feb 2026. [P] https://www.devclass.com/development/2026/02/21/should-there-be-a-new-manifesto-for-ai-development/4091612

[17] Y. Zhou, "2025 Overpromised AI Agents. 2026 Demands Agentic Engineering," Medium, Jan 2026. [B] https://medium.com/generative-ai-revolution-ai-native-transformation/2025-overpromised-ai-agents-2026-demands-agentic-engineering-5fbf914a9106

[18] Svngoku, "2026 Agentic Coding Trends — Implementation Guide," Hugging Face Blog, 2026. [B] https://huggingface.co/blog/Svngoku/agentic-coding-trends-2026

[19] L. Cabrera-Diego et al., "Toward Agentic Software Engineering Beyond Code: Framing Vision, Values, and Vocabulary," arXiv:2510.19692v2, Oct 2025. [A] https://arxiv.org/html/2510.19692v2

[20] ISO/IEC, "ISO/IEC 5338:2023 — Information technology — AI system life cycle processes," International Organization for Standardization, 2023. [S] https://www.iso.org/standard/81118.html

[21] The AI Native Dev podcast interview, "Can Agentic Engineering Really Deliver Enterprise-Grade Code?" (with R. Cohen), Sep 2025. [I] https://ainativedev.io

[22] Y. Pan et al., "SWE-CI: Evaluating LLM-based Agents in Continuous Integration Environments," arXiv:2603.03823, Mar 2026. [A] https://arxiv.org/abs/2603.03823

[23] D. Fretz, "The 5 Levels of AI Agentic Software Development," LinkedIn, Feb 2026. [B] https://www.linkedin.com/pulse/5-levels-ai-agentic-software-development-dominik-fretz-mba-pmp-xhvze/

[24] OpenAI, "Harness engineering: leveraging Codex in an agent-first world," OpenAI, Feb 2026. [I] https://openai.com/index/harness-engineering/

[25] Anthropic, "Demystifying evals for AI agents," Anthropic Engineering, Jan 2026. [I] https://www.anthropic.com/engineering/demystifying-evals-for-ai-agents

[26] D. Bursztein and B. Lewis, "Building agents with the Claude Agent SDK," Claude Blog, Jul 2025. [I] https://claude.com/blog/building-agents-with-the-claude-agent-sdk

[27] C. Horne, "Writing effective tools for AI agents," Anthropic Engineering, Sep 2025. [I] https://www.anthropic.com/engineering/writing-tools-for-agents

[28] A. Zhang et al., "Building a C compiler with a team of parallel Claudes," Anthropic Engineering, Feb 2026. [I] https://www.anthropic.com/engineering/building-c-compiler

[29] B. Böckeler, "Harness Engineering," Martin Fowler, Feb 2026. [B] https://martinfowler.com/articles/exploring-gen-ai/harness-engineering.html

[30] B. Böckeler, "Context Engineering for Coding Agents," Martin Fowler, Feb 2026. [B] https://martinfowler.com/articles/exploring-gen-ai/context-engineering-coding-agents.html

[31] K. Morris, "Humans and Agents in Software Engineering Loops," Martin Fowler, Mar 2026. [B] https://martinfowler.com/articles/exploring-gen-ai/humans-and-agents.html

[32] Google Cloud, "Vertex AI Agent Builder," Google Cloud, 2026. [I] https://cloud.google.com/products/agent-builder

[33] Google Cloud, "Agent Development Kit overview," Google Cloud Docs, 2026. [I] https://docs.cloud.google.com/agent-builder/agent-development-kit/overview

[34] G. Franceschini et al., "Build with Google Antigravity, our new agentic development platform," Google Developers Blog, Apr 2025. [I] https://developers.googleblog.com/en/build-with-google-antigravity-our-new-agentic-development-platform/

[35] Linux Foundation, "Linux Foundation Announces the Formation of the Agentic AI Foundation," Linux Foundation Press, Dec 2025. [I] https://www.linuxfoundation.org/press/linux-foundation-announces-the-formation-of-the-agentic-ai-foundation

[36] Anthropic, "MCP Joins the Agentic AI Foundation," Model Context Protocol Blog, Dec 2025. [I] https://blog.modelcontextprotocol.io/posts/2025-12-09-mcp-joins-agentic-ai-foundation/

[37] Google, "A2A: A New Era of Agent Interoperability," Google Developers Blog, Apr 2025. [I] https://developers.googleblog.com/en/a2a-a-new-era-of-agent-interoperability/

[38] Anthropic, "Equipping Agents for the Real World with Agent Skills," Anthropic Engineering, Oct 2025. [I] https://www.anthropic.com/engineering/equipping-agents-for-the-real-world-with-agent-skills

[39] OpenAI, "AGENTS.md," GitHub, 2025. [I] https://github.com/agentsmd/agents.md

[40] NVIDIA, "NVIDIA Announces NemoClaw," NVIDIA News, Mar 2026. [I] https://nvidianews.nvidia.com/news/nvidia-announces-nemoclaw

[41] S. Yegge, "Introducing Beads: A Coding Agent Memory System," Medium, 2026. [B] https://steve-yegge.medium.com/introducing-beads-a-coding-agent-memory-system-637d7d92514a

[42] CrowdStrike, "What Security Teams Need to Know About OpenClaw AI Super Agent," CrowdStrike Blog, 2026. [I] https://www.crowdstrike.com/en-us/blog/what-security-teams-need-to-know-about-openclaw-ai-super-agent/

[43] GitHub, "Spec-driven development with AI: Get started with a new open source toolkit," GitHub Blog, 2025. [I] https://github.blog/ai-and-ml/generative-ai/spec-driven-development-with-ai-get-started-with-a-new-open-source-toolkit/

[44] Fission AI, "OpenSpec: The Spec Framework for Coding Agents," Y Combinator Launch, 2025. [I] https://www.ycombinator.com/launches/Pdc-openspec-the-spec-framework-for-coding-agents

[45] J. Vincent, "Superpowers: How I'm using coding agents in October 2025," blog.fsck.com, Oct 2025. [B] https://blog.fsck.com/2025/10/09/superpowers/

[46] BMad Code, "BMAD-METHOD: Breakthrough Method for Agile AI-Driven Development," GitHub, 2025-2026. [I] https://github.com/bmad-code-org/BMAD-METHOD

[47] Oracle, "Introducing the Open Agent Specification (Agent Spec)," Oracle AI Blog, 2025. [I] https://blogs.oracle.com/ai-and-datascience/introducing-open-agent-specification

[48] spec-kit, "spec-kit: Technology-independent SDD toolkit," GitHub, 2025-2026. [I] https://github.com/spec-kit/spec-kit

[49] Model Context Protocol, "Key Changes," MCP Specification, Mar 2025. [S] https://modelcontextprotocol.io/specification/2025-03-26/changelog

[50] Model Context Protocol, "One Year of MCP: November 2025 Spec Release," MCP Blog, Nov 2025. [S] https://blog.modelcontextprotocol.io/posts/2025-11-25-first-mcp-anniversary/

[51] Linux Foundation, "Linux Foundation Launches the Agent2Agent Protocol Project to Enable Secure, Intelligent Communication Between AI Agents," Linux Foundation Press Release, Jun 2025. [I] https://www.linuxfoundation.org/press/linux-foundation-launches-the-agent2agent-protocol-project-to-enable-secure-intelligent-communication-between-ai-agents

[52] OpenTelemetry, "AI Agent Observability: Building Trust in Autonomous Systems with OpenTelemetry," OpenTelemetry Blog, Mar 2025. [S] https://opentelemetry.io/blog/2025/ai-agent-observability/

[53] OpenAI, "Why we no longer evaluate SWE-bench Verified," OpenAI, Feb 2026. [I] https://openai.com/index/why-we-no-longer-evaluate-swe-bench-verified/

[54] C. Cuadron et al., "Saving SWE-Bench: A Benchmark Mutation Approach for More Realistic Agent Evaluation," Microsoft Research, Oct 2025. [A] https://www.microsoft.com/en-us/research/publication/saving-swe-bench-a-benchmark-mutation-approach-for-realistic-agent-evaluation/

[55] OpenAI, "Understanding Prompt Injection," OpenAI, Nov 2025. [I] https://openai.com/index/prompt-injections/

[56] OpenAI, "Designing AI Agents to Resist Prompt Injection," OpenAI, Mar 2026. [I] https://openai.com/index/designing-agents-to-resist-prompt-injection/

[57] European Commission, "The General-Purpose AI Code of Practice," Shaping Europe's Digital Future, Jul 2025. [S] https://digital-strategy.ec.europa.eu/en/policies/contents-code-gpai

[58] NIST, "Announcing the 'AI Agent Standards Initiative' for Interoperable and Secure Innovation," NIST News, Feb 2026. [S] https://www.nist.gov/news-events/news/2026/02/announcing-ai-agent-standards-initiative-interoperable-and-secure

[59] J. Evans, B. Bratton, and B. Agüera y Arcas, "Agentic AI and the next intelligence explosion," Science, 2026. [A] https://www.science.org/doi/10.1126/science.aeg1895

[60] H. Zhou et al., "Memento-Skills: Let Agents Design Agents," arXiv:2603.18743, Mar 2026. [A] https://arxiv.org/abs/2603.18743

Principles for building systems where humans steer intent, agents execute within governed boundaries, and verified outcomes are the only measure that matters.

We are moving from writing software to architecting systems that write, test, and ship software under human direction. Through this work, we have come to value:

We Value More	over	We Also Value
Iterative steering and alignment	over	Rigid upfront specifications
Verified outcomes with auditable evidence	over	Fluent assertions of success
Right-sized agent collaboration	over	Monolithic god-agents
Curated, high-signal context and memory	over	Stateless sessions and noisy memory
Tooling, telemetry, and observability	over	Chat-based heroics
Resilience under stress	over	Performance in ideal conditions

That is, while there is value in the items on the right, we value the items on the left more.

Architectural basis (vendor-neutral): enforceable constraints, durable knowledge and memory, continuous evaluations, behavioral observability, and economics-aware routing.

What is Agentic Engineering?

Agentic Engineering is the discipline of architecting environments, constraints, protocols, and feedback loops where autonomous agents can safely plan, execute, and verify complex work under human governance.

It is distinct from:

AI Engineering: Building and training the base models themselves.
Prompt Engineering: Crafting text inputs to steer model outputs.
AI-Assisted Software Engineering: Using AI as an autocomplete or co-pilot to write human-authored code faster.

Agentic Engineering is about treating agents as governed system participants rather than as human proxies. It shifts the primary human role from writing code to specifying intent, defining verifiable contracts, and operating the system that executes the work. As agent capability scales, the governing challenge shifts from aligning one model in isolation toward aligning a society of interacting agents, tools, and humans through checks, balances, and explicit institutional control.

What This Is — and What It Is Not

This manifesto is not "prompting harder." It is not LLMs running production unsupervised. It is not replacing engineering judgment with agent confidence, and it is not more meetings with new names.

It is enforced constraints, verified outcomes, persistent learning, and human accountability — applied to systems that include AI agents as first-class participants in the engineering process.

The Agentic Loop

Every principle in this manifesto serves a single feedback cycle:

Specify → Design → Plan → Execute → Verify → Validate → Observe → Learn → Govern → Repeat

This loop is not a waterfall. Any phase can trigger a return to an earlier one based on evidence. The loop is the system. The principles are how you keep it honest.

Specify defines what to build and why.
Design architects how to build it: boundaries, topology, constraints, and coordination rules.
Plan decomposes the design into executable steps.
Execute carries out the plan within bounded autonomy.
Verify checks the output against the specification (did we build it right?).
Validate checks the outcome against real-world need (did we build the right thing?).
Observe monitors runtime behavior, drift, and cost.
Learn updates knowledge and memory from observations. At Phases 4–5, this means: add durable findings to the knowledge base and curate learned memory with new heuristics, routing preferences, and reusable skills. Updating model weights (fine-tuning, RLHF) is a separate infrastructure concern applicable at Phase 6 and beyond — not a per-loop operation for most organizations. Knowledge captures durable truth; memory captures learned heuristics and reusable skills.
Govern applies policy, accountability, change control, and economics review. When inference or governance cost exceeds the value of the work, Govern signals Specify to simplify scope or reduce autonomy rather than continuing to spend. A Govern cycle is not complete until: all outstanding policy violations are resolved, accountability signals are within threshold (no rubber-stamping pattern detected), economics review is recorded, any architectural decisions triggered by governance are filed back into Design, and tool invocations during the loop are confirmed within the authorized scope for the operating tier — any out-of-scope tool call is classified as a policy violation and triggers the remediation sub-cycle before the loop repeats.

What the Loop Produces

The loop's output is not code. It is an evidence-backed deployable — a package of artefacts that together satisfy the Definition of Done and provide the release layer with everything it needs to make a governed deployment decision.

A complete loop output contains:

The deployable artefact. The code, configuration, model, or process change that implements the specification. Built, tested, and ready to deploy to the target environment.

The evidence bundle. Per Principle 1 and the Definition of Done: evaluation reports with pass/fail and metrics, trace IDs linking to the full decision chain, diffs showing what changed, policy check outputs confirming constraint compliance, and memory updates confirming what was learned. The evidence bundle is the machine-readable record of how "done" was proven. Anything less is assertion.

The specification artefact. The versioned, final state of the specification as it stood when the evaluation suite passed. This is the document against which independent validation (P8) is performed, and the reference against which future changes to this component will be assessed.

The rollback procedure. A tested rollback plan (not just documented — tested in a representative environment) that allows the deployment to be reversed within a defined time window if the change produces unintended production behaviour. The rollback procedure is a condition of the DoD's "Governed" criterion.

The accountability sign-off. A named human who has reviewed the evidence bundle, accepted that the DoD conditions are satisfied, and accepts production accountability for the outcome (P12). This is not a rubber stamp — it is the governance record that the evidence was reviewed.

The control state record. A machine-readable record — generated at loop completion, not assembled post-hoc — that states, for every required control: whether it passed, failed, was waived, is stale, or requires a human decision before the release gate can be cleared. This is distinct from the evidence bundle: the evidence bundle contains the artefacts that constitute the evidence; the control state record contains the structured verdict on each control. A release layer that receives an evidence bundle without a control state record cannot assess gate readiness without re-reading all artefacts. A release layer that receives a control state record can assess gate readiness programmatically and route only the exceptions that require human judgment. A waived control must carry: the waiver rationale, the name of the human who granted the waiver, and an expiry date after which the waiver lapses and the control reverts to required.

These artefacts are handed to the release layer as a unit. A release layer that accepts a deployable without a complete evidence bundle is accepting unverified output. A release layer that accepts an evidence bundle without a named accountable human is accepting ungoverned output. Both are governance failures at the boundary, not in the loop.

Verification and validation are distinct disciplines. Verification is technical correctness against the spec. Validation is fitness for intended use in the real world. An agent can pass every verification check and still fail validation. Both are required.

Failures are data across every phase. Incidents, hallucinations, and policy violations must produce post-incident updates to specifications, evaluations, tooling constraints, and memory before retry.

When a feedback arrow fires, a remediation sub-cycle must complete before re-entering the loop:

Diagnose — classify the failure from traces: specification error, verification gap, enforcement failure, or operational override.
Update — patch memory, tighten contracts, or revise the specification to address the root cause.
Gate — add or strengthen an evaluation that would catch this failure class before retrying.
Re-verify — run the updated evaluation suite before advancing.

Skipping to step 4 without steps 1–3 is a retry, not remediation, and is the primary cause of hallucination loops.

flowchart LR
    Specify --> Design --> Plan --> Execute --> Verify --> Validate --> Observe --> Learn --> Govern
    Govern -->|Repeat| Specify

    Verify -.->|Plan / Execution Failure| Plan
    Verify -.->|Invalid Intent| Specify
    Validate -.->|Wrong Thing Built| Specify
    Validate -.->|Design Flaw| Design
    Observe -.->|Runtime Drift| Specify
    Observe -.->|Decomposition Error| Plan
    Govern -.->|Economics / Complexity Breach| Specify
    Govern -.->|Architectural Policy Change| Design

What Must Be True Before Entering Specify

The loop is rigorous. It verifies, validates, observes, learns, and governs with discipline. But that rigour is only as valuable as the intent it is applied to. Specify is not the beginning of a software lifecycle — it is the beginning of the execution phase. Something must precede it.

A specification is loop-ready when all of the following are true:

Business need validated. The need is real and evidence-backed — not just articulated. User research, data analysis, regulatory mandate, or executive decision with documented rationale constitutes validation. A stakeholder request without supporting evidence is not a validated need.

Value measurable. A success criterion at the business level exists and is measurable post-deployment. "Improve customer experience" is not measurable. "Reduce mean time to resolution for claims by 20% within 90 days of deployment" is. If you cannot define what business success looks like before entering the loop, you cannot validate at the end.

Acceptance criteria expressible. The need can be expressed as machine-readable acceptance criteria and constraints. If a domain expert cannot write a first draft of the acceptance criteria before the loop starts, the need is not well-understood enough to specify.

Constraints identified. Security requirements, compliance obligations, domain ownership boundaries, performance envelopes, and data classification constraints are known before Specify begins. Discovering a compliance constraint at the Verify phase is a scope failure, not a verification failure.

Accountable human named. A named person accepts business-level accountability for the outcome before the loop runs — the P12 anchor, established upstream. This person owns the success criterion, not the implementation.

Blast radius assessed. A preliminary assessment of maximum credible impact if the implementation fails. This informs the autonomy tier (P5) and the scope of the evidence bundle required. A change whose failure impacts one microservice requires different governance than one whose failure affects a regulatory filing.

Out-of-scope explicitly stated. What this specification explicitly does not include. Absent this, scope expands during execution — a common driver of specification drift inside the loop.

If these conditions are not met, the work is not loop-ready and should not enter the loop. Resolving these gaps requires demand governance upstream of engineering execution — clarifying the need, establishing measurable success criteria, and confirming constraints before the loop runs. Entering the loop without a loop-ready specification does not save time — it produces well-executed work on the wrong problem.

Minimum bar: If you cannot answer "what does business success look like and how will you measure it?" before entering Specify, the loop is not ready to run.

The New Way of Working

Humans express intent as specifications with constraints and acceptance criteria — then refine those specifications as evidence accumulates. They encode architecture as enforceable, monitored domain boundaries. They set autonomy tiers appropriate to risk. They own outcomes and remain accountable. They do not supervise every intermediate step — they define what success looks like, verify that the system achieved it, and inspect the reasoning when it matters.

Agents decompose specifications into executable tasks. They execute within domain boundaries, right-sized to complexity. They verify their own outputs against evaluations. They report evidence, not assertions. They learn from failure and encode that learning in memory — with provenance, so the system knows where every lesson came from.

Systems maintain persistent knowledge and curated learned memory. They route work to appropriate model tiers based on cost and quality requirements. They enforce architectural constraints at runtime and monitor for violations. They observe behavior, surface anomalies, and maintain the feedback loops that make everything else work. They forget what no longer serves them.

See Roles and the Human Side for how each role evolves through the phase transitions.

Scope and Framework Context

What this manifesto covers

The engineering discipline for governing systems that include autonomous agents as first-class participants in engineering execution: from a loop-ready specification entering Specify through governed output exiting the loop. This includes:

The Agentic Loop: Specify, Design, Plan, Execute, Verify, Validate, Observe, Learn, Govern — and the feedback paths between them.
Governance structures, autonomy controls, and evidence practices for the execution phase.
The Definition of Done: the conditions that prove a loop output is complete.
Adoption guidance for transitioning teams through the maturity phases.
Domain-specific mappings to regulatory frameworks for teams in regulated industries.

Position as shared engineering execution infrastructure

This manifesto governs engineering execution — from a loop-ready specification entering Specify through governed output exiting the loop. Outer lifecycle frameworks, covering what precedes the loop (demand validation, portfolio governance) and what follows it (release governance, operations, maintenance, and product lifecycle governance), depend on this manifesto as their shared engineering execution standard. The principles apply in full regardless of delivery mode; the outer frameworks govern what the manifesto does not.

What this manifesto does not cover

The following are explicitly out of scope for this document:

Business need validation, demand prioritisation, portfolio governance, and specification readiness — the upstream work that determines what enters the loop.
Release gates, environment promotion, change management, and compliance documentation for the deployment boundary — the downstream work that governs how loop output reaches production.
Incident management, SLO governance, security patching, ownership transfer, and system deprecation — the operational work that governs systems in production.
Agent product brief and trust model design, persona design, and regulatory risk classification — the product conception work that precedes engineering execution for agent products.
Behavioral envelope specification at the product level, use-case coverage mapping, uncertainty protocol, and escalation design for agent products.
Behavioral release gates, composite state versioning, behavioral drift governance, foundation model update governance, and regulated retirement for agent products.

Out of scope in all frameworks:

Training, fine-tuning, or evaluating foundation models.
Deploying agents in physical systems, robotics, or non-software operational domains.
Legal advice, compliance determinations, or jurisdiction-specific regulatory guidance. Domain pages map principles to frameworks; they are not substitutes for qualified regulatory counsel.
Autonomous weapons systems, or the safety certification of autonomous control systems themselves.
Federated agent networks without a single accountable operator.
Agent deployment in classified environments.

How to Read This Manifesto

Use two layers:

Manifesto core (this document + Twelve Principles + Definition of Done): values, principles with minimum bars, and what "done" means. Start here.
Companion guidance (Companion Guide and its linked documents): extended rationale, tradeoffs, worked patterns, failure modes, organizational change management, and domain-specific regulatory alignment. Come here when implementing. The companion layer is itself multi-document; the full map is in companion-guide.md.

The two-layer framing is accurate but incomplete. The minimum bars in the principles are necessary conditions; they are not sufficient for safe operation at Phase 4 and above. At higher phases, certain companion content becomes operationally essential rather than supplementary: the Specifications vs. Constraints distinction (P2), rubber-stamping detection (P12), the Adaptation Envelope — Layer 4 (P6), and the worked failure-mode patterns (P10/P12) are required reading before operating autonomy above Tier 1. If the core document describes the floor, these documents describe the walls and ceiling.

On evidence. This manifesto demands evidence as a discipline. We apply that standard to our own claims: empirically supported claims carry citations; threshold values are labeled as practitioner heuristics; deductive arguments are stated as arguments so they can be evaluated independently. Some claims in an emerging discipline necessarily precede the empirical grounding they ideally require. Treat those claims as hypotheses and revise them as evidence accumulates. That is what a living specification means in practice.

Concretely: numeric thresholds in this document — for example, the 80% first-pass verification rate and three-iteration specification stability gate (P2), the 20% cost-divergence threshold and 48-hour rollback freshness window (Definition of Done), and the phase-to-tier ceilings (P5) — are practitioner heuristics, not validated thresholds. They exist to make the bar concrete, not to assert that one number rather than another has been empirically established. Measure your baseline, calibrate against it, and revise as evidence accumulates. Treating these numbers as authoritative without local calibration is a misreading of the document.

Twelve Principles

The engineering principles that operationalize the six values: outcomes, specifications, architecture, swarm topology, autonomy tiers, knowledge and memory, context, evaluations and proofs, observability and interoperability, emergence and containment, economics, and accountability.

The Agentic Definition of Done

What "done" means in agentic engineering: shipped, observable, verified, provable, learned from, governed, and economical. Phase-calibrated, not all-or-nothing.

Glossary

Canonical definitions for terms used across this document set: agent, autonomy tier, blast radius, evidence bundle, evaluation, knowledge, learned memory, specification, trace, verification, validation, and more.

Exploration is a phase. Engineering is a discipline. These principles are not the last word — they are the minimum for a world where systems build, test, and ship their own code under human direction. The question that remains is whether governance can scale as fast as autonomy. We bet it can. This manifesto is how we intend to prove it.

The engineering principles that operationalize the six values.

See the Manifesto for the core values and the Agentic Loop. See the Definition of Done for what "done" means.

Values-to-principles mapping. The manifesto claims these twelve principles operationalize the six values. The correspondence:

Value	Principles
Iterative steering and alignment	1 — Outcomes, 2 — Specifications
Verified outcomes with auditable evidence	8 — Evaluations, 12 — Accountability
Right-sized agent collaboration	3 — Architecture, 4 — Swarm, 5 — Autonomy tiers
Curated, high-signal context and memory	6 — Knowledge/memory, 7 — Context
Tooling, telemetry, and observability	9 — Observability
Resilience under stress	10 — Containment, 11 — Economics

Why twelve. The principle count is not chosen for symbolic resonance with the 2001 Agile Manifesto. Each principle is irreducible: removing any one uncovers a distinct failure mode that the others do not cover. The closest candidates for consolidation are P3 (architecture) with P10 (containment), and P6 (knowledge/memory) with P7 (context). They remain separate because P3 governs deterministic boundaries enforced before agent action while P10 governs the engineered response to emergent behaviour after boundaries are crossed — different failure modes, different controls; and because P6 governs what the system retains and how it expires while P7 governs what is retrieved into a specific reasoning loop and how it is budgeted — different lifetimes, different infrastructure. A reader who finds twelve excessive should propose the specific failure mode that becomes uncovered when a principle is removed.

Sequencing matters. These principles are not independent. Prerequisites: Principle 2 (specifications) before Principle 8 (evaluations); Principle 3 (architecture) before Principle 5 (autonomy tiers); Principle 6 (knowledge/memory) before Principle 7 (context); Principle 9 (observability) before Principle 12 (accountability). The Incremental Adoption Path gives the recommended implementation order.

manifesto-principles-01.md — Outcomes are the unit of work
manifesto-principles-02.md — Specifications are living artifacts that evolve through steering
manifesto-principles-03.md — Architecture is defense-in-depth, not a document
manifesto-principles-04.md — Right-size the swarm to the task
manifesto-principles-05.md — Autonomy is a tiered budget, not a switch
manifesto-principles-06.md — Knowledge and memory are distinct infrastructure
manifesto-principles-07.md — Context is engineered like code
manifesto-principles-08.md — Evaluations are the contract; proofs are a scale strategy
manifesto-principles-09.md — Observability and interoperability cover reasoning, not just uptime
manifesto-principles-10.md — Assume emergence; engineer containment
manifesto-principles-11.md — Optimize the economics of intelligence
manifesto-principles-12.md — Accountability requires visibility

See the split files below for the full Manifesto Principles corpus.

1. Outcomes are the unit of work

Progress is measured by the cycle Outcome → Evidence → Learning — not by tokens generated, tasks dispatched, or agents spawned. An agent that says "done" has proven nothing. A change is done only when it is shipped, observable, verified, validated, and learned from.

Four distinct claims must hold before "done" is true:

Evaluation is the contract that defines correctness. Evaluations are versioned, machine-readable, and coupled to the specification. They define what "correct" means in terms the system can check autonomously.

Verification is the act of running evaluations to confirm the implementation matches the specification. Verification answers: did we build it right? It produces evidence — test reports, policy check outputs, trace IDs — that an agent's output satisfies the acceptance criteria.

Validation is the judgment that the specification itself was worth building. Validation answers: did we build the right thing? It checks fitness for real-world use: does the deployed behavior produce the intended business outcome? Verification can pass completely while validation fails — you can build exactly what the specification said, correctly, and ship the wrong thing.

Independent validation is the organizational challenge of whether verification and validation were genuinely rigorous. It answers: were the first two real? In regulated contexts, this must be performed by a party organizationally independent from the team that developed and verified the system. It is not a technical step — it is a governance requirement.

Evidence means: evaluation reports with pass/fail and metrics, trace IDs linking to the full decision chain, diffs showing what changed, deployment IDs confirming what shipped, rollback plans confirming reversibility, policy check outputs confirming constraint compliance, and memory updates confirming what was learned. Anything less is assertion, not evidence.

Minimum bar: If it is not deployed, instrumented, verified against evaluations, and validated against real-world outcomes, it is not done.

2. Specifications are living artifacts that evolve through steering

Requirements, constraints, and acceptance criteria must be versioned, reviewable, and machine-readable — because they drive agent behavior directly. Specifications are hypotheses that sharpen as agents explore the problem space and evidence accumulates. Express what must be true when the work is complete. Express what is forbidden. Let the swarm find the path. When the path reveals that the spec was wrong, update the spec and run again.

Specifications and architectural constraints operate at different layers and change at different speeds. Constraints are invariants — security policies, domain ownership boundaries, data integrity rules — that hold across specification iterations. Specifications are goals and acceptance criteria that evolve within those invariants. An agent can propose a revised acceptance criterion without governance overhead; proposing a relaxed constraint triggers a governed review. If the system cannot distinguish these two change types, specification iteration will silently erode architectural boundaries. See Specifications vs. Constraints in the extended guidance.

Minimum bar: If a specification cannot be versioned, reviewed, and revised based on execution evidence, it is a wish, not an engineering artifact.

These are starter defaults, not universal stop conditions. Calibrate them per domain, track false-convergence and false-drift, and harden them only after local evidence justifies the thresholds.

A specification is done iterating when:

Acceptance criteria remain stable across three consecutive iterations (no new criteria added, no existing criteria changed).
Scope is contracting, not expanding — each iteration narrows requirements, does not broaden them.
Agent first-pass verification rate exceeds 80% (the specification is clear enough for the agent to satisfy it without mid-task clarification).
No new stop criteria emerge in the last iteration.

If these are not met after three iterations, treat it as scope drift — not optimization — and reset the boundary. Iteration is not the goal; convergence is.

3. Architecture is defense-in-depth, not a document

Domain boundaries define what agents may do and what they must not do. Encode boundaries as machine-enforced policies: repository gates, type contracts, lint rules, domain ownership maps, CI checks.

Orchestration is a deterministic concern; execution is a probabilistic one — conflating them is the root failure mode. Do not rely on an LLM's system prompt to enforce your business rules. Build deterministic infrastructure wrappers around your probabilistic AI. Enforce permissions, repository gates, API rate limits, and data access at the system level. Expect the boundary to be tested. Design for what happens when it is crossed.

Deterministic wrappers catch structural failures — unauthorized access, schema violations, forbidden operations. They cannot catch semantic failures — an agent that writes syntactically valid but logically wrong code. That is why architecture is defense-in-depth, not a single layer: wrappers catch structural violations (Principle 3), verification catches semantic errors (Principle 8), and observability catches behavioral drift (Principle 9). No single layer catches everything. All three must hold.

Minimum bar: If a boundary is described but not enforced at runtime with automated detection and recovery, it is not architecture — it is documentation.

4. Right-size the swarm to the task

Prefer specialized agents coordinated through shared contracts and state. But do not default to maximum parallelism. A single well-evaluated agent with excellent tools often outperforms an expensive, uncoordinated swarm. Scale agents to complexity, not to ambition.

Design conflict resolution, not just parallelism. Swarms propose; a single commit path commits. Choose the simplest topology that solves the problem and graduate to more complex coordination only when evidence shows it is needed.

The point of a swarm is not to mimic an organization chart. It is to create structured disagreement, specialization, and reconciliation where the workload benefits from multiple perspectives. Intelligence at system scale is often plural rather than monolithic. The engineering question is not "how many agents can we run?" but "what coordination pattern produces better verified outcomes than a single agent on this workload?" Swarms are not only for implementation: the same coordination patterns — specialization, structured disagreement, reconciliation — apply to governance work such as specification critique, threat modeling, evidence assembly, and release risk assessment; a swarm that only writes code while governance remains a separate human overlay is not a governed agentic system.

Signals that indicate a single agent is insufficient:

The task requires concurrent reads or writes across multiple bounded contexts where race conditions cannot be resolved inside a single agent.
Evaluation pass rate plateaus below threshold across successive sessions despite specification refinement, indicating context degradation under length.
The task requires adversarial specialization — roles whose objectives conflict and cannot be fully trusted from the same agent (e.g., implementation and independent security review).
Single-agent tool call depth or context budget is consistently saturated on representative tasks.

In the absence of these signals, default to single-agent or pipeline.

Minimum bar: If shared state is not typed, versioned, and reconciled, the swarm is a mob.

Minimum bar (tier containment): An orchestrator cannot delegate actions to specialist agents that exceed its own authorized autonomy tier. Tier elevation requires the same approval path regardless of whether the request originates from a human or an orchestrating agent.

5. Autonomy is a tiered budget, not a switch

Grant permissions by risk tier, least privilege, and blast-radius limits. Agents behave like serverless functions, not employees: spin up for a guarded task, verify the result, and terminate.

Autonomy operates in explicit governance tiers — each defining who approves, what evidence is required, and what blast radius is acceptable:

Tier 1 — Observe. Agents analyze and propose. Blast radius: none.

Tier 2 — Branch. Agents write to isolated branches. Humans approve merges. Blast radius: contained.

Tier 3 — Commit. Agents take production-impacting actions with explicit human approval per change, attached rollback plans, and verified evidence. Blast radius: governed.

Tier 4 — Operate. Agents execute autonomously within a human-approved, machine-enforced policy envelope — without per-change human approval. The human approves the envelope (allowed change classes, blast radius ceiling, required evidence schema, rollback conditions, kill-switch configuration) and retains accountability for its design. Agents act within it; anomaly detection and governance observability surface deviations for human review. Blast radius: policy-bounded.

Tier 4 is not Tier 3 with the human removed. It is a governance model shift: accountability moves from the action level to the policy level. This only holds when the policy envelope is machine-enforced (not merely documented), control evaluations confirm the governance system itself works (P8), governance observability is instrumented and alerting on stale evidence and drift (P9), and rubber-stamping detection is active (P12). Without these prerequisites, Tier 4 is ungoverned production autonomy. The control state record produced by each loop iteration (see What the Loop Produces) is the primary audit mechanism at this tier: it must confirm that each action fell within the approved envelope before the action is logged as compliant.

Within each tier, define granular permissions: read production data but not write, deploy to canary but not full rollout, modify test code but not application code, change configuration but not schema. Tiers define the governance level; permissions define the allowed actions within that level.

Minimum bar: If you cannot reconstruct an agent's reasoning at any tier, your autonomy model has failed.

Minimum bar (tool authorization): If an agent can invoke tools that have not been explicitly authorized for its operating tier, the tier model is nominal. Tool access is part of the permission model — not a separate concern. A tool that a Tier 1 agent can call without authorization is a tier violation regardless of whether the agent chooses to call it.

Minimum bar (Tier 4): If the policy envelope is not machine-enforced, if control evaluations are not passing, or if governance observability is not instrumented, Tier 4 operation is not permitted regardless of phase.

Phase maturity is a prerequisite for autonomy tier. Tiers and phases are not independent: a team cannot safely operate at a higher tier than their phase supports, regardless of available infrastructure.

Deployment status disclosure (as of 2026-05-03): No production Tier 4 deployment is known to have been independently validated against the prerequisites stated above. The bar is documented; it has not been empirically met at scale in any deployment publicly reported against this specification. Treat Tier 4 as a research target with explicit prerequisites — not a production target with checklist-clearance — until independent validation evidence exists. This disclosure will be updated when validated deployments are reported.

These maximum tiers are conservative defaults for the relevant work item, not a blanket organization-wide policy. Calibrate by domain, data classification, and blast radius.

Phase	Maximum available tier	Rationale
Phase 1-2	Tier 1 only for governed production work	No evaluation suite, no evidence bundles — agent output is unverified
Phase 3	Tier 1 only for governed production work	Autonomy without verification; governance infrastructure not yet in place
Phase 4	Tier 2 (branch + human approval)	Verification gates operational; blast radius is contained
Phase 5+	Tier 3 (governed production impact)	Full Agentic Loop with verification, validation, and domain-scoped accountability
Phase 5+ with validated governance infrastructure	Tier 4 (policy-envelope autonomous operation)	Machine-enforced envelope, passing control evaluations, active governance observability, and rubber-stamping detection all confirmed operational

In regulated industries, use-case-specific caps apply independently of phase. See Companion Frameworks for the regulated-industry cap table.

Phase maturity and task blast radius are independent checks. Team phase determines the governance capability ceiling; it does not automatically qualify every task that falls nominally within that tier. For each task, perform a separate blast-radius assessment before acceptance:

What is the maximum credible impact if this specific task fails?
Does that impact stay within the governance coverage of the current phase?
If not — escalate the task to a phase with appropriate coverage, or decompose it so each subtask stays within the governance boundary.

A Phase 4 team operating correctly for Phase 4 can still fail on a cross-domain task whose blast radius exceeds Phase 4 governance coverage. Phase is a team capability ceiling; blast-radius assessment is a per-task gate. The most consequential failures tend to occur at domain boundaries, where tasks cross phase ceilings that are not checked at the task level.

Autonomy tiers and human oversight patterns are complementary classifications. Tiers classify the scope of agent authority. Oversight patterns classify the structure of human involvement. Both must be specified; neither is sufficient alone.

Four oversight patterns correspond to the tier model:

Human In the Loop (HITL). A human reviews and approves each agent output before it is enacted. This is the required pattern for Tier 1 proposals, Tier 2 merge approvals, Tier 3 evidence bundle sign-off, and any Tier 4 envelope change. HITL has two modes: synchronous (human review before the output takes effect — required when the action is irreversible before review can complete) and asynchronous (agent output enacted, human reviews and may revert immediately afterward — appropriate when the action is reversible within the review window). The distinction is not organizational preference; it is determined by the irreversibility window of the action class.

Human On the Loop (HOTL). The agent executes within defined boundaries; a human monitors and retains intervention authority before consequences become irreversible. The human does not review every action. HOTL is the natural oversight pattern for Tier 2 branch execution and Tier 3 operational monitoring. Its governance condition is the irreversibility window: the time from action to irreversibility must exceed the sum of alert detection latency, human notification latency, assessment time, and intervention execution time. If this condition cannot be satisfied, HOTL is not providing the claimed governance function — the pattern must be upgraded to HITL or the action's reversibility must be engineered to be longer.

Human Off the Loop (HOLL). The agent executes fully autonomously within a machine-enforced policy envelope. No human is present during operation. This is the Tier 4 pattern for in-envelope actions. HOLL requires three conditions beyond the Tier 4 prerequisites already stated: (1) the per-action evidence record is sufficient to reconstruct accountability without any human witness — a regulatory examiner must be able to determine from logs alone whether any action was within the approved envelope; (2) a periodic compliance audit at a defined cadence confirms that in-envelope actions remain within the specification's behavioral intent, not merely within the technical envelope; (3) reversion triggers are pre-defined — the conditions that cause the system to revert from HOLL to a more restrictive pattern are specified before deployment, not determined reactively after an incident.

Expert-Driven Loop (EDL). A qualified domain expert — not a general-competence reviewer — exercises judgment that defines correct behavior for use cases that cannot be fully pre-specified. EDL is not a separate queue structure. It is a qualification constraint on HITL: it determines who may serve as the human reviewer for designated work item classes. EDL applies wherever correct behavior requires domain expertise to assess rather than a pre-specified contract to enforce. In engineering delivery: independent validation for high-stakes and regulated systems (where organizational independence is necessary but not sufficient — domain expertise is also required); security review for systems with adversarial exposure; compliance validation in regulated industries. Expert judgment accumulates as ground truth: each expert review must produce a structured record of the case, the judgment, and the behavioral pattern it represents, so that the expert's criteria can be progressively codified into specifications that reduce EDL dependency over time.

The oversight pattern for a given work item class is not derived from its autonomy tier alone — it is specified at the point where the work item class is defined, and it constrains how the tier's governance is implemented. A Tier 3 system with asynchronous HITL has a longer action-to-review window than a Tier 3 system with synchronous HITL; both satisfy the tier model, but one requires a more demanding reversibility guarantee.

Minimum bar (HOTL): If the irreversibility window for a HOTL-designated action class has not been measured and confirmed to exceed the sum of monitoring detection, notification, assessment, and intervention time, HOTL is not providing oversight — it is providing the appearance of oversight.

Minimum bar (HOLL): If the per-action evidence record is not sufficient to reconstruct accountability from logs alone — without any human witness — HOLL is not a governed autonomy model. It is Tier 4 in name with ungoverned operation in practice.

Minimum bar (EDL): If the independent validator's domain qualifications are not documented, or if their review produces no structured record of judgment rationale, independent validation is organizational theater, not a governance control.

6. Knowledge and memory are distinct infrastructure

An agent without memory is a liability. But knowledge and memory are not the same thing, and conflating them is dangerous.

Knowledge is ground truth: code, documentation, ADRs, formal contracts, domain constraints. It is versioned, deterministic, and authoritative.

Learned memory is heuristic: reasoning patterns, incident learnings, routing preferences, and reusable skills. It is probabilistic, subject to decay, and requires continuous renewal — not just point-in-time control. Provenance, expiration, compression, rollback, and domain scoping are the mechanisms of that renewal cycle: each one governs not only what is stored, but whether what was learned is still valid before it is reused.

The practical test: if it changes through governed processes (pull requests, ADR reviews, schema migrations), it is knowledge. If it changes through feedback loops (agent learning, incident adaptation, routing optimization), it is learned memory. The governance mechanism determines the classification.

At the frontier, memory is not only retrieval. Agents can externalize procedures as reusable skill artifacts that evolve through experience without changing model weights. Those learned skills require the same provenance, review, rollback, and scoping discipline as any other memory layer.

Memory failure modes. The governance mechanisms above address the what-and-when of memory management. The threat model addresses what goes wrong when they fail:

Memory poisoning — an agent writes incorrect learnings that corrupt future agent behavior across sessions. Mitigate with human review gates on memory writes from agents operating at Tier 2 or above.
Cross-agent contamination — Agent A's domain-specific memory leaks into Agent B's reasoning context. Mitigate with domain-scoped memory namespacing and access controls on memory read paths.
Consistency under concurrency — two agents update the same memory store with conflicting observations. Mitigate with versioned writes and explicit conflict resolution policies, the same as for any shared mutable state.
Audit trail gap — "what version of memory was active when this decision was made?" requires point-in-time snapshots, not just current state, for meaningful incident reconstruction.
Knowledge contamination — agent-generated content enters the knowledge base through governed processes (commits, ADRs, documentation PRs) and is subsequently retrieved in future context with the same epistemic authority as human-authored knowledge. Mitigate by requiring provenance labeling of all agent-authored artefacts at commit time — the label travels with the artefact through versioning and retrieval so that consumers can apply appropriate epistemic weight. An unlabeled agent-authored ADR retrieved as authoritative knowledge is a failure mode that bypasses both memory governance and retrieval quality controls.

Minimum bar: If memory cannot expire, be rolled back, or show provenance, it is not memory — it is a liability. And if memory is not revalidated against current architecture and process before reuse, it is not being governed — it is being trusted.

7. Context is engineered like code

If the knowledge store is polluted with bad embeddings or stale data, the agent hallucinates — no matter how clean the code. Context quality and code quality are coupled. Context is a first-class dependency, engineered with the same rigor as code: versioned, tested, and performance-benchmarked.

Context retrieval must be fast enough to sustain the reasoning loop. Context windows are finite and reasoning quality degrades as low-signal context accumulates. Engineer explicit context budgeting: hierarchical retrieval, rolling summaries, state compaction, and authority-weighted pruning.

Minimum bar: If retrieval takes longer than the reasoning loop tolerates, context is broken infrastructure. But slow is not the only failure mode: stale embeddings, conflicting sources, semantic precision failures (fast retrieval of wrong artifacts), poisoned retrieval artifacts, and authority-weighting errors (an outdated ADR silently overriding current policy) are quality failures that a performance criterion does not catch. Context quality and code quality are coupled — both must be verified, not just timed.

8. Evaluations are the contract; proofs are a scale strategy

Evaluations define what "correct" means in terms the system can check autonomously. Every change must be verified against the evaluation suite — and every change must preserve or improve evaluation performance. Without evaluations, verification is assertion. Without verification, done is a claim.

Evaluations evolve with the system: coverage of the happy path, adversarial cases, regression scenarios, and behavioral checks. They are the machine- readable form of the acceptance criteria in Principle 2. When the specification changes, evaluations change with it.

"Proofs" here means formal verification of the contracts and infrastructure around agents — not of the agent's reasoning itself. You can prove that a retry policy is idempotent, that a state machine has no deadlocks, or that a type contract is satisfied. You cannot formally prove what an LLM will decide. The value of proofs scales with module count and risk: as more agents interact through more contracts, the contracts themselves become worth proving.

Minimum bar: If evaluations do not include regression cases, verification is incomplete.

Verification, validation, and independent validation are distinct disciplines. Passing evaluations satisfies verification. It does not satisfy validation or independent validation, which require additional steps:

Discipline	Question answered	Owner	Timing	Required by
Verification	Did we build it right? Implementation matches specification.	Development / QA team	Pre-merge, every change	Always
Validation	Did we build the right thing? Specification matches real-world need.	Product / domain owner	Pre-release	Phase 4+; always for regulated systems
Independent validation	Were verification and validation themselves rigorous?	Organizationally separate team (2nd line)	Pre-production	Any high-stakes system; mandated by SR 11-7, SS1/23, DORA in regulated industries

Independent validation is a governance principle, not merely a compliance requirement. Any system where a verification failure could cause significant harm — financial, safety-critical, reputational, or legally consequential — warrants organizational separation between the team that builds and verifies and the team that validates. Regulation formalizes this requirement; it does not create it. The most common failure: teams perform verification, label it validation, and have no independent validation. This is a quality gap in any context, not only a regulatory audit finding.

Independent validation must be capable of blocking production deployment. A team that can only observe and advise is not independent validation — it is a consultation. See Principle 12 for the accountability structure that makes independent validation meaningful.

Evaluations must also test whether the governance system works — not only whether the product works. A governance evaluation suite verifies: evidence bundle completeness (all required fields present and non-empty); provenance consistency (provenance fields match across artefacts in the same bundle); control state record accuracy (stated pass/fail/waived verdict matches the underlying artefact); rollback procedure currency (tested within the window defined by the evidence freshness rules); and SBOM completeness against the deployed dependency set. When governance evaluations fail, they trigger the same remediation sub-cycle as product evaluation failures — not a separate audit process. A governance system that is never evaluated is trusted, not governed.

9. Observability and interoperability cover reasoning, not just uptime

Instrument decisions, tool calls, policy violations, memory retrievals, cost per task, and near-misses — so you can explain why something happened, not just that it happened. Every agent action must produce an inspectable trace: diffs, tool calls, decision chains, evaluation results, rollbacks.

Traces are not logging. Logging records events. Traces reconstruct reasoning — the full chain from specification to decision to action to outcome. They are the audit trail that makes agentic systems governable, debuggable, and safe.

Observability and interoperability are coupled here because portable observability requires interoperable trace formats. You cannot aggregate traces across vendor boundaries without standardized contracts, and you cannot debug cross-runtime failures without replayable tool logs. They have separate minimum bars but share a dependency: without interoperability, observability fragments at the system boundary where it matters most.

Minimum bar (observability): If you cannot answer "why did this happen" from traces alone, you are not instrumented.

Minimum bar (interoperability): If tools cannot be swapped or replayed across runtimes without rewriting core workflows, the platform is brittle.

Observability covers governance state, not only system behavior. The following signals must be instrumented alongside reasoning traces: stale artefacts in active evidence bundles, surfaced without manual audit using the freshness rules in the Definition of Done; controls in a failed or waived state with no recorded resolution timeline or expiry; accountability ownership gaps — active production components with no named, current owner; rubber-stamping patterns — review-time distribution anomalies and approval-without-trace events as defined in the accountability metrics; and model, prompt, or tool manifest changes that have not triggered an evaluation re-run. Governance-state observability makes the difference between a system that is governed and a system that appears governed. If the current health of the governance state cannot be answered from instrumentation alone, the system is not observable in the sense that matters for agentic operation.

10. Assume emergence; engineer containment

Multi-agent systems exhibit emergent behavior by nature — some useful, some dangerous. Expect nonlinear failures, feedback loops, and phase changes. Build guardrails, rate limits, circuit breakers, and safe fallbacks before you need them.

When emergence produces useful behavior, capture it. When emergence produces dangerous behavior, contain it. The difference between these two outcomes is the quality of your containment engineering.

Security is a containment concern, not a separate audit. Agentic systems that autonomously write, execute, and deploy code present a distinct attack surface that must be threat-modeled before granting autonomy beyond Tier 1:

Prompt injection — adversarial content in retrieval artifacts, tool responses, or code patterns that redirects agent behavior without the operator's knowledge. Mitigate by treating all tool responses, retrieval artifacts, and agent-to-agent messages as untrusted input subject to input schema validation before processing. If the agent runtime cannot enforce an input boundary between external content and internal instruction, prompt injection is structurally enabled.
Privilege escalation — chained agent calls that accumulate permissions no single call would be granted under least-privilege policy.
Data exfiltration — tool calls that surface sensitive data to outputs that are not fully inspected or logged. Mitigate with egress controls on tool outputs — agent outputs that include retrieved or generated content must be logged with full trace before leaving the trust boundary. A system where output channels are not fully inspected and logged has no exfiltration defense.
Supply chain attacks — poisoned tool registries, model adapters, or retrieval sources that corrupt agent behavior at ingestion time. Mitigate by pinning tools to verified manifests — checksum or signing verification against a known-good registry. A tool added to the manifest without integrity verification is an uncontrolled dependency.
Social engineering — AI-generated outputs crafted to pass human reviewer scrutiny by exploiting reviewer trust in fluent, confident text. Mitigate by surfacing primary artefacts as the default review interface for any human-approval decision. An approval workflow whose default view is an agent-authored summary is structurally vulnerable to this attack.

Treat every retrieval artifact, tool response, and agent-to-agent message as untrusted input. Defense-in-depth means identity for agents and tools, signed provenance for shared state, least-privilege tool scopes, egress controls, and continuous anomaly detection for cross-agent trust edges.

Minimum bar: If you have not tested with tool outages, noisy retrieval, and adversarial inputs, you are not chaos-tested. If you have not threat-modeled prompt injection, privilege escalation, and exfiltration vectors for your specific agent topology, you are not security-tested.

Governance failure modes are containment concerns, not compliance concerns. The following failure modes must be engineered against with the same discipline applied to security threats:

Evidence laundering — an agent assembles an evidence bundle from outputs it generated, creating circular self-attestation with no independent verification. Mitigate by requiring that at least one verification step in the evidence bundle is executed by a process that did not produce the artefact under review.
Approval laundering — a human signs off on a change by reviewing an agent-generated summary rather than the underlying evidence. Mitigate with evidence bundle presentation controls that surface primary artefacts, not agent-authored summaries, as the default review interface.
Compliance theater — evaluations and controls are added to satisfy audit requirements rather than to catch failures. Detectable by back-testing whether controls would have caught known past failures; if not, the control is theater.
Stale-control reliance — a control is recorded as passing because it has not been re-run since the system changed, not because the system still satisfies it. Mitigate with the evidence freshness rules in the Definition of Done.
Automated rubber-stamping — human review rate collapses under volume; reviewers approve without meaningful inspection. Detectable via review-time distribution metrics as defined in adoption-metrics.md; requires the response defined in Principle 12: raise automation barriers, lower autonomy tiers until oversight signal quality is restored.
Waiver accumulation — waivers granted for specific circumstances persist beyond those circumstances, silently expanding the system's effective policy boundary. Require waiver expiry dates and automated staleness detection so accumulated waivers surface at the next release gate.

11. Optimize the economics of intelligence

Not every task requires the most capable model. Build a dynamic routing layer. Route simple tasks to fast, cheap models. Reserve expensive, high-reasoning models for complex orchestration and critical decisions. Model choice is a runtime decision, not a configuration constant.

Optimize total cost of correctness — not just inference cost, but the full cycle: inference + verification + governance overhead + incident remediation. Include human costs: review time per tier, context-switching across model behaviors, and debugging heterogeneous failure modes in multi-model routing. Track cost per task, cost per outcome, and cost per quality unit. When governance overhead exceeds the value of the work, that is a signal to simplify, not to add more governance.

Multi-model coherence. In heterogeneous swarms, different models may hold conflicting internal representations of the same codebase — different architectural pattern priors, different conventions for what "correct" looks like, different training-data views of domain boundaries. This coherence gap compounds at Phase 5+ when agent roles are highly specialized. Mitigate by: making shared architectural decisions explicit in the knowledge base rather than relying on implicit prompt conventions; routing semantically related tasks through the same model tier when consistency matters more than cost; and treating cross-model disagreement on shared artifacts as an observable quality signal rather than a coordination annoyance.

Minimum bar: If model choice is a configuration constant instead of a runtime decision, you are overspending.

12. Accountability requires visibility

Agents execute; humans own outcomes, risks, approvals, and incidents. No agent — however capable — absorbs legal, ethical, or operational responsibility. Release decisions, risk acceptance, production behavior, and incident response require a human with skin in the game. Agents may prepare evidence, summarize risk, flag missing controls, and recommend decisions. Agents may not accept residual risk, approve production exposure, waive controls, or absorb accountability for business outcomes. The boundary is not capability — it is consequence: when a decision has consequences the organization must answer for, a named human must make it.

But accountability without visibility is a legal fiction. You cannot own what you cannot see. The autonomy tiers in Principle 5, the traces in Principle 9, and the verification and validation disciplines in Principle 8 exist to make human accountability meaningful rather than ceremonial.

In regulated environments, accountability extends to independent validation: the organizational separation between the team that builds and verifies a system and the team that independently validates it is not bureaucracy — it is the mechanism that makes accountability real. A governance structure where the same team both builds and validates has no external check on whether its verification was genuine.

Accountability at scale operates at the policy level, not the action level. When agents process thousands of actions daily, per-action human review is neither feasible nor the right model. The resolution is a three-tier framework applied per action class:

Action class	Human involvement	Accountability mechanism
Low-risk, reversible (Tier 1, contained blast radius)	None per action; domain owner reviews statistical samples and trend dashboards	Automated evidence bundle; rollback ready; anomaly alert if pattern deviates
Medium-risk, governed (Tier 2, branch + approval)	Human approves merge; does not review every line	Evidence bundle gates approval; trace available on demand
High-risk, production-impacting (Tier 3)	Named human reviews evidence and accepts risk per change	Full evidence bundle required; no automated promotion
Policy-envelope autonomous (Tier 4)	Human approves and owns the policy envelope; no per-action review; anomaly detection routes deviations to human	Control state record confirms each action fell within the approved envelope; governance observability surfaces drift; kill switch available and tested

A domain owner owns the risk policy, the autonomy tier ceiling, the escalation path, and the incident response protocol for their domain. They do not approve every low-risk action — they own the framework that governs those actions, and they carry the accountability when that framework fails. When trace volume exceeds meaningful review capacity, the correct response is to raise automation barriers (tighten evaluation thresholds, lower autonomy tiers) until oversight signal quality is restored — not to accept degraded oversight as a workload problem.

Failures are data: errors and crashes are learning opportunities, and hallucinations can become a hallucination loop where plausible-but-wrong early output drives increasingly wrong follow-on fixes. Never simply retry a failed prompt. Diagnose, update memory, strengthen contracts and constraints, and rerun verification before retrying. But someone must own the consequences when systems go live. Clear responsibility is not bureaucracy; it is system safety.

Minimum bar: If no named human can inspect the reasoning, review the evidence, and own the outcome of a production agent, the system is ungoverned.

The four oversight patterns define how accountability is exercised, not just claimed. The three-tier action class table above maps action risk to human involvement. The oversight patterns from P5 specify the structure of that involvement. Together they close the gap between "a named human is accountable" (a statement about who) and "accountability is actually being exercised" (a statement about how).

Accountability failures in agentic systems cluster around three patterns:

Accountability diffusion — the accountable human's name is on the record, but their actual review was nominal. The oversight design nominally provides HITL; in practice, the reviewer approves without reviewing. Detected by: override rate at or near zero for work item classes with known complexity; reviewer agreement rate above 95% for sustained periods; review latency consistently below the minimum plausible for the case type. When these signals appear, raise the investigation before assuming the agent has become trustworthy — the more likely explanation is that human oversight has degraded into a signature ceremony.

Accountability displacement — the oversight design nominally provides HOTL, but the monitoring design does not detect violations within the irreversibility window. The human is on the loop in principle; the loop has no signal that reaches them in time to act. Detected by: monitoring false negative rate (violations that did not generate alerts, discovered post-hoc); alert-to-intervention latency exceeding the irreversibility window on measured cases; monitoring coverage validation failures.

Accountability abstraction — the oversight design nominally provides HOLL, but the per-action evidence record is insufficient to reconstruct what happened and why. The policy envelope was approved; the audit cannot confirm compliance. Detected by: evidence records with missing fields; audit findings that cannot be resolved because the log is incomplete; inability to answer, for a specific action, whether it was within the approved envelope.

The Expert-Driven Loop is the fourth accountability mechanism. When correct behavior cannot be fully pre-specified, the domain expert who defines it is the accountability anchor for that determination. An independent validation performed by a reviewer without domain expertise does not satisfy the accountability requirement for expert-designated work item classes — it satisfies the presence requirement while failing the quality requirement. The accountable expert is not interchangeable with an accountable generalist for work item classes where the distinction matters.

Oversight adequacy is measurable. "We have HITL" is not an accountability claim. "Our HITL override rate is N%, reviewer calibration is current, SLO compliance is N%, and capture detection has not triggered in N months" is an accountability claim. Accountability requires visibility — into the agent's reasoning (P9) and into the oversight mechanism's performance. Both must be instrumented.

Minimum bar (oversight adequacy): If you cannot report, for each oversight pattern in use, the metrics that indicate whether the pattern is delivering genuine governance — override rates for HITL, false negative rates for HOTL, compliance audit currency for HOLL, expert qualification currency for EDL — you cannot claim the governance is functioning. You can only claim it is present.

What "done" means in agentic engineering.

See the Manifesto for the core values and the Agentic Loop. See the Twelve Principles for the engineering principles.

The Agentic Definition of Done

Tokens generated and tasks dispatched are vanity metrics. "The agent said it worked" is not a completed ticket.

A change is done when it is:

Shipped — deployed or delivered, not just merged.

Observable — instrumented and logged so reasoning can be inspected and reconstructed from traces.

Verified — evaluated against regression tests (and adversarial cases), with an evidence bundle (diffs, trace IDs, policy check outputs) required for every automated merge.

Provable (when risk requires it) — formalized invariants and replayable proof artifacts attached for critical workflows.

Learned from — knowledge base and learned memory updated with what was discovered, with provenance.

Governed — operating within autonomy tiers appropriate to its risk, with human accountability assigned.

Economical — routed through appropriate model tiers, cost tracked and justified per outcome.

Anything less is not done for the current phase.

This DoD is phase-calibrated, not all-or-nothing. At Phase 3, "verified" means tests and a diff; at Phase 5, it means reproducible replay with formal artifacts where justified. "Provable" applies only when risk requires it; "economical" matters only when routing infrastructure exists. The bar rises with the stakes — but at every phase, the question is the same: can you show evidence, not just assertions?

Evolvability as an implicit criterion. A change that passes today's tests but degrades the codebase's capacity for future change is not truly done — it has traded short-term correctness for structural regression. Chen et al., SWE-CI: Evaluating Agent Capabilities in Maintaining Codebases via Continuous Integration (2026, https://arxiv.org/abs/2603.03823), reports that most frontier models achieve zero-regression rates below 0.25 across 100 maintenance tasks averaging 71 consecutive commits and 233 days of repository history — i.e., regressions appear in at least three of every four CI iterations for most agents. Independent measurements corroborate the long-horizon collapse: Deng et al., SWE-Bench Pro: Can AI Agents Solve Long-Horizon Software Engineering Tasks? (2025, https://arxiv.org/abs/2509.16941), reports frontier-model success rates near 23% on long-horizon, multi-file engineering tasks — versus substantially higher rates on shorter benchmarks such as SWE-bench Verified — and Kwa et al., Measuring AI Ability to Complete Long Software Tasks (METR, 2025, https://arxiv.org/abs/2503.14499), shows reliable model success collapsing as task duration extends beyond roughly fifty minutes of equivalent human effort. Treat these as calibration points, not universal rates; the rates differ by benchmark composition and model generation. They complement the issue-resolution benchmarks that preceded them — Jimenez et al., SWE-bench: Can Language Models Resolve Real-World GitHub Issues? (ICLR 2024, https://arxiv.org/abs/2310.06770), and Yang et al., SWE-agent: Agent-Computer Interfaces Enable Automated Software Engineering (2024, https://arxiv.org/abs/2405.15793) — by extending the evaluation surface from single-issue capability to multi-commit, long-horizon maintenance behaviour. SWE-CI is evidence of behavioral regression risk, not a direct measure of architectural evolvability: CI metrics do not capture coupling growth, cohesion decay, abstraction quality, or future-changeability. Both risks are real and distinct. At Phase 4 and above, "verified" should include evolution-weighted signals beyond CI pass rates — static analysis for coupling growth, module boundary stability, and change amplification — alongside the behavioral regression coverage the benchmark measures. See Structural Regression in the Companion Guide.

Why it matters: This forces the system to optimize for actual business outcomes rather than raw output volume, killing the illusion of productivity.

Definition of Done for Hardening

Applying the agentic DoD to work that begins as rapid exploration ("vibe coding") and must become governed engineering before it ships.

Exploratory agent output is not production-ready by default. A prototype that "worked in the demo" has not passed the Agentic Definition of Done. The four steps below define what hardening means: the path from captured exploration to governed, verifiable output.

Step 1 — Capture. Record the vibe output exactly as produced: diffs, trace IDs, prompts used, tool calls made, and any model or configuration state at the time of generation. Treat this as raw evidence, not a deliverable. Do not edit or clean the output before capturing it — the unmodified artifact is the baseline.

Step 2 — Extract Specification. From the captured output, derive the specification the agent was implicitly working toward: what behavior does the output exhibit, what constraints does it respect (or violate), and what observable success criteria would confirm it is correct? This step converts intent from the agent's context window into a machine-readable, reviewable specification. If no coherent specification can be extracted, the output is not a candidate for hardening — it is a candidate for restart.

Step 3 — Build Evaluation Portfolio. For the extracted specification, author an evaluation portfolio (P8): behavioral tests, adversarial cases, and at least one holdout case not derived from the captured output. The portfolio must include explicit regression coverage for any behavior the captured output depends on. Evaluation theater — a portfolio that only tests the happy path the exploration already demonstrated — does not satisfy this step.

Step 4 — Verify and Refactor. Run the evaluation portfolio against the captured output. Fix every failure. Refactor for structural quality (coupling, abstraction, module boundary stability) sufficient for the change's autonomy tier and risk level. Attach the evidence bundle (passing evaluations, trace IDs, refactoring diffs) to the change. The change is done when the evidence bundle is complete and a named human is accountable for it (P12).

The evidence bundle has a defined set of required components. The following components are always required: passing evaluation reports with timestamps, trace IDs linking specification to execution to output, diffs showing exactly what changed, and policy check outputs confirming constraint compliance.

Security static analysis results. For loop outputs that produce code touching external interfaces — HTTP endpoints, user-input processing, database queries, file system operations, or external API calls — static application security analysis must be run against the generated code and its results included in the evidence bundle. The analysis must pass with no unresolved Critical or High severity findings. A finding waived rather than remediated must be documented with explicit justification and approved by the accountable human before the evidence bundle is considered complete. The OWASP Application Security Verification Standard (ASVS) provides the reference set of security requirements against which static analysis rules should be calibrated; the specific tooling that executes the analysis is an implementation choice.

Bundle integrity attestation. The assembled evidence bundle must be integrity-protected before the release gate is assessed. Integrity protection means that any post-assembly modification to the bundle's contents is detectable. For most organisations, this is achieved by generating a cryptographic hash of the complete bundle at assembly time and recording the hash in the governance record, or by applying a digital signature using a key controlled by the release system or the accountable human. The integrity record — hash or signature — must be verifiable at the time of release gate assessment and must be retained alongside the bundle for the duration of the system's operational lifetime. A bundle whose integrity cannot be verified is a bundle that may have been modified after the evidence was produced. In regulated environments where tamper-evident audit trails are legally required, digital signatures with non-repudiation properties are the appropriate mechanism.

Agentic provenance record. For any loop output produced by a system using foundation models, the evidence bundle must include a machine-readable record capturing the following fields at the time the loop completed:

Foundation model identifier and version — the specific model used during the loop iteration, including fine-tune or adapter identifier if applicable.
Model provider category and deployment mode — the category of provider (e.g., third-party API, self-hosted open-weight model, on-premises deployment) without naming specific commercial products or vendors.
Evaluation model parity — whether the model used during evaluation matches the model deployed to production; if they differ, a documented justification must be present.
System instructions and prompts — represented by cryptographic hash, not plaintext, to protect confidentiality while enabling change detection across iterations.
Tool manifest — the complete list of tools, plugins, and integration servers available to the agent during the loop, their version identifiers, and their permission scopes.
Memory state version — the identifier of the agent's persistent memory state at loop entry and at loop exit, where persistent memory is in use.
Retrieval corpus version — the version identifier of any retrieval corpus, vector store, or knowledge base consulted during the loop, where retrieval augmentation is in use.
Embedding model version — the model used to generate embeddings, where applicable.
Dataset lineage — for systems using fine-tuned or domain-adapted models, the lineage of the training or fine-tuning dataset.
Policy constraints active — the set of safety, content, and behavioral policies in effect during the loop iteration.

The record must be structured (machine-readable). It must be generated as part of the loop completion process, not as a manual post-hoc document. It must be filed with the evidence bundle and retained for the operational lifetime of the system. The absence of this record is treated as an evidence bundle incompleteness — not as a pass.

A change in any provenance field between loop iterations constitutes a material change. It must be surfaced explicitly at the release gate because it may affect evaluation validity: an evaluation run against one model configuration, tool manifest, or retrieval corpus does not automatically transfer to a different configuration. Any such change must be reviewed before the gate is cleared.

The practical test. Ask: if the person who ran the exploration session left today, could another engineer reproduce, modify, and extend this output using only the specification, the evaluation portfolio, and the evidence bundle? If the answer is no, hardening is not complete.

When to skip hardening. Exploration output that will be discarded — a spike, a proof of concept that will be rewritten, a learning exercise — does not require hardening. The trigger for hardening is intent to ship, not intent to keep. If the output is going to influence production behavior in any form, the four steps apply.

Evidence Freshness

Evidence decays. An evidence bundle that was complete and accurate at the time of filing may become stale as the system changes beneath it. A threat model produced for last quarter's architecture may not cover this quarter's integrations. An SBOM generated at the last deployment does not reflect a dependency update applied in a hotfix. A cost forecast produced before a model repricing event no longer represents the system's actual cost profile. The evidence bundle is not a permanent artefact; it is a snapshot of the system's state at a point in time, and its validity is bounded by the changes that have occurred since it was produced.

The following freshness rules apply to evidence bundle artefacts:

Threat model: Stale on any material architecture change, tool manifest change, data source change, model version change, or integration of new external systems. There is no calendar freshness window — staleness is event-triggered by changes to the system's attack surface.
SBOM: Stale on any dependency change or new deployment. A deployment that uses a dependency set different from the one in the filed SBOM requires a new SBOM before the release gate.
Security static analysis results: Stale on any code change to the files analyzed. Analysis results are scoped to the specific commit analyzed; they are not forward-valid for subsequent commits.
Cost forecast: Stale before a release if model pricing has changed since the forecast was produced, or if the model tier used in the loop has changed. Stale in production if actual cost diverges from forecast by more than 20% over a 30-day window.
Runbook: Stale at every new release. The runbook must be reviewed and updated (or confirmed current) before each release before the release gate's rollback condition is satisfied.
DPIA: Stale on any change to the data purpose, data categories processed, or the identity of data subjects affected. A DPIA produced for an initial version of a system does not automatically apply to a version that processes additional data categories.
Model evaluation suite run: Stale on any change to the model version, prompt version, tool manifest, or retrieval corpus. An evaluation run conducted under a prior configuration is not valid for a configuration that differs in any of these dimensions.
Rollback procedure test: Stale 48 hours before production deployment. A rollback test must be conducted within the 48-hour window immediately preceding the planned deployment.
Agentic provenance record: Generated per loop iteration. The provenance record for a prior iteration is not valid for a subsequent iteration; any change to a provenance field generates a new record.

A stale artefact in the evidence bundle does not automatically fail the gate — it changes the condition's GateState to stale and requires the artefact to be refreshed before the gate can be assessed as passed. The distinction between stale and fail matters: a stale threat model was once correct; understanding when it became stale and why guides the refresh effort. A failed threat model requires correcting specific deficiencies. Treating all stale artefacts as failures conflates two different problems and produces the wrong remediation response — it directs the team toward correcting deficiencies that may not exist, rather than refreshing coverage for changes that have occurred since the artefact was produced.

Extended guidance, tradeoffs, failure modes, and reference material for each principle in the Agentic Engineering Manifesto.

This guide provides the deeper rationale, advanced bars, and operational detail behind each principle. Read the Manifesto first for the core values and minimum bars. Come here when implementing.

See the Adoption Playbook for organizational change management, role transitions, and pilot design.

Principle-by-Principle Guidance

Extended guidance for each of the twelve principles: Outcomes, Specifications, Architecture, Swarm Topology, Autonomy, Knowledge & Memory, Context, Evaluations & Proofs, Observability & Interoperability, Emergence & Containment, Economics, and Accountability. Includes the Accountability Paradox, retrieval SLOs, context budgeting, tier boundary design, and memory governance operational detail.

Cross-Cutting Frameworks

The Agentic Maturity Spectrum — six phases from Guided Exploration to Adaptive Systems, with failure modes for each
Boundary Conditions — when to cap autonomy, what regulated industries can still use, what would need to change
Operational Definitions — blast radius, right-sized, evidence bundle with phase-calibrated examples

Worked Patterns and Failure Patterns

Six worked patterns (A-F) showing the manifesto applied to concrete scenarios: single-domain reliability, cross-domain coordination, memory poisoning recovery, economics routing, runtime tier escalation, and a governed failure where governance didn't prevent the incident. Plus the Hallucination Loop and Operational Recovery Cycle failure patterns.

Failure Modes and Skill Requirements

Failure Modes of This Manifesto — over-governance, evidence theater, control theater, security theater, adoption theater, maturity inflation
Skill Requirements by Principle — readiness assessment (Ready, Reorient, Split, Acquire) for each principle with adoption bottleneck identification

Requirements Engineering Framework

Reference framework for specifying agentic systems: the two-axes classification matrix (system type × artifact consumer), behavioral envelope (four layers: hard boundaries, soft boundaries, performance, adaptation), hard requirements vs. probabilistic assurance targets, single-source/ multiple-projections principle, tiered lifecycle (Tier 1/2/3), NFR table, per-requirement checklist, multi-agent behavioral contracts, and change control.

Extended guidance, tradeoffs, and operational detail for each principle in the Agentic Engineering Manifesto.

Read the Manifesto for the core values and minimum bars. See the Companion Guide for the full table of contents. See the Adoption Playbook for organizational change management, role transitions, and pilot design.

principles-01.md — Outcomes: Extended Guidance
principles-02.md — Specifications: Extended Guidance
principles-03.md — Architecture: Extended Guidance
principles-04.md — Swarm Topology: Extended Guidance
principles-05.md — Autonomy: Extended Guidance
principles-06.md — Knowledge & Memory: Extended Guidance
principles-07.md — Context: Extended Guidance
principles-08.md — Evaluations & Proofs: Extended Guidance
principles-09.md — Observability & Interoperability: Extended Guidance
principles-10.md — Emergence & Containment: Extended Guidance
principles-11.md — Economics: Extended Guidance
principles-12.md — Accountability: Extended Guidance

See the split files below for the full Companion Principles corpus.

Principle 1 — Outcomes: Extended Guidance

See Principle 1 in the manifesto for the core statement and minimum bar.

The Probability-Compounding Problem

A common intuition is that system correctness compounds multiplicatively — if each module is correct with probability p, a system of N modules has roughly p^N correctness. This mental model is misleading in two directions:

Too optimistic, because it assumes independent failures. Real agentic systems share models, knowledge bases, and tool chains that create correlated failure domains. A single poisoned retrieval shard or a shared model blind spot can invalidate every agent simultaneously — far worse than p^N predicts.
Too pessimistic, because cross-verification between agents can break the compounding chain in ways that independent modules cannot. When agents verify each other's outputs against independent evidence sources, the effective error rate can be driven below any individual module's failure rate.

The useful question is not "what is p^N?" but "where are the shared dependencies that make failures correlated?" A working failure-domain decomposition:

Correlated model failure: The same base model is used everywhere, making reasoning blind spots systemic.
Correlated retrieval failure: The same poisoned or stale knowledge base shard feeds multiple agents. In practice, this is often the most insidious class because it produces plausible-looking but systematically wrong outputs.
Correlated tool failure: The same flaky integration or API rate limit blocks the entire swarm.
Correlated governance failure: The same reviewer fatigue or policy misconfiguration rubber-stamps errors.

This is a practitioner framework, not a proven exhaustive taxonomy. Teams should extend it for their specific failure surfaces and validate priority ordering against their own incident data. The shared dependencies it names mean system-level risk is often much worse than independent-failure models suggest — but also that targeted decorrelation (diverse models, independent retrieval indexes, redundant tool chains) can yield outsized reliability gains.

Evidence Bundles and Assurance Levels

This does not mean full formal verification is a near-term default for every team. It means assurance must scale with blast radius and system size. Evidence bundles should be immutable, replayable, and auditable, with proof artifacts introduced where risk justifies cost: signed trace manifests when required by policy, deterministic replay artifacts, and formalized invariants verified by proof or model-checking tools where warranted.

Principle 2 — Specifications: Extended Guidance

See Principle 2 in the manifesto for the core statement and minimum bar.

Contract-First Agentic Development

In practice, this can include contract-first agentic development: agents propose both implementation and machine-checkable contracts (preconditions, postconditions, invariants), then iterate in a tight loop: specify, implement, attempt to prove, fail, refine, repeat. Proof failure is not a blocker to hide; it is a steering signal.

Specifications as Agent-Consumable Artifacts

The specification-as-living-artifact pattern now has concrete implementations. Agent Skills (SKILL.md files — structured metadata plus step-by-step instructions that agents consume at runtime) and AGENTS.md (repository-level machine-readable constraints) are increasingly supported across several IDEs and coding agents. Both formats validate the core P2 claim: specifications that agents can parse directly reduce ambiguity, improve adherence, and make convergence measurable. Skills define what an agent can do; AGENTS.md defines how it must behave within a codebase. Together with agent-to-tool protocols (which define how agents connect to external capabilities), they form the specification layer of the emerging standards stack.

The Specification-Driven Development Movement

The specification-first pattern is not just an architectural recommendation — it is converging as the dominant practitioner workflow. A wave of open-source specification-driven development (SDD) frameworks has emerged, all built on the same thesis P2 advocates: write the spec before the agent writes the code. The pattern across these frameworks is consistent: specifications are treated as code artifacts, baked into workflows, and consumed by agents before implementation begins — whether through specify-plan-implement pipelines, state-machine-governed iteration, or composable skill-driven workflows. This validates P2's core claim at practitioner scale. See Sources for specific framework references.

Convergence Criteria

Specification evolution needs convergence criteria. Treat a specification as converging when acceptance criteria remain stable across successive iterations, scope narrows rather than expands, and incident classes trend downward. If each loop adds ambiguity or expanding goals without quality improvement, treat it as scope drift and reset the boundary.

Validation vs. Verification

Evaluations (Principle 8) and evidence bundles (Principle 1) answer the verification question: did we build it right? They confirm the implementation matches the specification. But verification alone has a blind spot: you can pass every check and still ship the wrong thing, just faster.

Validation answers a different question: did we build the right thing? Does the specification itself make business sense? Is the work scoped correctly? Will real users get value from it? Agents make the validation gap more dangerous because they can generate feature-shaped output quickly — complete with passing tests, clean architecture, and a full evidence bundle — while the underlying specification was never worth implementing.

The Agentic Loop addresses validation explicitly through the Validate → Observe → Learn → Govern cycle: after verification confirms technical correctness, validation checks fitness for real-world use; runtime behavior, usage data, and business outcomes then feed back into specification revision. But this only works if teams treat Validate as a distinct discipline from Verify, not just a technical monitoring step. Concretely:

Frame the work in context before specifying. Is this a proof of concept, a minimum viable feature, or a production commitment? Define "good enough" for each context and make the underlying business assumptions explicit. An agent cannot validate its own specification against business reality — that is a human judgment that must happen before the Loop begins.
Define stop criteria, not just acceptance criteria. Acceptance criteria tell the agent when the implementation is correct. Stop criteria tell the team when to abandon or pivot the specification itself — when usage data, customer feedback, or market evidence shows the spec was wrong regardless of implementation quality.
Connect evaluation results to business outcomes. If escaped defect rate is low but adoption, usage, or customer satisfaction metrics don't improve, the verification machinery is working but the validation loop is broken.

This is not a new idea — it is the core of Agile's "customer collaboration" value, and it survives unchanged into agentic engineering. What changes is that agents amplify the failure mode: without explicit validation loops, a team can ship more verified-but-wrong features in a month than a human team could in a quarter.

Requirements Engineering for Agentic Systems

Traditional RE was designed for deterministic systems. Agentic and hybrid systems require an extended framework. The key extensions are covered in companion-re-framework.md. The three most important for specification work:

Two-axes classification. Every requirements artifact sits on two axes: (1) system type — deterministic, agentic, or hybrid; and (2) artifact consumer — human, agent, or hybrid. The cell your requirement occupies determines the correct format and verification approach. Probabilistic assurance targets replace binary pass/fail requirements for agentic components. Agent-consumable specifications must be unambiguous to a machine — contextual inference is unreliable.

Behavioral envelopes. For agentic components, the primary specification artifact is a behavioral envelope — the boundary the system must stay within — not a list of enumerated acceptable outputs. The envelope's Layer 1 hard boundaries must be enforced by infrastructure policy, not prompt instruction. The performance envelope generates the evaluation suite directly.

Single-source principle. When a specification serves both human and agent consumers, one canonical document must be the source of truth. All other representations — governance prose, machine-readable encoding, evaluation criteria, compliance mapping — are derived projections. Independent authoring of separate documents is a divergence schedule.

See companion-re-framework.md for the full framework: two-axes matrix, hard requirements vs. probabilistic assurance targets, behavioral envelope structure, tiered lifecycle, per-requirement checklist, and academic references (arXiv:2602.22302, arXiv:2503.18666, NIST AI 600-1, ISO/IEC 5338).

The Architect Pattern: Agent-Generated Specifications

The manifesto treats specification steering as a human-governed activity. But emerging evidence shows that specification generation itself can be an agent capability — and that the quality of this capability is the primary differentiator in long-term maintainability.

The Architect–Programmer pattern separates these concerns explicitly: an Architect agent observes system behavior (test results, CI feedback, runtime metrics), diagnoses root causes, and generates machine-readable requirements. A Programmer agent implements those requirements. The cycle repeats: the Architect observes the results, refines the specification, and the Programmer iterates.

This pattern is a concrete instantiation of the Agentic Loop's Observe → Learn → Specify cycle. The SWE-CI benchmark (arXiv:2603.03823) validates it empirically: across 100 tasks spanning an average of 233 days and 71 commits of real-world development history, the Architect's ability to transform CI feedback into actionable requirements was the primary differentiator in long-term code maintainability. The three-step Architect protocol — Summarize (review failures), Locate (attribute to deficiencies), Design (produce requirements) — maps directly to the manifesto's convergence criteria: specifications that sharpen as evidence accumulates.

When to use this pattern: Long-running maintenance tasks where the specification must evolve across many iterations. For bounded, short-horizon tasks, a single agent with a clear specification may be more efficient (see Principle 4 guidance on topology choices). The Architect pattern is not a universal requirement — it is a validated topology for sustained evolution.

The governance implication: When specifications are agent-generated, the human role shifts from writing specifications to governing specification quality. The human defines the acceptance criteria for the Architect's output — what constitutes a valid requirement — and reviews the Architect's decisions at a cadence appropriate to the risk tier. The specification is still a governed artifact; the governance mechanism changes.

Specifications vs. Constraints

Specifications and architectural constraints (Principle 3) operate at different layers and change at different speeds. Constraints are invariants — security policies, domain ownership boundaries, data integrity rules — that hold across specification iterations. Specifications are goals and acceptance criteria that evolve within those invariants.

In practice, this means: an agent can propose a revised acceptance criterion without governance overhead, but proposing a relaxed constraint triggers a governed review (ADR update, policy approval, impact assessment). If your system cannot distinguish these two change types, specification iteration will silently erode your architectural boundaries.

Principle 3 — Architecture: Extended Guidance

See Principle 3 in the manifesto for the core statement and minimum bar.

Prompt Drift and Enforcement

Prompts drift, and context windows degrade. They approximate compliance — they do not guarantee it the way a compiler obeys syntax. When architecture is merely described rather than enforced, agents will violate it. When architecture is enforced but not monitored, violations will go undetected.

Domain-Driven Design for Swarms

Domain-Driven Design gives each swarm a bounded context — what it owns, where code belongs, what is forbidden to reinvent. Retrieval is untrusted input; treat context injection as a threat vector. This reduces swarm collisions and hardens the system against both accidental drift and adversarial conditions.

AGENTS.md files (an emerging repository-level convention in the AAIF ecosystem for agent instructions) offer a practical mechanism for encoding architectural constraints at the repository level. They function as machine-readable ADRs that coding agents respect at runtime — a concrete implementation of architecture as defense-in-depth.

Agent-as-Tool and Software of Unknown Provenance

In regulated development, software components are classified by provenance and qualification status. When agents participate in development, three classification questions arise:

The AI model itself: Non-deterministic, version-dependent, and opaque. Under IEC 62304 (SOUP), DO-178C/DO-330 (tool qualification), and GAMP 5 (software categories), the model cannot currently be qualified through traditional means.
Agent-selected dependencies: When an agent pulls in a library or pattern, it is making a provenance decision that may carry regulatory consequences. The human must own dependency approval; the agent must not introduce unvetted dependencies silently.
Agent-generated code: May incorporate training-data patterns that constitute derivative unclassified software. Evidence bundles must capture sufficient provenance to support classification.

The manifesto's defense-in-depth response: treat the agent as an unqualified tool and independently verify all output through qualified means. This is architecturally equivalent to treating retrieval as untrusted input (above). The infrastructure must enforce dependency allow-lists, and evidence bundles must capture dependency provenance.

See companion-frameworks.md for the cross-domain analysis and domains/ for domain-specific classification requirements.

Principle 4 — Swarm Topology: Extended Guidance

See Principle 4 in the manifesto for the core statement and minimum bar.

Topology Choices

Topology choices must be explicit, for example:

Single agent/pipeline for bounded tasks with low coordination overhead.
Hierarchy for clear decomposition with centralized decision checkpoints.
Mesh for discovery-heavy work where peers benefit from lateral coordination.

Bio-inspired swarms (experimental): bee-hive patterns and similar biologically-inspired coordination models appear in research for large search and exploration spaces. These are not production-proven at the time of writing. Naming them here is not an endorsement — it is an acknowledgment that teams will encounter them. Default to single, pipeline, hierarchy, or mesh unless your own measured results on your own workload justify bio-inspired coordination.

Inter-Agent Communication Standards

Open agent-to-agent protocols are beginning to standardize agent discovery, task lifecycle management, and cross-framework collaboration. The manifesto's governance model — tiers, traces, accountability — sits above these protocols: the protocol handles agent-level coordination; the manifesto's principles govern what those agents are allowed to do and how their decisions are audited. Teams adopting multi-agent topologies should treat communication protocols as the coordination layer and the manifesto's tier model as the authorization layer.

Expected Failure Modes by Topology

Expected failure modes differ by topology: bottlenecked leads in hierarchies, coordination storms in meshes, hidden coupling in pipelines, and role drift or signal-amplification errors in bio-inspired swarms (for example, over-committing to early weak signals). Use bio-inspired topologies only with empirical evidence that they outperform simpler topologies for the target workload.

The Single-Agent Default and Its Limits

The manifesto states: "a single well-evaluated agent with excellent tools often outperforms an expensive, uncoordinated swarm." This holds for bounded, short-horizon tasks where specification and implementation can be handled in a single context.

For long-term maintenance tasks — where the specification must evolve across dozens of iterations based on accumulated evidence — the Architect–Programmer separation may be structurally necessary, not just a preference. The SWE-CI benchmark (arXiv:2603.03823) provides evidence: across tasks spanning an average of 233 days and 71 commits, separating specification generation (Architect) from implementation (Programmer) is the minimal viable structure for sustained code maintainability. A single agent attempting both roles must hold implementation context and specification-steering context simultaneously, which degrades at the timescales long-term maintenance requires.

The practical rule: default to a single agent for bounded tasks. Adopt the Architect–Programmer topology when the task horizon exceeds what a single context window can sustain, or when specification quality is the primary bottleneck. See the Architect Pattern in the P2 extended guidance for operational detail.

Topology as a Runtime Concern

The topology choices above are presented as design-time decisions, and for most teams at Phase 3–4 they are. But the frontier is moving toward adaptive topology selection — systems that choose coordination patterns at runtime based on task characteristics, resource availability, and learned performance data. Indicators of this shift include: federation hubs that route work across heterogeneous agent pools, ephemeral workers that share persistent state rather than maintaining their own, and consensus-backed coordination that replaces static orchestrator hierarchies.

Teams should design their topology as a deliberate architectural choice today, but build the abstraction layer that allows the topology to change without rebuilding the system. The practical test: can you switch from hierarchy to mesh for a given task class without rewriting coordination logic? If not, the topology is hardcoded, and you will pay for that rigidity as the ecosystem matures.

Coordination Discipline

Parallelize exploration and analysis. Serialize decisions that change shared state. Coordination is never free: shared state must be typed, versioned, and reconciled. Contracts must be logged. Domain boundaries must prevent collisions. Without these, a swarm is a mob — agents duplicating work, producing conflicting diffs, or interpreting constraints inconsistently.

Principle 5 — Autonomy: Extended Guidance

See Principle 5 in the manifesto for the core statement and minimum bar.

Setting Tier Boundaries

The manifesto defines four tiers (Observe, Branch, Commit, Operate), but choosing where to draw the boundaries for your organization is the harder problem. Tier assignment should be driven by three factors:

Blast radius: What is the maximum credible impact if the agent acts incorrectly? Tier 1 (Observe) for actions with no production impact. Tier 2 (Branch) for actions contained to isolated environments. Tier 3 (Commit) only for production-impacting actions with verified rollback. Tier 4 (Operate) only within a machine-enforced policy envelope with passing control evaluations and active governance observability.
Reversibility: How quickly and completely can you undo a wrong action? Fast, clean rollback justifies higher autonomy. Irreversible actions (data deletion, external API calls, customer-facing communications) demand stricter gates regardless of blast radius.
Confidence maturity: How long has the agent been operating on this task class, and what is the historical error rate? New task types start at Tier 1 even if the blast radius would theoretically permit Tier 2. Promote only when evidence shows consistent correctness over a meaningful sample size.

In practice, start conservative. Most teams should default every new agent capability to Tier 1 and promote through evidence, not through optimism.

Runtime Tier Escalation

Agents sometimes discover mid-task that they need capabilities above their current tier. The protocol for tier escalation must be explicit:

The agent pauses execution and emits a structured escalation request: what action it needs, why, what evidence supports the request, and what the blast radius would be.
The system routes the request to the appropriate approver (automated policy check for Tier 1→2, human reviewer for Tier 2→3, governance board approval for Tier 3→4).
Approval is scoped and time-bounded — the agent receives temporary elevation for a specific action, not a blanket tier promotion.
The escalation, approval, and outcome are traced and auditable.

If tier escalation happens frequently for a given task class, that is a signal to reassess the tier assignment — either the task class belongs at a higher tier, or the specification needs refinement to keep the task within its current tier.

Long-Lived Agents

Long-lived agents are an exception that requires explicit justification, heartbeat monitoring, and drift controls. Tools are capabilities; audit tool access and grant least privilege. Make risky actions reversible or approval-gated.

The human role is to define the specification, set the tier, and own the outcome — not to supervise every intermediate step. But autonomy without governance is negligence. Calibrate the tier to the stakes.

Infrastructure-Level Tier Enforcement in Practice

Enterprise agent runtimes are demonstrating what infrastructure-level tier enforcement looks like at scale: declarative permission policies (typically YAML or equivalent), audit logs for every agent action, and guardrail constraints that the agent cannot override regardless of prompt instructions. This is the pattern the manifesto requires — enforcement at the infrastructure layer, not the prompt layer. Tiered autonomy is only meaningful when the infrastructure, not the agent, enforces the boundaries.

Auditing Tier Compliance

Tier boundaries are only meaningful if compliance is verified. Implement:

Runtime enforcement: The infrastructure (not the agent) blocks actions outside the agent's tier. An agent at Tier 1 physically cannot write to a production database, regardless of what its prompt says.
Compliance dashboards: Track tier violations, escalation frequency, and approval latency per domain. Rising violation rates signal either misconfigured tiers or inadequate specifications.
Periodic tier reviews: Quarterly review of tier assignments against incident data. Promote agents with strong track records; demote or constrain agents with elevated error rates.

Tier Assignment Decision Checklist

Before assigning a tier to a new agent capability — or promoting an existing capability to a higher tier — answer the following questions. Each "yes" to a risk question is a reason to stay conservative or require additional gates. This checklist is a decision aid, not a policy replacement; it does not substitute for domain-specific regulatory requirements.

Blast radius and reversibility

Could a wrong action affect production data, external parties, or safety-critical systems? → Default Tier 1 unless verified rollback exists.
Is the action irreversible within a one-hour window (data deletion, external API calls, customer-facing communications, financial transactions)? → Require Tier 1 or an explicit human approval gate at Tier 2.
Does the action cross a domain boundary (e.g., write to a system outside the agent's primary scope)? → Require explicit authorization, regardless of tier.

Confidence maturity

Has this agent operated on this exact task class for fewer than a calibration-minimum number of cycles with tracked outcomes? → Stay at Tier 1 until evidence accumulates. (Calibrate the minimum to domain: typically 20–50 cycles for low-blast-radius tasks; 100+ for production-impacting tasks.)
Has the agent's error rate on this task class been measured and is it within the threshold for the target tier? → If not measured, start at Tier 1.

Specification and governance readiness

Is the specification for this task class machine-readable with observable success criteria? → If no, Tier 1 regardless of blast radius. Tier escalation without a complete specification is not a risk decision — it is an unmanaged risk.
Is there an evaluation portfolio covering adversarial cases, not just happy-path behavior? → If no, do not promote beyond Tier 1.
Does the applicable domain set a regulatory floor (e.g., aviation DAL A/B, automotive ASIL C/D, medical device Class C, financial services SR 11-7 high-risk model)? → The regulatory floor overrides the blast-radius assessment; it cannot be overridden by team judgment.

Promotion and demotion rules

Promote one tier at a time, only after a consecutive-cycle window with zero incidents where the agent exceeded its authorized scope or caused undetected harm downstream (calibrate cycle count to domain; a reasonable starting default is 30 cycles for Tier 1→2, 60 cycles for Tier 2→3, and 90 cycles with validated governance infrastructure for Tier 3→4).
Demote immediately on any of: agent exceeded authorized scope; incident where blast radius exceeded predicted level; regulatory audit finding; specification drift detected; new task class introduced without fresh assessment.
Demotion is immediate; re-promotion requires a fresh checklist pass and a complete incident review.

Principle 6 — Knowledge & Memory: Extended Guidance

See Principle 6 in the manifesto for the core statement and minimum bar.

Memory Governance Properties — Operational Detail

The manifesto lists five governance properties. Here is what each means in practice:

Provenance: Every memory entry carries metadata: what event created it, which agent, what evidence supported it, when. Implementation: structured metadata fields on every entry in your memory store (vector DB, episodic store, or whatever layer holds learned memory). Without provenance, you cannot trace a bad decision back to a bad lesson.
Expiration: Learned memory decays. A routing preference learned during a model outage is wrong once the model recovers. A code pattern learned from a since-deprecated API is harmful. Implementation: TTLs on memory entries, calibrated by domain. High-volatility domains (model routing, API behavior) expire fast. Low-volatility domains (architectural patterns, security policies) expire slowly or never. Review expired entries before deletion — some should be promoted to knowledge; others should simply disappear.
Compression: Long-running agents accumulate memory faster than it can be consumed. Raw memory is noise; compressed memory is signal. Implementation: periodic consolidation passes that merge redundant entries, extract patterns from clusters of similar learnings, and discard entries that have been superseded. Think of it as garbage collection for learned context.
Rollback: When memory is poisoned — an agent learned something wrong from a bad incident, a corrupt retrieval shard, or a flawed evaluation — you need to undo the damage. Implementation: versioned memory snapshots (daily or per significant learning event), with the ability to revert a domain's learned memory to a known-good state. Test rollback before you need it. See Pattern C (Memory Poisoning Recovery) in the Worked Patterns.
Domain scoping: A lesson learned in the payments domain should not influence code generation in the notification service. Implementation: namespace or tag memory entries by domain, and enforce scope boundaries in retrieval queries. Cross-domain memory should be explicitly promoted, not implicitly leaked.

Emerging Memory Infrastructure

The memory infrastructure the manifesto calls for is beginning to materialize. Git-native agent memory systems demonstrate what governance-aware memory looks like in practice: provenance (every entry traceable to its source), rollback (versioned snapshots with merge-safe conflict resolution), and domain scoping (namespace isolation preventing cross-agent collisions in multi-branch workflows). Dependency-graph approaches validate the P7 claim that context must be engineered, not concatenated — tracking explicit task dependencies rather than relying on flat retrieval. Teams evaluating memory infrastructure should assess whether their chosen solution provides at minimum: provenance metadata, versioned snapshots, and scoped namespaces.

Beyond Retrieval: Persistent Agent Cognition

The manifesto frames memory governance in terms of retrieval infrastructure — provenance, expiration, rollback, scoping. This is necessary but no longer sufficient to describe the frontier. The emerging memory discipline includes three layers:

Retrieval memory — the layer the manifesto already covers well. Embedding stores, vector search, scoped retrieval with SLOs. This is the "better RAG" layer.
Skill memory — durable behavioral patterns agents acquire through experience, stored as reusable artifacts rather than retrieved context. An agent that has solved a class of problem before should carry forward not just the facts it retrieved but the approach that worked. Skill memory is closer to procedural knowledge than to information retrieval.
Causal and trajectory memory — the ability to store not just what happened but why it worked or failed, and to consolidate trajectories across tasks into generalizable reasoning patterns. This is learning in the operational sense: the agent's future behavior improves based on structured reflection over past behavior.

All three layers require the same governance properties (provenance, expiration, rollback, scoping). But they differ in what "poisoning" means and how rollback works. Reverting a bad embedding is straightforward. Reverting a bad learned skill is harder — the skill may have influenced downstream decisions that themselves became learned patterns. Teams building memory infrastructure should design for rollback at each layer independently.

The full operational specification for governing learned memory — what counts as adaptation, who may write to persistent memory and under what conditions, provenance requirements, retention and expiry policy, rollback mechanisms, and which behavioral changes trigger a revalidation cycle — is the Adaptation Envelope (Layer 4) of the behavioral envelope framework. See companion-re-framework.md, Section 4 (Behavioral Envelope, Layer 4) for the complete specification. Principle 6 names the governance properties; Layer 4 specifies what to actually write.

Recent agent-learning work sharpens this distinction further: reusable skills can function as an external learning substrate, allowing agents to improve by writing, selecting, and refining structured procedural artifacts rather than by updating model weights. This makes skill governance a first-class engineering concern. If a learned skill can change behavior across many future tasks, it should be treated as governed operational memory, not as an implementation detail hidden inside prompts.

This also changes the minimum governance question. It is no longer enough to ask whether a memory entry is traceable. Teams also need to ask:

Who may promote a learned behavior into a reusable skill?
What evidence is required before a skill is reused across domains?
How is skill rollback triggered and validated after an incident?
Which skills are experimental, local, approved, or forbidden?

Without these controls, a successful one-off workaround can silently become a portable failure mode.

The Knowledge-Memory Boundary in Practice

The manifesto defines the boundary by governance mechanism: knowledge changes through governed processes (PRs, ADRs); learned memory changes through feedback loops. In practice, entries migrate between the two:

Memory → Knowledge promotion: An agent repeatedly learns that a certain retry pattern works. After validation, this should be codified as an ADR or repository policy — promoted from heuristic to ground truth.
Knowledge → Memory demotion: A documented best practice stops holding under new conditions. Rather than immediately deleting the ADR, demote it to learned memory with an expiration, so the system can accumulate evidence for or against the change before formalizing it.

The migration process itself needs governance. Unreviewed promotions pollute your knowledge base. Unreviewed demotions erode architectural standards.

Memory Governance at Machine Scale

The governance properties described above (provenance, expiration, compression, rollback, domain scoping) are necessary but not sufficient at production volume. A single agent executing 100 tasks per hour generates 100 memory entries per hour. Human curators can meaningfully review 10-20 entries per hour — an immediate 5-10x backlog. At this scale, reactive curation (diagnose regression, identify poisoned entry, rollback) is a post-mortem methodology, not a governance strategy. Proactive detection is required.

Implement these four mechanisms before agents generate significant memory volume:

1. Retrieval canaries (continuous). For each memory shard serving a production domain, define one known-good query with an expected result. Run it on every retrieval cycle. If retrieved results deviate from expected, isolate the shard immediately and alert. This catches poisoning before agents act on bad context. Pattern C in companion-patterns.md shows this as a recovery step — it should be a permanent fixture, not a post-incident addition.

2. Consistency check on write. When a new memory entry contradicts an existing entry in the same domain, flag both for resolution before the new entry is propagated. Do not silently overwrite. The contradiction is signal — either the new lesson is wrong, the old lesson is stale, or both need re-examination.

3. Structured memory entry schema. Require all memory entries to carry:

lesson: what was learned (one sentence)
rationale: why this is believed to be true
confidence: 1-5 (1 = tentative observation, 5 = validated across many cases)
domain_scope: which domain(s) this applies to
expires_at: ISO 8601 datetime (see defaults below)
provenance: trace ID of the event that generated this entry

Agents cannot store memory without these fields. Entries without valid schema are rejected at the memory layer, not silently dropped.

4. Default TTL policy by volatility.

Domain type	Default TTL	Rationale
Model routing preferences	7 days	Provider behavior changes frequently
Transient operational learnings	7 days	Short-lived context (incidents, deployments)
API behavior and integration patterns	30 days	APIs change on release cycles
Architectural patterns (project-specific)	90 days	Reviewed at quarterly retro
Security policies and constraints	Never auto-expire	Human review required for any change
Compliance-relevant learnings	Never auto-expire	Regulatory retention requirements apply

Expired entries are not deleted automatically — they enter a review queue. A domain expert validates or discards them monthly. Target: 5% of active entries reviewed per month (manageable volume, full corpus covered in 20 months). Low validation rate triggers memory system remediation.

When memory governance fails at scale, the tell is a sudden degradation in evaluation metrics for a specific domain without a corresponding code change. The recovery path is Pattern C (Memory Poisoning Recovery). The prevention path is these four mechanisms deployed before the volume problem appears.

Memory Governance in Regulated Environments

The governance properties described above (provenance, expiration, compression, rollback, domain scoping) are necessary everywhere and insufficient in regulated environments. Data classification adds a layer of constraints on what agents may accumulate, retain, and retrieve.

What regulated environments add to memory governance:

Domain	Memory Retention Constraint	Retrieval Constraint	Key Regulatory Basis
Financial services	Customer PII must not persist in agent memory beyond the session unless a DPA is in place. Banking secrecy jurisdictions may prohibit retention entirely.	External LLM retrievals must not send Confidential/Restricted financial data to unclassified endpoints.	GDPR Art. 5 (data minimisation); DORA third-party risk
Medical devices / pharma	Patient-level data must not persist in learned memory. GxP operational data retention follows the applicable retention schedule, not agent TTL.	GxP raw data must never be retrieved into an agent context that has write access to production records.	HIPAA §164.528; GDPR Art. 5; GxP data integrity
Aviation	ITAR/EAR-controlled technical data retained in agent memory constitutes a controlled export if transmitted to a non-compliant endpoint.	Retrieval from ITAR-controlled knowledge stores must operate within a Technology Control Plan.	ITAR 22 CFR 120-130; EAR 15 CFR 730-774
Defense / government	CUI (Controlled Unclassified Information) must not persist in any memory store without appropriate classification handling. Classified information must not enter agent systems at all.	Retrieval must be restricted to approved, accredited environments.	CMMC 2.0; NIST SP 800-171; 32 CFR Part 2002

The practical rule: In regulated environments, learned memory is a data store subject to the same classification, retention, and access controls as any other system data. The manifesto's memory governance properties (provenance, expiration, rollback, scoping) are the mechanism; the applicable data regulation determines the thresholds. A GDPR data minimisation obligation, for instance, means the TTL default for customer-identifiable learnings is "session only" — not 30 days.

Audit trail for memory changes. In regulated contexts, the memory governance operations themselves (write, expire, rollback) must be logged. The standard memory entry schema fields (provenance, expires_at, domain_scope) are the minimum; add classification and retention_basis fields for regulated memory stores to make the audit trail complete.

See the domain documents for domain-specific memory classification requirements: financial-services.md · pharma.md · medical-devices.md · aviation.md

Principle 7 — Context: Extended Guidance

See Principle 7 in the manifesto for the core statement and minimum bar.

Retrieval SLOs

Define tiered SLO guidance by architecture class for context retrieval and decision latency. Not every retrieval path needs the same latency target:

Local retrieval (file system, in-process cache): < 100ms. This is the baseline for interactive agent loops where the developer is waiting.
Remote retrieval (vector DB, API-backed knowledge base): < 500ms with a relevance threshold. If retrieval takes longer, the agent should proceed with available context and flag the gap rather than block.
Hybrid + rerank (remote retrieval with a reranking model): < 1s end-to- end. The reranking step improves precision but adds latency; set a hard ceiling and degrade gracefully if exceeded.
Regulated logging (audit-required retrieval in compliance environments): latency is secondary to completeness and provenance. Log every retrieval with source, relevance score, and timestamp.

When retrieval SLOs are breached, alert and degrade — do not silently return stale or irrelevant context. An agent that reasons from bad context produces confidently wrong output.

Context Budgeting

Context windows are finite and reasoning quality degrades as low-signal context accumulates. This is not a theoretical concern — it is the most common root cause of agent quality degradation in long-running tasks. Engineer explicit context budgeting:

Hierarchical retrieval: Retrieve summaries first, then pull detailed context only for the sections the agent identifies as relevant. This avoids filling the window with potentially irrelevant detail.
Rolling summaries: For multi-step tasks, compress completed steps into structured summaries before starting the next step. The summary should capture decisions and outcomes, not raw content.
State compaction: Periodically replace accumulated context with a compact representation of current state. The compacted state is the new starting point; the raw history is available in traces for debugging but does not consume the active context window.
Authority-weighted pruning: When the context budget is exhausted, discard low-authority context first (heuristic suggestions, old memory entries) and preserve high-authority context (specifications, constraints, evaluation results).

A worked example: an agent tasked with refactoring a module across 15 files hits the context limit at file 8. Without budgeting, it either hallucinates the remaining files or produces inconsistent changes. With rolling summaries, it carries a compact summary of decisions made for files 1-7 and retrieves fresh context for files 8-15.

Context Poisoning

Context poisoning is distinct from memory poisoning (Principle 6) — it occurs when the retrieval system returns contextually appropriate but factually wrong or outdated content within a single task. Memory poisoning is a persistent corruption; context poisoning can happen on any retrieval call.

Common sources: stale index entries that survived re-indexing, retrieved content from a deprecated branch that was never cleaned up, code examples from a library version that no longer matches the project's dependencies.

Detection: monitor for sudden quality drops in agent output that correlate with specific retrieval sources. Track retrieval source freshness (time since last validation) and alert when agents consume context older than a configurable threshold.

Mitigation: retrieval canaries (known-good queries with expected results, run on every retrieval cycle), source freshness metadata in every retrieval response, and a circuit breaker that falls back to specification-only context when retrieval confidence drops below threshold.

Self-Improving Knowledge Bases

Codify "never do X here" as machine-enforced guidance: repository policies, architectural constraints, ADR rules, lints, CI gates. Make the knowledge base self-improving: let retrieval quality metrics feed back into indexing and curation, so the system gets more precise over time rather than more cluttered.

The feedback loop: track which retrieved contexts led to successful agent outcomes (evidence bundle accepted, evaluations passed) and which led to failures. Over time, demote or remove context sources that consistently correlate with poor outcomes. This is garbage collection for your knowledge base, driven by evidence rather than manual curation.

Cross-Iteration Learning and CI Context

A specific and increasingly important case of context budgeting is learning across CI iterations — where each iteration generates new evidence about the consequences of previous decisions. In a CI loop spanning dozens of iterations (the SWE-CI benchmark averages 71 commits per task), the agent must carry forward not just what changed, but what effect each change had on subsequent iterations.

This is distinct from single-task context budgeting because the evidence compounds: iteration 15 generates information about decisions made in iterations 3, 7, and 12. The context that matters is not "what happened last" but "which earlier decisions are causing current problems."

Practical approaches for cross-iteration context:

Decision-consequence summaries: After each iteration, compress the results into a structured summary that links decisions to outcomes. "Changed the retry logic in iteration 5; iteration 9 test failures trace to that change." These summaries are the rolling context for subsequent iterations.
Regression attribution: When a regression appears, trace it to the iteration that introduced the structural cause — not just the iteration that triggered the test failure. This requires structured tracing across iterations, not just within them.
Evolvability signals: Track whether each iteration's decisions made the next iteration easier or harder. The SWE-CI benchmark's EvoScore metric (arXiv:2603.03823) measures this explicitly: agents whose early decisions facilitate subsequent evolution score higher. Teams can approximate this by tracking iteration-over-iteration test pass rates and regression frequency.

Cross-iteration context management is the primary capability differentiator for long-running agent pipelines. Without it, agents repeat mistakes, fail to learn from structural consequences, and accumulate technical debt that traditional single-iteration metrics miss.

Tooling Maturity and Adoption

The context engineering standard described here exceeds what most teams can build today. The tooling ecosystem is maturing rapidly — open protocols for tool connectivity, structured capability definitions, and version-aware memory layers now exist — though production-grade governance tooling remains nascent. Adopt incrementally: start by measuring retrieval quality (relevance, latency, staleness), then add context budgeting for long-running tasks, then tiered SLOs as scale demands. The principle describes the engineering standard; the adoption path acknowledges the gap.

The Emerging Agent Stack

Recent frontier-lab writing is converging on a useful systems frame: the agent is not just a model with a prompt. The operational stack increasingly looks like:

Model — the reasoning engine
Context layer — retrieval, summaries, memory, and task framing
Harness — execution loop, tool orchestration, constraints, checkpoints, and cleanup
Tools / APIs — the external actions available to the agent
Environment / runtime — the bounded execution context, permissions, traces, and operational controls

This is mostly a vocabulary clarification, not a new principle. The manifesto's contribution is that it provides the governance model across this stack. P7 governs the context layer directly, but its quality depends on the harness that selects and compacts context, the tools that retrieve it, and the runtime that preserves or constrains state across sessions. In practice, treating "context engineering" as a standalone discipline without connecting it to the harness and runtime is how teams end up with excellent retrieval feeding poorly-governed execution loops.

As of early 2026, four open interface patterns are crystallizing around this stack:

Tool connectivity protocols — typed schemas, capability discovery, authorization, and structured tool invocation at the tools/APIs layer.
Agent coordination protocols — agent discovery, task lifecycle management, and cross-runtime delegation at the coordination layer.
Capability definition artifacts — reusable, reviewable descriptions of domain procedures, constraints, and operational skills at the harness layer.
Repository-level instruction artifacts — machine-readable project constraints and local conventions at the environment layer.

The manifesto's governance model — tiers, traces, accountability, evaluations — sits across all four. No single protocol provides governance; the manifesto's principles provide the governance framework that connects them.

Principle 8 — Evaluations & Proofs: Extended Guidance

See Principle 8 in the manifesto for the core statement and minimum bar.

Assurance Disciplines

As autonomy and module count grow, assurance must move across distinct practices with different cost curves:

Evaluations and tests for dynamic, example-based validation.
Formal contracts + proofs for mathematically checking module properties.
Model checking for state-space behavior (especially concurrency and protocol invariants).

These are separate disciplines. Use them intentionally: tests by default, formal methods first on critical paths and high-blast-radius components, then expand coverage where incident data and economics justify it.

The "proofs are a scale strategy" claim is now operationally achievable, not just theoretically sound. Executable specification languages allow teams to write specifications that are simultaneously human-readable documentation, testable assertions, and inputs to model checkers — collapsing the gap between "we wrote a spec" and "we proved a property." Model-based testing workflows can generate test suites directly from executable specifications, connecting formal models to CI pipelines without requiring teams to become proof engineers. The practical entry point is not theorem proving but executable specs on one critical path — the same scope recommended in the adoption playbook's formal contracts step.

LLM-as-Judge Risk

When models judge model-generated outputs, evaluator and producer can share blind spots. Mitigate LLM-as-judge risk with deterministic anchors, diverse judge models, periodic human-calibrated gold sets, and disagreement tracking between judges and production outcomes.

Evaluation Theater

Beware evaluation theater: evals that pass but do not test what matters. If evaluations do not cover edge cases, adversarial inputs, and behavioral regressions, they are measuring comfort, not correctness. When evaluation metrics become optimization targets rather than measures of quality, the system games the metric and drifts from the goal.

Detecting evaluation theater. Evaluation theater is recognizable by the gap between evaluation metrics and production outcomes. Watch for these signals:

Evaluation pass rates near 100% while escaped defect rates or user-reported issues remain elevated — the evaluation suite is not covering the failure modes that matter.
Adversarial inputs outside the evaluation distribution produce failures the suite never triggered — the evaluation distribution is too narrow.
Evaluation coverage grows (more tests, higher numbers) without growing the distribution of tested conditions — the same scenarios run repeatedly with minor variations, providing false coverage confidence.
Incident classes not covered by the current suite recur after remediation — the suite did not capture the failure mode, so the same issue reappears.

The primary structural defense is evaluation holdout (see below): scenarios the agent has never seen and cannot overfit to. Without holdout, high eval pass rates are consistent with both genuine quality and evaluation theater. The measurement mechanism for "evaluation theater detection rate" (listed as a Phase 5→6 metric) is therefore: track the fraction of production incidents that were not predicted by any evaluation failure in the preceding cycle.

Advanced bar: include adversarial cases for externally exposed or high-blast-radius systems. For model-judged evaluations, calibrate against human-labeled samples on a defined cadence.

Evaluation Holdout and the Gaming Problem

If agents can see the evaluation criteria during development, they can overfit to them — producing output that passes the specific tests while missing the intent behind them. This is the evaluation equivalent of teaching to the test.

The fix borrows from machine learning: evaluation holdout. Behavioral scenarios — specifications of what the software should do in realistic end-to-end conditions — are stored separately from the development context. The agent builds software without access to the evaluation criteria. The scenarios evaluate whether the output works. Because the agent never sees the evaluation criteria, it cannot game them.

This pattern is already in production. StrongDM's software factory uses holdout behavioral scenarios as the primary evaluation mechanism, with agents that implement against specifications and are evaluated against criteria they cannot see. The result is evaluation that tests intent, not just compliance.

When to use holdout evaluation: For any system where agents iterate autonomously (Phase 4+), especially when evaluation metrics show suspiciously high pass rates that do not correlate with production quality. Holdout evaluation is more expensive to maintain (two separate artifact sets: development specs and evaluation scenarios) but eliminates the most insidious form of evaluation theater — evaluations that pass because the agent learned the answers, not because it solved the problem.

Champion-Challenger Testing in Regulated Contexts

Champion-challenger testing compares agent system performance against an incumbent approach — the current model, the prior system version, or the clinical/operational standard of care. This is a cross-domain regulatory expectation, not a financial-services-specific concept:

Financial services (SR 11-7): Requires comparing agent outputs against alternative approaches or incumbent models. Statistical methodology for handling output variability (non-deterministic agents) is an open regulatory question; conservative approach is to run champion-challenger on a held-out sample with human adjudication of disagreements.
Medical devices: FDA GMLP and ISO/TS 24971-2 expect performance comparison against predicates (prior cleared devices) or the clinical standard of care. The manifesto's evaluation portfolio (P8) is the infrastructure for this comparison — extend evaluation suites with predicate-device test cases.
Pharma: CSA expects assurance that a new system performs at least as well as the system it replaces. Run champion-challenger during PQ by executing parallel workflows and comparing outputs. Evidence bundle includes disagreement analysis and resolution rationale.
Aviation: No direct champion-challenger requirement, but DO-178C requires that verification objectives are satisfied. For agent-assisted workflows replacing manual activities, demonstrate that the agent-assisted approach produces equivalent or better coverage per Table A objectives.

The non-determinism problem. Traditional champion-challenger assumes identical inputs produce comparable outputs. Agents are non-deterministic. Practical mitigation: run multiple agent invocations per input (N=3-5); use the majority-vote or highest-confidence output as the champion response; compare the distribution of champion responses against the incumbent. Statistical confidence intervals, not point comparisons, are the evidence.

Independent Verification in Regulated Contexts

Regulated industries share a common governance requirement: the party that verifies a system must be organizationally independent from the party that built it. SR 11-7 (financial services) requires independent model validation. IEC 62304 (medical devices) requires verification by qualified parties distinct from developers. DO-178C (aviation) requires independence at each design assurance level.

In agentic engineering, this principle extends to agent-generated output: the evaluation infrastructure that verifies agent work should be independent of the agent that produced it. Concretely:

Evaluation criteria should not be visible to the producing agent (evaluation holdout, described above)
Evaluation models should differ from production models where feasible (avoid shared blind spots — see P1 correlated failure domains)
For Tier 3 operations in regulated environments, organizational independence between agent development and agent validation should mirror existing regulatory expectations

This is not a new principle — it is a regulated-environment application of the existing evaluation-as-contract pattern. See companion-frameworks.md for the cross-domain analysis and domains/ for domain-specific independence requirements.

Fairness and Bias Testing in High-Risk AI

EU AI Act Article 10 requires that training, validation, and testing datasets for high-risk AI systems are "free of errors and complete" and that they account for "characteristics or elements that are particular to the specific geographical, behavioural or functional setting." In practice, this mandates bias testing as part of the evaluation portfolio for any high-risk AI system.

This is a cross-domain obligation, not a financial-services-specific one:

Financial services: Explicit fairness testing against protected classes under ECOA, FHA, and FCA Consumer Duty. Evaluation suites must include demographic parity and disparate impact analysis.
Medical devices: Clinical AI systems must demonstrate equivalent performance across demographic subgroups (age, sex, ethnicity). ISO/TS 24971-2 explicitly addresses this. Evaluation portfolios for Class B/C SaMD must include subgroup performance analysis.
Pharma: ICH E8(R1) requires that clinical trial populations are representative of the intended treatment population. AI systems used in patient selection or stratification must be tested for demographic bias.
Automotive / industrial: AI systems in driver monitoring or operator safety systems must demonstrate consistent performance across demographic characteristics that could influence detection accuracy.

Minimum evaluation bar for high-risk AI systems: Include at least one explicit fairness evaluation category alongside behavioral regression and adversarial cases. Fairness evaluation should specify: (1) which subgroup characteristics are tested, (2) which performance disparity metric is used (demographic parity, equalized odds, etc.), (3) the maximum acceptable disparity, and (4) who owns the determination that the disparity is acceptable. The last item is a human judgment — not an evaluation output.

Workflow-Level Evaluation Enforcement

The evaluation-as-contract pattern extends beyond test suites into the development workflow itself. Workflow-level skill frameworks now enforce strict red-green-refactor TDD: if an agent writes implementation code before a failing test exists, the framework deletes the code and forces a restart. Design-first, plan-first, and test-first phases are mandatory, not suggested. This is evaluation-as-contract applied to the development process rather than the runtime — and it demonstrates that P8's principle operates at multiple layers, from CI pipelines to agent harness constraints.

Boolean vs. Probabilistic Evaluation

The manifesto's current evaluation model is largely boolean: tests pass or fail, regression cases are covered or not, evidence bundles are complete or incomplete. This framing is necessary for minimum bars but insufficient for mature agentic systems.

At Phase 5 and above, consider probabilistic satisfaction: of all observed execution trajectories through all behavioral scenarios, what fraction actually satisfies the specification? This replaces "did it pass?" with "how reliably does it pass, across how many conditions?"

The shift matters because agentic systems are inherently probabilistic. A boolean "pass" on ten test cases tells you the agent produced correct output ten times. It tells you nothing about the eleventh case, the hundredth case, or the distribution of partial failures. Probabilistic satisfaction metrics — drawn from scenario-based evaluation at volume — give a confidence distribution rather than a binary verdict.

Practical adoption: Start boolean (Phase 3-4). Add scenario coverage and pass-rate distributions as the evaluation portfolio matures (Phase 4-5). Treat probabilistic satisfaction as the target metric for fully autonomous pipelines where human review is sampled rather than comprehensive.

Behavioral Regression vs. Structural Regression

The manifesto's minimum bar for evaluations states: "If evaluations do not include regression cases, they are insufficient." In practice, there are two distinct categories of regression, and most teams only test for one.

Behavioral regression is what traditional regression testing catches: a change breaks existing functionality. The tests that passed before now fail. This is well-understood and well-tooled.

Structural regression is subtler and more dangerous: a change passes all current tests but degrades the codebase's capacity for future change. The code is locally correct but globally harmful — naming conventions that create confusion across iterations, architectural choices that increase coupling, dependency structures that make the next change harder. Structural regression does not fail any test today; it fails the test that you will need to write tomorrow.

The SWE-CI benchmark (arXiv:2603.03823) provides the first empirical evidence for this distinction. Across 100 tasks spanning an average of 233 days of development history, most agents achieve a zero-regression rate below 0.25 — meaning in over 75% of CI iterations, agents introduce at least one regression. Many of these regressions are structural: the agent's decisions in early iterations create friction that compounds across subsequent iterations. The benchmark's EvoScore metric captures this by measuring functional correctness on future modifications — not just current tests.

Detecting structural regression:

Evolution-weighted metrics: Track not just whether today's tests pass, but whether each change makes the next change easier or harder. EvoScore is one formalization; a simpler proxy is iteration-over-iteration regression frequency.
Coupling analysis: Monitor dependency graphs, import structures, and module boundaries across iterations. Rising coupling without corresponding functionality is a structural regression signal.
Specification convergence: If specifications become harder to express precisely over time, the codebase's structure is degrading even if tests pass. The manifesto's convergence criteria (P2) apply here: diverging specifications are a symptom of structural regression.

The implication for evaluation portfolios: Teams at Phase 4 and above should include structural regression indicators alongside behavioral regression tests. This does not require formal verification — it requires tracking the trajectory of code quality across iterations, not just the state of code quality at each iteration.

Benchmark Instability and Contamination Risk

Benchmarks are necessary and insufficient. As public agent benchmarks mature, they are increasingly affected by contamination, target leakage, and adaptation to the benchmark rather than to the underlying engineering problem. Treat benchmark gains as directional evidence, not as durable truth about production readiness.

Three practical rules follow:

Prefer mutation and refresh over static leaderboard worship. If a benchmark remains unchanged for long enough, the ecosystem will optimize for it directly.
Maintain private holdouts. Public benchmarks are useful for comparability; private evaluations are necessary for real assurance.
Test transfer, not just score. A claimed improvement matters only if it carries over to your stack, constraints, and failure modes.

The manifesto's position is intentionally conservative: external benchmarks help calibrate ambition, but promotion between maturity phases should be based on the evidence your own system can produce under your own operating conditions.

See also Verification without validation in the Failure Modes section, which describes the related but distinct case where verification machinery confirms correctness without confirming value.

Principle 9 — Observability & Interoperability: Extended Guidance

See Principle 9 in the manifesto for the core statement and minimum bar.

What a Trace Must Contain

A trace is not a log line. A complete agentic trace captures:

Specification received: What was the agent asked to do? The versioned specification or task decomposition that initiated the work.
Decision chain: What options did the agent consider, what did it select, and what reasoning or scoring drove the selection? For multi-step tasks, the chain must show each decision point, not just the final output.
Tool calls and responses: Every external tool invocation — API calls, file operations, retrieval queries — with inputs, outputs, and latency.
Memory retrievals: What context was retrieved, from which store, with what relevance scores? This is critical for diagnosing retrieval-driven hallucinations.
Evaluation results: Which evaluations ran, what passed, what failed, what was the delta from previous runs?
Policy checks: Which constraints were checked, which passed, which triggered violations or near-misses?
Cost accounting: Tokens consumed, model used, inference latency, total cost of this task.

The trace must be structured, not free-text. Structured traces can be queried, aggregated, and replayed. Free-text logs require human interpretation at every step.

Near-Real-Time Drift Detection

Observability is incomplete if it only reconstructs the past. For production agentic systems, you also need near-real-time detection of constraint violations, behavioral drift, and anomalous patterns:

Constraint violation alerts: Immediate notification when an agent attempts or completes an action outside its tier or domain boundary.
Behavioral anomaly detection: Statistical monitoring of agent outputs over time. A sudden shift in code style, error rate, or tool usage pattern may indicate context poisoning, model degradation, or specification drift.
Cost anomaly alerts: A task that normally costs $0.50 suddenly costing $15 signals a reasoning loop, retry storm, or routing failure.

The goal is not to alert on everything but to detect when the system has left its expected operating envelope before the damage compounds.

Interoperability Requirements

Interoperability requires typed schemas, explicit auth boundaries, versioned capabilities, and replayable tool logs. Treat adapters as temporary bridges, not architecture. The goal is replaceable components, not locked pipelines.

The emerging open-protocol stack now covers both interoperability axes the manifesto requires: how agents connect to tools, and how agents coordinate with other agents. Recent protocol revisions added stronger authorization models, structured capability metadata, safer transport patterns, and more durable task lifecycle support. These developments matter because they move interoperability from vendor-specific SDK behavior toward inspectable contracts that can be governed, audited, and replaced.

Interoperability minimum bar: If tools cannot be swapped or replayed across runtimes without rewriting core workflows, the platform is brittle.

Principle 10 — Emergence & Containment: Extended Guidance

See Principle 10 in the manifesto for the core statement and minimum bar.

Chaos Practice

Practice chaos: test with tool outages, noisy retrieval, adversarial inputs, partial memory corruption, reordered swarms, and model degradation — before reality does. Offline tests are insufficient for systems that operate autonomously in the wild. Enforce invariants at runtime with policy checks, monitors, and automated intervention.

Chaos testing for agentic systems requires its own safety model:

Steady-state hypothesis: define expected behavior before injecting faults, so you can detect when the system has left its safe operating envelope.
Blast-radius controls: isolate chaos experiments to scoped environments, shadow traffic, or canary populations — never inject faults into the full production agent population.
Automated abort conditions: if the system breaches predefined thresholds (error rate, latency, cost spike), halt the experiment and roll back automatically.
Graduated severity: start with single-fault injection (one tool outage), then compound faults only after single-fault resilience is proven.

Threat Modeling

Threat modeling must explicitly include:

Prompt injection and jailbreak propagation across agent chains
Memory/context poisoning and supply-chain contamination
Agent impersonation and forged role assertions in swarm coordination
Data exfiltration through tool permissions and connector abuse

Defense-in-depth means identity for agents and tools, signed provenance for shared state, least-privilege tool scopes, egress controls, and continuous anomaly detection for cross-agent trust edges.

Real-World Containment Failures

The OpenClaw ecosystem (2025-2026) provides instructive case studies. OpenClaw itself — an open-source autonomous agent with 247K GitHub stars — demonstrated how rapidly agentic systems scale when governance is absent. The Moltbook incident (February 2026) exposed 1.5 million registered agents (only 17,000 human owners) through a misconfigured Supabase database with full read/write access. The failure hit every threat category above: no identity controls, no domain scoping, no blast-radius limits, no audit trail.

NVIDIA's response — NemoClaw (GTC 2026) — is an enterprise-hardened fork that adds YAML-based permission policies, audit logging, and guardrail constraints. This is containment engineering in practice: the same agent runtime, now with the governance layer the manifesto requires. The pattern validates the core P10 claim: emergence is not a feature to celebrate but a hazard to engineer around. Systems that scale without containment infrastructure will produce incidents at scale.

Principle 11 — Economics: Extended Guidance

See Principle 11 in the manifesto for the core statement and minimum bar.

Intelligent Routing

Intelligent routing — selecting the right model, the right agent topology, and the right resource tier for each task — extends effective capacity by multiples while maintaining quality. This "economics-aware routing" must consider not just token cost, but correlation cost (avoiding a single point of epistemic failure by using diverse models and independent tool chains).

Total Cost of Correctness

Inference cost and assurance cost are coupled, not independent knobs. Cheaper models may require stronger verification, more retries, or tighter approvals.

The full cost model includes:

Inference cost: tokens, compute, API fees.
Verification cost: evaluation runs, proof checking, canary deployments.
Governance overhead: human review time per tier, approval latency, policy maintenance.
Incident remediation: rollback, diagnosis, constraint updates, re-verification.
Opportunity cost: delayed decisions from approval queues or routing latency.
Context-switching cost: debugging heterogeneous failure modes across models and vendors.

Optimize total cost of correctness, not inference cost alone. When governance overhead exceeds the value of the work, reduce governance complexity rather than adding more layers.

Multi-Model Risk

Multi-model and multi-vendor swarms introduce heterogeneous failure and policy risk. Model errors are often correlated through shared dependencies, similar training artifacts, or vendor-side incidents. Routing policies must include failure-domain isolation, cross-model canary checks, and explicit data handling boundaries per provider.

Resilience Measures

To mitigate systemic fragility, extend resilience measures across the stack:

Diversity routing (different models/judges) to reduce correlated hallucinations.
Retrieval canaries across independent indexes.
Tool redundancy plans for rate limits/outages.

This is the "organism avoiding monoculture collapse."

Advanced bar: route by expected total cost of correctness, not token price.

Total Cost of Correctness — Decision Framework

The manifesto defines the formula conceptually. Here is how to use it for routing decisions.

The formula:

Total Cost of Correctness =
  (Inference cost per task × Task count)
  + (Verification cost per task × Task count)
  + (Governance overhead per task × Task count)
  + (Expected remediation cost per failure × Failure rate)
  + (Opportunity cost of latency)

Worked example: generating integration tests for a new API endpoint

Model tier	Inference cost	Expected pass rate	Rework cost on failure	Total cost of correctness
Fast/cheap model	$0.04	85% (3 failures of 20)	$0.50/failure = $1.50	$1.54
Balanced model	$0.08	95% (1 failure)	$0.50/failure = $0.50	$0.58
High-capability model	$0.20	99% (0.2 failures)	$0.50/failure = $0.10	$0.30

Naive cost optimization picks the fast model. Total-cost optimization picks the high-capability model. The fast model's lower failure rate in simple cases matters less than the higher-capability model's reliability on edge cases.

Routing decision record. For each routed task, capture:

task_type: [description]
estimated_complexity: [1-10]
model_selected: [model name/tier]
rationale: [why this model for this complexity]
actual_outcome: [pass / fail / rework]
actual_cost: [inference + verification + remediation]

Feed these records into a FinOps dashboard quarterly. Within three months of operation, you will have an empirical cost model that makes routing decisions data-driven rather than intuition-driven. The goal is not the cheapest model — it is the model with the lowest total cost of correctness for that task class.

DORA concentration risk note. In regulated financial services, model routing is not only an economics decision — it is a DORA third-party risk control. Routing policies must include: failure-domain isolation (ensure no single provider failure disables all tasks), cross-model canary checks, and documented exit procedures if a provider becomes unavailable. Multi-model routing should be documented in the DORA third-party risk register.

Principle 12 — Accountability: Extended Guidance

See Principle 12 in the manifesto for the core statement and minimum bar.

Domain-Scoped Ownership

At scale, ownership is domain-scoped, not change-scoped. A named human owns the risk policy, approval thresholds, and incident response for a bounded domain; the system enforces those policies per change. Human review must focus on exceptions, high-risk deltas, and statistically valid sampling, not every low-risk action.

The Accountability Paradox

The manifesto states: "Agents execute; humans own outcomes, risks, approvals, and incidents. No agent — however capable — absorbs legal, ethical, or operational responsibility." This is the manifesto's strongest claim about the human role. It is also the claim most certain to break under scale.

If your agents process thousands of actions per day, human review of every action is not just impractical — it is impossible. A domain owner who "approves" 200 changes per day is not governing; they are rubber-stamping. The manifesto's accountability model, applied literally at volume, collapses into control theater (see Failure Modes).

This is not a minor gap. It is the central tension of the entire manifesto: the principles require human accountability, and the economics of agentic systems at scale make comprehensive human accountability impossible.

How to Navigate the Paradox

The manifesto does not resolve this tension — it provides the tools to manage it. The resolution is not "remove the human" or "review everything." It is a phase-calibrated layering of accountability mechanisms:

At Tier 1 (Observe): Agents can only analyze and propose. Human accountability is inherent because no action reaches production without human execution. This is fully compatible with the manifesto at any volume.

At Tier 2 (Branch): Agents write to isolated environments. Accountability shifts from reviewing every action to designing the constraints that bound agent behavior and the evaluations that verify output. The human owns the constraint design and the evaluation portfolio, not every individual diff. When an escaped defect occurs, accountability traces to which constraint or evaluation was missing — not which reviewer missed which line.

At Tier 3 (Commit): Agents take production-impacting actions. This is where the tension is sharpest. The practical approach: automated policy enforcement handles routine checks at machine speed; human review focuses on exceptions, high-risk deltas, and statistically valid sampling. The human is accountable for the policy, the sampling strategy, and the incident response — not for having personally inspected every action.

In all tiers, build recursive feedback mechanisms: systems evaluate their own errors, feed failures back into context, and self-correct or automatically roll back. This is not replacing human accountability — it is extending the human's reach through system design.

The Level 5 Challenge: No Human Writes or Reviews Code

The sharpest version of the accountability challenge comes from teams already operating at what practitioners call "Level 5" or "dark factory" mode: specifications go in, working software comes out, no human writes or reviews code. StrongDM's software factory is the most documented example — three engineers, no code writing, no code review. Humans write specifications and evaluate outcomes. Machines do everything in between.

This sounds like it contradicts the manifesto's accountability model. It does not — but it forces the model to its logical conclusion. In a Level 5 system:

Accountability shifts from reviewing code to designing constraints. The human owns the specification quality, the evaluation portfolio (including holdout scenarios the agent cannot see), and the incident response policy. They do not own every line of code — they own the system that produces and verifies the code.
Evaluation replaces review. Instead of reading diffs, humans evaluate outcomes against behavioral scenarios, probabilistic satisfaction metrics, and business impact measures. The evaluation infrastructure is the review process — it just runs at machine speed rather than human speed.
The accountability surface changes, not the accountability principle. A human is still accountable for production behavior. But "accountable" means "designed the constraints, approved the evaluation portfolio, and owns the incident response" — not "read every line of code."

This is consistent with the manifesto's Tier 3 governance at scale: automated policy enforcement handles routine verification, human review focuses on exceptions and high-risk deltas, and accountability traces to constraint design rather than individual code inspection. Level 5 is what Tier 3 governance looks like when the constraints, evaluations, and evidence infrastructure are mature enough to replace line-by-line review entirely.

The manifesto does not prescribe Level 5 as a target. Most teams are not ready for it — and the perception gap is real: a 2025 study reported that experienced developers using AI tools took 19% longer to complete tasks while believing AI made them 24% faster. Teams that believe they are operating at Level 4 or 5 are often stuck at Level 2, confusing tool adoption with workflow transformation. The maturity spectrum (Phase 1-6) and the evidence requirements at each phase exist precisely to prevent this self-assessment inflation.

The Open Problem

This layered approach is mitigation, not resolution. Oversight saturation at scale remains an open problem: systems can outgrow meaningful human governance bandwidth faster than governance practices mature. This is not a caveat buried in extended guidance — it is a load-bearing limitation of the entire manifesto.

The twelve principles are designed to remain useful at any scale, but the governance model that binds them (human accountability for production outcomes) is bounded by human bandwidth. As agentic systems scale toward Phase 6 (adaptive, self-improving), the fraction of system behavior that any human can meaningfully review approaches zero. The manifesto's answer — governance through constraints, evaluations, and evidence rather than through direct oversight — delays this limit but does not eliminate it.

Treat this as the manifesto's most important active frontier. If your engineers spend all day reviewing agent trace logs, you have replaced coding with babysitting and the governance model is already failing. If they review nothing, accountability is fictional. The correct position is somewhere between, defined by the quality of your constraints, evaluations, and feedback loops — and it must be re-evaluated as the system grows.

Governance as Practice — The Domain Owner's Routine

The manifesto describes governance structure: named owners, defined tiers, evidence bundles, approval gates. Structure is necessary but not sufficient. A team can have all structural components in place and still have non-functional governance: domain owners who approve evidence bundles without understanding them, audit trails no one reads, policy violations detected but not acted upon. Governance also requires practice — the ongoing behavioral routine by which a domain owner actually performs governance rather than performs its appearance.

What distinguishes performed governance from simulated governance:

Understanding what is being approved. A domain owner performing governance can answer, without prompting: what changed, why, what could go wrong, and why the evidence bundle indicates those risks were addressed. If they cannot answer, they are signing, not governing.

Acting on anomalies. When accountability signals degrade — review times drop, rejection rate trends toward zero — a governing domain owner reduces autonomy scope for that domain. A domain owner performing governance theater adds reviewers or frames the problem as a workload issue.

Reading incidents as policy feedback. After an incident, the governing question is: which constraint was missing, which evaluation didn't catch this, which evidence bundle criterion was insufficient? The non-governing question is: who approved the change that caused the incident? The first drives remediation; the second drives blame without improving the system.

Maintaining calibration. A domain owner who has not rejected a change in two months either has extraordinary agents or has stopped governing. Healthy rejection rates (5–15% of agent-generated PRs) are a calibration signal, not a ceiling to minimize. Sustained rates below that range should be treated as a governance degradation signal, not as quality improvement, unless corroborated by other evidence.

These behaviors are not auditable by structure alone. They require the domain owner to treat governance as a craft that degrades without practice.

Governance Health Monitoring

Accountability frameworks can degrade silently. Control theater — humans nominally accountable but operationally blind — is the most common governance failure at scale and cannot be detected from the outside. Detect it from the inside by monitoring the signals that distinguish meaningful review from rubber-stamping. The Rubber-stamping detection table in the adoption metrics document provides a quantitative baseline: median review time, PR rejection rate, inline comment density, and rework rate within one week. These thresholds are operational heuristics, not empirically validated figures — treat them as starting points calibrated against your own team's baseline data. The intervention protocol when thresholds breach is to reduce autonomy scope for that domain, not to add more reviewers.

Incident Attribution

When incidents occur, accountability is assigned by policy failure mode: specification error, verification gap, enforcement failure, or operational override. This avoids circular blame on the final approver and drives targeted remediation. If trace volume exceeds meaningful human review, raise automation barriers or reduce autonomy until oversight signal quality is restored.

Maturity spectrum, boundary conditions, and operational definitions that apply across all twelve principles.

The Agentic Maturity Spectrum

Maturity is domain-specific, not organization-wide. A team can be Phase 5 in CI and Phase 2 in production operations. Assess each domain honestly.

Phase 1 — Guided Exploration ("vibe coding"). Single prompts, no structure, no memory. Creative but unreliable. Useful for discovering what agents can do; dangerous for anything that matters. Failure mode: extrapolating demo results to production expectations.

Phase 2 — Assisted Delivery. AI as autocomplete. AI code-completion tool suggestions where the human executes. Productivity gains are real but bounded by human throughput. Failure mode: optimizing human-in-the-loop speed instead of questioning whether the loop is necessary.

Phase 3 — Agentic Prototyping. Agents execute tasks autonomously within a single session. Limited memory, limited verification. The moment most teams realize prompting is not engineering. Failure mode: autonomy without verification — the agent said it worked.

At this phase, teams should begin contract-aware prompting: agents produce assertions and pre/postconditions with code, even before full proof pipelines are in place.

Phase 4 — Agentic Delivery. Agents operate with basic guardrails: autonomy tiers are defined, evaluations gate changes, and basic memory persists across sessions. But the system is still single-domain, single-swarm, and largely reactive. Failure mode: governance without feedback — constraints are enforced but never updated by what the system discovers.

Phase 4 should pilot formal contracts on a narrow critical path only when team capability and economics support it.

Phase 5 — Agentic Engineering. Structured autonomy at scale. Specifications steer behavior and evolve through evidence. Multi-agent swarms operate across domain boundaries, right-sized to each task. The full Agentic Loop operates as a continuous system. Failure mode: evaluation theater — evals pass but do not test what matters.

This is where contract-first development becomes systematic: code, contracts, and proofs co-evolve continuously rather than being bolted on late.

Phase 6 — Adaptive Systems. Self-improving infrastructure within governed boundaries. Systems that build, test, and fix themselves — then learn from the results. Continuous learning with active memory curation. Chaos-tested, runtime-verified, economically optimized. Specifications co-evolve with the system's understanding of the problem space. Phase 6 is not inevitable; it requires capabilities — formal verification, causal reasoning, provable containment — that are still maturing. Treat it as a frontier, not a destination. Failure mode: self-improvement without containment — optimizing the metric, not the goal.

At this phase, agents can propose contract refinements and invariant updates, but proof systems and governance gates must validate changes before adoption.

Every phase transition has distinct challenges. Phase 2→3 is where the supervision paradox first hits. Phase 3→4 is where governance overhead must justify itself. Phase 4→5 requires organizational change, not just tooling. See the Adoption Playbook for detailed transition guidance for each phase, role changes, and pilot design.

Empirical Phase Profiles: Evidence from SWE-CI

The SWE-CI benchmark (arXiv:2603.03823) provides the first empirical evidence for what each maturity phase looks like in measurable agent performance. SWE-CI evaluates agents across 100 tasks spanning an average of 233 days and 71 commits of real-world development history, using an Architect–Programmer dual-agent CI loop.

Phase 1-2 performance: Agents at these phases fail SWE-CI entirely. They lack the iterative capability to sustain a CI loop and cannot integrate feedback across cycles.
Phase 3 performance: Agents pass early iterations but accumulate regressions at a high rate. Most models achieve a zero-regression rate below 0.25 — matching Phase 3's canonical failure mode: "autonomy without verification." The agent produces plausible output but erodes the codebase iteration by iteration.
Phase 4 performance: Agents show basic CI-loop competence with evidence per iteration but struggle with cross-iteration learning. Regression rates improve but do not plateau. Governance catches individual failures but does not address the structural regression pattern.
Phase 5+ performance: Only top-performing models exhibit Phase 5 characteristics: specification convergence across iterations, declining regression rates over time, and improving EvoScore. This matches Phase 5's description: the full Agentic Loop operating as a continuous system.

These profiles are descriptive, not normative — SWE-CI tests a specific task type (long-term code maintenance), and phase maturity is domain-specific. But they provide a concrete, measurable calibration point for teams self-assessing their maturity.

Use this benchmark family as a calibration aid, not as a universal scorecard. Public agent benchmarks age quickly, can be contaminated, and tend to attract optimization pressure from the ecosystem. Treat them as one input into maturity assessment alongside private holdouts, incident rates, replay quality, and domain-specific evidence bundles.

Alternative Framing: The Five Levels of Agentic Development

A complementary practitioner framing describes agentic maturity by what the human does rather than what governance exists. These levels (attributed to Dominik Fretz's analysis of production agentic teams) map to the manifesto's phases but emphasize the human role transition:

Level 0 — Spicy Autocomplete: AI as tab completion (Phase 1-2).
Level 1 — Coding Intern: Discrete tasks delegated, everything reviewed (Phase 2-3).
Level 2 — Junior Developer: Multi-file AI changes, human reads all code (Phase 3). Most teams claiming to be "AI-native" operate here.
Level 3 — Manager: Human directs AI, reviews at PR level, no longer writes code (Phase 3-4 transition).
Level 4 — Product Manager: Human writes specification, evaluates outcomes hours later, does not read code (Phase 4-5).
Level 5 — Dark Factory: Specifications in, working software out, no human writes or reviews code (Phase 5-6).

Anecdotal practitioner reports suggest many teams overestimate their AI-native maturity — most operate closer to Level 2 than they believe. The gap between perceived and actual maturity is the most common failure mode in agentic adoption. A 2025 study reported that experienced developers using AI tools took 19% longer to complete tasks while believing AI made them 24% faster. The manifesto's phase-calibrated evidence requirements exist precisely to close this perception gap — your phase is determined by the evidence you can produce, not by the practices you believe you follow.

Use this as calibration, not as a universal scorecard.

Where the two frameworks diverge: Fretz's levels are descriptive of the human experience. The manifesto's phases are prescriptive about governance infrastructure. A team can be at Level 3 (human as manager) while lacking the Phase 4 governance infrastructure (evidence bundles, evaluation gates, defined autonomy tiers) that makes Level 3 safe. The manifesto's position: advancing levels without advancing phases is how you get Level 4 velocity with Phase 2 governance — which is how incidents happen.

Boundary Conditions

This manifesto assumes the environment can support governed autonomy, reliable evidence capture, and reversible operations. When these assumptions do not hold, agentic engineering should be constrained — but not abandoned entirely.

When to Cap Autonomy

Proceed cautiously or cap autonomy at Phase 2-3 when:

Certification or regulatory regimes require deterministic assurance patterns that the current agent/tool chain cannot meet
Safety-critical or real-time systems cannot tolerate probabilistic behavior at the current control boundary
Classified or restricted environments cannot satisfy data-handling and tool isolation requirements
Teams lack baseline CI/CD quality gates, incident response discipline, or domain ownership needed for safe autonomy

Hard Autonomy Caps by Regulated Use Case

Some use cases carry hard autonomy caps regardless of the organization's maturity phase. These caps are not recommendations — they are regulatory constraints. A Phase 5 team operating at full agentic maturity still cannot exceed these caps. The table below shows the strictest cap per domain; see each domain document for the complete use-case-specific cap table.

Domain	Conservative Default Cap	Regulatory Basis	Domain Document
Aviation (airborne software DAL A/B)	Tier 1 (observe only)	DO-178C; DO-330 tool qualification	aviation.md
Medical Devices (IEC 62304 Class C; EU AI Act high-risk)	Tier 1 (observe only)	IEC 62304; EU MDR + AI Act (Class IIa+)	medical-devices.md
Pharma (GMP context; GxP record modification)	Tier 1 (observe only)	GAMP 5; 21 CFR Part 11; EU GMP Annex 11	pharma.md
Financial Services (credit/insurance decisions; algorithmic trading)	Tier 1 (observe only)	EU AI Act Annex III §5; GDPR Art. 22; MiFID II	financial-services.md
Automotive (ASIL C/D safety functions)	Tier 1 (observe only)	ISO 26262; UN Regulation 157	automotive.md
Defense / Government (classified or ITAR-controlled systems)	Tier 1 (observe only)	CMMC; ITAR 22 CFR 120-130; FedRAMP	defense-government.md

The rows above show conservative defaults for the most restrictive category in each domain. Lower-risk workflows in the same domain may permit higher tiers if separately justified. Most workflows in these industries permit Tier 2 or Tier 3 for lower-risk activities. The domain documents contain full use-case-specific cap tables with the regulatory basis for each row.

Phase maturity and autonomy tiers interact. Beyond the hard caps above, phase maturity is a prerequisite for autonomy tier:

Phase 3 or below → Tier 1 only, regardless of infrastructure
Phase 4 → Tier 2 available (branch + human approval)
Phase 5+ → Tier 3 available, subject to use-case caps above

A team cannot operate at a higher autonomy tier than their phase supports, even if the infrastructure is in place.

What Regulated Industries Can Still Use

Capping autonomy does not mean the manifesto is irrelevant. Teams in regulated environments (healthcare, finance, aerospace, defense, government) can still adopt the manifesto's principles selectively:

Principles 1-3 (Outcomes, Specifications, Architecture) apply fully. Outcome orientation, machine-readable specifications, and enforced domain boundaries are valuable in any regulatory context — and may strengthen compliance posture.
Principle 5 (Autonomy) applies at Tier 1 (Observe) and Tier 2 (Branch) in most regulated environments. Agents analyze, propose, and draft in isolated environments; humans execute. The manifesto's tier model maps naturally to regulatory approval workflows.
Principle 8 (Evaluations) applies fully. Evaluation portfolios, regression gates, and evidence bundles are compatible with — and often required by — regulatory audit frameworks.
Principle 9 (Observability) applies fully. Structured traces with provenance are often more rigorous than existing audit logs.
Principle 12 (Accountability) applies fully. Domain-scoped human accountability with incident attribution aligns with regulatory responsibility frameworks.

The principles that require caution in regulated environments are primarily Principle 5 at Tier 3 (production-impacting agent actions), Principle 6 (memory governance in data-restricted environments — see P6 extended guidance), and Principle 10 (chaos testing in safety-critical systems — validate chaos experiments in isolated environments before running on production equivalents).

For viable starting points by domain, see: Aviation · Medical Devices · Pharma · Financial Services

What Would Need to Change

For regulated industries to move beyond Phase 3, the following capabilities must mature: deterministic or formally verifiable agent behavior for critical paths, certified evidence chains that satisfy audit requirements, and data-handling frameworks that meet jurisdictional restrictions for agent- accessible data. These are active areas of development. Teams in regulated environments should track progress and pilot cautiously rather than waiting for full maturity.

Cross-Domain Regulatory Insights

Three governance requirements appear independently across regulated domains. They are not domain-specific — they are structural properties of any high- stakes verification system.

Independent validation as a governance principle. Across regulated domains — SR 11-7 in financial services, IEC 62304 in medical devices, DO-178C in aviation — a common requirement emerges: the entity that validates a system must be independent from the entity that developed it. In agentic engineering, this applies at two levels: the agent system itself must be validated by parties independent of its development, and agent-generated outputs in regulated contexts must be verified through independent means. This maps to the manifesto's tier model: at Tier 1-2, human review provides independence; at Tier 3, independent evaluation infrastructure (separate models, holdout scenarios) provides the independence guarantee. See P8 extended guidance.

SOUP / agent-as-tool categorization. Multiple regulatory frameworks require classification of software components by provenance and qualification status: IEC 62304 (SOUP), DO-178C/DO-330 (COTS/PDS and tool qualification levels), ISO 26262 (SEooC), and GAMP 5 (software categories). In agentic engineering, three entities require classification: the AI model itself (non-deterministic, version-dependent, opaque), agent-selected dependencies (libraries and patterns chosen during execution), and agent-generated code (may incorporate training-data patterns as implicit unclassified software). The manifesto's defense-in-depth response: treat the agent as an unqualified tool and independently verify all output through qualified means. See P3 extended guidance.

Data classification as an agent constraint. Agents operating in regulated environments must respect data classification boundaries. Classification requirements constrain what data agents may access, where inference may execute, and what outputs may be retained. Data classification is not a prompt instruction — it must be enforced at the infrastructure level (Principle 5: autonomy tiers). Domain-specific constraints:

Financial services: GDPR cross-border transfer rules (Chapter V) and banking secrecy laws (Switzerland, Luxembourg, Singapore) may prohibit certain data from reaching external inference APIs entirely.
Life sciences (pharma / medical): GxP record integrity requires ALCOA+ compliance; patient-level clinical data carries HIPAA (US) and GDPR (EU) obligations; raw GxP data must never be modifiable by agents.
Aviation / defense: ITAR (22 CFR 120-130) and EAR (15 CFR 730-774) restrict export-controlled technical data to compliant infrastructure; agents must operate within Technology Control Plans.
Automotive / industrial: Safety-function configuration data may be restricted under product liability and type-approval obligations.

The manifesto's architecture principle (P3) applies across all domains: data classification boundaries must be machine-enforced, not documented and hoped for. See each domain document for the applicable classification matrix and enforcement mechanism.

IEC 61508 as the parent functional safety standard. IEC 61508 (2010) is the foundational functional safety standard for industrial electronic systems, from which several domain standards derive: IEC 62304 (medical device software), ISO 26262 (automotive), EN 50128 (railway), and IEC 62061 (machinery). Teams in domains not covered by a specific domain document should map IEC 61508 Safety Integrity Levels (SIL 1–4) to the manifesto's autonomy tiers using the same logic as the DAL and safety class mappings: SIL 3-4 functions → Tier 1 (observe only); SIL 2 → Tier 1-2; SIL 1 → full tier range with evidence controls.

Domain-Specific Regulatory Alignment

For detailed mappings between the manifesto and specific regulatory frameworks, see the Domain Regulatory Alignment documents:

Aviation — DO-178C, DO-330, DO-333, ARP 4754A
Medical Devices — IEC 62304, ISO 14971, ISO 13485, FDA SaMD
Pharma / Life Sciences — GAMP 5, CSA, 21 CFR Part 11, ICH
Financial Services — SR 11-7, DORA, EU AI Act, SOX
Automotive — ISO 26262, ASPICE, UN Regulation 157
Defense / Government — CMMC, FedRAMP, NIST SP 800-53, ITAR/EAR

For V-model organizations, see adoption-vmodel.md for a V-model-specific adoption path.

Operational Definitions

Blast radius: the maximum credible impact of a wrong action across users, data, services, or regulatory obligations.

Right-sized: the smallest agent topology and model tier that can meet the required quality and latency targets at acceptable total cost of correctness.

Evidence bundle: the minimum artifacts needed to justify a change at a given phase and risk tier.

Total cost of correctness: inference + verification + governance overhead + incident remediation + opportunity cost + context-switching cost. Optimize this composite, not any single component. See Principle 11 guidance.

Evolution-weighted correctness (EvoScore): a metric that measures functional correctness on future modifications, not just current tests. Agents whose early decisions facilitate subsequent evolution score higher; agents that accumulate structural technical debt see progressively declining performance. Introduced by the SWE-CI benchmark (arXiv:2603.03823). Use evolution-weighted metrics as a complement to total cost of correctness for long-running agent pipelines. See Structural Regression in the P8 extended guidance.

Structural regression: a change that passes all current tests but degrades the codebase's capacity for future change. Distinguished from behavioral regression (breaking existing functionality). See P8 guidance.

Phase-calibrated evidence examples:

Phase 3: tests, diff, trace link, rollback note.
Phase 4: Phase 3 bundle plus policy checks and incident tags.
Phase 5+: Phase 4 bundle plus reproducible replay and, where justified, formal artifacts.

ALCOA+ Alignment

Organizations operating under GxP, FDA 21 CFR Part 11, or equivalent regulated data-integrity frameworks will recognise that the manifesto's evidence model satisfies ALCOA+ by construction:

ALCOA+ Criterion	Manifesto Mechanism
Attributable	Agent identity in every trace; named human domain owner (P12)
Legible	Structured, queryable traces — not free-text logs (P9)
Contemporaneous	Traces captured at execution time, not reconstructed after the fact
Original	Signed provenance for shared state; immutable evidence bundles
Accurate	Evaluations as the contract between intent and behavior (P8)
Complete	Evidence bundles are phase-gated; incomplete bundles block merge
Consistent	Versioned specifications; regression gates enforce non-degradation
Enduring	Replayable tool logs; trace retention as infrastructure requirement
Available	Traces must be queryable and aggregatable for audit at any time (P9)

This mapping is intentional, not coincidental. The manifesto was designed so that governed agentic delivery produces records that meet regulated-industry data-integrity standards without a separate compliance overlay.

Concrete scenarios showing how the manifesto's principles apply in practice, including both successful applications and governed failures.

Read the Manifesto for the core values and minimum bars. See the Companion Guide for the full table of contents. See the Companion Principles for extended guidance on each principle.

Worked Patterns

Pattern A — Single-Domain Reliability Fix

Specification: "Retry payment capture exactly once after timeout; never double-charge."

Agent decomposition: implement retry logic, add idempotency key handling, add tests, produce trace and rollback plan.

Evidence bundle (Phase 4): diff, regression tests, trace ID, policy check results, rollback command.

Outcome: shipped change, observed behavior, no duplicate charges in canary.

Pattern B — Multi-Agent, Cross-Domain Coordination

Specification: "Cancel order across orders, billing, and notifications without double-refund, stale customer status, or orphaned events."

Swarm decomposition:

Planner agent creates domain tasks with shared invariants.
Domain agents implement bounded changes in parallel.
Verification agent runs cross-service regression and contract checks.
Coordinator agent resolves conflicting diffs through a single commit path.
Operations agent gates rollout with canary and rollback criteria.

Evidence bundle (Phase 5): per-domain diffs, cross-domain trace graph, invariant check results, reconciliation decisions, canary metrics, rollback commands.

Outcome: one conflicting refund rule detected pre-merge, corrected via constraint update, release completed without refund duplication.

Pattern C — Memory Poisoning Recovery

Scenario: A retrieval shard serving the billing domain is corrupted by a batch indexing error. Agents start generating code that references a deprecated payment API. Three PRs are merged before the pattern is detected through evaluation regression.

Detection: Evaluation metrics for billing-domain changes show a sudden increase in API-compatibility failures. Trace analysis reveals all three failing changes retrieved context from the same shard, and the retrieved context references the deprecated API.

Recovery:

Isolate the corrupted shard — remove from retrieval rotation immediately.
Identify all memory entries created or influenced by the bad shard using provenance metadata.
Roll back billing-domain learned memory to the last known-good snapshot (pre-indexing error).
Revert or flag the three merged PRs for re-review against corrected context.
Re-index the shard from authoritative knowledge sources.
Add a retrieval canary for the billing domain: a known-good query with an expected result, run on every retrieval cycle, alerting on drift.
Update incident memory with the failure class, root cause, and recovery steps — so the system recognizes this pattern faster next time.

Evidence bundle: trace IDs of affected changes, memory diff (before/after rollback), retrieval canary configuration, re-indexed shard validation.

Pattern D — Economics Routing Decision

Scenario: A specification requires generating integration tests for a new API endpoint. The team has access to a high-capability model (expensive, strong reasoning) and a fast model (cheap, weaker on complex logic).

Routing decision:

Route the initial test generation to the fast model — integration test boilerplate is well-covered in training data and doesn't require deep reasoning.
Route the edge-case and adversarial test generation to the high-capability model — these require understanding failure modes and security boundaries.
Route the test review and evaluation against the existing regression suite to deterministic tooling — no model needed.

Illustrative only, not benchmark data.

Cost comparison (illustrative):

All tasks to high-capability model: $4.20 total, 45 seconds.
Routed as above: $0.85 total, 32 seconds (fast model handles 70% of volume).
All tasks to fast model: $0.40 total, 25 seconds — but edge-case tests miss two security boundaries caught in evaluation, requiring a retry on the high-capability model. Actual total: $1.60, 55 seconds.

The lesson: total cost of correctness, not token price, is the metric. The cheapest model is not always the most economical if its failure rate drives rework.

Pattern E — Autonomy Tier Escalation at Runtime

Scenario: An agent operating at Tier 2 (Branch) is implementing a database migration. Mid-task, it discovers that the migration requires modifying a production configuration value to update a connection string.

Escalation protocol:

Agent pauses migration and emits a structured escalation request: "Need to update DB_CONNECTION_STRING in production config. Reason: migration target requires new connection endpoint. Blast radius: all services using the billing database. Reversibility: config change is reversible via config rollback. Evidence: migration plan diff, test results on staging."
System routes the request to the domain owner (billing infrastructure). Because this is a Tier 2→3 escalation (production-impacting action), it requires human approval.
Domain owner reviews the evidence, approves with a time-bound scope: "Approved for this specific config key, this deployment window only."
Agent executes the config change, completes the migration, and the temporary Tier 3 elevation expires.
Full trace captured: escalation request, approval, action, outcome, tier restoration.

Anti-pattern: The agent modifies the production config without escalation because its prompt says "complete the migration." Infrastructure enforcement — not prompt compliance — must block this.

Pattern F — Governance That Didn't Prevent the Incident

Scenario: A team at Phase 4 has evidence bundles, evaluation gates, and defined autonomy tiers. An agent generates a migration that renames a database column used by three downstream services. The evidence bundle is complete: diff, passing tests (all within the agent's domain), trace, rollback command. The domain owner reviews and approves. The change ships. Two hours later, three downstream services fail because they depend on the old column name.

What went right: evidence bundle was complete per Phase 4 requirements. Evaluation gates caught regressions within the domain. The trace made root cause analysis fast — the team identified the breaking change in minutes, not hours.

What went wrong: the evaluation suite only tested within the agent's domain boundary. The specification said "rename the column" but didn't include cross-domain impact as an acceptance criterion. The domain owner approved based on evidence that was correct but incomplete.

Why the manifesto didn't prevent this: at Phase 4, governance is single- domain. Cross-domain evaluation coverage is a Phase 5 capability (shared evaluation registry, cross-domain trace standards). The team was operating correctly for their phase — but their phase wasn't sufficient for the task's actual blast radius.

The lesson: governed failure is still failure. The manifesto reduces the frequency and blast radius of incidents, and it makes diagnosis and recovery faster. It does not eliminate incidents. When a governed change still causes an outage, the question is not "why didn't governance prevent this?" but "what evidence was missing, and at what phase does the manifesto add that evidence?" In this case, the answer is Phase 5's cross-domain evaluation coverage — which the team should now prioritize for domains with shared dependencies.

The anti-lesson: do not respond to this incident by adding more governance at Phase 4 (requiring cross-domain reviews for every change). That is over-governance — it would slow every change to protect against a failure class that only occurs when changes cross domain boundaries. Instead, promote the specific domains with shared dependencies to Phase 5 governance.

Pattern G — Exception-Based Governance at Scale

These sampling thresholds are example defaults; calibrate them to local risk, review complexity, and incident history. This is a governance pattern, not a universal policy.

Context: A team at Phase 4+ is generating agent-driven changes at a volume that exceeds meaningful human review of every change. Domain owners are showing rubber-stamping signals (review time < 2 minutes, rejection rate < 1%).

The supervision paradox: Human review does not scale to machine-speed output. Adding more reviewers at the same volume creates the same pattern faster. The solution is to reduce the volume of decisions requiring human review — not the quality of review.

The pattern:

Classify all changes by risk tier using an automated pre-screener built from domain rules and change impact analysis:
- High-risk: Changes touching pricing logic, customer-facing decisions, shared schemas with cross-domain consumers, security boundaries, or compliance-annotated code paths → mandatory human review before merge.
- Medium-risk: Changes within a single domain, touching non-critical paths, passing full evaluation suites → statistical sample (10-20%) reviewed by domain owner; remainder logged without review.
- Low-risk: Test updates, documentation, configuration in isolated environments, changes with complete evidence bundles and no cross-domain impact → logged and merged automatically; retrospective audit.
Gate high-risk changes explicitly. A PreToolUse hook on PR creation checks the risk classification and blocks merge until the named domain owner approves. Approval latency for high-risk changes is a tracked metric — rising latency indicates the high-risk classification is too broad.
Sample medium-risk changes. The domain owner reviews a random 15% sample each week. If the sample catch rate (issues found per reviewed PR) falls below 2%, the classification threshold may be too conservative — promote some medium-risk to low-risk. If the catch rate exceeds 15%, the threshold is too permissive — raise more to high-risk.
Log low-risk changes for retrospective audit. PostToolUse hooks produce full audit records. Internal audit or the 2nd line of defense conducts periodic retrospective reviews of the low-risk cohort (monthly, 5% random sample) to validate the classification is working.

Evidence bundle: Risk classification rationale per PR (which rules triggered which tier), domain owner approval record for high-risk changes, weekly sample review record, retrospective audit findings.

Classification criteria examples:

Rule	Classification
Touches `src/pricing/**`	High-risk
Touches `src/claims/**`	High-risk
Modifies a database schema	High-risk (cross-domain impact)
New dependency added	High-risk (provenance review required)
Test file changes only, green regression suite	Low-risk
Documentation, README, comments	Low-risk
Single-domain logic change, passing evals	Medium-risk

Anti-pattern: Treating all agent-generated changes as equally risky and requiring human review for all of them. This creates the rubber-stamping failure mode. The goal is not to review everything — it is to review the right things with enough attention to catch real problems.

Relationship to Three Lines of Defense: In regulated environments, the classification pre-screener is a 1st-line control. The 2nd-line independent validation function reviews the classification criteria periodically (not individual changes) and challenges whether the risk tiers are set appropriately. The 3rd line audits whether the process was followed.

Pattern H — The Persona Simulator

When to use: Before shipping a feature that involves complex user interactions, ambiguous intent, or high diversity of user populations. Use this pattern to validate that the specification itself is correct — not just that the implementation satisfies the specification as written.

The problem it solves: Specifications are written from the perspective of the team that wrote them. They encode assumptions about how users will interact with the feature, what they will ask, and what they consider a success. These assumptions are often wrong. Traditional testing verifies that the implementation matches the spec; it does not verify that the spec matches user reality.

Pattern:

Deploy a swarm of simulation agents, each instantiated with a distinct persona profile: domain expertise, communication style, prior experience with the system, edge-case goals, adversarial intent (where applicable). Each persona agent interacts with the feature under development using the specification as its behavioral target.

The simulation produces two outputs:

Coverage gaps — interaction paths, question types, or intent categories that the specification does not address. These become specification amendments before implementation is finalized.
Failure signals — interactions where the feature's response would be incorrect, ambiguous, or unsafe from the perspective of that persona. These become evaluation cases in the evaluation portfolio (P8).

Relationship to the Agentic Loop: The Persona Simulator belongs to the Validate phase, not the Verify phase. Verify confirms the implementation satisfies the specification. Validate asks whether the specification itself is worth satisfying. Running the simulator before implementation is complete (on a specification stub or prototype) catches the wrong-thing-built failure class before it is fully built.

Minimum viable version:

personas = [
  { role: "power user", style: "terse", goal: "efficiency" },
  { role: "first-time user", style: "exploratory", goal: "orientation" },
  { role: "adversarial user", style: "probing", goal: "boundary-finding" },
  { role: "domain expert", style: "precise", goal: "correctness validation" },
]

for persona in personas:
  interactions = simulate(persona, feature_spec, n=20)
  gaps = extract_coverage_gaps(interactions, feature_spec)
  failures = extract_failure_signals(interactions, acceptance_criteria)
  report.add(persona, gaps, failures)

Exit criterion: The simulation is complete when coverage gaps have been either addressed in the specification or explicitly accepted as out of scope, and all failure signals have been added to the evaluation portfolio. Shipping without addressing the failure signals is an explicit risk decision, not an oversight.

Not all failure signals can be deferred. Any failure signal involving safety, data integrity, irreversible user harm, or a regulatory requirement is non-deferrable: it must be addressed in the specification before the implementation proceeds. Logging it as "accepted out of scope" is not acceptable for these categories. If no fix is feasible, the feature scope must be reduced to exclude the interaction class that produces the failure.

What this pattern is not: It is not a replacement for user research. Real users surface failure modes that no persona model anticipates. The Persona Simulator is a pre-ship filter, not a substitute for post-ship observation. It raises the floor; it does not guarantee the ceiling.

Failure Patterns

Hallucination Loop

An agent misreads a timeout error as auth failure, applies credential retries, and increases incident volume. Each retry generates plausible-but-wrong output that drives increasingly wrong follow-on fixes.

The fix is not "retry the prompt." It is:

Diagnose using traces — identify the misclassification point.
Add a contract/invariant: "timeout retry must not mutate credentials."
Update evaluations with the failure class as a regression test.
Gate rollout until traces confirm the corrected behavior.

Never simply retry a failed prompt. Diagnose, update memory, strengthen contracts and constraints, and rerun verification before retrying.

Operational Recovery Cycle

Diagnose using traces and failure classification.
Add or tighten contract/invariant for the violated behavior.
Add regression and adversarial tests for the failure class.
Re-run verification and canary on constrained scope.
Promote only after evidence shows the loop is broken.

Cross-Domain Incident Classification Framework

A common severity framework enables consistent incident classification, reporting, and recovery across regulated environments. Domain-specific calibrations are listed below.

Severity	Definition	Recovery Expectation	Regulatory Trigger
Severity 1	Agent takes unauthorized action with external impact (customer accounts, patient data, regulatory submissions, safety-critical systems)	Immediate containment; production rollback; root cause analysis with executive sign-off	Mandatory regulatory notification in most domains (DORA Art. 17-23; MDR Art. 87; ITAR incident reporting)
Severity 2	Agent produces incorrect output detected before downstream impact; indicates a control failure (evaluation gate missed, tier enforcement bypassed)	Same-day diagnosis; evidence bundle with root cause; governance review of the failed control	Internal incident record; potential regulatory disclosure depending on data type affected
Severity 3	Agent performance degradation (latency, accuracy drift, increasing evaluation failure rate) detected through monitoring; within tolerance thresholds	Diagnosis within 24h; specification or tier adjustment if root cause identified	Typically internal; may trigger DORA notification if threshold-breaching degradation continues
Severity 4	Agent failure fully contained by circuit breakers or fallback mechanisms; no downstream impact	Post-incident review within 48h; update chaos test suite with the failure scenario	Internal only; document in resilience engineering log

Domain-specific calibration:

Aviation: Map to the failure condition category (Catastrophic, Hazardous, Major, Minor) of the software component affected. Any agent action affecting airborne software in a DAL A/B component is Severity 1 by default.
Medical devices: Map to IEC 62304 safety class and ISO 14971 harm probability × severity. Any agent action affecting Class C critical-path software is Severity 1. Vigilance reporting timelines apply for Severity 1-2.
Pharma: Map to GxP data integrity impact. Any agent action that modifies or corrupts GxP records without a valid audit trail is Severity 1. Deviation and CAPA procedures apply.
Financial services: Use the DORA Severity 1-4 taxonomy defined in the financial-services.md domain document. DORA notification timelines are strict; track them as a first-class workflow trigger.
Automotive: Any agent action affecting ASIL C/D safety function specifications, test cases, or verification records is Severity 1. ISO 26262 Part 8 change management requirements apply.
Defense / government: ITAR/EAR violations are automatic Severity 1 regardless of downstream impact. Report to the cognizant security officer immediately; do not attempt self-remediation before reporting.

Companion document to the Agentic Engineering Manifesto. Extends Principle 2 (Specifications as living artifacts) with a structured Requirements Engineering framework adapted for probabilistic, agentic, and hybrid systems.

Primary reference: "Requirements Engineering in the Age of Agentic AI" (submitted framework). Key academic support: arXiv:2602.22302 (Agent Behavioral Contracts), AgentSpec ICSE 2026 (arXiv:2503.18666), NIST AI RMF GenAI Profile (NIST AI 600-1, July 2024), ISO/IEC 5338 (AI system life cycle), ISO/IEC 42001 (AI management systems), EU AI Act (Regulation (EU) 2024/1689).

1. The Paradigm Break

Traditional requirements engineering was designed for deterministic systems. A requirement specifies a condition the system must satisfy; a test confirms whether the system satisfies it. Pass or fail.

Agentic systems break this model in three ways:

Non-determinism. The same input may produce different outputs across runs. A requirement stating "the system shall return X given input Y" cannot be verified by a single test execution. It must be stated as a probabilistic assurance target: "the system shall return output consistent with class X in at least N% of runs across the evaluation distribution."

Emergent behavior. Agentic systems learn, adapt, and generate outputs outside any enumerated set. Requirements that enumerate permitted outputs will always be incomplete. Requirements must instead define a behavioral envelope — the boundary the system must stay within — and verify containment rather than specific outputs.

Dual consumers. Specifications in agentic pipelines are consumed by both humans (who interpret intent) and agents (who execute literally). A specification that relies on human context to be meaningful will fail when consumed by an agent.

These three breaks require a different RE vocabulary. This document provides it.

2. The Two-Axes Classification Matrix

Every requirements artifact in an agentic system can be placed on two axes:

Axis 1 — System type:

Deterministic: Classical software. Outputs are fully determined by inputs and current state. Traditional RE applies without modification.
Agentic: LLM-based, reinforcement-learning-based, or otherwise probabilistic. Outputs are non-deterministic. Traditional RE must be extended.
Hybrid: Deterministic orchestration layer over agentic execution components. Most production agentic systems. Requires mixed RE strategies.

Axis 2 — Artifact consumer:

Human: The requirement is written for a human reader. Natural language is appropriate. Intent can be communicated through context, examples, and commentary.
Agent: The requirement is consumed directly by an agent as part of a specification, system prompt, or AGENTS.md file. Must be unambiguous to a machine. Contextual inference is unreliable.
Hybrid: The requirement must serve both humans (for review, governance) and agents (for execution). This is the hardest case and requires explicit dual-format specifications.

The 3×3 Matrix

	Human consumer	Agent consumer	Hybrid consumer
Deterministic system	Traditional RE. Prose + formal models.	AGENTS.md / skill files. Machine-readable constraints with no ambiguity.	Canonical prose spec + machine-readable encoding. Keep them in sync.
Agentic system	Behavioral envelope in prose. Probabilistic assurance targets as acceptance ranges.	Behavioral contracts (arXiv:2602.22302). AgentSpec format (arXiv:2503.18666). Enumerated constraints with explicit probability bounds.	Single source (behavioral envelope) + dual projections: prose for governance, structured format for agent consumption.
Hybrid system	Separate deterministic and agentic requirement sets. Document which components are which.	Orchestration spec (deterministic, machine-readable) + behavioral envelope (agentic components).	Full RE framework: single-source document → human projection → agent projection → governance projection.

Key rule: Never write a requirement in the human-consumer format when the primary consumer is an agent. The specification will be consumed literally. What a human infers from context, an agent will miss or misapply.

Stack allocation note: Requirements must be allocated at the appropriate layer of the system stack: foundation model / provider, prompt and runtime policy, planner or controller, memory, tools and connectors, deterministic orchestration, human review interface, deployment and monitoring infrastructure. A single high-level requirement (e.g., "the system must not exfiltrate sensitive data") typically decomposes into separate requirements at multiple layers: model and provider constraints, retrieval scoping, connector authorization scopes, egress controls, logging, review gates, and incident response. Apply the two-axes classification at the layer where the requirement is enforced, not at the system level.

3. Hard Requirements vs. Probabilistic Assurance Targets

The requirement type must match the system type.

Hard requirements are absolute. The system either satisfies them or it does not. They apply to:

Deterministic components of hybrid systems
Safety boundaries (the system must never take action X regardless of context)
Authorization and access control
Structural invariants (data formats, API contracts, schema validation)

Hard requirements in agentic systems should be enforced by infrastructure policy wherever possible, not by the agent's own reasoning. An agent instructed not to do X via a prompt can be argued or manipulated out of that constraint. An agent that cannot do X because the tool call is disabled cannot. In practice, critical hard requirements may need layered enforcement — infrastructure policy as the primary control, supplemented by runtime monitoring, human review gates, and post-hoc audit detection. No single enforcement mechanism should be treated as sufficient in isolation for Tier 3 systems.

Probabilistic assurance targets define acceptable performance ranges across an evaluation distribution. They apply to:

Output quality (accuracy, relevance, completeness)
Behavioral consistency (the system should behave consistently within the behavioral envelope)
Task success rates

Format: "The system shall achieve [metric] of [value] ± [tolerance] across [evaluation distribution] with [confidence level]."

Example: "The claim extraction agent shall achieve F1 score ≥ 0.85 across the held-out evaluation set of 500 documents, with 95% confidence interval upper bound ≥ 0.82."

Critical distinction: Probabilistic assurance targets are not lower-quality requirements. They are the correct specification format for non-deterministic behavior. Writing a hard requirement for probabilistic behavior is not more rigorous — it is a category error that will always fail at verification.

4. The Behavioral Envelope

A behavioral envelope defines the space within which agent behavior is acceptable, without enumerating acceptable behaviors. It consists of four layers:

Layer 1 — Hard boundaries (must never). Actions the agent is prohibited from taking regardless of context, instructions, or apparent justification. These are enforced structurally (tool removal, permission policy) not by prompt instruction.

Examples: writing to production databases without explicit approval, sending external communications without human review, executing irreversible actions in Tier 3 systems (see Section 6).

Layer 2 — Soft boundaries (should avoid). Behaviors that are undesirable but not prohibited. Enforced by evaluation, monitoring, and steering. Alert on violation; do not hard-block.

Examples: producing responses that exceed the approved length envelope, citing sources outside the approved knowledge base, introducing architectural patterns not aligned with the codebase style.

Layer 3 — Performance envelope. The acceptable range of quality, cost, latency, and resource consumption. Defines when degraded performance triggers escalation.

Layer 4 — Adaptation envelope. For systems that learn or accumulate state: defines what the system is permitted to learn from (allowed data types, sources, and feedback channels), what it must not update on (prohibited inputs to persistent memory or fine-tuning), and how learned behavior is governed and audited. Specify: what counts as adaptation (few-shot history, RAG knowledge base updates, fine-tuning, long-term memory writes); who can write to persistent memory and under what conditions; provenance requirements for stored knowledge; retention and expiry policy; how learned state can be rolled back; and what behavioral changes trigger a revalidation cycle. For systems using retrieval-augmented generation, specify knowledge base governance: source authority, freshness requirements, and access boundaries.

The behavioral envelope is the primary specification artifact for agentic components. It replaces enumerated-output requirements as the verification target. For the full system, the behavioral envelope coexists with hard requirements for deterministic components and interface contracts between system layers.

APLC extension: This framework defines the vocabulary and technical specification structures for the engineering-level behavioral specification. For agent products — systems where the agent itself is the delivered product rather than an instrument of software delivery — the behavioral envelope model is extended to the product level in agent-behavioral-specification.md. The business purpose, user model, and trust architecture from Stage 1 (Conception) are additional inputs that shape every section of the product-level behavioral specification. The companion RE framework remains the technical reference; agent-behavioral-specification.md applies it at the product scope.

Multi-Agent Behavioral Contracts

When multiple agents interact, behavioral envelopes must be specified for each agent individually and for inter-agent boundaries. The Agent Behavioral Contracts framework (arXiv:2602.22302) addresses this explicitly: contracts define pre/postconditions and invariants at agent boundaries, and multi-agent contract composition yields computable probabilistic degradation bounds for the chain.

The practical implication: reliability does not improve in a multi-agent system simply by adding more agents. Correlated failure modes — shared base model, shared knowledge base, shared tool chain — mean the combined reliability of a chain can be worse than any single agent's reliability, because failures propagate in the same direction simultaneously. Requirements for multi-agent systems must therefore specify:

Communication contracts at each inter-agent boundary: what one agent is permitted to send, what another is required to accept, what triggers rejection or escalation
Chain-level reliability targets stated as probabilistic assurance targets, not derived from per-agent targets by multiplication (which assumes independence that rarely holds)
Failure isolation boundaries: what happens when one agent in the chain fails — does it escalate, fall back, or propagate the error downstream?
Shared resource governance: if agents share a knowledge base, memory store, or tool, specify which agent can write, which can read, and under what conditions

5. The Single-Source / Multiple-Projections Principle

Agentic pipelines require requirements artifacts in multiple formats for multiple consumers:

Governance and audit: prose, human-readable, context-rich
Agent execution: structured, machine-readable, unambiguous
Testing and evaluation: measurable, with clear pass/fail or threshold criteria
Regulatory compliance: aligned to ISO/IEC 5338, NIST AI RMF, or domain-specific standards

The failure mode is maintaining separate documents for each consumer. These diverge. The governance document says one thing; the agent execution spec says another; the tests verify a third thing.

The single-source principle (governance best practice, not a legal requirement): One canonical source document (the behavioral specification) is the source of truth. All other representations are generated or derived from it, not independently authored. When the source changes, all projections must be updated.

In practice:

Write the behavioral specification in human-readable prose with explicit, structured sections
Derive the agent-consumable encoding (AGENTS.md, AgentSpec format, or behavioral contract) from the prose by explicit, documented transformation
Derive the evaluation suite from the probabilistic assurance targets, not independently
Derive the compliance mapping from the behavioral envelope using the relevant standard's framework (NIST AI RMF risk categories, ISO/IEC 5338 life cycle requirements)

Every requirement in every projection must trace back to a named section in the canonical source. If it cannot, it either belongs in the source or should not exist.

Change Control

Because the behavioral specification is the source of truth for all projections, changes to it carry cascading implications. A change control process must specify:

Who can propose changes to the behavioral specification, and who must approve them (minimum: the tier owner and a representative from each affected consumer group — governance, engineering, and security)
What triggers mandatory re-evaluation: new capability deployment, behavioral drift detected by monitoring, incident post-mortem, regulatory change, or elapsed review interval
How projections are updated: no projection may be updated independently. The canonical source is updated first; projections are re-derived. The commit history of the source document is the audit trail.
How version mismatches are detected: agents consuming a behavioral specification should receive a versioned reference. If the version they were instantiated with no longer matches the current canonical version, the mismatch must be flagged before the next deployment cycle.

For Tier 3 systems, changes to Layer 1 hard boundaries require explicit re-authorization: re-specification, updated evidence bundle, and revalidated approval chain.

6. Tiered Lifecycle

Requirements governance applies differently at different autonomy tiers. Mismatching governance to autonomy tier is a common source of both under-governance (too little control at high autonomy) and over-governance (paralyzing low-risk operations with excessive process).

Tier 1 — Propose-only (analyze and recommend, no execution). Requirements emphasis: output quality, behavioral consistency, information boundary (what data the agent can access). Governance: standard review gates. The blast radius of wrong output is bounded by the human review step.

Minimum requirements artifacts: behavioral envelope (Layers 1 and 3), evaluation suite for output quality, data access specification.

Tier 2 — Isolated execution (writes to branches, sandboxes, or staging environments; changes require review before promotion). Requirements emphasis: all Tier 1 requirements, plus: scope boundary (what the agent can modify), promotion criteria (what review must confirm before promotion), rollback specification.

Minimum requirements artifacts: Tier 1 artifacts + scope boundary document + review gate criteria + rollback procedure.

Tier 3 — Production-impacting (writes to production state, sends external communications, takes irreversible actions). Requirements emphasis: all Tier 2 requirements, plus: explicit human approval requirements (who can authorize, under what conditions), audit trail specification, rollback plan (pre-approved, not improvised), incident escalation path.

Minimum requirements artifacts: Tier 2 artifacts + human approval policy + audit trail specification + pre-approved rollback plan + incident escalation procedure.

EU AI Act obligations apply to high-risk systems operating at Tier 3. Human oversight requirements under the Act are not governance checkboxes — they are system design requirements. Specify: what interface enables operators to monitor, detect anomalies, and override outputs; what training or competency is required to exercise oversight effectively; what stop-operation procedure exists and how quickly it can be invoked; and what post-market monitoring captures for ongoing review. "Human approval" is insufficient as a Tier 3 requirement unless the approval mechanism itself is specified as part of the system.

Tier assignment is a requirements decision, not a deployment parameter. It must be made explicitly at the specification stage and documented in the behavioral specification. Tier assignment determines the governance overhead; an agent assigned to Tier 1 cannot subsequently be granted Tier 3 authority without a full re-specification and review cycle.

7. Non-Functional Requirements for Agentic Systems

The NFR categories that require explicit treatment in agentic systems:

NFR Category	Agentic-Specific Consideration	Specification Format
Reliability	Non-determinism means reliability must be stated as a distribution, not a point estimate	Probabilistic assurance target
Safety	Define the behavioral envelope Layer 1 hard boundaries explicitly	Hard requirement, infrastructure-enforced
Security	Agentic threat landscape includes: prompt injection (direct and indirect via tool outputs), context poisoning, memory poisoning, goal/behavior hijacking, over-permissioned connectors, privilege escalation, supply-chain risks in tool protocols, and identity abuse. Requires explicit threat model and defense-in-depth — prompt-level controls alone are insufficient. Specify: credential scope per connector, connector trust model and verification, memory integrity requirements, red-team cadence	Hard requirement per threat category + evaluation suite for injection resistance + review gate for connector authorization
Privacy	Data exposure via context window and memory requires explicit access boundaries	Hard requirement
Fairness / Bias	Output bias is a behavioral quality metric, not a binary	Probabilistic assurance target + evaluation distribution specification
Explainability	Traceability of agent reasoning to decision	Hard requirement (trace format) + probabilistic target (trace completeness)
Cost	Token consumption, compute cost per task	Probabilistic assurance target (p95 cost per run)
Latency	Time-to-completion distribution	Probabilistic assurance target (p50/p95/p99)
Regulatory compliance	EU AI Act (high-risk obligations apply on a staged timetable; verify the current application date and transitional rules for your use case): documented post-market monitoring, human oversight, logging of autonomous decisions, traceability	Hard requirements (documentation, logging, override capability) + process requirements (post-market monitoring plan)
Evolvability	Specifications must evolve without full re-derivation	Single-source principle compliance

8. Per-Requirement Checklist

For each requirement in a Tier 2+ agentic system, verify:

Type declared: Is this a hard requirement or a probabilistic assurance target? Is the type correct for the system type?
Consumer declared: Is the consumer human, agent, or hybrid? Is the format appropriate?
Axis classification: Is the system type (deterministic/agentic/hybrid) and consumer type documented?
Traceable to source: Does this requirement trace to a named section in the canonical behavioral specification?
Verifiable: Can this requirement be verified by an evaluation or test? Is the evaluation defined?
Tier-appropriate: Is the governance overhead appropriate for the tier?
Hard boundaries infrastructure-enforced: If this is a hard boundary, is it enforced by infrastructure policy, not prompt instruction?
Probabilistic targets have distributions: If this is a probabilistic assurance target, is the evaluation distribution specified?
Memory governance addressed: If the agent has persistent memory, is the adaptation envelope (Layer 4) specified?
Rollback defined: If this is Tier 3, is the rollback procedure pre-approved and documented?

9. Connection to the Manifesto

This framework is an extension of Principle 2 (Specifications are living artifacts that evolve through steering). It provides the vocabulary and structure that Principle 2 requires but does not define.

The behavioral envelope (Section 4) operationalizes Principle 5 (tiered autonomy): every tier has a corresponding behavioral envelope scope.

The single-source principle (Section 5) operationalizes Principle 9 (observability as infrastructure): when requirements are single-source, the audit trail is coherent.

The tiered lifecycle (Section 6) maps directly to the autonomy tiers in Principle 5 and the blast radius management framework in Principle 10.

The probabilistic assurance targets (Section 3) operationalize Principle 8 (evaluations are the contract): the evaluation contract is the assurance target, not a binary test assertion.

10. Academic References

arXiv:2602.22302 — Agent Behavioral Contracts. Formal specification of agent behavior using pre/postconditions adapted for probabilistic systems. Provides mathematical grounding for the behavioral envelope concept.
arXiv:2503.18666 — AgentSpec (ICSE 2026). A domain-specific language for specifying and enforcing runtime constraints on LLM agents. Rules consist of triggers, predicates, and enforcement mechanisms that intercept agent actions before execution. Relevant for encoding Layer 1 and Layer 2 behavioral envelope constraints in machine-executable form. Note: AgentSpec is a runtime enforcement tool, not a requirements specification format; it operationalizes the agent-consumer column of the two-axes matrix rather than replacing the requirements specification itself.
NIST AI 600-1 (July 2024) — NIST AI RMF Generative AI Profile. Risk category taxonomy for generative AI systems. Maps to Layer 1 and Layer 2 of the behavioral envelope.
ISO/IEC 5338 — AI system life cycle processes. International standard for requirements engineering in AI systems. The tiered lifecycle (Section 6) aligns to ISO/IEC 5338's risk-based process tailoring.
ISO/IEC 42001 — AI management systems. International standard for governance, performance evaluation, monitoring, and continual improvement of AI systems. The single-source principle (Section 5) and tiered lifecycle (Section 6) align to ISO/IEC 42001's documentation and change management requirements.
EU AI Act (Regulation (EU) 2024/1689) — Obligations for high-risk AI systems enforceable from 2 August 2026. Relevant provisions: post-market monitoring systems, human oversight measures, logging of autonomous decisions, technical documentation. For agentic systems: wrapping a foundation model in an orchestration layer can constitute a substantial modification triggering full provider obligations. Regulatory compliance row in Section 7 maps to these obligations.
arXiv:2603.03823 — SWE-CI benchmark (Sun Yat-sen University & Alibaba Group, 2026). Found that most evaluated models achieve zero-regression rates below 0.25, meaning regressions were introduced across the majority of long-horizon maintenance tasks — validating the need for probabilistic assurance targets and independent evaluation rather than point-in-time binary testing.

How this manifesto can fail, and the skills teams need to implement it.

Failure Modes of This Manifesto

These are failure modes of the manifesto's technical implementation. For failure modes of the organizational change process (adoption without support, incentive mismatch, skipping phases), see the Adoption Playbook.

Applied poorly, this manifesto can fail through:

Over-governance: Constraints so heavy that human coding becomes faster. The tell: lead time increases without corresponding quality improvement. The fix: reduce ceremony, widen Tier 1/Tier 2 boundaries, and measure whether governance overhead is justified by incident reduction.
Evidence theater: Large bundles with low signal. Teams produce voluminous evidence artifacts that nobody reads and that don't catch real failures. The tell: evidence bundle size grows while escaped defect rate stays flat. The fix: audit which evidence artifacts actually influenced a decision in the last quarter. Cut the rest.
Control theater: Humans nominally accountable but operationally blind. A named domain owner "approves" changes they cannot meaningfully review because volume exceeds capacity. The tell: approval latency drops to near zero (rubber-stamping). The fix: reduce autonomy scope until review is meaningful, or invest in automated pre-screening that surfaces only the exceptions worth human attention.
Security theater: Policies documented but not enforced at tool/runtime boundaries. The architecture describes constraints that no infrastructure actually blocks. The tell: agents violate documented policies with no system-level detection. The fix: enforce before you document — if the infrastructure can't block it, the policy is aspirational, not real.
Adoption theater: Teams adopt the manifesto's vocabulary without its discipline. Evidence bundles are renamed PR descriptions. Autonomy tiers are defined but not enforced. Maturity self-assessments are aspirational. The tell: the language changes but incident patterns don't. The fix: measure outcomes (escaped defect rate, incident severity, rollback frequency), not adoption checkboxes.
Maturity inflation: Teams self-assess at Phase 4 or 5 because the phase descriptions are aspirational enough to pattern-match to current practice. The tell: a team claims Phase 4 but cannot produce an evidence bundle for a recent change. The fix: use the phase-calibrated evidence examples (Operational Definitions) as a litmus test — the evidence you can actually produce determines your phase, not the practices you intend to adopt.
Verification without validation: Every gate passes, evidence bundles are complete, escaped defect rate is low — but the team ships the wrong thing. The specification was never worth implementing, and the manifesto's verification machinery confirmed the implementation was correct without anyone confirming it was valuable. The tell: system quality metrics improve while business outcome metrics (adoption, usage, revenue impact, customer satisfaction) stay flat or decline. The fix: treat the Agentic Loop's Observe → Learn phases as validation checkpoints — connect evaluation results to business outcomes, define stop criteria (not just acceptance criteria) for every specification, and make business assumptions explicit before the Loop begins. See the Validation vs. Verification section in P2 extended guidance.
Structural regression without detection: Every change passes current tests, regression suites are green, evidence bundles are complete — but the codebase is progressively harder to maintain. Each iteration's decisions (naming conventions, dependency structures, architectural choices) create friction that compounds across subsequent iterations. The code is locally correct but globally harmful. The tell: iteration-over-iteration regression frequency rises, time per change increases, and specification convergence slows — all while current test pass rates remain high. The fix: track evolution-weighted metrics (see EvoScore in Operational Definitions), monitor coupling and dependency trajectories across iterations, and include structural quality indicators in evaluation portfolios alongside behavioral regression tests. See the Structural Regression section in P8 extended guidance. The SWE-CI benchmark (arXiv:2603.03823) provides empirical evidence: most agents introduce regressions in over 75% of CI iterations, many of which are structural rather than behavioral.

The corrective action is always the same: reduce ceremony, increase signal, and measure cycle time, defect rate, and incident severity together.

Skill Requirements by Principle

Not all principles require the same skills. This table helps teams identify capability gaps before they become adoption blockers. See the Adoption Playbook for guidance on building these capabilities.

Principle	Core Skill Required	Team Readiness	Notes
P1 — Outcomes	CI/CD, release engineering	Ready	Existing pipelines need extension, not replacement
P2 — Specifications	Formal requirements, contract design	Reorient	Requirements skills exist but need machine-readable precision. Agent Skills, AGENTS.md, and specification-driven development frameworks provide concrete formats and workflows
P3 — Architecture	Infrastructure engineering, policy-as-code	Reorient	Infra skills exist but policy-as-code enforcement is new
P4 — Swarm Topology	Distributed systems design	Acquire	Few teams have multi-agent coordination experience. A2A protocol provides emerging standards for agent discovery and task delegation
P5 — Autonomy	Security engineering, access control	Reorient	Access control exists but agent-specific tier enforcement is new. Infrastructure-level policy systems (YAML-based permissions, audit logs, guardrail constraints) offer reference implementations
P6 — Knowledge & Memory	Data engineering, information retrieval	Acquire	Memory governance (provenance, expiration, rollback) is a new discipline. Git-native agent memory systems provide early reference architectures
P7 — Context	ML/retrieval engineering, context engineering	Acquire	Retrieval engineering at agent scale requires specialized skills. Agent-to-tool protocols, capability definitions, and agent memory systems form an emerging tooling ecosystem
P8 — Evaluations & Proofs	Test engineering, formal methods	Split	Test engineering: ready. Formal methods: acquire (and defer until Phase 5)
P9 — Observability	SRE, distributed tracing	Reorient	SRE exists but agentic traces require new schema and tooling. Emerging interoperability standards under neutral governance (AAIF) provide the foundation
P10 — Emergence	Chaos engineering, security	Acquire	Chaos engineering for agentic systems has no established playbook. Early autonomous agent security incidents provide case studies
P11 — Economics	FinOps, cost optimization	Reorient	FinOps exists but total-cost-of-correctness models are new
P12 — Accountability	Incident management, compliance	Ready	Incident management extends naturally; compliance may need updates

Reading the Readiness column:

Ready: The skill exists and applies with minor extension.
Reorient: The skill exists but must be redirected toward agentic concerns. Training and practice are sufficient; hiring is not required.
Split: Part of the skill is ready; part must be acquired separately.
Acquire: The skill is rare or nonexistent in most teams. Requires hiring, dedicated training, or partnering with specialists.

Principles marked "Acquire" are the adoption bottlenecks. Do not attempt these at full depth without investing in the skill. Start with "Ready" and "Reorient" principles (P1, P3, P5, P9, P12) and build toward the harder ones incrementally. The Adoption Playbook maps these skills to specific phase transitions.

Annotated Agent Configuration Template

Every project needs an agent configuration file (commonly named AGENTS.md, CLAUDE.md, or similar depending on tooling). Neither the manifesto nor most tooling documentation provides a starting point. Use this template — adapt it, do not just copy it. Annotations explain what each section must contain and whether it is mandatory per CoE policy.

# [Project Name] — Agent Instructions

## Scope and Version
<!-- RECOMMENDED. Establish ownership and applicability before the agent reads further. -->
Owner: [name or team]
Last updated: [date]
Applicable systems: [which services, repos, or pipelines this file governs]

## Project Overview
<!-- MANDATORY. 3-5 lines. What does this service do? What domain does it own?
     What is its upstream/downstream position in the system? -->
[Service name] is responsible for [core function]. It owns [domain boundary].
Upstream: [what feeds into it]. Downstream: [what consumes its output].
Stack: [language, framework, runtime].

## Build, Test, Deploy Commands
<!-- MANDATORY. Agents must be able to run these without asking. -->
Build:  [command]
Test:   [command]          # Must exit 0 before any PR
Lint:   [command]
Deploy: [command or "see CI pipeline — do not deploy manually"]

## Domain Constraints
<!-- MANDATORY. What must this agent never do in this codebase? -->
- Never modify [schema/table/config] without a migration file and a rollback.
- Never call external APIs directly — use the adapter layer at [path].
- Never generate pricing, underwriting, or claims logic — flag for human review.
- [Any other non-negotiable domain boundary]

## Security
<!-- MANDATORY. Do not duplicate enterprise-wide policy here; link to the
     governing file instead. Add only project-specific security constraints. -->
Follows enterprise security rules. Project-specific additions:
- All [entity type] inputs must be validated against [schema/contract] at [path].
- [Any project-specific credential or secret handling requirement]

## Testing Conventions
<!-- MANDATORY. Agents must know how tests are structured before writing them. -->
Test location: [path pattern]
Naming: [convention, e.g., describe/it or TestFunctionName_Scenario]
Mocking: [approved mock strategy — real DB / in-memory / stub]
Coverage threshold: [minimum %, matches hook threshold]

## Commit and PR Conventions
<!-- MANDATORY. -->
Commit format: [conventional commits / other]
PR title: [format]
Every agent-assisted commit must include: "Co-Authored-By: [agent-id]"

## Architecture Notes
<!-- RECOMMENDED. Key decisions agents must respect. Keep brief. -->
- [ADR reference or one-line constraint, e.g., "hexagonal architecture — no
  framework code in domain layer"]
- [Data flow constraint, e.g., "all writes go through the command bus at [path]"]

## MCP Integrations in Use
<!-- RECOMMENDED. List approved MCPs available in this project. -->
- [MCP name]: [what it does, what data classification it can access]

## What NOT to Put Here
<!-- Advisory — for the human writing this file -->
Do not include: credentials, environment variable values, hostnames, IPs,
information that belongs in enterprise rules (already loaded), information
that should be in a path-scoped rule file.
Do not exceed 200 lines. Use @path/to/file imports for larger reference docs.

CoE review checklist for project agent configuration file:

Project Overview: domain boundary clearly stated
Build/test/deploy commands: all present and tested
Domain Constraints: no overlap with enterprise-managed agent configuration
Security section: references enterprise rules rather than duplicating them
Testing Conventions: coverage threshold matches hook threshold
No credentials, hostnames, or environment-specific values
Under 200 lines

Cross-Domain Supplier and Vendor Qualification

Every regulated domain requires qualification of critical suppliers of software systems. In agentic engineering, "supplier" is an ambiguous category — LLM providers, open-source frameworks, agent runtimes, and tool integrations all fall into scope. This section provides a cross-domain synthesis; domain documents provide the regulatory specifics.

Who Is the Supplier?

Component	Supplier Type	Qualification Obligation	Key Issue
Commercial LLM API (OpenAI, Anthropic, etc.)	Named vendor with terms of service	Vendor assessment: data handling, version notification, SLA, incident notification	No access to training data, model weights, or full anomaly documentation. Regulatory expectations were written for traditional software suppliers.
Open-source foundation model (Llama, Mistral, etc.)	No identified supplier entity	Deploying organization assumes full supplier responsibility: validation, maintenance, version control, anomaly tracking, incident response	No quality agreement possible. The QMS burden falls entirely on the deployer.
Agent framework / orchestration library	OSS or commercial	Same as above, based on licensing model	Framework updates may change agent behavior without semantic versioning signals
MCP tool integrations	Varies	Each tool integration is a system boundary requiring supplier qualification appropriate to the data classification it can access	External API access expands the effective supply chain
Agent memory infrastructure	Internal or vendor	Internal: first-party governance. Vendor: assess data residency, backup/recovery, retention controls	Memory stores may hold regulated data; the store's supplier must be qualified accordingly

The Open-Source Supplier Problem

GAMP 5 (pharma), ISO 13485 (medical devices), and SR 11-7 (financial services) assume an identifiable supplier with a quality system. Open-source foundation models have no such entity. The deploying organization must formally document that it assumes supplier responsibilities. This is not optional — it is the regulatory consequence of the build decision.

Documentation required:

Assumption of supplier responsibilities: A formal record stating that the organization assumes full validation, maintenance, monitoring, anomaly tracking, and incident response responsibilities for the open-source model.
Version management plan: How model versions are tracked, tested before upgrade, and rolled back if needed.
Anomaly tracking: How the organization monitors community-reported issues and assesses impact on its validated use cases.
Exit strategy: How the organization would migrate to a different model if the open-source project is abandoned or compromised.

Cross-Domain Qualification Minimum Requirements

Regardless of domain, agent supplier qualification should address:

Requirement	Why It Matters	Minimum Evidence
Data handling and residency	Regulated data must not leave compliant infrastructure	Data processing agreement or on-premises deployment confirmation
Version notification	Model updates change agent behavior	Version change notification procedure with minimum lead time
Availability SLA	Agent unavailability is an ICT operational risk	SLA documentation with incident notification commitments
Security posture	Agent infrastructure is an attack surface	Security assessment (SOC 2, ISO 27001, or equivalent)
Sub-processor visibility	Data may pass through additional third parties	Sub-processor list and flow-down requirements
Exit strategy	Concentration risk requires mitigation	Multi-model routing plan (P11) as DORA/third-party risk mitigation

Multi-vendor routing as qualification simplification. P11's multi-model routing strategy (routing tasks to the cheapest capable model) also reduces supplier qualification burden by preventing dangerous concentration in a single provider — a regulatory requirement under DORA for financial services, and a prudent risk management practice in all regulated domains. Each provider still requires qualification, but no single provider's failure can take down the entire capability.

Ecosystem References

This guide references standards and tools that are evolving rapidly. Rather than duplicate descriptions that will age, we list the categories that matter and point to the authoritative sources.

Standards under AAIF governance: MCP (agent-to-tool), A2A (agent-to-agent), Agent Skills (capability definition), AGENTS.md (repository-level constraints). The Agentic AI Foundation, launched December 2025 under the Linux Foundation, provides neutral governance across these protocols.

Specification-driven development frameworks: Multiple open-source frameworks enforce the specification-first workflow described in P2: specify before implementing, treat specs as code artifacts, and consume them at agent runtime. See Sources refs 43–47 for specific projects.

Memory and coordination infrastructure: Git-native agent memory systems, autonomous agent runtimes with infrastructure-level policy enforcement, and continuous integration benchmarks for structural regression. See Sources refs 40–42 for specifics.

The manifesto does not endorse specific tools. Its contribution is the governance model that applies across them. The Sources file carries the dated references; this guide carries the principles.

How to adopt the Agentic Engineering Manifesto in your organization: incremental steps, role evolution, change management, and success metrics.

Read the Manifesto for the core principles. Read the Companion Guide for implementation depth and worked patterns. Use this playbook to plan and drive the organizational change.

Making the Business Case

Agentic engineering is a technical discipline. But the organizational decision to adopt it — and sustain it through the J-curve dip before returns materialize — is a business decision that requires a business case.

The Competitive Logic

The organizations leading on AI are not winning because they have access to better models. The same foundation models are broadly available. They are winning because they can apply those models faster, with less risk, and at greater scale than competitors who are still governing AI with processes designed for human developers.

That advantage compounds. A team that verifies faster ships faster. A team that ships faster learns faster. Better learning sharpens specifications, which improves agent output, which reduces rework, which frees capacity for higher-value work. The Agentic Loop, run well, is a compounding return on engineering investment — not a one-time productivity gain.

The teams that build this flywheel early widen the gap continuously. The question for decision-makers is not "should we invest in agentic engineering?" but "how long can we afford not to?"

Stage-Gated Investment Model

Agentic engineering adoption is stage-gated investment, not a single project. Each phase transition has a distinct investment profile and return horizon:

Phase transition	Investment character	Return horizon	Key go/no-go signal
Phase 1→2 (exploration → assisted delivery)	Low: tooling licenses, standardization time	Immediate: measurable cycle time reduction on assisted tasks	AI suggestions accepted at a materially positive rate without increasing rework
Phase 2→3 (assisted → agentic prototyping)	Low-medium: specification discipline, review process	1–2 months	Agent outputs consistently reviewable; rework rate tracked
Phase 3→4 (prototyping → governed delivery)	Medium: evidence pipeline, evaluation suite, domain boundary encoding	2–4 months	Evidence completeness ≥95%; escaped defect rate ≤ human baseline
Phase 4→5 (governed → engineering scale)	Significant: platform ownership, memory governance, multi-domain expansion	4–8 months	Total cost of correctness declining per outcome; oversight load stable

Treat these as starting signals, not universal thresholds. Calibrate against your domain baseline and risk class.

Do not fund the next phase until the current phase has produced evidence that justifies it. Organizations that invest in Phase 4 governance infrastructure before they have Phase 3 evidence that agents produce reviewable output create bureaucracy, lose team confidence, and stall. The correct sequence: prove the model in one domain, then replicate. Replication is cheap once proven; the investment in proof is not recoverable if you skip it.

Business Outcome Metrics

Frame investment returns in business terms, not engineering activity:

Cycle time reduction. Time from specification to verified deployment. Target: halving cycle time for governed changes by Phase 4. This directly enables faster product iteration and competitive response.

Escaped defect rate. Post-release fixes cost 5–10× pre-release fixes. Every percentage-point reduction compounds into reduced incident cost, reduced remediation overhead, and reduced reputational risk.

Senior talent leverage. Risk-tiered verification routes low-risk changes through automated evidence pipelines, freeing senior engineers for architecture, evaluation design, and high-risk review. Track hours redirected from low-value review to high-leverage work.

Total cost of correctness. The full cycle cost: inference + verification + governance overhead + incident remediation. This replaces story points and velocity as the primary economic signal. Track it per domain, per phase, per quarter. If it is not declining, the phase transition has not delivered.

The New Way of Working

Converting Agile Ceremonies to Agentic Practice

Teams converting from Agile face a specific organizational challenge: the ceremonies are load-bearing. They are not decoration. They synchronize teams, surface blockers, and create accountability. Abolishing them without replacing the function they serve produces confusion and regression. The question is not whether to keep the ceremonies — it is what mechanism replaces each function.

The table below maps the core Agile ceremonies to their agentic equivalents. The intent of each ceremony is preserved; the mechanism changes to match machine-speed, evidence-based execution. These are starting points, not mandates. Adapt to the team's phase maturity and domain constraints.

Agile Ceremony	Intent	Agentic Equivalent	Mechanism
Sprint Planning	Agree on scope and how to build it	Spec Refinement & Tier Assignment	Domain owner and leads convert backlog items into machine-readable specifications with autonomy tier assignments and blast-radius classifications. Ambiguous items are decomposed until unambiguous — not estimated. The plan artifact is a specification, not a story-point count.
Daily Standup	Synchronize status and surface blockers	Trace Audit & Anomaly Review	Daily review of structured traces from the prior period. Tasks with unexpected tool calls, evaluation failures, or cost spikes are flagged for root-cause. The traces are the status; there is no verbal report. The review surfaces behavioral drift before it compounds into a hallucination loop.
Sprint Review	Demonstrate completed work to stakeholders	Evidence Bundle Review	Completed work is presented via evidence bundles: diffs, trace IDs, evaluation results, policy check outputs. Stakeholders review outcomes and audit quality, not demos. "The agent said it worked" does not pass review.
Retrospective	Reflect on process and improve it	Memory Curation & Skill Promotion	Review the knowledge base and learned memory from the cycle: what heuristics held, what failed, what should be promoted to reusable skill artifacts. Stale memory is pruned. Recurring failure patterns become new evaluation cases. The retro artifact is a memory diff, not a list of action items.
Backlog Refinement	Clarify and prioritize upcoming work	Specification Sharpening	Upcoming specifications are reviewed for constraint completeness, risk assignment, and observable success criteria. Items without measurable success criteria are not pulled into the next Spec Refinement cycle.
Release Planning	Coordinate cross-team work for a release	Governance Checkpoint	Cross-domain review of autonomy tier assignments, blast-radius gates, and evidence bundle completeness for all release-bound changes. The domain owner (P12) confirms accountability assignment before deployment.

The failure mode to avoid. Teams that attempt to run Agile ceremonies unchanged alongside agentic workflows typically end up with two parallel processes: the Agile process for humans and an ungoverned agentic process running in parallel. Both processes degrade. The table above collapses the two into one evidence-based, specification-driven workflow.

Phase calibration. At Phase 1–2, the Standup → Trace Audit conversion may be partial: teams are still building trace infrastructure. Start with a hybrid (brief verbal check plus whatever traces exist) and migrate fully once tracing is reliable. Do not adopt the full ceremony mapping before the infrastructure can support it.

Roles and the Human Side

How roles evolve (Developers, Tech Leads, QA Engineers, Operations Engineers) and the human dimension of the transition: naming the loss, the supervision paradox, the acceleration trap, sustainable pace, and protecting the junior pipeline.

Adoption Path and Phase Transitions

The six-step incremental adoption path (technical infrastructure for Phase 3+) and organizational change guidance for every phase transition from Phase 1→2 through Phase 5→6.

Resistance, Politics, and Your First Pilot

Navigating organizational friction (productivity dip, velocity metrics, cost conversation, incentive misalignment) and a concrete guide for running your first governed pilot.

Success Metrics and Failure Modes

Metrics by phase transition, team health indicators, quarterly review cadence, and common failure modes of the organizational change program.

How roles evolve and how to manage the human dimension of the transition.

Read the Manifesto for the core principles. See the Adoption Playbook for the full table of contents. See the Adoption Path for incremental steps and phase transitions.

How Roles Evolve

The transition from writing code to steering agents changes what each role owns day-to-day. This is not a minor adjustment. It is a fundamental shift in professional identity that must be named, supported, and managed — not imposed silently.

These role descriptions show one likely trajectory from current state toward Phase 5 (Agentic Engineering). The shift is progressive — no one wakes up in the end state. At Phase 2, developers still write most code; at Phase 3, they begin delegating and reviewing; at Phase 4, specifications and evidence become the primary work product. Read these as a direction of travel, not a before/after switch.

Developers

Before (Phase 1-2): Own code quality through implementation. Write features, fix bugs, review peers' code. At Phase 2, AI assists with suggestions and completions, but the developer remains the author. Professional identity is rooted in craftsmanship — the ability to think through a problem and express a solution precisely in code.

Transition (Phase 3-4): Begin delegating bounded tasks to agents. Write informal specifications with acceptance criteria. Review agent-generated output — initially every line, increasingly by evaluating evidence bundles as evaluation suites mature. Still write code directly for complex or ambiguous work where specification would cost more than implementation.

After (Phase 5): Own specification quality, constraint encoding, and outcome acceptance. Write machine-readable acceptance criteria and constraints. Review agent-generated diffs against specifications. Accept or reject outcomes based on evidence bundles. The core skill shifts from writing code to expressing intent precisely enough that agents can execute it, then refining intent based on evidence.

What this means in practice: The shift is gradual. At Phase 3, a developer might spend 70% of their time writing code and 30% reviewing agent output. By Phase 4, that ratio inverts for routine work. By Phase 5, the primary work product is the specification and the evaluation — implementation is delegated. But even at Phase 5, developers still read, understand, and occasionally write code. The skill doesn't disappear; it becomes the foundation for a harder skill.

The identity challenge: Many engineers became engineers because they love writing code. The shift to steering agents can feel like being told the skill they spent years mastering is suddenly less important. This is not imaginary — it is a real loss of craftsmanship that leaders must acknowledge. The new role is not lesser; it requires different and often harder skills (system-level reasoning, precise specification, critical evaluation of code you didn't write). But the transition needs support, not just announcement.

Tech Leads

Before: Own architectural decisions, code review standards, and technical direction. Mentor junior engineers through code review and design discussions.

After: Own domain boundaries, decision records, topology choices, and conflict-resolution rules. Design constraints that keep multi-agent collaboration reliable under load. The core skill shifts from reviewing individual code quality to designing system-level governance.

What this means in practice: Tech leads spend less time in code review and more time in constraint engineering: defining what agents may and must not do, choosing swarm topologies, and designing the evaluation portfolios that verify agent output at scale.

QA Engineers

Before: Own test plans, manual testing, and test automation. Verify that code behaves as specified through structured test execution.

After: Own evaluation portfolios, adversarial coverage, formal-invariant checks where needed, and evidence gates. The core skill shifts from executing tests to defining the contract between intent and behavior in machine-verifiable terms.

What this means in practice: QA engineers become the architects of the verification pyramid. They design evaluation suites that agents run autonomously, define adversarial test cases that probe agent behavior under stress, and set the evidence thresholds that gate promotion of changes from branch to production.

Operations Engineers

Before: Own deployment pipelines, monitoring, incident response, and infrastructure reliability.

After: Own behavioral observability, cost routing, memory governance, runtime safety, and chaos drills. The core skill shifts from keeping infrastructure running to keeping the feedback loop honest under real-world conditions.

What this means in practice: Operations engineers own a new category of infrastructure: agent runtime, memory stores, retrieval systems, and routing layers. They monitor not just uptime but behavioral drift, cost anomalies, and evaluation regression. Incident response expands to include agent-specific failure patterns (hallucination loops, memory poisoning, tier violations).

Demand-Layer Roles

The process-level definition of these roles — what each role does in the demand-to-specification workflow — is in Demand & Value. This section covers the organizational and team-evolution perspective: how these roles emerge, what skills they require, and how they develop through the ASDLC maturity phases.

The roles above operate within the engineering execution loop (Layer 2 of the ASDLC). The ASDLC's demand layer (Layer 1) introduces roles that do not have a natural home in the current engineering role taxonomy:

Product Owner (agentic context). Accountable for specification readiness and business value definition at the Layer 1→2 boundary. This role differs from the traditional product owner role in one critical respect: in an agentic context, the product owner's primary output is a loop-ready specification — not a backlog of user stories for human engineers. The specification must be machine-readable, have verifiable acceptance criteria, and pass the specification readiness gate before entering the loop. The product owner owns the success criterion and the business-level Definition of Done, distinct from the engineering DoD.

Business Demand Sponsor. Accountable for the business need validation that precedes specification. This role may be a business analyst, a domain expert, or a senior business stakeholder. Their function is to ensure the need is real (evidence-backed, not assumed), the value is quantifiable, and the out-of-scope is explicitly stated before the product owner begins translating the need into a specification.

Specification Analyst. Responsible for the translation from validated business need to loop-ready specification. In smaller organisations this is often the product owner; in larger organisations it is a distinct role that understands both the domain and the manifesto's specification requirements (machine-readable acceptance criteria, versioned constraints, blast-radius framing).

For the authoritative process-level definition of these roles and their responsibilities in the demand layer workflow, see Demand & Value.

Talent Density and Organizational Design

Role evolution tells you what people do differently. Talent density tells you how many people of what kind you need to build an organization that can actually deliver this. These are separate questions, and confusing them is how organizations end up with the right job descriptions but the wrong structure.

The Build-vs-Buy Decision by Phase

The default assumption — outsource early, build in-house later — is correct in principle but often applied too late. The governance capabilities at the core of agentic engineering (evaluation design, memory governance, autonomy tier management, observability of reasoning) are not purchasable as a service. They must be built as organizational muscle, and that requires in-house practitioners who own the outcomes.

A practical guide by phase:

Phase	In-house minimum	Where external help makes sense	What must not be outsourced
Phase 1–2	Core engineering team using AI tools; no specialist role needed	AI tool vendor support; training	Judgment on which AI outputs are acceptable
Phase 3	At least one engineer who owns specification quality	Tool configuration, infrastructure setup	Specification writing; failure pattern documentation
Phase 4	Domain owners; QA lead owning evaluation suite; one ops engineer owning observability	Platform infrastructure, CI/CD pipeline build	Evidence gate design; autonomy tier policy; incident response
Phase 5	Platform team (3–5 engineers): agent runtime, memory governance, routing; evaluation guild; security lead for agent threat model	Specialized formal methods expertise (targeted, time-bounded)	All governance roles; evaluation ownership; incident accountability

The practical target by Phase 4: the majority of people doing agentic delivery work are in-house, the majority of those are practitioners who build and own outcomes (not coordinators or oversight layers), and the majority of those are operating at a competent-or-above level in their role. Organizations that invert this — heavy external dependency, high coordinator-to-practitioner ratio, or large numbers of engineers operating below the competency threshold for agentic work — will not reach Phase 5. The governance infrastructure requires practitioners who understand what they are governing.

Team Size and Composition by Phase

Agentic engineering does not scale the way traditional software teams scale. Adding headcount at Phase 3 before governance infrastructure exists creates coordination problems that compound with agent output volume. The right trajectory is:

Phase 1–3: Small, high-trust teams (3–8 people). The primary bottleneck is governance design, not delivery throughput. Adding people before governance patterns are established creates more output to govern, not more governance capacity.
Phase 4: Governance roles become explicit. Minimum viable structure: a domain owner per active agent domain, a QA lead owning the evaluation portfolio, and one platform/operations owner. Total team size for a single pilot domain: 5–10 people including these roles.
Phase 5+: Platform team separates from delivery teams. Shared infrastructure (evaluation registry, trace standards, routing layer, memory governance) is owned by a dedicated platform function, not embedded in each delivery team. Delivery teams remain small (5–8 people each) and multiply across domains, sharing platform infrastructure. Scale comes from replicating the governed delivery model across domains, not from growing individual teams.

The Skill Density Requirement

The transition to agentic engineering concentrates the value of high-skill practitioners. A senior engineer who can write precise machine-readable specifications, design adversarial evaluation cases, and reason about blast radius is more valuable in a Phase 4 team than in a traditional team — because their work governs an agent that produces the output of several engineers. A junior engineer who cannot yet write reviewable specifications creates bottlenecks, not throughput.

This creates a real organizational challenge: the skills most needed (evaluation design, specification engineering, memory governance, observability of reasoning) are not standard hiring criteria and are not covered by most engineering bootcamps or degree programs. Build an explicit skills development path covering specification engineering, memory governance, and observability of reasoning — from prompt engineering fundamentals through agentic system design. Do not assume the market supplies practitioners ready-made — it does not, at scale, yet.

The Human Side of the Transition

Adopting agentic engineering is not purely a technical change. It is an organizational transformation that directly affects people's professional identity, daily work, and career trajectory. Ignoring the human dimension is how organizations lose their best engineers during the transition.

Naming the Loss

The shift from writing code to steering agents involves a genuine loss of craftsmanship for many engineers. AI made producing code easier and made being an engineer harder — and both things are true simultaneously. Engineers who raised concerns about this shift have too often been told, explicitly or implicitly, to "just adapt faster."

That is not how you build a sustainable engineering culture. Leaders must acknowledge that the transition asks people to redefine what they do and who they are professionally. This acknowledgment is not a sign of weakness — it is a prerequisite for maintaining a team that trusts you enough to follow you through the change.

The Supervision Paradox

Reviewing AI-generated code is often harder than writing code yourself. When you write code, you carry the context of every decision. When AI writes code, you inherit output without reasoning. You see the code but not the decisions behind it. This is why the manifesto insists on traces that capture reasoning, not just events (Principle 9). But leaders must also recognize that the cognitive load of reviewing agent output at volume is a new kind of burden that doesn't appear in productivity metrics.

If your engineers spend their days as judges on an assembly line, stamping pull requests that never stop coming, production volume went up but the sense of craftsmanship went down. That is not a morale problem to be managed. It is a workflow design problem to be solved — through better specifications (reducing the need for review), better evaluations (automating the reviewable parts), and better traces (making the non-automatable review faster).

The Acceleration Trap

AI makes certain tasks faster. Faster tasks create the perception of more available capacity. More perceived capacity leads to more work being assigned. More work leads to more AI reliance. More AI reliance leads to more code that needs review, more context to maintain, more systems to understand, and more cognitive load on engineers already stretched thin.

This cycle — what researchers have called "workload creep" — is self- reinforcing. It looks like productivity from the outside (velocity charts go up, more PRs merged, more features shipped) while quality quietly erodes, technical debt accumulates, and the people doing the work run on fumes.

The perception gap makes the trap invisible from inside. A rigorous 2025 study found that experienced developers using AI tools took 19% longer to complete tasks than developers working without them — while believing AI made them 24% faster. They were wrong not just about the magnitude but about the direction of the change. This perception gap is where the acceleration trap becomes self-reinforcing: teams believe they have more capacity, take on more work, and never measure whether the capacity was real. When the J-curve adoption dip arrives — productivity declining before improving as new workflows mature — teams that have already overcommitted have no slack to absorb the dip.

The corrective: set explicit throughput limits per engineer that account for the full cycle (specification + agent execution + verification + review), not just the implementation phase. Measure outcomes (defect rate, incident severity, customer impact) alongside output volume. When output goes up and outcomes don't improve, the acceleration trap has closed.

Sustainable Pace

The manifesto optimizes for correctness, governance, and economics. But governance that burns out the humans governing it is self-defeating. Sustainable pace is not a nice-to-have — it is a precondition for the human accountability that the entire manifesto depends on.

Track team health alongside system health. Burnout indicators (review latency spikes, approval rubber-stamping, rising escaped defect rates) are system health signals — they indicate that the human layer of governance is degrading. When these signals appear, the correct response is to reduce autonomy scope or simplify governance, not to push harder.

Protecting the Junior Pipeline

If junior engineers traditionally learned by doing routine work — fixing small bugs, writing straightforward features, implementing well-defined tickets — and agents now handle that work, the training ground is disappearing.

This is not just a concern for individual careers. It is a systemic risk: if junior engineers never develop foundational skills through hands-on work, the industry will face a shortage of senior engineers who truly understand the systems they oversee. You cannot supervise what you never learned to build.

This is an organizational policy choice, not a universal staffing rule.

Concrete actions:

Dedicate a portion of agent-suitable work to junior engineers as learning tasks, even when an agent could do it faster. The efficiency cost is an investment in the talent pipeline.
Use agent output as teaching material: juniors review agent-generated code, identify weaknesses, and write the evaluations that catch those weaknesses. This builds judgment faster than writing boilerplate ever did.
Pair junior engineers with agents rather than replacing their work with agents. The junior specifies, the agent implements, the junior evaluates. This builds specification and evaluation skills from day one.
Create structured progression paths that move juniors from "evaluating agent output" to "designing specifications" to "architecting constraints" — making the skill development explicit rather than hoping it happens through osmosis.

The technical infrastructure for governed delivery and organizational change guidance for every phase transition.

Read the Manifesto for the core principles. See the Adoption Playbook for the full table of contents. See the Roles and the Human Side for how roles evolve during the transition.

Canonical sources. Normative principle definitions (P1–P12) and autonomy tier definitions are in manifesto-principles.md. Phase transition criteria and go/no-go thresholds in this document are heuristics — calibrate to local domain and baseline before applying. See glossary.md for canonical term definitions.

For V-model organizations: If your organization operates a traditional V-model SDLC (common in life sciences, medtech, aerospace, automotive, and regulated financial services), see adoption-vmodel.md for a V-model-specific variant of this adoption path that preserves your existing verification structure while transitioning to agentic execution.

ASDLC context: The phases and transitions described in this document are Layer 2 maturity phases — the maturity of the engineering execution loop, as defined in the Agentic Engineering Manifesto. A team at Phase 5 of the inner loop has highly mature engineering execution. That does not imply maturity in the demand layer (how well business needs are validated before entering the loop), the release layer (how governed the path to production is), or the operations layer (how well the team can maintain and operate what has been delivered). Those layers have their own maturity assessment, covered in the ASDLC.

Recommended sequencing: build Phase 3 inner-loop maturity (governed agentic delivery in at least one domain) before investing heavily in outer-layer governance. A team with no reliable inner loop cannot benefit from improved demand management or release governance — those investments require a working execution engine to govern. The correct sequence is: prove the inner loop, then extend outward.

Incremental Adoption Path

This section describes the technical infrastructure you build to support governed agentic delivery. It assumes your team is at or approaching Phase 3 (agents executing autonomously) and wants to reach Phase 4 and beyond. If you are at Phase 1 or 2, start with the Phase Transitions section below — it covers the organizational changes needed before this infrastructure makes sense.

The seven steps below roughly map to the Phase 3→4 transition (Steps 1–3), the Phase 4→5 transition (Steps 4–6), and ongoing expansion (Step 7). Each step is described at the level of what you actually need to do — not just what the target state looks like.

Step 1: Define Domain Boundaries and Autonomy Tiers

What to do: Map your codebase into domains with clear ownership. For each domain, define what agents may do (Tier 1: analyze and propose; Tier 2: write to branches; Tier 3: production actions) and what they must not do. Encode these as infrastructure-level permissions, not prompt instructions.

Who leads: Tech leads, with input from security and operations.

Minimum viable version: Start with one domain. Define Tier 1 only for the first pilot domain (agents can analyze and propose, zero blast radius). This is safe, reversible, and immediately useful as a learning exercise.

Timeline: 2-4 weeks for initial domain mapping. Ongoing refinement.

Success signal: You can answer "what is this agent allowed to do in this domain?" for every active agent, and the infrastructure enforces the answer.

Step 2: Require Evidence Bundles for Every Merged Change

What to do: Define the minimum evidence bundle for your current phase (Phase 3: tests, diff, trace link, rollback note). Integrate evidence collection into your CI/CD pipeline so it's automatic, not manual.

Who leads: QA engineers and CI/CD owners.

Minimum viable version: Require a diff, a test report, and a rollback command for every agent-generated PR. Block merge without these. This adds minutes per PR, not hours.

Timeline: 1-2 weeks to configure CI gates. 1-2 sprints to normalize.

Success signal: No agent-generated change merges without an evidence bundle. Engineers stop saying "the agent said it worked" and start pointing at evidence.

Step 3: Add Regression Gates Before Expanding Autonomy

What to do: Build a regression evaluation suite for each domain where agents operate. Every agent-generated change must preserve or improve evaluation performance. Failed evaluations block merge.

Who leads: QA engineers, with domain expertise from developers.

Minimum viable version: Start with existing tests. Add behavioral regression tests for the most common agent failure patterns in your domain. Ten well-chosen regression cases are more valuable than a hundred boilerplate tests.

Timeline: 2-4 weeks for initial suite. Continuous expansion.

Success signal: Escaped defect rate for agent-generated changes is equal to or lower than for human-generated changes.

Step 4: Add Adversarial and Security Evaluations on Exposed Surfaces

What to do: For any agent-generated code that touches external-facing surfaces (APIs, user interfaces, data pipelines), add adversarial test cases: injection attacks, malformed inputs, edge cases, authorization bypasses.

Who leads: Security engineers and QA.

Timeline: 2-4 weeks per exposed surface.

Success signal: No agent-generated change reaches an external surface without adversarial evaluation coverage.

Step 5: Establish Durable Coordination State

What to do: Before expanding to multi-agent topologies or long-running agent tasks, build the coordination substrate that prevents duplicate work, orphaned tasks, and post-restart divergence. The minimum infrastructure:

Work ledgers: A single source of truth for what tasks are active, claimed, completed, or abandoned. Without this, concurrent agents duplicate effort or leave work silently unfinished.
Lease-based task ownership: Agents claim tasks with time-bounded leases. If an agent crashes or stalls, the lease expires and the task becomes available for reassignment. Without leases, orphaned tasks accumulate silently.
Restart-safe handoffs: Agent state must survive restarts. If an agent is interrupted mid-task, the next agent (or the same agent after restart) must be able to resume from a well-defined checkpoint rather than starting over. Design for replay safety: re-executing a handoff must produce the same result, not duplicate side effects.

Who leads: Platform/infrastructure engineers.

Minimum viable version: A shared task ledger with lease expiration for one multi-agent workflow. This can be as simple as a database table with claim timestamps and TTLs.

Timeline: 2–4 weeks for initial ledger. Ongoing refinement as topologies expand.

Success signal: No orphaned tasks after agent crashes. No duplicate work across concurrent agents. Restart produces resumption, not repetition.

Step 6: Pilot Formal Contracts on One High-Blast-Radius Path

What to do: Select one critical path (e.g., payment processing, data integrity constraint, authentication flow) and add machine-checkable contracts (preconditions, postconditions, invariants). This is not full formal verification — it is contract-first development on a narrow scope.

Who leads: Senior engineers with architecture responsibility. May require external expertise in formal methods — see the Skill Requirements table in the Companion Guide.

Timeline: 4-8 weeks for initial pilot.

Success signal: The contracted path has zero escaped defects from contract-violating changes over the pilot period.

Step 7: Expand Only When Incident Rate and Economics Improve

What to do: Before expanding agent autonomy (promoting from Tier 1 to Tier 2, or from one domain to multiple domains), verify that the current scope is working: incident rate is flat or declining, total cost of correctness is acceptable, and governance overhead is sustainable.

Who leads: Engineering leadership with input from operations and finance.

Expansion criteria: Incident rate stable or improving for two consecutive quarters. Total cost of correctness declining per outcome. Human oversight load (reviews per domain owner) is sustainable.

Organizational Change by Phase Transition

The manifesto defines six maturity phases. The Companion Guide provides full definitions and failure modes for each. Here is a summary for reference:

Phase 1 — Guided Exploration. Single prompts, no structure, no memory.
Phase 2 — Assisted Delivery. AI as autocomplete; humans execute.
Phase 3 — Agentic Prototyping. Agents execute within a single session; limited verification.
Phase 4 — Agentic Delivery. Basic guardrails: autonomy tiers, evaluation gates, persistent memory. Single-domain.
Phase 5 — Agentic Engineering. Structured autonomy at scale. Multi-domain, evidence-driven, continuous Agentic Loop.
Phase 6 — Adaptive Systems. Self-improving infrastructure within governed boundaries. Frontier capabilities required.

Each transition below describes what changes organizationally, what actions to take, and what makes the transition hard.

Investment and Organizational Sizing by Phase

Every phase transition has both a technical dimension (what you build) and an organizational dimension (what you fund, who you hire or develop, what you stop doing). This table gives decision-makers the investment framing alongside the technical steps:

Calibrate all phase-transition metrics to domain baseline and incident history.

Phase transition	Typical investment	Team change	Primary cost driver	ROI signal
Phase 1→2	Tooling licenses (low); process standardization (1–2 weeks engineering time)	No new roles; existing team adopts AI tools	Tool cost + standardization overhead	Cycle time reduction on AI-assisted tasks; measurable in weeks
Phase 2→3	Specification training (1–3 weeks); review process redesign	No new roles; senior engineers develop specification discipline	Senior engineer time on specification + review patterns	Reviewable agent output without excessive rework
Phase 3→4	CI/CD evidence pipeline (4–8 weeks engineering); evaluation suite build (4–8 weeks QA); domain boundary encoding (2–4 weeks tech lead)	Add: QA lead owning evaluations; explicit domain owners	Evaluation suite build is the primary investment	Escaped defect rate ≤ human baseline; evidence completeness ≥95%
Phase 4→5	Platform team formation (3–5 engineers, ongoing); shared evaluation registry; trace standards; memory governance infrastructure	Add: platform team separates from delivery; multiply delivery teams across domains	Platform infrastructure and governance capability building	Total cost of correctness declining per outcome; oversight load stable while output scales
Phase 5→6	Formal methods expertise (targeted, time-bounded); independent audit paths; self-improvement governance	Formal verification specialists (targeted hire or consultant); independent validation function	Specialized expertise and governance overhead for self-improving systems	Phase 6 is a frontier, not a universal target — assess only when Phase 5 is fully stable across all critical domains

Decision discipline: Do not fund the next phase until the current phase has produced evidence that justifies it. If the go/no-go signals fail for two review cycles, freeze expansion and re-baseline before proceeding. This is not conservatism — it is the mechanism that prevents the most common failure: organizations that invest in Phase 4 governance infrastructure before Phase 3 evidence exists that agents produce reviewable output. The infrastructure becomes bureaucracy, teams lose confidence, and the initiative stalls.

Phase 1 → 2: From Exploration to Assisted Delivery

What changes: You move from unstructured experimentation ("let's see what ChatGPT can do") to repeatable AI-assisted workflows where humans remain in the loop for every action. Agents go from novelty to daily tool.

This transition matters to the manifesto because it builds the foundation for two things that every later phase depends on: the habit of evaluating AI output critically (the seed of Principle 8 — Evaluations), and the organizational muscle of defining what tools may and must not do (the seed of Principle 5 — Autonomy). Teams that skip Phase 2 arrive at Phase 3 with no discipline around either, and Phase 3 is where the consequences start compounding.

Organizational actions:

Identify the tasks where AI assistance delivers consistent value (code completion, test generation, documentation drafting) and standardize tooling around them
Establish basic usage guidelines: what models are approved, what data may be shared with them, what outputs require human review before use
Begin measuring where AI assistance actually saves time versus where it creates rework — intuition is unreliable here; this is your first encounter with the economics principle (Principle 11) at the simplest possible scale
Run a lightweight retrospective: which experiments from Phase 1 produced real value, and which were demos that impressed but didn't stick?

The hard part: The organizational challenge is not technical — it is cultural. Phase 1 generates enthusiasm and a sense of possibility. Phase 2 demands that you kill the experiments that felt exciting but don't produce repeatable value. Teams that skip this curation step carry forward a scattered toolset of one-off prompts and ad-hoc workflows that no one else can reproduce. Worse, they develop a false confidence that "we're already doing AI" which becomes a barrier to the deeper changes Phase 3 requires. This is primarily a curation and standardization exercise, not a technical build.

Phase 2 → 3: From Assisted Delivery to Agentic Prototyping

What changes: You move from AI-as-autocomplete (human executes, AI suggests) to agents that execute autonomously within a single session. The human stops typing every line and starts delegating whole tasks. This is the moment the team realizes prompting is not engineering.

Organizational actions:

Select 2-3 bounded tasks where agents can execute end-to-end within a session (e.g., generate a module from a spec, write a test suite for an existing component, refactor a file according to a style guide)
Require human review of every agent-generated output before merge — no exceptions. At this phase, the agent has no memory, no verification pipeline, and no guardrails beyond the prompt
Begin documenting failure patterns: where agents hallucinate, where they miss edge cases, where they produce plausible-looking code that fails silently. This documentation becomes the seed for your evaluation suite in Phase 4
Start writing specifications with explicit acceptance criteria, even if informally. The habit of defining "what does done look like" before the agent starts is the single most important skill for everything that follows

The hard part: The supervision paradox hits here for the first time. Reviewing agent-generated code is harder than writing it yourself — you inherit output without context. Teams that don't acknowledge this will either rubber-stamp agent output (creating quality risk) or reject the workflow entirely (losing the productivity gain). Neither is acceptable. The answer is better specifications and the beginning of structured evaluation, which is exactly what Phase 4 formalizes. Expect this transition to take longer than Phase 1→2 — your team is moving from "AI suggests, I decide" to "AI executes, I verify," and learning to verify well takes practice.

Phase 3 → 4: Governed Delivery Foundation

What changes: You move from "agents do things and we hope they work" to "agents do things within defined boundaries with evidence."

Organizational actions:

Add CI/CD evidence and policy gates
Assign domain owners and escalation rotations
Standardize incident classification and rollback drills
Begin tracking evidence bundle completeness and escaped defect rate

The hard part: Convincing teams that the evidence overhead is worth it when they're already shipping faster than ever. The acceleration trap makes governance feel like a brake. Frame it as insurance, not bureaucracy: the evidence bundle is what lets you expand autonomy later. Without it, you're stuck at Phase 3 forever. Start with a single domain; parallel rollout across domains is possible but increases coordination overhead.

Phase 4 → 5: Engineering-Scale Transition

What changes: You move from single-domain, reactive governance to multi-domain, evidence-driven engineering. This is the hardest transition because it requires organizational change, not just tooling.

Organizational actions:

Establish shared evaluation registry and trace standards
Create platform ownership for agent runtime, routing, and memory governance
Formalize security reviews for tools, connectors, and shared state
Invest in the "Rare" skills identified in the Companion Guide's Skill Requirements table: distributed systems design, memory governance, ML/retrieval engineering, chaos engineering

The hard part: This transition often requires new roles or responsibilities that don't exist in the current org chart. "Platform ownership for agent runtime" is not something most organizations have. You are creating infrastructure categories, not just adopting tools. This is not a sprint goal — it is an organizational redesign that unfolds over multiple quarters.

Phase 5 → 6: Adaptive Frontier

What changes: Systems begin improving themselves within governed boundaries. This is a frontier — not all organizations need to reach Phase 6, and the capabilities required (formal verification, causal reasoning, provable containment) are still maturing.

Organizational actions:

Require governance for self-updating specifications and routing policies
Maintain independent audit paths for high-impact domains
Treat formal methods expertise as targeted specialization, not universal role

The hard part: Knowing when you're ready. Phase 6 without Phase 5's discipline is how you get self-improvement without containment — the system optimizes the metric, not the goal. Do not attempt Phase 6 until Phase 5 is stable across all critical domains.

For organizations transitioning from a traditional V-model SDLC to agentic engineering. This is a V-model-specific variant of adoption-path.md.

Read the Manifesto for the core principles. Read the Companion Guide for implementation depth. Read the Adoption Playbook for organizational change management.

For generic (non-V-model) adoption steps, see adoption-path.md.

Core Thesis

The transition should not throw away the V-model.

In life sciences, aerospace, automotive, and regulated financial services, the V-model survives because it solves real problems: it forces early definition of intended use and design inputs, creates explicit traceability between requirements and evidence, distinguishes verification from validation, and fits quality systems, change control, and audit expectations.

What changes in an agentic SDLC is not the need for rigor. What changes is the way rigor is expressed:

specifications become more structured and machine-readable
verification becomes more automated, layered, and continuously replayable
validation remains human-owned, but is better instrumented
traceability moves from manual spreadsheet labor to generated evidence graphs
implementation shifts from direct human authorship to governed agent execution inside bounded harnesses

The right goal: keep the V-model's assurance logic, but retool its artifacts, gates, and execution model for agents.

This is a transition framework for agentic engineering inside a V-model organization, not a proposal to discard the V-model.

What Should Stay the Same

Intended use, risk classification, and release responsibility remain human accountabilities.
Verification and validation remain distinct disciplines.
Change control, approval records, and traceability remain mandatory.
Higher-risk functions retain stricter review, narrower autonomy, and stronger evidence requirements.
Validation against clinical, operational, or business reality cannot be delegated fully to an agent.

What Should Change

Applicability varies by domain, qualification regime, and tool qualification constraints.

Requirements become versioned, structured, and reusable by both humans and agents.
Architecture is encoded as enforceable constraints, not just diagrams and prose.
Verification plans become executable evaluation suites.
Evidence bundles are assembled automatically from traces, tests, policies, and artifacts.
Agents assist with decomposition, implementation, regression analysis, traceability, and document assembly under explicit autonomy tiers.
Post-release monitoring and periodic revalidation become part of the same lifecycle, not a separate operational afterthought.

The Traditional V-Model

Stakeholder         <----------------->         Acceptance
Requirements                                  Testing
    \                                           /
  System            <----------------->       System
  Requirements                              Testing
      \                                       /
    Architecture    <----------------->     Integration
    Design                                Testing
        \                                   /
      Detailed      <----------------->   Unit
      Design                            Testing
          \                               /
            --->  Implementation  <---

Each left-side phase produces a specification. Each right-side phase verifies that specification. The horizontal arrows are traceability links. The bottom of the V is human implementation.

The Agentic V-Model

Outcome             <----------------->        Acceptance &
Specifications                               Accountability
(P1, P2)                                     (P12, P8)
    \                                           /
  System            <----------------->       System-Level
  Specifications                            Evaluation
  (P2, P3)                                  (P10, P8)
      \                                       /
    Agent           <----------------->     Cross-Agent
    Architecture                          Verification
    (P3, P4, P5)                          (P9, P10)
        \                                   /
      Context &     <----------------->   Per-Agent
      Domain Design                     Evaluation
      (P6, P7, P11)                     (P8, P9)
          \                               /
            --->  Agent Execution  <---
                  (Bounded Autonomy)

The structural symmetry is preserved: every specification level maps to a verification level. But every layer has changed in substance.

Classical V-model stage	Agentic equivalent	What changes	Human accountability remains at
User needs / intended use	Structured intent package	Intended use, hazards, workflow assumptions, risk class, and success criteria become explicit machine-readable inputs	intended use, risk acceptance, go / no-go
System requirements	Versioned requirement contracts	Requirements include acceptance criteria, stop criteria, data constraints, and traceability IDs	requirement approval and scope decisions
High-level design	Enforced architectural policy	ADRs, bounded contexts, tool permissions, and data boundaries become executable constraints	boundary ownership and exception approval
Detailed design	Executable specifications	Interfaces, invariants, state models, and critical decision rules become machine-checkable	design review at risk-based depth
Implementation	Harnessed agent execution	Agents draft, implement, refactor, and document inside governed sandboxes	autonomy tier approval and exception handling
Unit / component verification	Deterministic verification-as-code	Tests, static analysis, contracts, replay, and proofs are run automatically	review of failures, waivers, and critical evidence
Integration verification	Tool, workflow, and protocol verification	Agents and tools are verified as a system, not component by component only	approval of integration evidence and unresolved deviations
System verification	Evaluation harnesses	End-to-end workflows, adversarial cases, reliability, and economics are evaluated continuously	decision on fitness for intended technical use
Validation	Human-led contextual validation	Workflow fit, clinical or user value, and real-world operating assumptions are assessed with stronger instrumentation	validation conclusion and release decision
Maintenance / changes	Continuous revalidation loop	Drift, regressions, memory updates, and agent policy changes re-enter change control	periodic revalidation and CAPA ownership

Layer-by-Layer Transformation

Level 1: Stakeholder Requirements --> Outcome Specifications

Traditional: Business analysts translate stakeholder needs into a requirements document. Requirements are written in natural language, reviewed by humans, and baselined.

Agentic: Outcome specifications replace requirements documents. Specifications are machine-readable: they define what "done" means in terms agents can evaluate autonomously (Principle 1). They include acceptance criteria, boundary conditions, blast-radius constraints, and validation criteria — not just verification criteria.

Validation ("did we build the right thing?") becomes a first-class concern because agents can satisfy every verification check and still produce the wrong outcome.

What changes in practice:

Requirements become executable specifications with machine-readable acceptance criteria
Validation criteria are defined upfront alongside verification criteria
Specifications are versioned artifacts that evolve through the Agentic Loop
Traceability is automatic: the specification is the input to the agent, and the trace records the link between specification and execution

Level 2: System Requirements --> System Specifications

Traditional: System engineers decompose stakeholder requirements into system requirements. Each system requirement is testable and traceable.

Agentic: System specifications define domain boundaries, inter-domain contracts, and the constraints that agents must respect (Principle 3: defense-in-depth). Each domain has a clear owner, a defined autonomy tier, and machine-enforceable boundaries.

The key shift: system specifications are infrastructure-level constraints that the runtime enforces. An agent that violates a domain boundary is blocked by the system, not caught in review.

What changes in practice:

System requirements become enforceable domain boundaries and typed contracts
Decomposition is driven by blast radius and autonomy tiers, not just functional decomposition
Each domain specifies its evidence bundle requirements by phase

Level 3: Architecture Design --> Agent Architecture

Traditional: Software architects define the system structure: components, interfaces, data flows, deployment topology.

Agentic: Agent architecture defines the topology of the agentic system: how many agents, what roles, what coordination pattern (Principle 4). It defines the autonomy tier for each agent (Principle 5) and the defense-in-depth layers that wrap probabilistic decisions in deterministic infrastructure (Principle 3).

Architecture also encompasses context architecture (Principle 7) and memory architecture (Principle 6).

What changes in practice:

Component diagrams become agent topology diagrams with explicit authority relationships
Data flow diagrams include context flow, memory flow, and cost flow
Architecture decisions include model selection rationale, routing policies, and cost targets (Principle 11)

Level 4: Detailed Design --> Context and Domain Design

Traditional: Module-level design defines the internal structure of each component. This is the last specification level before coding begins.

Agentic: Context and domain design defines what each agent needs to execute correctly: its context budget, retrieval configuration, memory access, tool permissions, and evaluation criteria.

This layer is where most agentic projects fail. Teams jump from architecture to execution without specifying what context each agent should see, what tools it may use, what cost limits apply, or what evaluation criteria define success.

What changes in practice:

Module designs become agent configuration specifications
Algorithm selection becomes model selection with cost/quality tradeoffs
Data structure design includes memory store design with the five governance properties (provenance, expiration, compression, rollback, domain scoping)
Internal interfaces become tool contracts with typed inputs/outputs

Level 5 (Bottom): Implementation --> Agent Execution

Traditional: Developers write code. The bottom of the V — the only layer where artifacts are produced rather than specified or verified.

Agentic: Agents execute within bounded autonomy. They receive specifications, context, and tool access. They produce code, artifacts, decisions, or actions. They generate traces of their reasoning. They are evaluated against criteria they may not see (evaluation holdout).

The fundamental shift: implementation is delegated. The human's role moves from writing code to defining the conditions under which agents execute and verifying the results.

What changes in practice:

Coding sessions become agent execution sessions with full trace capture
Code review becomes evidence bundle review (diff + tests + trace + rollback)
The implementation artifact includes not just the code but the trace that explains how and why it was produced

Level 6 (Right, ascending): Unit Testing --> Per-Agent Evaluation

Traditional: Unit tests verify that each module behaves as designed.

Agentic: Per-agent evaluation portfolios verify that each agent's output meets its specification (Principle 8). This includes happy-path validation, adversarial testing, regression coverage, and behavioral checks. Evaluation holdout prevents agents from overfitting to visible criteria.

Structured traces (Principle 9) make every agent decision inspectable.

What changes in practice:

Unit tests become evaluation portfolios with four coverage categories
Test execution becomes continuous evaluation on every change
Test reports become structured traces queryable by any dimension

Level 7: Integration Testing --> Cross-Agent Verification

Traditional: Integration tests verify that subsystems work together.

Agentic: Cross-agent verification confirms that agents interacting across domain boundaries produce correct system-level behavior. This includes trace correlation across agent chains, drift detection, and cost anomaly monitoring.

This layer also includes the behavioral vs. structural regression distinction: an agent's output may pass all current evaluations but degrade the codebase's capacity for future change.

What changes in practice:

Integration test suites become cross-domain evaluation portfolios
Interface testing becomes trace correlation and provenance verification
Structural regression monitoring is added alongside behavioral regression

Level 8: System Testing --> System-Level Evaluation

Traditional: System tests verify the complete system against system requirements including non-functional requirements.

Agentic: System-level evaluation includes chaos testing (Principle 10) and threat modeling for agentic systems. This tests what happens when tools fail, retrieval is noisy, memory is corrupted, or agents interact in unexpected ways.

What changes in practice:

System test plans become chaos testing plans with safety models
Security testing becomes agentic threat modeling (prompt injection, memory poisoning, agent impersonation, data exfiltration)
Non-functional testing includes total cost of correctness measurement

Level 9 (Top right): Acceptance Testing --> Acceptance and Accountability

Traditional: Acceptance tests verify the system against stakeholder requirements. In regulated industries, this includes formal sign-off by a qualified person.

Agentic: Acceptance and accountability verification confirms that a named human can inspect the reasoning, review the evidence, and own the outcome of every production agent (Principle 12). This is tier-calibrated governance:

Tier 1 (Observe): Human executes every action. Accountability is inherent.
Tier 2 (Branch): Human owns constraint design and evaluation portfolio.
Tier 3 (Commit): Human owns policy design, sampling strategy, incident response. Automated enforcement handles routine checks.

Evidence bundles are the acceptance artifact: diff, tests, trace, rollback command, policy checks, and cost accounting — all phase-gated and immutable.

What changes in practice:

UAT becomes evidence bundle review with tier-appropriate depth
Formal sign-off becomes accountability assignment with trace-backed evidence
Release gates become phase-calibrated evidence thresholds
Post-release monitoring becomes continuous behavioral observability

What the Agentic V-Model Adds

The agentic V-model is not simply "V-model with AI at the bottom." It adds structural elements that the traditional V-model does not address:

Continuous verification. The traditional V-model verifies at gates. The agentic V-model verifies continuously — evaluations run on every change, not at phase transitions.

Emergence testing. The traditional V-model assumes deterministic implementation. Agentic systems are probabilistic and exhibit emergent behavior. Chaos testing and containment engineering have no equivalent in the traditional V.

Behavioral observability. The traditional V-model verifies correctness at each level. The agentic V-model also monitors for drift, anomaly, and constraint violation in real-time between verification levels.

Accountability under non-determinism. When agents implement at scale, comprehensive inspection is impossible. The agentic V-model replaces direct inspection with tiered governance: humans own the constraints, evaluations, and evidence model — not every individual output.

Economic optimization. The traditional V-model does not address the cost of verification. The agentic V-model includes economics as a first-class concern (Principle 11).

ALCOA+ Traceability Through the Agentic V-Model

For GxP and regulated environments, the agentic V-model produces ALCOA+ compliant records at every layer by construction:

V-Model Layer	Record Produced	ALCOA+ Properties Satisfied
Outcome Specifications	Versioned, machine-readable specs	Original, Legible, Enduring
System Specifications	Domain boundaries, autonomy tiers	Consistent, Complete
Agent Architecture	Topology decisions, routing policies	Attributable, Accurate
Context/Domain Design	Agent configurations, tool scopes	Complete, Consistent
Agent Execution	Structured traces with full reasoning	Contemporaneous, Attributable, Original
Per-Agent Evaluation	Evaluation results, evidence bundles	Accurate, Available
Cross-Agent Verification	Correlated traces, provenance	Complete, Attributable
System-Level Evaluation	Chaos test records, threat models	Enduring, Available
Acceptance & Accountability	Named owner sign-off, evidence	Attributable, Complete, Available

The trace chain from outcome specification through agent execution to acceptance evidence is unbroken, machine-queryable, and immutable.

Transition Principles

1. Start with specification engineering, not coding agents

If requirements are vague, agents will merely produce ambiguity faster. Specification quality is the primary upstream control variable.

2. Modernize verification before expanding autonomy

Autonomy without a strong right side of the V creates faster defect production, not faster compliant delivery.

3. Keep validation explicitly human-led

Verification asks whether the system satisfies the specification. Validation asks whether the specification was worth building. Agents can support validation; they should not own it.

4. Treat architecture as policy

In a standard SDLC, architecture can be partly social. In an agentic SDLC, domain boundaries, tool permissions, and data handling rules must be enforced by the runtime.

5. Make traceability an output of the system

Do not scale manual trace matrices. Generate traceability from linked, versioned artifacts, execution traces, tests, approvals, and evidence bundles.

6. Expand autonomy by risk tier, never uniformly

Low-risk artifacts can move to agent assistance early. High-risk requirements, validation conclusions, and release approvals remain strongly human-governed.

Transition Roadmap

This roadmap assumes a serious regulated environment and a staged transition. The phases are sequential in emphasis, but some activities overlap.

Phase	Focus	Typical duration	Primary outcome	Manifesto phase
0	Baseline and segmentation	4-6 weeks	Current V-model mapped, risk classes segmented, pilot scope chosen	Pre-Phase 3
1	Specification foundation	6-10 weeks	Requirements become structured, versioned, and agent-usable	Phase 2-3
2	Verification and validation backbone	8-12 weeks	V and V evidence becomes executable, repeatable, and tiered	Phase 3
3	Architecture and harness controls	6-10 weeks	Agents operate inside enforceable boundaries	Phase 3-4
4	Controlled agent-assisted build and test	8-12 weeks	Agents contribute under supervision and evidence gates	Phase 3-4
5	Integrated agentic V-model release loop	8-12 weeks	Release, change control, and revalidation become evidence-driven	Phase 4-5
6	Full agentic SDLC	ongoing	Governed autonomy across the lifecycle	Phase 5+

Phase 0 — Baseline and Segmentation

Objective: Understand the current V-model implementation before changing it.

Activities:

Map current lifecycle artifacts: intended use, requirements, architecture, verification plans, validation protocols, trace matrices, release records
Segment products and workflows by risk and regulatory consequence
Identify where traceability is manual, weak, or routinely backfilled
Define autonomy red lines: high-risk approvals remain human-owned
Select one pilot value stream

Exit criteria:

The organization can name which lifecycle decisions will remain human-only
The first pilot scope is explicit and bounded
Current evidence gaps are known

Phase 1 — Specification Foundation

Objective: Turn design inputs into structured artifacts that can steer agents safely.

Activities:

Standardize templates for intended use, user needs, system requirements, and detailed specifications
Require every requirement to include: rationale, acceptance criteria, trace ID, risk tag, source, and owner
Add stop criteria to major work items
Define interface contracts and prohibited behaviors in machine-usable form
Introduce specification review focused on ambiguity and unverifiable language

Exit criteria:

A reviewer can determine whether a requirement is specific enough for an agent to act on
Trace IDs are stable and versioned
Ambiguous prose is being reduced before implementation starts

Phase 2 — Verification and Validation Backbone

Objective: Rebuild the right side of the V as an executable evidence system.

Activities:

Convert verification plans into executable suites: unit tests, integration tests, policy checks, static analysis, simulation, adversarial scenarios
Define evidence bundles for each change: trace links, diffs, test outputs, review outcomes, policy checks
Separate verification layers: deterministic, statistical, formal, human
Define validation protocols that remain human-led but instrumented
Establish failure handling: explicit deviations, root-cause tagging

Exit criteria:

Verification evidence can be regenerated, not merely asserted
Validation records distinguish technical correctness from contextual fitness
Teams can show which requirements are insufficiently covered by evidence

Phase 3 — Architecture and Harness Controls

Objective: Ensure agents execute inside bounded constraints.

Activities:

Convert architecture rules into enforceable controls: domain ownership, dependency rules, tool permissions, data-access policies
Define the agent harness: prompts, tool registry, runtime permissions, checkpointing, trace capture, evidence collection
Create autonomy tiers by risk class
Introduce sandboxing for agent execution

Exit criteria:

Agents cannot bypass architectural rules through prompt interpretation alone
Every agent action in the pilot is attributable and auditable

Phase 4 — Controlled Agent-Assisted Build and Test

Objective: Use agents in implementation and verification without breaking the quality system.

Activities:

Start with bounded tasks: draft low-risk code, generate tests, propose trace links, summarize impact, prepare evidence packs
Require every agent-produced change to pass the verification backbone
Route higher-risk changes through narrower autonomy and deeper review
Capture review outcomes as structured signals: accepted, rejected, partially accepted, policy exception, unclear spec
Measure where agent output fails: bad decomposition, hallucinated requirements, architectural drift, weak evidence

Exit criteria:

Agent assistance reduces cycle time on low-to-medium-risk work without reducing assurance quality
Human reviewers focus on risk and ambiguity, not rereading every low-level step
The pilot produces reusable evidence and lessons

Phase 5 — Integrated Agentic V-Model Release Loop

Objective: Move from isolated pilot to a governed lifecycle that closes the loop from requirement change to monitored release and revalidation.

Activities:

Integrate generated traceability into change control and release records
Add periodic revalidation triggers: model change, tool change, policy change, workflow change, observed drift
Define how memory or learned agent behaviors are versioned and approved
Connect field data, deviations, and CAPA findings back into requirement and validation updates
Establish evidence-based release readiness

Exit criteria:

A post-release issue can be traced to the relevant requirement, implementation, evidence, and approval path
Revalidation triggers are explicit rather than ad hoc
The lifecycle is closed from design input to operational learning

Phase 6 — Full Agentic SDLC

Characteristics:

Specifications are the primary work product
Verification is largely automated and replayable
Validation is instrumented and human-owned
Traceability is generated continuously
Architecture is enforced at runtime
Agents operate under risk-tiered autonomy
Deviations, incidents, and revalidation update the system continuously

This is not: unrestricted autonomous change in high-risk areas, agent-written documentation without evidence linkage, replacing QMS discipline with prompt craft, or treating validation as another test suite.

Recommended Transition Sequence by Artifact

Intended use and user needs — Tighten purpose, scope, hazards, exclusions, and success criteria.
System and software requirements — Make them versioned, structured, and traceable.
Verification plans — Convert to executable evidence where possible.
Validation plans — Clarify human-led contextual validation and decision ownership.
Architecture and design constraints — Encode boundaries, permissions, and invariants.
Implementation workflow — Introduce harnessed agents on bounded work.
Traceability and evidence management — Generate, do not manually reconstruct.
Release, change control, and revalidation — Close the loop operationally.

That order is deliberate: specification first, verification second, architecture third, autonomy fourth.

Role Evolution in a V-Model Context

Quality / Validation functions

Move from document checkers to evidence-system governors. Own validation integrity, deviations, and release confidence boundaries.

System / software architects

Move from describing design to encoding enforceable constraints and approved execution zones.

Developers and technical leads

Spend more time on specification quality, interface design, evaluation design, and exception handling. Spend less time on first-draft boilerplate implementation.

Regulatory / quality leadership

Focus on where agent participation changes the assurance case: electronic records, traceability, tool qualification, approval semantics, and revalidation triggers.

Engineering leadership

Fund the evidence backbone, not just coding tools. Prevent local optimization where teams adopt agents without verification, traceability, or validation discipline.

Metrics for the Transition

Do not measure success with raw output volume. Track:

lead time from approved specification to verified evidence bundle
first-pass acceptance rate of agent-generated changes
percentage of requirements with executable verification coverage
percentage of changes with complete traceability
deviation rate introduced by agent-assisted work vs. human-only work
reviewer time spent on low-risk vs. high-risk changes
revalidation effort per major change class
total cost of correctness: inference + verification + governance overhead + incident remediation

Failure Modes to Avoid

Automating implementation before fixing requirement quality
Treating validation as test automation
Allowing agents to modify constraints that should be governance-controlled
Keeping traceability manual while scaling change volume
Granting identical autonomy to low-risk and high-risk work
Letting model or tool changes bypass revalidation logic
Measuring success by throughput while reviewer fatigue and deviation rates climb

Bottom Line

The V-model does not disappear in agentic engineering. It becomes more important.

But its artifacts can no longer remain passive documents. They must become active controls:

specifications that steer machines
architectures that constrain machines
verification that proves what happened
validation that confirms the work still matters
traceability that is generated by the system itself

Organizations that already operate a mature V-model are better positioned for agentic engineering than organizations that skipped the V-model for agile. They already have the specification discipline, the verification culture, and the traceability infrastructure. What they need to add is: machine-readable specifications, evaluation portfolios that handle non-determinism, continuous observability, emergence containment, and tiered accountability.

The V-model does not become obsolete. It becomes the governance skeleton that makes autonomous execution safe.

Appendix: Mermaid Diagram

graph TD
    subgraph "LEFT: Specification Cascade"
        L1["1. Outcome Specifications<br/>(P1, P2)"]
        L2["2. System Specifications<br/>(P2, P3)"]
        L3["3. Agent Architecture<br/>(P3, P4, P5)"]
        L4["4. Context & Domain Design<br/>(P6, P7, P11)"]
    end

    subgraph "BOTTOM: Execution"
        B["Agent Execution<br/>(Bounded Autonomy)"]
    end

    subgraph "RIGHT: Verification Cascade"
        R6["6. Per-Agent Evaluation<br/>(P8, P9)"]
        R7["7. Cross-Agent Verification<br/>(P9, P10)"]
        R8["8. System-Level Evaluation<br/>(P10, P8)"]
        R9["9. Acceptance & Accountability<br/>(P12, P8)"]
    end

    L1 --> L2 --> L3 --> L4 --> B
    B --> R6 --> R7 --> R8 --> R9

    L1 -.->|"Traceability"| R9
    L2 -.->|"Traceability"| R8
    L3 -.->|"Traceability"| R7
    L4 -.->|"Traceability"| R6

Navigating organizational friction and running your first governed pilot.

Read the Manifesto for the core principles. See the Adoption Playbook for the full table of contents. See the Roles and the Human Side for the human dimension of the transition.

Navigating Resistance and Politics

The Human Side of the Transition covers the emotional and cognitive challenges individuals face. This section covers the organizational and political friction points that leaders must navigate.

The Productivity Dip

Teams will be slower before they're faster. Writing specifications is slower than writing code — at first. Building evidence gates adds overhead — at first. Reviewing agent output is harder than reviewing human code — until traces and evaluations reduce the review burden.

What to do: Set expectations explicitly at the start of the transition. Budget for a 2-4 week productivity dip per domain. Measure the dip so you can show the recovery. Protect the team from "why is velocity down?" pressure by communicating the plan to leadership in advance. This is where the acceleration trap (described in The Human Side) is most dangerous: the temptation to skip governance and reclaim velocity is strongest when the dip is visible to leadership.

Management That Wants Velocity Metrics

The manifesto explicitly argues that velocity, story points, and lines of code are the wrong metrics for agentic engineering. But management may still demand them — especially if the AI investment was justified on productivity grounds.

What to do: Don't fight the productivity narrative. Redirect it. Show that the right productivity metrics (lead time from specification to verified deployment, escaped defect rate, total cost of correctness per outcome) capture actual business value, while velocity measures raw output that may or may not produce value. Frame it as "we're measuring the thing that matters to the customer, not the thing that looks good on a slide."

The Cost Conversation

Agentic infrastructure costs money: inference costs, tooling, memory infrastructure, evaluation pipelines. The investment must be justified before results are fully proven.

What to do: Start with a narrow pilot (Step 1 in the adoption path) where costs are containable and measurable. Track total cost of correctness from day one, so you can demonstrate economics improvement as the pilot matures. Frame the comparison against the true cost of the status quo: escaped defects, incident remediation, technical debt accruing at machine speed.

Incentive Misalignment

If developers are still measured on lines of code, PRs merged, or tickets closed, the manifesto's values will lose to the incentive structure every time. Incentives that reward output volume punish the careful specification, verification, and governance the manifesto requires.

What to do: Align incentives with outcomes, not output. Reward: defect- free deployments, specification quality (measured by agent first-pass success rate), evaluation coverage, and incident prevention. These are harder to measure than "PRs merged" but they measure what actually matters.

How to Run Your First Pilot

This pilot is designed to take your team from Phase 3 (agents executing autonomously without governance) to Phase 4 (governed delivery with evidence bundles and autonomy tiers). It maps to Steps 1-3 of the Incremental Adoption Path. Do not attempt this pilot until your team has worked through Phase 2→3: agents are executing whole tasks, your team has documented initial failure patterns, and engineers are writing specifications with acceptance criteria (even if informally).

Selecting the Pilot Domain

Choose a domain that is:

Bounded: Clear inputs, outputs, and domain boundaries. You should be able to define what agents may and must not do without ambiguity.
Low-to-medium risk: Not your most critical production path. A failure should be recoverable without customer impact.
Well-tested: Existing test coverage provides a baseline for evaluating agent output quality.
Owned by a willing team: The team should be curious, not coerced. Forced adoption produces compliance, not learning.

Good pilot domains: internal tools, test infrastructure, documentation generation, non-critical API endpoints, CI/CD pipeline improvements.

Bad pilot domains: payment processing, authentication, customer-facing decisions with legal or financial impact, and other high-blast-radius or controlled-data workflows — these are Step 5, not Step 1.

Pilot Structure

Duration: 6-8 weeks minimum. Shorter pilots don't generate enough evidence to distinguish signal from noise.

Team size: 3-5 engineers from the pilot domain, plus one operations engineer and one QA engineer. Small enough to iterate fast; large enough to test real workflows.

Scope: One domain, Tier 1 autonomy (agents analyze and propose), with evidence bundles required for every merged change.

Tooling investment: Minimal. Use existing CI/CD with added evidence gates. Do not invest in specialized agent platforms before validating the workflow.

Pilot Success Criteria

The pilot succeeds if:

Escaped defect rate for agent-generated changes is equal to or lower than the domain's historical baseline
Engineers can produce evidence bundles without unsustainable overhead (measure time per bundle)
The team can articulate what worked, what didn't, and what they'd change for the next domain
At least one specification was refined based on execution evidence (demonstrating the Agentic Loop in practice)

The pilot fails if:

Governance overhead exceeds the value of agent output (teams spend more time on evidence than the agent saves on implementation)
Escaped defect rate increases
Team burnout indicators appear (review rubber-stamping, evidence bundle quality declining over time)

After the Pilot

Document findings as a case study: what worked, what broke, what you'd change. Use the case study to inform the next domain's adoption. Do not generalize from one pilot — each domain has different failure surfaces.

How to measure progress and the common ways the change program fails.

Read the Manifesto for the core principles. See the Adoption Playbook for the full table of contents. See the Adoption Path for incremental steps and phase transitions.

Canonical sources. Normative principle definitions (P1–P12) are in manifesto-principles.md. Metric thresholds and alert bands in this document are heuristics — example starting bands that must be calibrated to local baseline, domain, and risk class before use. See glossary.md for canonical term definitions.

Success Metrics

Treat this manifesto as a living specification. Run pilots, publish failure analyses, measure outcomes, and revise principles based on evidence from real workflows.

Treat every threshold below as a starting baseline that must be calibrated to local review size, risk class, and domain history.

Metrics by Phase Transition

Phase 1 → 2 (focus on standardization and repeatable value):

Number of AI-assisted tasks with documented, repeatable workflows
Rework rate on AI-assisted outputs (how often does the human redo the AI's suggestion entirely?)
Team coverage: percentage of engineers using approved AI tooling regularly
Data handling incidents: trending toward zero for sensitive data shared with unapproved models (track as a security metric, not an adoption gate)

Phase 2 → 3 (focus on autonomous execution quality):

Agent task completion rate (tasks delegated vs. tasks that required human takeover mid-execution)
Review rejection rate for agent-generated outputs
Documented failure patterns (growing catalog indicates learning, not problems)
Specification quality: percentage of tasks where acceptance criteria were defined before agent execution

Phase 3 → 4 (focus on governance foundation):

Evidence bundle completeness rate (target: 100% of agent-generated changes)
Escaped defect rate: agent-generated vs. human-generated changes
Rollback frequency and mean time to recovery
Time per evidence bundle (sustainability indicator)

Phase 4 → 5 (focus on scale and economics):

Lead time from specification to verified deployment
Total cost of correctness by domain
Policy violation rate and resolution time
Cross-domain evaluation coverage

Phase 5 → 6 (focus on self-improvement and containment):

Specification convergence rate (iterations to stable acceptance criteria)
Evaluation theater detection rate (evals that pass but miss real issues)
Self-improvement cycle time and containment breach frequency
Human oversight load (high-risk reviews per domain owner)

Team Health Metrics (All Phases)

Review latency trends (rising latency may indicate review fatigue or cognitive overload)
Approval depth (are reviewers engaging meaningfully or rubber-stamping?)
Engineer satisfaction and burnout indicators (survey quarterly)
Junior engineer progression rate (are juniors developing specification and evaluation skills?)

Track these alongside system health. If system metrics improve while team health metrics decline, the governance model is consuming its own foundation.

Rubber-stamping detection. Control theater — humans nominally accountable but operationally blind — is the most common governance failure at scale. Detect it quantitatively before it becomes an incident:

Signal	Example healthy band	Example alert band	What it indicates
Median review time per agent-generated PR	8–20 minutes	< 2 minutes	Reviewer not reading the diff
PR rejection rate (agent-generated)	5–15%	< 1%	Approving without meaningful review
Inline comments per approved PR	3–7	Trending to 0 over 4 weeks	Review becoming mechanical
Rework rate within 1 week of merge	1–3%	> 10%	Approved changes requiring hotfixes

Collect these via your code review platform (GitHub, GitLab, Azure DevOps — all provide approval timestamps and comment counts via API).

These thresholds are operational heuristics calibrated from practitioner experience, not empirically validated across diverse organizations. Treat them as starting baselines and adjust based on your team's observed patterns. The alert thresholds are directional: any sustained trend toward them warrants investigation, even before a hard threshold is crossed.

Intervention protocol when thresholds breach: Do not add more reviewers. Reduce autonomy scope for that reviewer's domain until review is meaningful again. The problem is volume, not capacity. Additional reviewers at the same volume create the same rubber-stamping pattern faster.

Governance Overhead Metrics

Governance infrastructure has real cost. Without efficiency metrics, it is impossible to distinguish "governance is working" from "governance is overhead with no signal." Finance and leadership will ask; measure proactively.

Metric	Target	Alert threshold	What to do
Governance overhead as % of engineering throughput	< 15%	> 25% for two consecutive quarters	Audit which governance artifacts are actually influencing decisions; remove what isn't
False-positive rate on hook blocks	< 5%	> 15%	Rules are over-restrictive; refine with domain input
Time-to-update-governance-policy	< 2 weeks for standard changes	> 6 weeks	Governance model is too rigid; simplify change management path for low-risk policy updates
Incident-prevention rate attributable to governance controls	At least 1 prevented incident per quarter per active hook	Zero incidents prevented in 2 consecutive quarters	Hook may not be testing what matters; audit coverage
Hook false-negative rate (incidents that governance should have caught)	< 2% of total incidents	> 10%	Governance gaps; add coverage for the failure class

Calibrate after one quarter of baseline measurement.

If governance overhead exceeds 25% of throughput with no corresponding reduction in escaped defects, that is over-governance. Reduce ceremony, increase signal. The corrective action is always the same: audit what is actually influencing decisions and cut the rest.

Quarterly Review Cadence

Begin formal quarterly reviews once your team reaches Phase 4 (governed delivery). At Phases 1-3, use the phase-specific metrics above in lighter- weight retrospectives. Once at Phase 4, each quarter review:

Lead time from specification to verified deployment
Escaped defect rate and incident severity distribution
Rollback frequency and mean time to recovery
Policy violation rate and evidence bundle completeness
Human oversight load (high-risk reviews per domain owner)
Total cost of correctness by domain
Team health indicators

If governance overhead rises while quality and resilience do not improve, reduce control complexity and re-baseline autonomy scope.

Common Failure Modes of the Change Program

The Companion Guide covers technical failure modes (over-governance, evidence theater, control theater, etc.). This section covers failures in the organizational change process itself.

Adoption without transition support. Leadership announces "we're doing agentic engineering" without budgeting for training, experimentation time, or the productivity dip. Engineers are expected to learn on their own time. The fix: budget explicitly for the transition — training, protected experimentation time, and a communicated plan that accounts for the dip.
Ignoring the human cost. System metrics improve while engineers burn out. Governance load exceeds human capacity but nobody measures it. The fix: track team health alongside system health. When burnout indicators appear, reduce scope before pushing harder. See Sustainable Pace.
Unclear ownership between platform, product, and operations teams. Nobody knows who owns agent runtime, memory governance, or evaluation registries because these infrastructure categories didn't exist before. The fix: explicit domain-owner assignments with escalation rotations, created as part of the Phase 4→5 transition.
Premature autonomy expansion. A successful pilot in one domain leads to immediate rollout across all domains, skipping the evidence that the governance model scales. The fix: gate expansion on two consecutive quarters of stable or improving metrics in the current scope.
Incentive-adoption mismatch. The organization adopts the manifesto's vocabulary but continues rewarding output volume (PRs merged, velocity points). Engineers learn to game the new system by producing minimal evidence bundles that satisfy the letter of the process without the spirit. The fix: align incentives with outcomes before expanding adoption. See Incentive Misalignment.
Skipping phases. A team jumps from Phase 2 to Phase 4 because they "don't need" Phase 3's learning period. They adopt governance infrastructure without having documented the failure patterns it's supposed to catch. The fix: each phase builds prerequisites for the next. The phases are not a checklist to accelerate through — they are a learning sequence.

The adoption playbook addresses how a single team adopts the manifesto. This document addresses how an enterprise - dozens of teams, multiple business units, existing governance structures, and competing priorities - assesses readiness, sequences adoption, and governs the transition at scale. It draws on implementation patterns from large-scale system integration engagements.

Why Enterprise Adoption Is Different

A single team can adopt the manifesto in weeks. Give them domain boundaries, autonomy tiers, evidence gates, and let them run a pilot. The adoption path covers this well.

Enterprise adoption faces five problems that single-team adoption does not:

Heterogeneous maturity. Some teams operate at Phase 2 (AI as autocomplete), others at Phase 3 (agentic prototyping), a few already at Phase 4. A uniform adoption mandate either bores the advanced teams or overwhelms the lagging ones.
Governance overlap. Most enterprises already have compliance frameworks, audit structures, change management boards, and risk committees. The manifesto introduces new governance concepts (autonomy tiers, evidence bundles, memory governance) that must integrate with - not replace - existing structures.
Cross-domain dependencies. When Team A's agents produce artefacts consumed by Team B's agents, the verification and governance requirements multiply. The probability compounding problem applies across teams, not just within agent chains.
Budget and prioritisation. The infrastructure the manifesto requires - observability, memory governance, evaluation pipelines - competes with other transformation investments. Leadership needs a phased investment case, not a total-cost-of-transformation number.
Political dynamics. Adopting agentic engineering changes who makes decisions about what. Autonomy tier assignments are power decisions. Evidence requirements create accountability where ambiguity previously existed. These are organisational politics, not engineering problems.

Enterprise Readiness Assessment

Before sequencing adoption, assess the organisation across six dimensions. Each maps to specific manifesto infrastructure.

Dimension 1: Current Agentic Maturity Distribution

Map every team to the manifesto's maturity spectrum (Phase 1-6). This produces a maturity heatmap:

Phase	Description	Typical indicators
1	Guided Exploration	Ad hoc prompt use, no structure
2	Assisted Delivery	AI autocomplete, humans execute everything
3	Agentic Prototyping	Agents execute within sessions, limited verification
4	Agentic Delivery	Autonomy tiers, evaluation gates, persistent memory
5	Agentic Engineering	Multi-domain, evidence-driven, continuous Agentic Loop
6	Adaptive Systems	Self-improving within governed boundaries

Most enterprises in 2026 will show a distribution concentrated at Phase 2-3, with outlier teams at Phase 1 and Phase 4. The gap between the median and the most advanced team reveals the internal knowledge transfer opportunity. The gap between the median and Phase 4 reveals the investment distance.

Dimension 2: Existing Governance Integration Points

Map the manifesto's governance requirements to existing enterprise structures:

Manifesto concept	Enterprise equivalent	Integration approach
Autonomy tiers	Access control / change authority matrices	Extend existing authority frameworks with agent-specific tiers
Evidence bundles	Audit evidence / SOX controls	Align evidence bundle format with existing audit requirements
Memory governance	Data governance / information management	Extend data governance to cover learned memory as a new asset class
Behavioural observability	Application monitoring / SIEM	Extend existing monitoring to capture agent reasoning traces
Economics-aware routing	IT financial management / chargeback	Integrate cost-of-correctness metrics into existing FinOps
Domain boundaries	Domain ownership / team topology	Align agent domain boundaries with existing team structures

The enterprise that already has mature governance can adopt the manifesto faster - the new concepts graft onto existing structures. The enterprise without governance maturity faces a double investment: building the baseline and extending it for agentic systems.

Dimension 3: Infrastructure Readiness

The manifesto requires infrastructure that most enterprises do not have in 2026:

Reasoning-level observability. Not application performance monitoring. Agent reasoning traces that capture why decisions were made, not just what happened. This is a new tooling category.
Memory infrastructure. Persistent, governed, retrievable knowledge and learned memory stores with provenance tracking. Distinct from data lakes, document management, or wikis.
Evaluation pipelines. Automated verification gates that run on every change: deterministic checks, statistical evaluation, and formal contracts on critical paths. Distinct from CI/CD pipelines, which test code, not agent behaviour.
Cost-quality routing. Infrastructure to route tasks to appropriate model tiers based on cost, risk, and quality requirements. Distinct from existing load balancing or API gateway patterns.

Assessment should classify each as: exists, partially exists (can be extended), or must be built.

Dimension 4: Skill Distribution

The manifesto requires skills that are rare in 2026 (see companion-reference.md):

Specification engineering: expressing intent precisely enough for agents (Phase 3+)
Evaluation design: building verification pyramids (Phase 3+)
Memory governance: curating and governing persistent agent memory (Phase 4+, rare)
Formal methods: machine-checkable contracts on critical paths (Phase 5+, rare)

Map skill availability against the target maturity phase. The gap determines the training and hiring investment.

Dimension 5: Regulatory Exposure

Regulated industries face additional constraints:

Financial services: Algorithmic accountability, model risk management (SR 11-7), audit trail requirements. The manifesto's evidence bundles and autonomy tiers align well with existing regulatory expectations, but explicit mapping is required.
Healthcare: Patient data handling, clinical decision support regulations. Agent domain boundaries must align with data classification zones.
Defence/Government: Security clearance implications for agent access, national security constraints on model providers. Autonomy tiers require security-level alignment.

Regulatory exposure determines the maximum autonomy tier available for production paths and the evidence standard required.

Dimension 6: Organisational Change Capacity

How much change can the organisation absorb simultaneously? Indicators:

Number of concurrent transformation programmes already running
Change fatigue signals (resistance to new processes, low adoption of recent changes)
Leadership alignment on agentic engineering as a priority vs. one-of-many initiatives
Middle management readiness to evolve roles (tech leads, QA leads, ops leads)

This dimension determines adoption pace, not adoption scope.

Sequencing Enterprise Adoption

Wave Model

Adopt in waves, not simultaneously. Each wave adds teams, expands scope, and raises the target maturity phase.

Wave 0: Foundation (8-12 weeks) Scope: One team, one domain, Phase 3→4 transition. Purpose: Prove the manifesto works in this enterprise's context. Produce reference evidence: pilot metrics, adapted governance mappings, infrastructure requirements validated. Outcome: A working example other teams can observe.

Wave 1: Early Adopters (12-16 weeks) Scope: 3-5 teams selected for willingness, existing Phase 3 maturity, and domain diversity. Purpose: Test cross-domain patterns. Discover where enterprise governance integration fails. Build internal expertise. Outcome: Validated enterprise governance mappings. First cohort of practitioners who can mentor Wave 2 teams.

Wave 2: Mainstream Adoption (16-24 weeks) Scope: All teams above Phase 2. Target Phase 4 minimum. Purpose: Scale what Wave 1 proved. Centralise shared infrastructure (observability, memory, evaluation). Establish enterprise standards for evidence bundles and autonomy tiers. Outcome: Agentic engineering as the default operating model for software delivery.

Wave 3: Advanced Capabilities (ongoing) Scope: Teams ready for Phase 5. Cross-domain orchestration. Formal methods on critical paths. Purpose: Push the maturity frontier. Build the organisation's capacity for Phase 5-6 operations. Outcome: Competitive advantage through governed agentic systems at scale.

Wave Selection Criteria

Select Wave 1 teams based on:

Current maturity at Phase 3 or approaching it
Willing team lead and engineering manager
Domain with clear boundaries (not deeply entangled with other domains)
Moderate regulatory exposure (enough to test governance, not enough to paralyse)
Existing test infrastructure that evaluation gates can extend

Explicitly exclude from Wave 1:

Teams below Phase 2 (not ready)
Highest-risk domains (payment processing, core trading, patient-facing systems) - save for Wave 2 after governance is proven
Teams in the middle of other major transitions

Governance at Enterprise Scale

The Enterprise Governance Layer

Single-team governance is straightforward: the team owns its domain boundaries, autonomy tiers, and evidence gates. Enterprise governance adds three concerns:

Cross-domain verification. When agents in Domain A produce artefacts consumed by agents in Domain B, who verifies the interface? The manifesto's Principle 3 (defense-in-depth) applies at the enterprise level: domain boundaries must include contract-based verification at integration points.

Autonomy tier consistency. If Team A operates at Tier 2 and Team B operates at Tier 3, what tier governs their interaction? The conservative answer: the lower tier. The practical answer: define interaction-specific tiers in the enterprise governance framework.

Memory isolation and sharing. Learned memory from one domain may be valuable to another domain. Memory governance at enterprise scale must address: what memory can be shared, under what provenance requirements, and with what decay policies.

Enterprise Governance Board

Establish a lightweight governance body responsible for:

Maintaining the enterprise autonomy tier framework
Reviewing and approving cross-domain autonomy tier escalations
Setting enterprise evidence standards (minimum evidence bundle contents per domain classification)
Governing shared memory infrastructure (what is shared, what is domain-isolated)
Reviewing enterprise-level metrics quarterly (see adoption-metrics.md)

Membership: CTO or VP Engineering (chair), domain leads from Wave 1 teams, QA/evaluation lead, operations lead, one representative from risk/compliance. Keep it small. Meets monthly during Wave 1-2, quarterly after Wave 2.

Investment Case

Phased Investment Model

Enterprise adoption costs fall into three categories:

Infrastructure investment. Reasoning-level observability, memory infrastructure, evaluation pipelines, cost-quality routing. Shared across all teams. Front-loaded in Wave 0-1. Estimated at 2-4 FTEs for 6 months to establish, 1-2 FTEs ongoing to operate.

Enablement investment. Training, mentoring, role transition support. Scales with wave size. Estimated at 1-2 days per engineer for Phase 3→4 transition, with ongoing coaching.

Governance overhead. Enterprise governance board operation, cross-domain verification design, evidence standard maintenance. Estimated at 0.5 FTE ongoing after Wave 1.

Return Model

The investment case rests on three measurable returns:

Defect reduction. The SWE-CI benchmark data suggests that governed agentic pipelines (Phase 4+) reduce regression rates by 40-60% compared to ungoverned agentic use (Phase 2-3). Measure: escaped defect rate pre/post adoption.
Cost optimisation. Economics-aware routing (Principle 11) reduces inference costs by routing routine tasks to cheaper models. Organisations without routing typically overspend on inference by 30-50%. Measure: inference cost per verified outcome.
Compliance cost avoidance. Evidence bundles and autonomy tiers reduce the cost of audit evidence production and regulatory response. Measure: audit preparation time pre/post adoption.

The break-even point for most enterprises: 2-3 quarters after Wave 1 completion, driven primarily by defect reduction and cost optimisation.

Common Enterprise Failure Modes

1. The Big Bang Mandate

Mandating manifesto adoption across all teams simultaneously. Results in: governance theater (vocabulary adoption without practice change), tool procurement without capability building, and backlash from teams that weren't ready. The wave model exists to prevent this.

2. Infrastructure Before Practice

Building full enterprise infrastructure (memory, observability, evaluation pipelines) before any team has adopted the manifesto. Results in: infrastructure that doesn't match actual needs, budget consumed before value demonstrated, and shelfware. Wave 0 exists to validate infrastructure requirements with a real team before scaling investment.

3. Governance Without Engineering

Establishing enterprise governance boards and evidence standards without the engineering practices that make them meaningful. Results in: compliance overhead without quality improvement, evidence theater (producing evidence bundles that satisfy process without ensuring quality), and engineer disengagement.

4. Ignoring the Political Dimension

Autonomy tier assignments change decision-making authority. Evidence requirements create accountability. Treating these as purely technical decisions ignores that they redistribute power within the organisation. Engage leadership early. Make autonomy tier assignments a leadership decision, not an engineering one.

5. Premature Phase 5 Ambition

Targeting Phase 5 (multi-domain agentic engineering) before Phase 4 is stable across core domains. Phase 4→5 is the hardest transition in the manifesto's maturity spectrum because it requires new infrastructure categories (agent runtime, memory governance at scale) and organisational structures that don't exist at Phase 4.

This document is part of the Agentic Engineering Manifesto. See also: Roles and the Human Side for leadership-level implications, Adoption Playbook for single-team adoption, and Adoption Path for phase transitions.

These documents map the principles of the Agentic Engineering Manifesto to the regulatory frameworks that govern specific industries. They bridge the gap between the manifesto's domain-agnostic guidance and the concrete standards teams must satisfy in regulated environments.

These documents do not explain the regulations themselves, nor do they constitute compliance advice. They assume the reader already understands the applicable regulatory landscape and needs to see how agentic engineering practices align with it.

Disclaimer — These alignment mappings are provided for informational purposes only. They do not represent legal, regulatory, or compliance advice. Organizations must conduct their own compliance assessments with qualified professionals. Regulatory frameworks evolve; always verify against the current published standards.

Documents

Document	Scope
Aviation	DO-178C, DO-330, DO-333, ARP 4754A, DO-326A — airborne software and systems assurance
Medical Devices	IEC 62304, ISO 14971, ISO 13485, FDA SaMD, EU MDR — medical device software lifecycle
Pharma / Life Sciences	GAMP 5, CSA, 21 CFR Part 11, EU Annex 11, ICH Q8-Q12 — pharmaceutical computerized systems
Financial Services	SR 11-7, DORA, EU AI Act, SOX, Three Lines of Defense — banking, insurance, capital markets
Automotive	ISO 26262, ASPICE, UN Regulation 157 — road vehicle functional safety and autonomous driving
Defense / Government	CMMC, FedRAMP, NIST SP 800-53, ITAR/EAR — government contracting and defense systems

Cross-Cutting Themes

Several themes recur across all domains and are addressed at the manifesto level rather than in domain-specific documents:

Independent validation as a governance principle — see Companion Principles P8
SOUP / agent-as-tool categorization — see Companion Principles P3
Data classification as an agent constraint — see Companion Frameworks
ALCOA+ compliance — see Companion Frameworks
Champion-challenger testing — see Companion Principles P8
Fairness and bias testing — see Companion Principles P8
Cross-domain incident classification — see Companion Patterns
Supplier and vendor qualification — see Companion Reference
Memory governance in regulated environments — see Companion Principles P6
Open interoperability requirements — see Companion Principles P9
Benchmark instability and private holdouts — see Companion Principles P8

Cross-Domain Open Regulatory Questions

The following questions are unresolved across multiple regulated domains. They represent the highest-priority areas where industry consensus, standards-body guidance, or regulatory precedent is needed. Each question links to the domain that has developed the most specific framing.

#	Question	Domains Affected	Status
1	Agent-as-tool qualification: Is an AI agent SOUP (IEC 62304), an unqualified tool (DO-178C/DO-330), a GAMP Cat 3/4 system, or a new category requiring new classification frameworks? No domain has a settled answer.	All	Open — each domain uses the "treat as unqualified tool, independently verify output" pragmatic approach pending regulatory guidance
2	Model version change revalidation scope: When the underlying model is updated (e.g., model version bump by the provider), what revalidation is required? Does a minor version change trigger full re-IQ/OQ/PQ? Full independent model validation? Or only a behavioral regression test?	Medical, Pharma, Financial	Open — PCCP (FDA) partially addresses anticipated modifications but not infrastructure-level model changes
3	Memory accumulation as a change control event: At what point does accumulated learned memory constitute a change to a validated system? No domain has a threshold or methodology.	Pharma (most developed), Medical, Financial	Open — GAMP 5 open question; no regulatory body has published guidance
4	Open-source model supplier responsibility: When a deploying organization uses an open-source model with no identifiable supplier, how should GAMP 5 supplier qualification, ISO 13485 §7.4 purchasing controls, and SR 11-7 vendor model management apply?	Pharma, Medical, Financial	Open — conservative position is to assume full supplier responsibility; regulatory validation of this approach is untested
5	GDPR Art. 22 and agent-assisted decisions: When an agent produces a recommendation that a human rubber-stamps, does that constitute "solely automated decision-making" under GDPR Art. 22? The boundary between meaningful human review and rubber-stamping is undefined in regulatory guidance.	Financial, Medical, All customer-facing	Open — rubber-stamping detection metrics (see adoption-metrics.md) partially address the engineering side; the legal question is unresolved
6	Protocol and evidence portability: What level of interoperability should regulated teams require for tool invocation, agent delegation, trace export, and replay before an agent platform can be treated as operationally governable rather than vendor-bound?	All	Open — open protocols are emerging, but regulatory expectations for portability, replay, and audit export are not yet settled

Document Structure Template

All domain documents in this directory should include the following sections. Sections may be omitted only where clearly not applicable to the domain — in which case add a brief "Not applicable for this domain: [reason]" note.

## [Criticality/Risk Level] to Manifesto Autonomy Mapping
Map the domain's primary risk/criticality classification (DAL, safety class,
ASIL, GxP context, etc.) to manifesto autonomy tiers. This is the primary
table readers need.

## [Framework]-by-[Framework] Mapping (repeat for each major standard)
Table mapping each regulatory standard's key requirements to manifesto
principles, with Alignment (Strong/Good/Partial/Gap) and Gap description.

## SOUP / Agent-as-Tool Treatment
How the domain's software component classification framework applies to AI
agents, model dependencies, and agent-selected libraries.

## Hard Autonomy Caps
Regulatory floor caps by use case. These are not recommendations — they are
constraints. Include the regulatory citation for each cap.

## Viable Starting Points
3-6 concrete, low-risk entry points for teams beginning agentic adoption
in this domain. Each should be realistically achievable without resolving
open regulatory questions.

## Tool Configuration Notes
How to configure agent tooling (hooks, RBAC, MCP allowlists, model pinning)
to satisfy the domain's audit trail and data classification requirements.

## ALCOA+ or Equivalent Data Integrity Cross-Reference
Cross-reference to companion-frameworks.md#alcoa-alignment with any
domain-specific additions.

## Open Regulatory Questions
Unresolved questions specific to this domain. Cross-reference to the
cross-domain questions in this README where applicable.

Design Assurance Level to Manifesto Autonomy Mapping

The manifesto defines four autonomy tiers (Principle 5): Tier 1 (Observe), Tier 2 (Branch), Tier 3 (Commit), Tier 4 (Operate). The mapping below constrains the maximum permissible tier based on the failure condition severity tied to the software component's Design Assurance Level.

DAL	Failure Condition	Max Agent Autonomy Tier	Verification Depth	Rationale
A	Catastrophic	Tier 1 -- Observe only	All agent output independently verified through qualified means (DO-178C Table A-1 through A-10 objectives, independence requirements)	No certification credit for unqualified tool output. Agent may analyze and propose; human authors and verifies.
B	Hazardous	Tier 1 -- Observe only	Independent verification required for all objectives with independence (Table A-4, A-5, A-7)	Same constraint as DAL A. Reduced objective count does not relax the independence requirement.
C	Major	Tier 1-2 -- Observe or Branch	Agent may draft artifacts to isolated branches; merge requires qualified human verification against applicable Table A objectives	Fewer objectives with independence. Agent-drafted code and tests are viable when independently reviewed before baseline.
D	Minor	Tier 1-3 -- Full tier range	Standard evidence bundles (P1) attached to each agent contribution; verification per Table A objectives	Reduced verification rigor. Agent contributions with evidence bundles can satisfy most objectives with standard review.
E	No Effect	Tier 1-4 -- Full tier range	Standard manifesto adoption path applies; Tier 4 additionally requires validated governance infrastructure per P5	No certification objectives apply. Normal manifesto governance is sufficient. Tier 4 permitted only when machine-enforced policy envelope, passing control evaluations, and active governance observability are all confirmed operational.

Key constraint: DAL assignment is determined by the system safety assessment (ARP 4754A/4761A), not by the development team. The DAL dictates the ceiling; the team cannot raise it.

DO-178C Objectives to Manifesto Principle Mapping

DO-178C organizes airborne software lifecycle activities into five process categories. The table below maps each to the most applicable manifesto principles.

DO-178C Process	Key Objectives	Manifesto Principle	Alignment	Notes
Planning Process (Section 4)	PSAC, SDP, SVP, SCMP, SQAP	P2 -- Specifications are living artifacts	Strong	Machine-readable specifications (P2) strengthen plan-to-artifact traceability. Plans remain human-approved documents.
Planning Process	Standards definition, transition criteria	P5 -- Autonomy is a tiered budget	Moderate	Autonomy tiers map to plan-defined transition criteria. Agent permissions can be encoded in SDP/SVP.
Development Process (Section 5)	Requirements, design, coding, integration	P1 -- Outcomes are the unit of work	Strong	Evidence bundles per outcome satisfy DO-178C's requirement for traceable development output.
Development Process	Architecture, detailed design	P3 -- Architecture is defense-in-depth	Strong	Manifesto boundary enforcement aligns with DO-178C architectural partitioning (Section 2.4.1).
Development Process	Source code, integration	P4 -- Right-size the swarm	Moderate	Multi-agent coordination must preserve single-threaded configuration baselines.
Verification Process (Section 6)	Reviews, analyses, test cases, test procedures, test results	P8 -- Evaluations are the contract	Strong	Evaluation portfolios map directly to verification cases/procedures. Evidence bundles map to test results.
Verification Process	Structural coverage, requirements-based testing	P9 -- Observability covers reasoning	Strong	Trace-level observability supports structural coverage analysis and requirements-based test traceability.
Verification Process	Independence of verification	P12 -- Accountability requires visibility	Strong	Manifesto's accountability model requires named human ownership; DO-178C requires verification independence. Both demand separation of authoring from verification.
CM Process (Section 7)	Configuration identification, baselines, change control, status accounting, archival	P6 -- Knowledge and memory are infrastructure	Strong	Knowledge as versioned ground truth (P6) maps to CM identification and baseline management.
CM Process	Problem reporting, change review	P9 -- Observability covers reasoning	Moderate	Agent action traces provide richer change history than traditional problem reports.
QA Process (Section 8)	Process assurance, compliance, transition criteria	P12 -- Accountability requires visibility	Strong	QA's role as independent process watchdog parallels manifesto's accountability requirements.
QA Process	Standards compliance	P8 -- Evaluations are the contract	Moderate	Evaluation gates can automate portions of conformity review, but QA independence remains human-owned.

DO-330 Tool Qualification -- The Hard Constraint

DO-330 determines when a software development tool requires qualification and at what rigor. This is the single hardest regulatory constraint for agentic engineering in aviation.

Tool Qualification Level Determination

An agent used in the development of airborne software is a development tool under DO-330. Its Tool Qualification Level (TQL) is determined by the DAL of the software it produces and whether its output errors are detectable.

TQL	Software DAL	Error Detectability	Required Tool Development Rigor	Agent Feasibility (Current State)
TQL-1	DAL A	Undetectable	Equivalent to DO-178C DAL A	Not feasible. LLMs are non-deterministic, requirements are unknowable, and exhaustive testing is impossible.
TQL-2	DAL A-B	Detectable	Equivalent to DO-178C DAL B	Not feasible. Same fundamental obstacles as TQL-1 with marginally reduced scope.
TQL-3	DAL A-C	Detectable	Equivalent to DO-178C DAL C	Not feasible. Requires demonstrable tool requirements and verification. Current LLMs cannot satisfy these.
TQL-4	DAL B-D	Detectable	Equivalent to DO-178C DAL D	Marginal. Possible only with extremely constrained agent scope and deterministic wrappers.
TQL-5	DAL C-E	Detectable	Equivalent to DO-178C DAL E	Viable for narrow tool functions where all output is independently verified.

The Realistic Path

Current LLMs cannot achieve TQL-1 through TQL-3 qualification. The fundamental obstacles are non-determinism, absence of specifiable tool requirements (in the DO-330 sense), and inability to demonstrate coverage or absence of anomalous behavior.

The viable approach: treat the agent as an unqualified development tool and independently verify all of its output.

DO-178C already accommodates unqualified tools -- their output simply receives no certification credit until independently verified. This is precisely the manifesto's model:

Evidence bundles (P1) document what the agent produced and what evidence supports it.
Evaluation portfolios (P8) provide the independent verification that replaces tool qualification credit.
Observability traces (P9) provide the audit trail showing that verification was performed and by whom.

The agent accelerates development; verification provides the assurance credit. This is Tier 1 and Tier 2 operation by construction.

Note: This constraint may evolve. EASA and FAA have issued AI roadmaps (EASA AI Concept Paper 2.0, FAA AI Safety Assurance Framework). Certification authorities are actively developing guidance for ML-based tools. Industry groups (SAE G-34/EUROCAE WG-114) are drafting standards for ML in airborne systems. Monitor these developments.

DO-333 Formal Methods -- The Opportunity

DO-333 is the formal methods supplement to DO-178C. It provides certification credit for formal analyses that replace specific testing objectives -- making it the most natural intersection between agentic engineering and aviation certification.

Manifesto Principle 8 states: "proofs are a scale strategy." DO-333 is the certification framework that gives this statement regulatory teeth.

DO-333 Credit Categories Mapped to Manifesto

DO-333 Credit	What It Replaces	Manifesto Formal Contracts Approach	Aviation Applicability
Formal proof of absence of runtime errors	Robustness testing objectives	Agent-generated code with formal proofs via tools like Astree, Polyspace, or Frama-C	Production precedent: Astree on Airbus A380/A350, A340
Formal proof of requirements satisfaction	Requirements-based test cases (partial)	Formal contracts as machine-verifiable specifications (P2 + P8)	Applicable where requirements are formally expressible
Model checking of state machines	State machine testing	Agent-generated models with exhaustive state exploration	Applicable to control logic, mode management
Formal equivalence checking	Integration testing (partial)	Agent-generated code verified against formal reference model	Applicable to compiler/code generator qualification (CompCert precedent)

Why This Matters for Agentic Engineering

Agent-generated code accompanied by machine-checked formal proofs can produce a stronger certification case than traditionally hand-written code with manual testing alone. The proof is the evidence, and it is independently verifiable by deterministic tools.

Production precedents exist:

Astree -- abstract interpretation, deployed on Airbus A380/A350 flight control software, proving absence of runtime errors.
CompCert -- formally verified C compiler, applicable to TQL arguments.
SCADE -- qualified code generator with formal semantics, used across multiple Airbus and other airborne platforms.

The manifesto's position that "proofs are a scale strategy" is directly validated by the DO-333 credit model: formal methods scale certification evidence in ways that test-only approaches cannot.

ARP 4754A System-Level Mapping

ARP 4754A governs the system development process that produces the safety requirements and DAL assignments flowing down to DO-178C software development. Agents can assist at this level, but human accountability is absolute.

ARP 4754A Process	Agent Role (Manifesto Alignment)	Human Accountability
Functional Hazard Assessment (FHA)	Agent assists with analysis: identifies failure modes from system architecture, cross-references historical FHA databases (P6 -- Knowledge).	Human owns hazard classification. FHA severity assignments require engineering judgment and regulatory agreement.
Preliminary System Safety Assessment (PSSA)	Agent drafts fault trees and dependency diagrams from architectural models; proposes failure rates from component databases (P1 -- Evidence bundles).	Human approves safety assessment. PSSA conclusions drive DAL allocation and must be defensible to the certification authority.
System Safety Assessment (SSA)	Agent generates bidirectional traceability matrices between safety requirements, design artifacts, and verification evidence (P9 -- Observability).	Human validates completeness and correctness. SSA is the final safety argument; it must be human-owned.
Common Cause Analysis (CCA)	Agent identifies common causes across subsystems: shared resources, environmental factors, cascading failures (P10 -- Containment).	Human approves analysis and determines acceptability of residual common-cause risk.
Requirements validation	Agent cross-checks system requirements against FHA/PSSA allocations for completeness and consistency (P2 -- Specifications).	Human confirms that derived requirements are correctly captured and allocated.
FDAL/IDAL allocation	Agent proposes allocation based on FHA severity and architectural independence arguments.	Human owns allocation decisions. FDAL/IDAL assignments are certification commitments.

Configuration Management

DO-178C Section 7 requires configuration management with identification, baselines, traceability, change control, status accounting, and archival for all software lifecycle data.

Agent-generated artifacts are software lifecycle data and fall under the same CM requirements as human-generated artifacts. The manifesto's model supports this:

Evidence bundles (P1) are CM items. Each bundle carries identification (trace ID, agent ID, timestamp), provenance, and linked problem reports.
Manifesto trace model (P9) provides bidirectional traceability from specification through implementation to verification -- the same traceability DO-178C Section 7.2 requires.
Knowledge as versioned ground truth (P6) maps to CM baseline management. Agent knowledge stores must be baselined and change-controlled alongside source code and requirements.

Agent memory (the heuristic/learned component per P6) is not a CM item unless it influences airborne software output. If it does, it must be baselined, and changes must go through problem reporting.

CM Mapping Summary

DO-178C CM Objective (Section 7)	Manifesto Mechanism	Implementation Note
Configuration identification	Evidence bundle IDs (P1), trace IDs (P9)	Each agent-generated artifact carries a unique identifier linked to the agent session, model version, and prompt hash.
Baselines	Knowledge baseline (P6)	Agent knowledge stores and model versions are baselined alongside software baselines at each lifecycle milestone.
Traceability	Bidirectional trace model (P9)	Specification-to-code-to-test traceability generated by agents must be independently validated for completeness.
Problem reporting	Evaluation failures (P8)	Failed evaluations generate problem reports automatically. Agent-introduced defects trace back to the originating session.
Change control	Autonomy tier gates (P5)	Tier 2 branch-to-merge workflow enforces change control. No agent-generated change enters a baseline without human approval.
Release and archival	Evidence bundles (P1)	Bundles are archival-ready: self-contained, immutable, and reproducible.

ARP 4761 / 4761A Safety Assessment

ARP 4761 (and its revision 4761A) defines the safety assessment methods that produce the failure condition classifications driving DAL assignment. Agent involvement in safety assessment activities requires particular care because errors propagate into DAL assignments and certification scope.

Safety Assessment Method	Agent Contribution	Constraint
Fault Tree Analysis (FTA)	Agent drafts fault trees from system architecture models and failure mode libraries.	Human validates logical correctness, cut set analysis, and probability assignments. Automated generation must not mask missing failure modes.
Failure Modes and Effects Analysis (FMEA)	Agent populates FMEA worksheets from component databases, prior analyses, and architecture descriptions.	Human reviews severity classifications, detection methods, and recommended actions. Agent cannot assign severity.
Markov Analysis	Agent builds state transition models and computes reliability metrics.	Human validates state space completeness and transition rate assumptions.
Dependency Diagram Analysis	Agent generates dependency diagrams from system interconnection data.	Human validates that all relevant dependencies are captured, including latent and environmental dependencies.
Common Mode Analysis (CMA)	Agent cross-references design data to identify shared resources, manufacturing processes, and environmental exposures.	Human owns the determination of common mode acceptability and any required design changes.

The manifesto's Principle 10 (containment) is directly relevant: safety assessment errors are emergent risks that compound through the certification chain. Independent human review is non-negotiable for all safety assessment outputs regardless of DAL.

Airworthiness Security (DO-326A / DO-356A)

DO-326A establishes the airworthiness security process; DO-356A provides the information security supplement. Agentic engineering introduces specific threat vectors that must be addressed in the Security Risk Assessment.

Manifesto Alignment

Security Concern	Manifesto Mapping	Aviation-Specific Consideration
Agent data access scope	P10 -- Containment; P3 -- Defense-in-depth	Agents must not have access to airborne software beyond their authorized development scope. Network isolation and data classification enforcement apply.
Supply chain integrity of agent models	P3 -- Architecture boundaries	Model provenance, integrity verification, and version control. Untrusted model updates are a supply chain attack vector.
Prompt injection / adversarial input	P10 -- Containment	Adversarial inputs to development agents could introduce subtle vulnerabilities in airborne code. Independent verification (DO-330 unqualified tool path) is the mitigation.
Data exfiltration via agent context	P7 -- Context is engineered	Agent context windows may contain export-controlled technical data.

Export Control (ITAR/EAR)

Airborne software, particularly defense-related avionics, is frequently subject to ITAR (22 CFR 120-130) or EAR (15 CFR 730-774) restrictions. Agents that process ITAR/EAR-controlled technical data must operate within compliant infrastructure: no data transmission to non-compliant cloud endpoints, no model training on controlled data without authorization, and access controls consistent with Technology Control Plans.

DO-278A -- Ground-Based Software

DO-278A governs software for ground-based CNS/ATM systems. It is structurally similar to DO-178C but uses Assurance Levels (AL 1-6) rather than DALs and applies to a lower-criticality domain overall.

DO-278A is a strong candidate for earlier agentic adoption:

DO-278A Assurance Level	Equivalent Rigor	Agent Autonomy Ceiling
AL-1	Comparable to DAL A	Tier 1
AL-2	Comparable to DAL B	Tier 1
AL-3	Comparable to DAL C	Tier 1-2
AL-4	Comparable to DAL D	Tier 1-3
AL-5	Below DAL D	Tier 1-4 (Tier 4 requires validated governance infrastructure per P5)
AL-6	Below DAL E	Tier 1-4 (Tier 4 requires validated governance infrastructure per P5)

The same DO-330 tool qualification constraints apply. The path is identical: unqualified tool with independent verification of all output.

ALCOA+ Compliance

Aviation configuration management (DO-178C Section 7) requires data integrity standards that parallel ALCOA+ requirements. The manifesto's evidence model satisfies these by construction. See Companion Frameworks — ALCOA+ Alignment for the complete mapping table.

For aviation-specific application:

Configuration identification maps to ALCOA+ "Attributable" and "Original": every agent-generated artifact carries agent identity, model version, session ID, and prompt hash.
Baselines map to "Contemporaneous" and "Enduring": evidence bundles are captured at execution time and retained as immutable CM items.
Problem reporting maps to "Accurate" and "Complete": evaluation failures generate problem reports that are traceable and cannot be silently suppressed.

Practical constraint: for DO-178C programs, the trace infrastructure is a development tool and must be addressed in the PSAC. Conservative framing: describe it as an internal tooling component with documented version control, not as a tool requiring TQL qualification.

Market-Specific Autonomy Guidance

The table below maps aviation workflows to recommended autonomy tiers. The DAL-based ceiling in the first section of this document applies; this table adds workflow-level context.

These are conservative caps for safety-relevant software paths; lower-risk supporting tooling may have different constraints.

Workflow	DAL / Assurance Level	Recommended Autonomy	Notes
Airborne software — critical paths (flight control, engine control)	DAL A/B	Tier 1 (observe only)	Agent may analyze, draft, and propose. All output independently verified by qualified personnel. TQL-1/2 tool qualification is not currently feasible under present evidence and qualification expectations; treat the agent as an unqualified tool pending authority review.
Airborne software — major functions	DAL C	Tier 1-2	Agents draft to isolated branches. Merge requires qualified review against applicable Table A objectives.
Airborne software — minor / no-effect functions	DAL D/E	Tier 1-3	Standard evidence bundles satisfy reduced verification objectives. Natural pilot domain.
Ground support equipment (GSE) software	Typically not DO-178C scope	Tier 1-3	Normal manifesto adoption applies. Confirm applicability of DO-178C to specific GSE.
Ground-based CNS/ATM software (DO-278A)	AL-3 to AL-6	Tier 1-3 (AL-3 ceiling: Tier 1-2)	Lower assurance levels; natural early adoption domain. Same DO-330 path applies.
Test generation and requirements analysis	Any DAL — tool output only	Tier 1 (observe)	Agent operating at Tier 1 generates candidate test cases, traceability matrices, and coverage analyses. Human qualified staff review and accept. No tool qualification required.
Safety assessment (FHA, FMEA, FTA)	N/A — feeds DAL assignment	Tier 1 (observe only)	Errors propagate into DAL and certification scope. Independent human review non-negotiable for all safety assessment outputs regardless of DAL.
Traceability and evidence package assembly	Any DAL	Tier 1-2	High value, low risk. Agent assembles; human validates completeness. Strong ALCOA+ alignment.

Tool Configuration Notes

How to configure agent tooling to satisfy DO-178C and DORA Article 9 traceability requirements. Read alongside your enterprise configuration guide.

Configuration Management Hook Mapping

DO-178C Section 7 requires that all software lifecycle data is identified, baselined, and change-controlled. Agent configuration contributes to this:

DO-178C CM Objective	Hook Type	What It Produces
Configuration identification of agent artifacts	PostToolUse audit hook	Artifact ID, agent session ID, model version, timestamp
Change control — agent-modified files	PreToolUse gate hook	Review record, autonomy tier at time of change
Problem reporting — failed evaluations	PostToolUse evaluation hook	Evaluation failure record with trace ID
Archival and retention	SessionEnd archive hook	Immutable session record in the CM repository

Export Control Enforcement (ITAR/EAR)

For programs with ITAR/EAR-controlled technical data, the MCP allowlist (Layer 6 in enterprise configuration) is the primary data residency control:

Restrict MCP servers to on-premises or US-person-accessible endpoints only.
No external API calls for sessions containing ITAR-controlled design data.
Log all tool calls with data classification context for Technology Control Plan compliance.

Model Version Pinning for Certification Stability

Pin agent model versions during active certification programs:

During DER/ODA review periods
While PSAC or SCI is open
After any verification baseline has been established

Model version changes affecting agent behavior should be documented as CM changes and assessed for impact on previously verified artifacts.

Viable Starting Points

Not all aviation software carries equal certification burden. The following are realistic entry points for agentic engineering practices today:

DAL D/E software development. Reduced verification objectives, fewer independence requirements. Evidence bundles and evaluation gates provide sufficient assurance credit with standard review.
Ground support equipment (GSE) software. Often not subject to DO-178C at all. Standard manifesto adoption applies.
Test generation and requirements analysis automation. Agents operating at Tier 1 (Observe) to generate candidate test cases, requirements traceability matrices, and coverage analyses. Output is reviewed and accepted by qualified personnel -- no tool qualification required.
Traceability automation and evidence bundle assembly. Agent-assembled traceability data and certification evidence packages. Human validates completeness. High-value, low-risk application.
Formal proof assistance (DO-333 credit). Agents generate proof obligations or proof scripts for formal verification tools. The tool (Astree, Frama-C, etc.) provides the deterministic verification. Agent output is checked by the prover, not by human review alone.
DO-278A AL-4 through AL-6 systems. Lower assurance levels with proportionally reduced verification burden. Natural pilot domain.

Open Regulatory Questions

The following questions do not have settled answers as of this writing. Organizations should track developments from FAA, EASA, SAE G-34, and EUROCAE WG-114.

Certification authority stance on agent-generated airborne software. No published policy exists specifically addressing LLM-generated code in DO-178C certification. Current guidance is interpreted through existing tool qualification (DO-330) frameworks.
Issue Paper likelihood. Novel technologies in certification programs typically trigger FAA Issue Papers or EASA Certification Review Items (CRIs). An agentic development approach in a DAL A-C program should anticipate this.
PSAC framing. How to describe agentic engineering practices in the Plan for Software Aspects of Certification without triggering unnecessary concern. Framing agents as unqualified development tools with independent verification is the current pragmatic approach.
Tool qualification evolution for AI-based tools. SAE G-34/EUROCAE WG-114 are developing ARP 6983 (ML in airborne systems) and related guidance. Future standards may provide a path to qualified AI-based development tools that does not exist today.
Multi-model supply chain. When multiple models (routing per P11) are used in a development workflow, the tool qualification and CM implications compound. No guidance exists for multi-model development tool chains.
Memory and learned behavior in development tools. If an agent's learned memory (P6) influences airborne software output, does that memory become lifecycle data under DO-178C Section 7? The conservative position is yes.

ASDLC and APLC Regulatory Guidance

For aviation-specific regulatory requirements mapped to ASDLC Layers 1, 3, and 4, see ASDLC Aviation Domain Guidance.

For agent product regulatory guidance (EU AI Act, EASA, FAA Part 21) applicable to aviation agent products governed by the APLC, see APLC Aviation Domain Guidance.

Mapping the Agentic Engineering Manifesto to medical device regulatory frameworks.

Disclaimer — This document maps concepts from the Agentic Engineering Manifesto to medical device regulatory frameworks. It does not constitute compliance or regulatory advice. Consult qualified regulatory and quality professionals for compliance determinations.

Regulatory currency: This document reflects IEC 62304, EU MDR 2017/745, FDA 21 CFR Part 820 (QMSR, effective February 2026, replacing the prior QSR), and EU AI Act requirements as understood at the time of last review. The EU AI Act implementation timeline is subject to ongoing guidance and proposed amendments; verify current status at eur-lex.europa.eu before relying on AI Act classifications in this document. Last reviewed: April 2026. Proposed changes not yet enacted are flagged as such.

See also: Companion Frameworks (boundary conditions, ALCOA+ mapping), Agentic V-Model (V-model lifecycle transition for regulated industries).

Canonical sources. Normative principle definitions (P1–P12) and autonomy tier definitions are in manifesto-principles.md. This document maps those definitions to medical device regulatory requirements; it does not redefine them.

IEC 62304 Safety Class to Manifesto Autonomy Mapping

IEC 62304 safety classification determines documentation depth, verification rigor, and -- in this mapping -- the permissible agent autonomy ceiling. If IEC 62304 is revised, re-evaluate the class mapping rather than assuming the current three-class structure is permanent; until any update is published, map conservatively to the three-class model.

Safety Class	Risk Level	Max Agent Autonomy	Documentation Depth	Evidence Bundle Requirements
Class A (no injury)	Negligible	Tier 1-3 (P5) for non-safety-critical software items; full agentic loop remains subject to the device's risk controls and use-case constraints.	Minimal: requirements + release documentation.	Standard evidence bundles per manifesto phase.
Class B (non-serious injury)	Moderate	Tier 1-2 (P5). Agents propose; humans approve merges.	Moderate: architecture + integration testing required.	Enhanced bundles with SOUP risk analysis per item.
Class C (death / serious injury)	High	Tier 1 only (P5). Agents analyze and propose; humans implement.	Full: detailed design + unit-level verification required.	Complete bundles with SOUP verification, unit-level trace, formal risk linkage.

Notes:

Autonomy ceilings are conservative defaults. Organizations may justify narrower or wider bounds through documented risk-benefit analysis.
Class C Tier 1 restriction means agents assist with analysis, traceability matrix generation, and test scaffolding -- not code generation for safety-critical paths.
If the 2026 IEC 62304 update merges Class A and B into a single class, re-evaluate the Tier 1-2 boundary for the merged class based on the updated documentation requirements.
Evidence bundle requirements scale with safety class. Class C bundles must include unit-level traceability from requirement through design, implementation, and verification -- satisfying IEC 62304 Clause 5.6 in full.

IEC 62304 Software Lifecycle to Manifesto Mapping

IEC 62304 Activity	Clause	Manifesto Equivalent	Principle	Alignment	Gap
Software development planning	5.1	Specification + Plan phases of Agentic Loop	P2, P5	Strong. Living specifications exceed static plans.	Plans must be frozen at submission; manifesto assumes evolution. Snapshot mechanism needed.
Software requirements analysis	5.2	Specify phase; machine-readable specs	P2	Strong. Machine-readable specs satisfy traceability.	Requirements must include safety requirements traced to risk analysis (ISO 14971 linkage).
Software architectural design	5.3	Design phase; domain boundaries (P3)	P3	Strong. Enforced boundaries map to software items.	Architecture must decompose to software items with assigned safety classes.
Software detailed design	5.4	Design phase (Class C depth)	P3	Partial. Manifesto does not mandate unit-level design docs.	Class B/C require detailed design for each software unit. Agents can generate but humans must verify.
Unit implementation	5.5	Execute phase	P4, P5	Partial. Agent execution replaces human coding.	Agent-as-tool qualification is unresolved (see Open Questions).
Unit verification	5.6	Verify phase; evaluation portfolio (P8)	P8	Strong. Evaluation gates exceed minimal unit test requirements.	Must include static analysis, code review equivalent, and SOUP verification.
Integration and integration testing	5.7	Verify phase; integration evaluations	P8, P9	Strong. Traces reconstruct cross-component interactions.	Integration must verify software item interfaces per architectural design.
System testing	5.8	Validate phase	P1, P8	Strong. Outcome-based validation aligns directly.	System tests must trace to software requirements (5.2).
Software release	5.9	Govern phase; release evidence bundle	P12	Strong. Evidence bundles with accountability satisfy release criteria.	Release must include version identification, known anomalies, and SOUP list.
Software maintenance	5.10	Learn + Govern phases; living specifications	P2, P6	Strong. Continuous loop exceeds reactive maintenance.	Problem and modification analysis must follow change control procedures.

ISO 14971 Risk Management to Manifesto Mapping

ISO 14971 Element	Clause	Manifesto Mechanism	Alignment
Intended use / reasonably foreseeable misuse	4.2-4.3	Specification scope (P2); boundary enforcement (P3)	Strong. Machine-enforced boundaries prevent foreseeable misuse categories.
Hazard identification	5.2	Adversarial testing (P8); chaos testing (P10)	Moderate. Manifesto identifies runtime hazards; clinical hazards require domain expertise outside agent scope.
Risk estimation	5.4	Observability data (P9); incident attribution (P12)	Moderate. Runtime data informs probability estimation; severity requires clinical judgment.
Risk evaluation	5.5	Autonomy tiering (P5); blast-radius limits	Moderate. Risk-based autonomy is philosophically aligned; acceptability criteria require manufacturer determination.
Risk control	6	Defense-in-depth (P3); deterministic wrappers; evaluation gates (P8)	Strong. Layered controls (wrappers + evaluations + observability) map to inherent safety, protective measures, and information for safety.
Residual risk evaluation	7	Evidence bundles; evaluation portfolio completeness	Partial. Manifesto does not explicitly model residual risk acceptance. Requires human risk-benefit judgment.
Production and post-production information	8	Observe + Learn phases; telemetry (P9)	Strong. Continuous observability exceeds traditional post-market surveillance data collection.

ISO/TS 24971-2 (ML-specific risk management): Extends ISO 14971 for ML-based medical devices. Key additions relevant to agentic systems:

Data quality risk: training and inference data quality directly affects agent output quality. Manifesto's context engineering (P7) addresses data curation but does not prescribe medical-device-specific data quality metrics.
Model drift monitoring: the manifesto's Observe phase and evaluation regression gates (P8) detect drift. ISO/TS 24971-2 requires drift to feed back into the risk management file.
Performance degradation detection: continuous evaluation portfolios satisfy this requirement when evaluation thresholds are calibrated to clinically meaningful performance boundaries.
Uncertainty quantification: ISO/TS 24971-2 expects ML systems to characterize output uncertainty. The manifesto does not mandate uncertainty quantification but its evaluation framework can incorporate it.

ISO 13485 QMS to Manifesto Mapping

ISO 13485 Requirement	Clause	Manifesto Mechanism	Notes
Design input	7.3.3	Specifications (P2); machine-readable requirements	Specs must include applicable regulatory requirements, standards, and risk control outputs.
Design output	7.3.4	Evidence bundles; verified artifacts	Outputs must reference design input requirements and include acceptance criteria.
Design review	7.3.5	Govern phase; human accountability (P12)	Named domain owner reviews at each design stage. Agent-generated artifacts are inputs to review, not substitutes.
Design verification	7.3.6	Verify phase; evaluation portfolio (P8)	Evaluation results serve as verification records when traced to design inputs.
Design validation	7.3.7	Validate phase; outcome-based acceptance (P1)	Validation must occur under defined use conditions. Simulated environments require justification.
Design transfer	7.3.8	Release evidence bundle; deployment records	Transfer procedures must ensure design outputs are verified before manufacturing.
Document control	4.2.4	Versioned specifications (P2); immutable evidence bundles	Manifesto versioning satisfies document control if retention and approval workflows are formalized.
Traceability	7.5.9	Trace infrastructure (P9); specification-to-outcome links	Structured traces exceed typical traceability matrices. Must extend to UDI and device identification.
CAPA	8.5.2-3	Learn phase; incident-driven specification updates	Manifesto's "failures are data" philosophy aligns. CAPA records must follow prescribed timelines and formats.
Management review	5.6	Govern phase; accountability (P12)	Requires periodic QMS effectiveness review. Manifesto governance is continuous but must produce discrete review records.
Purchasing controls	7.4	SOUP management; agent-selected dependencies	Supplier qualification applies to SOUP items. Agent-selected dependencies must go through purchasing/supplier evaluation.

SOUP / Agent-as-Tool in Medical Device Context

SOUP Requirements by Safety Class

Requirement	Class A	Class B	Class C
SOUP identification	Required	Required	Required
SOUP risk analysis	--	Required	Required
Published anomaly list review	--	Required	Required
SOUP functional/performance requirements	--	Required	Required
SOUP verification (detailed)	--	--	Required
SOUP qualified via testing	--	Recommended	Required

AI Model as SOUP

In agentic engineering, the AI model exhibits SOUP characteristics that exceed traditional SOUP assumptions:

Non-deterministic: identical inputs may produce different outputs across invocations, violating the implicit SOUP assumption of repeatable behavior.
Version-dependent: model updates change behavior without explicit changelogs, making published anomaly list review impractical.
Opaque anomaly list: failure modes cannot be enumerated a priori; the "published anomaly list" for a foundation model is effectively unbounded.

Agent-Selected Dependencies as SOUP Decisions

When agents select libraries, frameworks, or code patterns during execution, each selection is a SOUP decision that must be captured and evaluated. The manifesto's trace infrastructure (P9) records these selections but does not automatically trigger SOUP evaluation workflows.

Training-Data Patterns as Implicit SOUP

Agent-generated code may incorporate patterns, algorithms, or architectural decisions derived from training data. These constitute implicit SOUP -- code of unknown provenance embedded without explicit dependency declaration.

Manifesto Response

Treat the agent as an unqualified tool. Independently verify all agent output through the evaluation portfolio (P8) and human review (P12). This is consistent with the manifesto's position that agent assertions are never evidence -- only verified outcomes count (P1).

Practical implication: for Class B and C devices, every agent execution that produces deliverable artifacts must include a SOUP impact assessment in the evidence bundle. This assessment identifies any new dependencies introduced, any training-data-derived patterns detected (where feasible), and confirms that independent verification was performed on the output.

See Companion Frameworks -- Boundary Conditions for SOUP treatment in the cross-cutting regulated-industry guidance.

FDA SaMD / GMLP / PCCP

Predetermined Change Control Plan (PCCP)

The FDA PCCP framework for AI/ML-based SaMD requires a pre-specified plan for anticipated modifications. The manifesto's living specifications (P2) and continuous revalidation triggers align structurally:

PCCP Element	Manifesto Mechanism
Description of anticipated modifications	Living specifications with versioned change categories (P2)
Modification protocol (implementation, V&V)	Agentic Loop: Execute, Verify, Validate phases with evidence gates
Real-world performance monitoring plan	Observe + Learn phases; telemetry and drift detection (P9)
Revalidation triggers	Evaluation regression gates (P8); specification change triggers re-verification

Gap: PCCP requires pre-submission of the change control plan. The manifesto's continuous evolution must be bounded by the approved PCCP scope for marketed SaMD.

GMLP Principles to Manifesto Mapping

GMLP Principle	Manifesto Principle	Alignment
Multi-disciplinary expertise	Right-sized swarm (P4); human domain ownership (P12)	Strong
Good software engineering practices	Architecture (P3); evaluations (P8)	Strong
Clinical association and scientific validity	Outside manifesto scope	Gap -- requires clinical expertise
Data quality assurance	Context engineering (P7); knowledge governance (P6)	Moderate
Data management and relevance	Memory curation (P6); versioned data (P7)	Moderate
Computational and statistical rigor	Evaluation portfolio (P8); formal verification	Strong
Study design transparency	Observability (P9); evidence bundles	Strong
Performance assessment across subgroups	Adversarial evaluations (P8)	Moderate
Independent datasets for testing	Evaluation design practice	Moderate -- not explicitly mandated
Monitoring and retraining	Observe + Learn phases (P9, P6)	Strong

Total Product Lifecycle (TPLC)

The FDA TPLC approach for AI/ML SaMD maps directly to the Agentic Loop. Both assume continuous monitoring, learning, and modification rather than a single pre-market snapshot.

TPLC Stage	Agentic Loop Phase	Notes
Planning and development	Specify, Design, Plan	Manifesto specifications serve as the SaMD development plan.
Verification and validation	Execute, Verify, Validate	Evidence bundles document V&V activities per PCCP scope.
Deployment and monitoring	Observe, Learn	Real-world performance monitoring feeds back into specifications.
Modification and revalidation	Govern, Specify (repeat)	PCCP-scoped modifications trigger re-entry into the loop.

The manifesto's loop (Specify-Design-Plan-Execute-Verify-Validate-Observe- Learn-Govern) is a superset of the TPLC cycle. The key constraint: TPLC modifications outside the approved PCCP scope require new regulatory submissions.

EU MDR + AI Act Dual Compliance

Many EU MDR IIa+ devices that incorporate AI will also trigger high-risk AI obligations under the EU AI Act, but the exact classification depends on intended purpose and the applicable AI Act annexes. This typically creates dual compliance obligations.

Requirement Source	Requirement	Manifesto Principle	Notes
AI Act Art. 10	Data governance	P6 (Knowledge/Memory), P7 (Context)	Training, validation, and testing datasets must meet quality criteria. Manifesto's data curation aligns but must be formalized per Annex IV.
AI Act Art. 13	Transparency	P9 (Observability)	Traces and decision reconstruction satisfy transparency requirements. Must include user-facing documentation per AI Act format.
AI Act Art. 14	Human oversight	P5 (Autonomy tiers), P12 (Accountability)	Tiered autonomy with named human owners directly satisfies human oversight requirements.
AI Act Art. 15	Accuracy, robustness, cybersecurity	P8 (Evaluations), P10 (Containment)	Evaluation portfolios and chaos testing address accuracy/robustness. Cybersecurity requires supplementary assessment.
MDR Annex I, Ch. I	General safety and performance	P1 (Outcomes), P3 (Architecture)	Risk-based design with verified outcomes. Clinical performance outside manifesto scope.
MDR Annex II	Technical documentation	P2 (Specifications), P9 (Observability)	Versioned specs + structured traces produce technical documentation artifacts. Format must comply with MDCG guidance.
MDR Art. 83-86	Post-market surveillance / vigilance	P9 (Observability), Learn + Govern phases	Continuous observability exceeds minimum PMS requirements. Vigilance reporting timelines are regulatory obligations outside manifesto scope.

Notes:

Class IIa+ devices with AI components = high-risk AI system automatically under AI Act Article 6(1) via Annex I, Section A. No separate risk classification is needed on the AI Act side.
Notified bodies must assess both MDR and AI Act conformity. A single evidence bundle strategy that satisfies both regimes reduces audit burden. The manifesto's evidence model is designed for this consolidation.
AI Act conformity assessment may be integrated into the MDR conformity assessment procedure. Manufacturers should plan for a single, unified technical file that addresses both sets of requirements.
AI Act Article 9 (risk management) overlaps significantly with ISO 14971. A single risk management file can serve both regimes if it addresses AI-specific risks (bias, drift, opacity) alongside device-level hazards.

Clinical Evidence Boundary

Clinical evaluation (EU MDR Article 61), post-market clinical follow-up (PMCF), and benefit-risk determination are explicitly outside agent scope. These require clinical domain expertise, investigator judgment, and regulatory strategy that agents cannot provide.

Agents may assist with:

Traceability matrix generation between requirements and clinical evidence
Evidence assembly and formatting for clinical evaluation reports
Statistical analysis of post-market surveillance data
Literature search and screening for clinical evaluation

Agents must NOT:

Make clinical judgments or risk-benefit determinations
Generate clinical evidence claims or conclusions
Determine clinical investigation endpoints or study design
Assess clinical significance of post-market data

ALCOA+ Compliance

The manifesto's evidence model satisfies ALCOA+ data integrity requirements by construction. See Companion Frameworks -- ALCOA+ Alignment for the complete mapping table.

For medical device applications, this means evidence bundles produced through governed agentic delivery inherently meet the data integrity expectations of ISO 13485 record-keeping and FDA 21 CFR Part 820 quality system requirements, provided the underlying trace infrastructure is validated.

Key implementation note: the trace infrastructure itself is a computerized system subject to validation under 21 CFR Part 11 / Annex 11. Organizations must validate the evidence capture pipeline before relying on it for regulatory records. The manifesto's observability requirements (P9) provide the functional specification for this validation.

Market-Specific Autonomy Guidance

The IEC 62304 safety class mapping at the top of this document defines the regulatory ceiling. This table adds workflow-level context for common medical device development activities.

Workflow	Safety Class / Risk	Recommended Autonomy	Key Constraint
SaMD — patient-facing clinical decision output	Class C (IEC 62304); High-risk (EU AI Act)	Tier 1 (observe only)	Agent assists analysis; human clinician or qualified reviewer owns every output affecting patient care.
SaMD — Class B device software	Class B	Tier 1-2	Agents draft to isolated branches. Enhanced evidence bundles with SOUP risk analysis.
Class A device software and tooling	Class A	Tier 1-3	Full agentic loop permissible. Standard evidence bundles. Natural pilot domain.
Test generation and requirements traceability	Any class — tool output	Tier 1 (observe)	Agent generates candidate tests and traceability matrices. Qualified personnel review before entry into the DHF/DMR.
Post-market surveillance data analysis	Post-market	Tier 1-2	Agents analyze vigilance data, identify signals, draft initial assessments. Clinical significance determination remains human-owned.
Clinical evidence assembly and formatting	Pre-submission	Tier 1-2	Agents compile CER evidence packages, literature search results, and statistical summaries. Clinical conclusions are human-authored.
IQ/OQ/PQ evidence assembly	Validation	Tier 1-2	Agents assemble qualification evidence packages and format test results. Human qualified person reviews and approves.
CAPA root cause analysis assistance	Quality	Tier 1-2	Agents draft root cause analyses from defect data and trend analysis. Human quality owner approves before closure.

Tool Configuration Notes

How to configure agent tooling to satisfy IEC 62304 traceability and 21 CFR Part 11 / EU Annex 11 audit trail requirements.

Audit Trail Hook Mapping

21 CFR Part 11 §11.10(e) and EU Annex 11 §9 require audit trails for all GxP computerized system activity. Agent configuration should produce:

Regulatory Requirement	Hook Type	What It Produces
Audit trail — every agent action	PostToolUse audit hook	Agent identity, action type, timestamp, trace ID, data accessed
Access controls — authorized agents only	PreToolUse gate hook	RBAC check record; denied requests logged
Electronic signature for GxP record entry	PreToolUse signature hook	Named human approval with timestamp before any record submission
System validation evidence	SessionStart + SessionEnd hooks	Complete session record for IQ/OQ/PQ evidence
Data backup verification	Scheduled hook	Periodic confirmation that trace archive is intact and queryable

Data Classification Enforcement

For Class B/C devices and for GxP records:

Restrict agents to approved MCP servers only. No external API calls for sessions containing patient data or device design data.
Apply HIPAA (US) and GDPR (EU) data handling controls through the infrastructure-level MCP allowlist, not through agent prompts.
The trace infrastructure is a computerized system subject to Part 11 validation. Validate before using as a regulatory record source.

SOUP Detection Integration

Integrate a dependency scanning hook (PreToolUse) that:

Intercepts any new library or framework selection by the agent
Queries the organization's SOUP registry for qualification status
Blocks integration of unqualified SOUP for Class B/C development
Logs all SOUP decisions in the evidence bundle for DHF inclusion

Viable Starting Points

Not all medical device software carries equal certification burden. The following are realistic entry points for agentic engineering practices today:

Class A device software. No injury risk. Full agentic loop permissible. Standard evidence bundles. Natural pilot domain with minimal regulatory overhead.
Test generation for any safety class (Tier 1 observe). Agents generate candidate test cases, traceability matrices, and IEC 62304 §5.6 unit verification scaffolding. Qualified personnel review and accept. No tool qualification required. Applicable to Class B and C.
Post-market surveillance analysis. Agents analyze complaint data, identify adverse event patterns, and draft initial signal assessments. Human clinical reviewer owns the determination. High-value use case with contained blast radius.
Clinical evidence assembly. Agents compile literature search results, summarize clinical data, and format CER draft sections. Clinical conclusions remain human-authored. Reduces evidence assembly cycle time without automating clinical judgment.
Traceability matrix generation. Agent assembles specification-to-test-to-verification matrices from the DHF. Human validates completeness. Strong ALCOA+ alignment; directly supports MDR Annex II technical documentation.
IQ/OQ/PQ evidence packaging. Agents format and assemble qualification evidence packages from evaluation results. Human qualified person reviews before sign-off. Reduces qualification cycle time significantly.

Open Regulatory Questions

The following questions are unresolved in current regulatory guidance and represent areas where industry consensus, standards body clarification, or regulatory precedent is needed:

Agent-as-tool qualification under IEC 62304: Is an AI agent a "software tool" requiring qualification per IEC 62304 Clause 8, or is it SOUP, or something that requires a new classification? Current guidance does not address non-deterministic, general-purpose generation tools.
SOUP classification for continuously-learning systems: IEC 62304 assumes SOUP is versioned and stable between versions. A continuously- learning agent violates this assumption. How should SOUP risk analysis apply when the SOUP item's behavior changes without a discrete version boundary?
Version change revalidation requirements: When the underlying model is updated (e.g., model v1 to v2), what revalidation scope is required? The PCCP framework addresses anticipated modifications but does not explicitly cover infrastructure-level model changes that alter agent behavior without software changes.
FDA / notified body stance on agent-generated SaMD components: No regulatory body has published guidance on whether code generated by AI agents requires different verification than human-written code. The manifesto's position -- that agent output is unverified until independently confirmed -- is conservative but has not been tested in a regulatory submission.

Maps the Agentic Engineering Manifesto principles to pharmaceutical and life sciences regulatory frameworks.

Disclaimer — This document maps concepts from the Agentic Engineering Manifesto to pharmaceutical and life sciences regulatory frameworks. It does not constitute compliance or regulatory advice. Consult qualified regulatory and quality professionals for compliance determinations.

Regulatory currency: This document reflects GAMP 5 (2nd ed. 2022), FDA 21 CFR Part 11, EU Annex 11, ICH Q10, and EMA guidance as understood at the time of last review. FDA and EMA guidance on AI/ML in regulated manufacturing is actively evolving. Last reviewed: April 2026. Proposed changes not yet enacted are flagged as such.

Related documents: Companion Frameworks (boundary conditions, ALCOA+ mapping) | V-Model Adoption Path | Manifesto Principles

Canonical sources. Normative principle definitions (P1–P12) and autonomy tier definitions are in manifesto-principles.md. This document maps those definitions to pharmaceutical regulatory requirements; it does not redefine them.

1. GAMP 5 Category Mapping

These mappings are pragmatic classifications, not formal GAMP rulings.

GAMP 5 (2nd edition, 2022) categorizes computerized systems for risk-based validation. The table below maps each category to its agentic engineering equivalent and the manifesto mechanism that governs it.

GAMP Cat	Description	Agent Context	Validation Approach	Manifesto Mechanism
1 -- Infrastructure	OS, databases, networking	Agent runtime infrastructure (container host, network layer, database engine)	Minimal -- qualify as part of platform	P3 architecture enforcement; infrastructure treated as deterministic wrapper
3 -- Non-Configured	COTS used as-is	LLM API consumed without customization; off-the-shelf agent framework with default settings	Verification of output against intended use; supplier documentation leveraged	P8 evaluation portfolios verify outputs; supplier qualification per GAMP Appendix O3
4 -- Configured	Products configured for intended use	Agent system configured via prompts, skills, tool permissions, autonomy tiers	Configuration-focused validation; verify each configured parameter behaves as intended	P2 living specifications; P5 autonomy tiers as validated configuration; P7 context engineering
5 -- Custom	Bespoke software	Agent-generated code; custom tool integrations; bespoke orchestration logic	Full risk-based validation per GAMP lifecycle	P1 outcome evidence; P8 evaluations as contract; P9 structured traces for traceability

Open question: Is the agent system itself Category 3 or Category 4?

An LLM API used with default parameters is arguably Category 3. The same API used with system prompts, configured tools, and autonomy tier enforcement is Category 4. Most production agent deployments are Category 4 at minimum. The categorization determines validation burden and must be justified in the system's validation plan. Where agent-generated code is deployed, that code is Category 5 regardless of the system that produced it.

Agent-selected dependencies. When an agent pulls in a library or framework, it is implicitly making a GAMP categorization decision. The manifesto's P3 (architecture as defense-in-depth) provides the mechanism -- allowlists and tool permissions -- but the GAMP implications must be explicitly addressed: each agent-selected dependency inherits a category and validation obligation that the deploying organization owns.

GAMP 5 2nd edition "critical thinking" alignment. The 2022 revision emphasizes critical thinking over rote compliance -- a philosophy shared with the manifesto. GAMP 5's risk-based approach to validation effort maps to the manifesto's phase-calibrated evidence: higher risk demands more rigorous evaluation, not more documentation.

2. Computer Software Assurance (CSA) Alignment

The FDA's 2022 CSA guidance replaces traditional CSV with risk-based, critical-thinking-driven assurance. This is the strongest alignment point between the manifesto and pharma regulation.

CSV (Traditional)	CSA (2022)	Manifesto Alignment
Document everything	Risk-based documentation	Evidence bundles scaled by risk tier (P1)
Scripted testing only	Unscripted + scripted testing	Evaluation portfolios with adversarial cases (P8)
Compliance theater	Critical thinking	Outcomes over assertions (P1); verified outcomes over fluent assertions
Test to the script	Test to the risk	Phase-calibrated evidence; chaos testing (P10)
Every IQ/OQ/PQ step documented	Assurance commensurate with risk	Autonomy tiers match risk (P5); evidence bundles gated by phase
Scripted execution as proof	Intended use drives assurance	Specification-first approach (P2); validation distinct from verification
Compliance as end-state	Continual assurance	Agentic Loop (Observe, Learn, Govern) as living assurance cycle

Strategic context. The manifesto is an engineering framework that operationalizes CSA's philosophy. CSA calls for risk-based, critical-thinking-driven assurance but does not prescribe the engineering discipline to implement it. The manifesto provides that discipline: specifications as living artifacts (P2), evaluations as contracts (P8), structured traces for auditability (P9), and tiered autonomy calibrated to risk (P5). Organizations struggling to operationalize CSA can adopt the manifesto's engineering practices as a CSA implementation framework.

Most pharma companies understand CSA's intent but lack the engineering practices to execute it. The manifesto fills that gap -- not as a compliance framework, but as the engineering discipline that produces CSA-aligned evidence by construction.

CSA principle-to-manifesto detail.

CSA Principle	Manifesto Implementation
"Assurance activities commensurate with risk"	Phase-calibrated evidence bundles (P1); autonomy tiers scaled to risk (P5)
"Use of unscripted testing"	Adversarial evaluation cases (P8); chaos testing (P10)
"Critical thinking over scripted compliance"	Outcomes as unit of work (P1); evaluations as contract, not checklist (P8)
"Intended use drives assurance"	Specification-first approach (P2); validation distinct from verification (Agentic Loop)
"Leverage supplier testing"	Agent-generated evidence bundles as supplier evidence (P1, P8)

This alignment is structural, not retrofitted. The manifesto's evidence model produces CSA-compatible artifacts as a byproduct of its engineering discipline. Organizations adopting the manifesto for agentic delivery simultaneously produce documentation that satisfies CSA expectations -- without a separate compliance workstream.

3. 21 CFR Part 11 / EU Annex 11 Mapping

Requirement	Regulation	Manifesto Mechanism	Alignment	Gap
Audit trails	Part 11 s 11.10(e); Annex 11 s 9	P9 structured traces -- every agent action produces inspectable trace with decision chain	Good fit	Agent system configuration changes (prompt edits, tier adjustments, tool additions) require their own audit trail beyond action traces
Electronic signatures	Part 11 s 11.50-11.100; Annex 11 s 14	P12 accountability -- humans own outcomes, approvals, risk acceptance	Partial	Agent-produced records entering GxP systems may require legally valid electronic signatures; manifesto does not address signature binding
System access controls	Part 11 s 11.10(d); Annex 11 s 12	P5 autonomy tiers with granular permissions (read/write, deploy scope, data access)	Good fit	--
Closed vs. open system	Part 11 s 11.30	P3 architecture as defense-in-depth; deterministic wrappers around probabilistic AI	Partial	No classification guidance for whether agent systems with external API calls constitute open systems
Data backup and recovery	Annex 11 s 7.1	P6 memory governance -- rollback, provenance, expiration	Partial	Memory governance covers learned memory; GxP backup requirements extend to all system data and configuration
Validation	Part 11 s 11.10(a); Annex 11 s 4	P8 evaluations as contract; evidence bundles per P1	Partial	No explicit IQ/OQ/PQ mapping (see section 7 below)
Operational checks	Part 11 s 11.10(f)	P10 containment engineering -- circuit breakers, rate limits, safe fallbacks	Good fit	--
Authority checks	Part 11 s 11.10(g)	P5 tier enforcement -- actions gated by tier and permission scope	Good fit	--
Record retention	Part 11 s 11.10(c); Annex 11 s 17	P9 trace retention as infrastructure requirement; ALCOA+ "Enduring" criterion	Good fit	Retention periods and format migration for agent traces need specification per GxP context

4. GxP Context Differentiation

GxP Context	Key Regulations	Agent Use Cases	Risk Profile	Recommended Max Autonomy
GMP (Manufacturing)	21 CFR 210/211, EU GMP Annex 11, PIC/S	Batch record review, deviation trending, CAPA root cause analysis, process analytical technology (PAT)	High -- errors affect product quality and patient safety; manufacturing records are legal quality documents	Tier 1 (Observe) -- agents analyze and propose; human executes all GMP record modifications
GLP (Laboratory)	21 CFR Part 58, OECD GLP Principles	Protocol drafting, data analysis, literature review, study report compilation	Medium -- errors compromise study integrity and regulatory submission basis; raw data integrity is absolute	Tier 1-2 (Observe/Branch) -- agents draft in isolation; human reviews and approves; agents must never modify raw data
GCP (Clinical)	ICH E6(R3), 21 CFR 50/56/312, EU CTR	Protocol design assistance, site feasibility, patient matching, medical coding (MedDRA), safety signal detection	Medium-High -- errors affect patient safety or trial integrity; ICH E6(R3) "fit-for-purpose" quality management applies	Tier 1-2 (Observe/Branch) -- agents assist under human governance; causality assessment and patient-facing decisions remain human-owned

Differentiating factor. The manifesto treats "regulated industries" as a category. Pharma practitioners operate in specific GxP contexts with distinct requirements. GMP imposes the heaviest constraints on agent autonomy because manufacturing records are legal quality documents subject to Part 11. GLP permits more agent involvement in analysis but enforces absolute raw data integrity. GCP benefits most from ICH E6(R3)'s "fit-for-purpose" alignment with the manifesto's risk-tiered approach.

Use-case risk graduation. Organizations can adopt agentic engineering incrementally across GxP contexts:

Use-Case Domain	Regulatory Burden	Starting Autonomy	Expansion Path
Drug discovery / research	Low	Tier 2-3	Manifesto applies directly; minimal regulatory overlay
Regulatory affairs	Medium (high value)	Tier 1-2	Dossier assembly, consistency checking; submission content human-approved
Clinical operations (GCP)	Medium	Tier 1-2	Agents assist; ICH E6(R3) fit-for-purpose quality management applies
Pharmacovigilance	Medium-High	Tier 1	Signal detection, ICSR triage; causality assessment human-owned
Manufacturing (GMP)	High	Tier 1	Batch record review, deviation analysis; agent modification of GMP records requires full Part 11 compliance

5. ICH Guidelines Mapping

ICH Guideline	Core Concept	Relevance to Agentic Engineering	Manifesto Alignment
Q8 (Pharmaceutical Development)	Design Space -- operating ranges within which changes do not require regulatory notification	Tier 2 autonomy within established boundaries; agents operate freely within a validated Design Space, escalate outside it	P5 autonomy tiers: Tier 2 (Branch) maps to operation within Design Space; boundary crossing triggers Tier 3 governance
Q9 (Quality Risk Management)	Risk-based approach to quality decisions; severity, probability, detectability	Risk assessment drives autonomy level, evidence requirements, and validation depth	P5 risk-tiered autonomy; P8 phase-calibrated evidence; P11 economics of intelligence (cost of correctness includes risk)
Q10 (Pharmaceutical Quality System)	Continual improvement; knowledge management; management review	Agentic Loop (Observe, Learn, Govern) as a continual improvement engine; P6 knowledge vs. learned memory distinction	P6 knowledge infrastructure; P9 observability for management review; Agentic Loop as PQS implementation mechanism
Q12 (Lifecycle Management)	Established conditions; post-approval changes; reporting categories	Revalidation triggers when agent behavior changes; model version updates as post-approval changes; change classification	P2 living specifications; change control for model versions, prompt modifications, and memory accumulation
E6(R3) (GCP)	"Fit-for-purpose" quality management; proportionate approaches; risk-based monitoring	Risk-tiered governance for clinical agent applications; assurance proportionate to decision impact	P5 autonomy tiers; P8 evaluations scaled to risk; manifesto's risk-based philosophy mirrors E6(R3)'s proportionality principle

6. IQ/OQ/PQ Framework for Agent Systems

The pharma qualification framework maps to the manifesto's engineering practices as follows.

Qualification Stage	Traditional Scope	Agent System Equivalent	Manifesto Mechanism
IQ (Installation Qualification)	System installed per specification; hardware and software verified	Agent runtime installed; model versions locked and documented; tool connections verified; infrastructure (Cat 1) validated; configuration baselines captured	P3 architecture enforcement; P2 versioned specifications; infrastructure as deterministic wrapper
OQ (Operational Qualification)	System operates as intended within specified ranges	Agent produces correct outputs for defined test cases; autonomy tiers enforce correctly; traces capture completely; error handling and escalation paths verified	P8 evaluation portfolios; P5 tier enforcement verification; P9 observability validation
PQ (Performance Qualification)	System performs consistently under production conditions over time	Agent performs reliably under production load and data volumes over an extended period; drift detection active; evidence bundles generated consistently	P9 observability and drift monitoring; P10 resilience under stress; P1 outcome evidence over sustained operation

Mapping note. IQ/OQ/PQ is a sequential qualification framework. The manifesto's Agentic Loop is continuous. In practice, IQ/OQ/PQ establishes the initial validated state; the Agentic Loop (Observe, Learn, Govern) maintains that state through ongoing operation. Requalification is triggered by changes per the organization's change control procedure -- see section 9.

IQ/OQ/PQ evidence mapping.

Stage	Required Evidence (Traditional)	Agent System Evidence (Manifesto)
IQ	Installation records, version logs, configuration screenshots	P2 versioned specification snapshot; P3 infrastructure-as-code manifests; model version hash; tool connection test results
OQ	Test protocols, test results, deviation reports	P8 evaluation portfolio results; P5 tier enforcement test logs; P9 trace completeness verification
PQ	Production run records, performance trending	P9 observability dashboards; P1 evidence bundle consistency over time; P10 resilience metrics under production load

7. Data Integrity for Agent Systems

ALCOA+ is the foundational data integrity framework for pharma and GxP. The manifesto's ALCOA+ mapping in companion-frameworks.md covers software development records. Pharma operational records require additional consideration.

Data Integrity Concern	Regulatory Basis	Agent-Specific Consideration
Agent-generated data as "original" data	21 CFR 211.68; Annex 11 s 8	When an agent generates a calculation, trend, or summary entering a batch record or clinical database, the source record must be defined. The agent's input data and logic trace constitute the source.
Agent-modified data	Annex 11 s 9; Part 11 s 11.10(e)	Audit trail must capture: original value, new value, reason for change, who authorized the change, timestamp. The manifesto's P9 traces cover agent actions; the authorization chain (P12) must link to a named human.
Metadata preservation	WHO Data Integrity Guidance; PIC/S PI 041	Agents processing GxP data must preserve timestamps, user IDs, system IDs, and audit metadata. Transformation or reprocessing must not corrupt metadata.
Data access classification	P5 autonomy tiers	Which GxP data can agents access? Define per data classification: read-only for raw data (GLP), read-only for batch records (GMP), read-write for draft documents only, no access to restricted patient-level data without additional controls.

Data access matrix by GxP context.

Data Type	GMP Access	GLP Access	GCP Access	Rationale
Raw / source data	Read-only	Read-only (absolute)	Read-only	Raw data integrity is non-negotiable across all GxP contexts
Batch records	Read-only	N/A	N/A	Legal quality documents; modifications require human execution with Part 11 signatures
Draft documents	Read-write	Read-write	Read-write	Agents draft; humans review and approve before documents enter the quality system
Calculated / derived data	Read-write with trace	Read-write with trace	Read-write with trace	Agent must log input data, algorithm, and output; source traceability required
Patient-level data	N/A	N/A	Read-only with controls	Additional access controls, anonymization, and data protection requirements apply

8. Supplier Qualification

Pharma requires supplier qualification for all critical suppliers of GxP computerized systems.

Supplier Qualification Aspect	Agent-Specific Consideration
Vendor audit	LLM providers and agent framework vendors require assessment. Audit scope should include: data handling practices, model versioning, availability SLAs, security posture.
Quality agreement	Agreements with model providers must address: version notification, deprecation timelines, data confidentiality, uptime commitments, incident notification.
Ongoing performance monitoring	P9 observability provides richer monitoring data than traditional supplier review. Track: output quality drift, latency changes, availability, cost-per-query trends.
Open-source models	No traditional "supplier" exists. The deploying organization assumes full supplier responsibility: validation, maintenance, version control, incident response. Document this in the validation plan.
Multi-vendor routing	P11 economics-aware routing means multiple model providers. Each requires qualification. Routing logic itself is a validated configuration (GAMP Cat 4).

Open regulatory issue: who is the "supplier" for open-source models?

GAMP 5 and EU GMP Chapter 7 assume an identifiable supplier with a quality system. Open-source foundation models have no such entity. The deploying organization must formally document that it assumes supplier responsibilities -- including validation, ongoing monitoring, version control, anomaly tracking, and incident response. This represents a significant resource commitment that must be factored into the build-vs-buy decision for GxP agent deployments.

9. Change Control Considerations

Treat model updates, prompt edits, tool changes, and memory growth as distinct change classes; they carry different validation scopes and requalification burdens.

Change Type	Pharma Change Control Implication	Manifesto Mechanism	Open Question
Model version update	Change to a validated system; requires impact assessment and potential requalification (OQ minimum)	P2 living specifications; revalidation triggers in Agentic Loop	What is the minimum requalification scope for a minor model version change vs. a major version change?
Prompt / specification modification	Configuration change to a Cat 4 system; requires change control record	P2 versioned specifications; P9 traces capture specification version	Should prompt changes follow the same change control rigor as software configuration changes?
Tool addition or removal	System boundary change; may affect GAMP categorization and validation scope	P3 architecture enforcement; P4 swarm topology	Does adding a tool to an agent's toolkit constitute a change requiring full OQ?
Memory accumulation	Agent behavior changes as learned memory grows; this is a novel change type	P6 memory governance -- expiration, rollback, provenance	Is accumulated memory a change requiring change control? At what threshold?
Autonomy tier adjustment	Risk profile change; requires risk assessment and potential requalification	P5 tiered autonomy; P12 accountability	Tier escalation (1 to 2) requires documented risk acceptance. Does de-escalation?
Periodic review	Annual review obligation remains regardless of continuous monitoring	P9 continuous observability provides richer data than traditional periodic review	How does continuous observability supplement or replace the annual periodic review?

10. Viable Starting Points

Not all pharma workflows carry equal GxP burden. The following are realistic entry points for agentic engineering practices today:

Drug discovery and early research (no GxP obligations). Manifesto applies directly with minimal regulatory overlay. Natural pilot domain. Use to build team competency and evidence practices before GxP contexts.
Regulatory dossier consistency checking. Agents cross-check submission sections for internal consistency, identify gaps against CTD format requirements, and flag cross-references. Regulatory affairs professional approves before submission. High-value use case; Tier 1-2 natural ceiling.
Deviation trending and CAPA root cause assistance. Agents analyze deviation databases, identify patterns, and draft initial root cause analyses for human review. Reduces investigation cycle time. No GMP record modification — observe only.
Pharmacovigilance signal detection. Agents analyze ICSR data and literature for emerging safety signals. Qualified pharmacovigilance professional reviews all findings before regulatory reporting. Contained blast radius; significant value at Tier 1 observe.
Protocol drafting assistance (GLP, GCP). Agents draft study protocol sections from templates and prior studies. Principal investigator or sponsor review and approval before finalization. Strong alignment with ICH E6(R3) "fit-for-purpose" quality management.
IQ/OQ/PQ evidence assembly. Agents format and compile qualification evidence packages from evaluation results. Qualified person signs off. Reduces validation cycle time while preserving human accountability for all quality decisions.

11. Hard Autonomy Caps

The following caps apply regardless of organizational maturity phase. They are derived from GxP data integrity requirements, not from risk preference.

Use Case	Maximum Tier	Regulatory Basis	Key Constraint
GMP batch record modification	Tier 1 (observe only)	21 CFR 211.68; EU GMP Annex 11 §9; Part 11	Batch records are legal quality documents. Agents may analyze; humans execute all modifications with Part 11 electronic signatures.
GMP manufacturing instructions	Tier 1 (observe only)	EU GMP Chapter 4; 21 CFR 211	Agent may draft; qualified person reviews and approves before issuance to production.
GLP raw data	Tier 1 (observe only)	21 CFR Part 58; OECD GLP Principles	Raw data integrity is absolute. Agents may read; agents must never modify raw data.
GCP patient-facing decisions / causality	Tier 1 (observe only)	ICH E6(R3); 21 CFR 50/56	Causality assessment and any patient safety decision requires qualified human judgment.
Regulatory submission content	Tier 2 max	FDA, EMA submission regulations	Agent drafts and consistency-checks; regulatory affairs professional approves before submission.
Drug discovery / early research	Tier 3 available	Minimal GxP overlay	Standard manifesto adoption applies. No GxP obligations for pre-IND research.
Pharmacovigilance (ICSR triage, signal detection)	Tier 1-2	ICH E2A/E2B/E2C; EudraVigilance	Agent assists signal detection and ICSR assembly; qualified pharmacovigilance professional reviews every case before reporting.

12. Formal Verification Opportunity

Manifesto Principle 8 states: "proofs are a scale strategy." For pharma, formal verification creates value in specific contexts:

Process Analytical Technology (PAT) and Control Strategy

PAT models (ICH Q8, Q10) governing real-time release testing and process control can benefit from formal verification of the control logic:

Process model contracts: Formal preconditions and postconditions on analytical control algorithms can be machine-verified rather than validated through scripted testing alone.
Agent-generated PAT logic with formal proofs: Agent-generated control logic accompanied by machine-checked proofs of correctness properties (no out-of-bounds, monotonicity of response) can produce a stronger validation case than test-only approaches.
FDA CSA alignment: CSA's "use of unscripted testing" and "critical thinking over scripted compliance" principles support replacing exhaustive scripted test matrices with targeted formal verification on critical paths.

Quantitative Structure-Activity Relationship (QSAR) and Pharmacokinetic Models

QSAR models and PK/PD algorithms used in drug development can benefit from:

Formal invariants: Constraints on output ranges, monotonicity of dose-response relationships, and absence of undefined behavior formally verified rather than tested across a finite sample.
Contract-first specification (P2): Specify model constraints as formal contracts before implementation. Agent-generated model code verified against the formal contract by a model checker provides stronger evidence than equivalence testing alone.

Practical Entry Point

Formal methods do not require a full theorem-proving infrastructure. The practical entry is executable specification: write GxP acceptance criteria as machine-checkable assertions (postconditions on calculations, invariants on data ranges). These serve simultaneously as human-readable requirements and automated verification inputs — collapsing the gap between specification and test evidence. This is directly compatible with CSA's intent and eliminates the overhead of scripted test protocol generation.

13. Tool Configuration Notes

How to configure agent tooling to satisfy 21 CFR Part 11 / EU Annex 11 audit trail requirements and GxP data integrity obligations.

Audit Trail Hook Mapping

GxP Requirement	Hook Type	What It Produces
Audit trail — agent actions on GxP data	PostToolUse audit hook	Timestamp, user/agent identity, action type, before/after values, reason
Electronic signatures for GxP records	PreToolUse signature gate	Named qualified person approval with binding electronic signature
System access controls	PreToolUse RBAC hook	Access check record; unauthorized access attempts logged
Configuration change audit trail	PostToolUse config hook	Specification version changes, prompt modifications, tier adjustments logged
Data backup and recovery verification	Scheduled PostToolUse	Periodic archive integrity check
Operational checks (circuit breakers)	PreToolUse system check	Agent health check; blocks execution if system state outside validated range

GxP Data Classification Enforcement

The MCP allowlist (Layer 6 in enterprise configuration) is the primary data residency control for GxP systems:

Data Classification	Agent Access	Routing Constraint
Raw / source data (GMP, GLP)	Read-only	On-premises or validated private cloud only; no external API
Batch records and GMP quality records	Read-only	On-premises only; any agent access logged as a Part 11 event
Draft documents	Read-write with audit trail	Approved models with signed DPA; agent writes to draft state only
Restricted patient-level data	Read-only with additional controls	Anonymization layer required; local inference preferred

GAMP Validation of the Agent Infrastructure

The agent runtime itself is a GAMP Category 4 or 5 system:

IQ evidence: Configuration-as-code (specifications, tool permissions, tier settings, model version pins) captured in the version-controlled configuration repository.
OQ evidence: Evaluation portfolio results (P8); tier enforcement test logs (P5); trace completeness verification (P9).
PQ evidence: Production performance metrics (P9); drift detection records; evidence bundle consistency over time (P1).

The configuration repository is the IQ record. Point auditors to it — it is the answer to "show me your validated system configuration."

14. Open Regulatory Questions

These questions are unresolved at the intersection of agentic engineering and pharma regulation. They are listed here to support regulatory strategy discussions, not to imply that answers exist.

#	Question	Regulatory Context	Manifesto Reference
1	How should agent systems be categorized under GAMP 5 -- Category 3, 4, or a new category?	GAMP 5 (2nd ed.)	P3, P5
2	Do agent-generated GxP records satisfy Part 11 requirements for electronic records?	21 CFR Part 11	P9, P12
3	What validation approach applies to systems whose behavior changes through learning?	GAMP 5; CSA	P6, P8
4	Is a model version change equivalent to a software version change for change control purposes?	EU GMP Annex 11 s 10; ICH Q10	P2
5	Does prompt modification constitute a configuration change requiring formal change control?	GAMP 5 Cat 4; Annex 11 s 10	P2, P7
6	At what point does memory accumulation constitute a change to a validated system?	GAMP 5; Annex 11 s 11	P6
7	Can agent-generated evidence bundles serve as supplier documentation under CSA's "leverage supplier testing" principle?	FDA CSA	P1, P8
8	What constitutes an adequate quality agreement with an LLM provider for GxP use?	EU GMP Chapter 7; ICH Q10	P11
9	How should agent systems be classified as open or closed systems under Part 11?	21 CFR Part 11 s 11.30	P3
10	Does continuous observability (P9) satisfy or supplement periodic review obligations?	EU GMP Annex 11 s 11	P9

These questions reflect the current state of regulatory uncertainty. As regulatory bodies issue guidance on AI in GxP environments, this section should be revisited and questions resolved or refined. Organizations should track FDA, EMA, MHRA, and PIC/S publications for emerging positions.

Appendix A: Alignment Summary by Manifesto Principle

Principle	GAMP 5	CSA	Part 11 / Annex 11	GxP (GMP/GLP/GCP)	ICH Q8-Q12 / E6(R3)
P1 Outcomes	Cat 5 validation evidence	Risk-based documentation	Record retention	Evidence across all GxP	Q10 continual improvement
P2 Specifications	Cat 4 configuration	Intended use drives assurance	--	Protocol / specification management	Q12 established conditions
P3 Architecture	Category boundary enforcement	--	Closed/open system classification	System boundary definition	--
P5 Autonomy	Risk-based validation depth	Assurance commensurate with risk	Access controls; authority checks	Tier caps per GxP context	Q8 Design Space; Q9 risk management
P6 Memory	--	--	Data backup and recovery	Raw data integrity	Q10 knowledge management
P8 Evaluations	Validation testing	Unscripted + scripted testing	Validation of computerized systems	IQ/OQ/PQ framework	E6(R3) fit-for-purpose QM
P9 Observability	--	--	Audit trails; operational checks	Audit trail across GxP	Q10 management review
P10 Containment	--	--	Operational checks	--	Q9 risk controls
P12 Accountability	--	Critical thinking	Electronic signatures	Human ownership of GxP records	E6(R3) sponsor/investigator responsibility

Appendix B: Principle Quick Reference

Manifesto principles referenced throughout this document.

Ref	Principle	Core Concept
P1	Outcomes are the unit of work	Evidence bundles; deployed, instrumented, evaluated
P2	Specifications are living artifacts	Versioned, reviewable, machine-readable
P3	Architecture is defense-in-depth	Deterministic wrappers; enforced boundaries
P4	Right-size the swarm	Topology matched to complexity
P5	Autonomy is a tiered budget	Tier 1 Observe / Tier 2 Branch / Tier 3 Commit
P6	Knowledge and memory are distinct	Knowledge (ground truth) vs. learned memory (heuristic)
P7	Context is engineered like code	Versioned, tested, performance-benchmarked
P8	Evaluations are the contract	Evaluation portfolios; regression gates
P9	Observability covers reasoning	Structured traces; audit trails; interoperability
P10	Assume emergence; engineer containment	Circuit breakers; chaos testing; safe fallbacks
P11	Optimize economics of intelligence	Cost of correctness; dynamic model routing
P12	Accountability requires visibility	Human ownership; incident attribution

ASDLC and APLC Regulatory Guidance

For pharma and life sciences-specific regulatory requirements mapped to ASDLC Layers 1, 3, and 4 (GAMP 5, 21 CFR Part 11, EU Annex 11, GxP validation), see ASDLC Pharma Domain Guidance.

For agent product regulatory guidance applicable to pharmaceutical agent products governed by the APLC, see APLC Pharma Domain Guidance.

Mapping the Agentic Engineering Manifesto principles to financial services regulatory frameworks.

Disclaimer -- This document maps concepts from the Agentic Engineering Manifesto to financial services regulatory frameworks. It does not constitute compliance or regulatory advice. Consult qualified risk, compliance, and regulatory professionals for compliance determinations.

Regulatory currency: This document reflects SR 11-7 / OCC 2011-12, DORA (EU 2022/2554), EU AI Act, GDPR, MiFID II, and SEC/FINRA model risk guidance as understood at the time of last review. Financial services regulation varies significantly by jurisdiction; this document uses conservative cross-jurisdictional defaults, not jurisdiction-specific advice. The EU AI Act implementation timeline and Annex III classifications are subject to ongoing guidance; verify current status before relying on AI Act references here. Last reviewed: April 2026. Proposed changes not yet enacted are flagged as such.

Preamble

This document is a companion to manifesto.md. It assumes familiarity with the boundary conditions and the Agentic V-Model transition framework. Financial services already operates the governance infrastructure the manifesto demands: model risk management, three lines of defense, change control, audit trails. The bridge to agentic engineering is extension of existing frameworks, not construction of new ones.

Canonical sources. Normative principle definitions (P1–P12) and autonomy tier definitions are in manifesto-principles.md. This document maps those definitions to financial services regulatory requirements; it does not redefine them.

SR 11-7 / OCC 2011-12 Model Risk Management

SR 11-7 defines model risk management expectations for banking organizations supervised by the Federal Reserve and OCC. Agent systems that influence financial decisions fall within scope when they meet the SR 11-7 definition of "model" -- a quantitative method that processes inputs to produce quantitative estimates used in decision-making.

SR 11-7 Requirement	Manifesto Mechanism	Alignment	Gap
Model development documentation -- design, theory, data, assumptions	P1 evidence bundles, P2 specifications	Partial	Model development rationale (why this model, alternatives considered, limitations) not captured by default evidence bundles. SR 11-7 expects documentation of the conceptual soundness of the approach, not just that it was built and tested.
Independent model validation -- effective challenge by qualified staff	P8 evaluations	Significant gap	SR 11-7 requires organizational independence between developer and validator. The manifesto treats verification as part of the delivery pipeline, performed by the same team. Validation must include conceptual soundness review, not just test execution.
Ongoing monitoring -- backtesting, benchmarking, sensitivity analysis, outcomes analysis	P9 observability, structured traces	Good fit	Agent traces provide richer monitoring data than most current model monitoring infrastructure. Traces capture reasoning chains, not just input-output pairs, enabling deeper performance analysis.
Model inventory and classification -- tiering by materiality, use, and complexity	None	Missing	Every agent system used in financial decisions must be registered, classified by materiality, and tracked in the model inventory. Classification drives validation frequency, monitoring intensity, and governance oversight.
Model risk governance -- roles, escalation, board reporting, risk appetite	P5 autonomy tiers, P12 accountability	Partial	Three Lines of Defense roles and escalation paths not explicitly addressed. Board-level model risk reporting and model risk appetite statements have no manifesto equivalent.
Champion-challenger testing -- parallel execution against alternative approaches	None	Missing	Comparing agent outputs against alternative approaches or incumbent models is not part of the manifesto evaluation framework. Critical for demonstrating that the agent system performs at least as well as the approach it replaces.
Model limitations documentation -- known weaknesses, boundary conditions, compensating controls	None	Missing	Explicit documentation of what the agent system cannot do, known failure modes, conditions under which outputs should not be relied upon, and compensating controls for known limitations.
Vendor model management -- due diligence, ongoing monitoring of vendor models	P11 economics, multi-model routing	Partial	SR 11-7 requires due diligence on vendor models including access to methodology documentation. LLM providers rarely provide the level of transparency SR 11-7 expects for vendor model assessment.

SS1/23 (PRA) addendum. The PRA model risk management principles extend SR 11-7 with several additions relevant to agentic systems:

Model risk appetite defined and approved at board level, with explicit thresholds for model performance degradation and triggers for remediation.
Explicit coverage of AI/ML models, removing ambiguity about whether agent systems are in scope.
Proportionality requirements scaled to model materiality -- not every agent system requires the same validation intensity.
Enhanced expectations for data quality in model inputs, strengthening the link to P7 (context quality as infrastructure).

These additions reinforce the case for P12 (accountability at governance level) and P7 (context quality as infrastructure).

Implementation note. Organizations should map each agent system to the SR 11-7 model tiering framework at the point of registration. Tier 1 (highest materiality) agent systems require annual independent validation, quarterly ongoing monitoring review, and board-level reporting. Lower-tier systems may follow a lighter cadence, but no agent system influencing financial decisions should be exempt from the inventory and governance framework entirely.

The manifesto's evidence bundles (P1) provide a strong foundation for SR 11-7 model documentation, but must be supplemented with:

Conceptual soundness assessment -- why this agent architecture, what alternatives were considered, what are the theoretical limitations.
Outcome analysis -- comparison of agent decisions against actual outcomes over time, with statistical rigor appropriate to the use case.
Sensitivity analysis -- how agent outputs change under varying inputs, context quality, and model provider configurations.

Three Lines of Defense

The Three Lines model is the foundational governance structure in financial services. Any agentic engineering adoption must map to this structure or it will not pass internal governance review.

Line	Traditional Role	Agentic Equivalent	Manifesto Principle	Key Requirement
1st -- Business / Technology	Builds and operates models; owns risk within business domain	Develops agent systems, defines specifications, produces evidence bundles, operates monitoring, manages day-to-day agent performance	P1-P11	Owns first-line risk for agent systems within its domain; responsible for evidence quality and ongoing monitoring; accountable for agent outputs
2nd -- Risk / Compliance	Oversees, challenges, and independently validates; sets risk frameworks and policies	Independently validates agent systems; monitors ongoing performance against risk appetite; challenges autonomy tier assignments; sets agent governance policy	P8 independent validation, P5 autonomy tiers	Must be organizationally independent from 1st line; cannot develop what it validates; sets model risk appetite for agent systems
3rd -- Internal Audit	Provides independent assurance over the governance framework itself	Audits the entire agent governance framework -- specifications, evidence quality, validation independence, trace completeness, policy adherence	P12 accountability, P9 observability	Evidence bundles and traces enable audit; structured data reduces audit cycle time; audit scope includes the governance process, not just the agent output

Segregation of duties. The team that builds and operates the agent system cannot also validate it. This is non-negotiable under SR 11-7 and SS1/23. The manifesto's P8 evaluation framework must be extended to require organizational separation between the first line (development and operation) and second line (independent validation and challenge).

In practice, this means:

First-line teams write specifications, build agent systems, and run evaluations as part of their development process.
Second-line teams independently design validation test cases, execute them without first-line involvement, and issue findings that must be remediated before production deployment.
Third-line teams audit the process: was the segregation real, were findings tracked to closure, did evidence bundles meet the standard.

DORA (Digital Operational Resilience Act)

DORA applies to financial entities operating in the EU and establishes requirements for ICT risk management, incident reporting, resilience testing, and third-party risk management. Agent systems are ICT assets and fall within scope.

DORA Pillar	Articles	Requirement	Manifesto Principle	Alignment
ICT Risk Management	Art. 5-16	Agent systems included in ICT risk framework; business impact analysis for agent failure scenarios; risk identification and classification	P3 defense-in-depth, P5 autonomy tiers	Good fit -- defense-in-depth architecture and tiered autonomy map directly to ICT risk management expectations. Agent failure scenarios should be included in business continuity planning.
Incident Reporting	Art. 17-23	Agent failures classified as ICT incidents; classification by severity; notification to competent authorities within regulatory timelines; root cause analysis	P9 observability	Good fit -- structured traces enable incident classification and root cause analysis. Gap: incident reporting workflow, severity classification taxonomy for agent failures, and regulatory notification timelines are not addressed in the manifesto.
Resilience Testing	Art. 24-27	Scenario testing for agent systems; advanced testing including TLPT for significant entities; testing of ICT tools, systems, and processes	P10 containment, chaos testing	Strong fit -- the manifesto's chaos testing (tool outages, noisy retrieval, adversarial inputs) aligns directly with DORA resilience testing expectations for agent systems. TLPT scenarios should include agent-specific attack vectors.
Third-Party Risk	Art. 28-44	LLM providers as critical ICT third parties; concentration risk assessment; exit strategies; right to audit; sub-outsourcing controls; contractual requirements	P11 multi-model routing	Partial -- multi-model routing mitigates concentration risk by design. Gaps: contractual requirements for LLM providers (SLA, data handling, incident notification), exit planning and portability, sub-outsourcing visibility, right-to-audit clauses in provider agreements.
Information Sharing	Art. 45	Agent-specific threat intelligence sharing with peers, regulators, and industry bodies	P10 containment	Supportive -- the manifesto's containment patterns generate threat intelligence (adversarial inputs, failure modes); no explicit mechanism for sharing this intelligence with the financial services community.

Multi-model routing can be an effective mitigation for DORA concentration risk, but it is not a universal regulatory requirement. Under the third-party risk pillar, concentration risk in a single LLM provider creates regulatory exposure where a single provider outage would impair critical financial functions. P11 (economics of intelligence) therefore serves a dual purpose: cost optimization and DORA concentration risk mitigation. Organizations should document their multi-model routing strategy as a DORA third-party risk mitigation measure where relevant.

Exit planning. DORA requires exit strategies for critical ICT third-party providers. For agent systems, this means: the ability to switch LLM providers without loss of capability, portability of specifications and evaluation suites across providers, and documented fallback procedures when a provider becomes unavailable. P2 (specifications) and P8 (evaluations) support this if they are provider-agnostic by design.

Incident classification for agent failures. DORA requires classification of ICT-related incidents by materiality. Organizations should define agent-specific incident categories:

Severity 1: Agent takes unauthorized action affecting customer accounts, market positions, or regulatory submissions.
Severity 2: Agent produces incorrect output that is detected before downstream impact but indicates a control failure.
Severity 3: Agent performance degradation (latency, accuracy drift) detected through monitoring but within tolerance thresholds.
Severity 4: Agent failure contained by circuit breakers or fallback mechanisms with no downstream impact.

The manifesto's P9 (observability) provides the data needed for classification. The gap is the classification framework itself and the escalation workflow.

EU AI Act

Financial AI systems frequently fall into the high-risk category under Annex III. The mapping below focuses on high-risk system obligations, which apply to most financial use cases involving automated decision-making.

AI Act Requirement	Article	Manifesto Principle	Notes
Risk classification	Art. 6, Annex III	--	Financial AI systems are frequently high-risk: credit scoring, insurance pricing, fraud detection, AML screening. Classification triggers the full set of high-risk obligations.
Risk management system	Art. 9	P3, P5, P10	Defense-in-depth, autonomy tiers, and containment engineering collectively satisfy risk management system requirements. Must be documented as a continuous iterative process.
Data governance	Art. 10	P7 context engineering	Data quality, relevance, representativeness, and freedom from errors. Context quality engineering directly maps. Training data governance for fine-tuned models adds scope beyond P7.
Technical documentation	Art. 11	P1 evidence, P2 specifications	Evidence bundles and versioned specifications satisfy technical documentation. Must include intended purpose, foreseeable misuse, and interaction with other systems.
Record-keeping and logging	Art. 12	P9 observability	Automatic logging of events during system operation. Structured traces exceed this requirement. Logs must enable post-market monitoring and incident investigation.
Transparency and information to deployers	Art. 13	P9 observability	Structured traces satisfy transparency obligations. Traces and documentation must be accessible to deployers in a form they can understand and act upon.
Human oversight measures	Art. 14	P12 accountability, P5 autonomy	Tier-calibrated governance provides graduated human oversight proportional to risk. System must allow human intervention, including ability to override or stop the system.
Accuracy, robustness, cybersecurity	Art. 15	P8 evaluations, P10 containment	Evaluation portfolios address accuracy requirements. Chaos testing addresses robustness. Cybersecurity must cover adversarial attacks specific to agent systems.
Conformity assessment	Art. 43	P1 evidence bundles	Evidence bundles structured to serve as conformity assessment documentation. Financial services AI may require third-party conformity assessment under sector-specific rules.
Post-market monitoring	Art. 72	P9 observability	Ongoing monitoring through traces, evaluation regression tracking, and performance drift detection. Must feed back into the risk management system.

High-risk classification in financial services. Under Annex III, Section 5, the following financial use cases are explicitly listed as high-risk:

Creditworthiness assessment of natural persons.
Risk assessment and pricing for life and health insurance.
Evaluation of credit scoring or establishment of credit scores.

Additional financial use cases may qualify as high-risk under the general criteria in Art. 6(2) when they significantly affect decisions about natural persons. Organizations should conduct a risk classification assessment for each agent system and document the rationale, including cases where the system is determined to be non-high-risk.

SOX Controls for Agent Systems

SOX compliance applies to publicly traded companies and focuses on internal controls over financial reporting. Agent systems that touch financial data, reporting pipelines, or accounting processes fall within scope.

SOX Requirement	Manifesto Mechanism	Alignment
IT General Controls (ITGC)	P3 architecture, P5 autonomy tiers	Good fit -- defense-in-depth and tiered permissions map to ITGC expectations for access management, change management, and operations
Change management -- authorization, testing, approval before deployment	P2 specifications, P1 evidence bundles	Good fit -- evidence bundles with evaluation results, diffs, and deployment IDs exceed most ITGC change management documentation requirements
Access controls -- logical access, authentication, authorization	P5 autonomy tiers, least privilege	Good fit -- tier enforcement and granular permissions (read but not write, deploy to canary but not full rollout) provide stronger access controls than typical role-based models
Audit trails -- who did what, when, and why	P9 structured traces	Strong fit -- traces reconstruct reasoning chains, not just event logs; traces include decision rationale, tool calls, and policy checks
Segregation of duties -- incompatible functions separated	--	Gap -- not explicitly addressed in the manifesto; must be enforced through organizational controls external to the agent system (see Three Lines of Defense above)
Financial reporting integrity -- completeness, accuracy, validity	P8 evaluations	Partial -- evaluation portfolios verify correctness but do not specifically address financial statement assertion-level testing (completeness, existence, valuation, rights, presentation)

Algorithmic Accountability and Explainability

These requirements span multiple regulatory frameworks and represent a cross-cutting concern for any agent system that influences decisions affecting individuals.

Requirement	Source	Manifesto Mechanism	Gap
Right to explanation for automated decisions	GDPR Art. 22	P9 structured traces	Traces provide system-level reasoning reconstruction. Gap: individual-level explainability (why this specific decision for this specific customer) requires purpose-built explanation generation, not raw trace data.
Fairness and non-discrimination testing	Fair Lending (ECOA, FHA), FCA Consumer Duty, EU AI Act Art. 10	P8 evaluation portfolios	No explicit fairness testing, bias detection, or protected-class impact analysis in the manifesto evaluation framework. Evaluation portfolios must be extended with fairness-specific test cases.
Contestability of automated decisions	Consumer protection regulation, FCA Consumer Duty	P12 accountability	No defined process for customer challenge of agent-influenced decisions. Accountability exists but a contestation workflow -- how a customer disputes, how the decision is re-examined, how traces are reviewed -- does not.
Kill switches for algorithmic trading systems	MiFID II Art. 17	P10 containment, circuit breakers	Good fit -- circuit breakers and containment engineering serve as kill switch infrastructure. Must operate in real-time with sub-second latency for trading systems.
Model explainability for supervisory review	SR 11-7, SS1/23	P9 traces, P1 evidence	Partial -- traces explain system-level behavior. Gap: model-level interpretability (feature importance, sensitivity analysis, partial dependence) requires additional tooling beyond manifesto scope.

GDPR Art. 22 in practice. The right not to be subject to solely automated decision-making with legal or similarly significant effects creates a hard constraint on agent autonomy tiers in customer-facing financial decisions. Any agent system that produces a credit decision, insurance pricing determination, or account action must either:

Maintain meaningful human involvement in the decision (not rubber-stamping), which maps to manifesto Tier 1 (observe) or Tier 2 (branch with approval), or
Obtain explicit consent and provide the right to contest, which requires a contestation workflow that the manifesto does not currently define.

The practical implication is that for customer-facing decisions with legal or similarly significant effects, Tier 1 or Tier 2 is the conservative default pending jurisdiction-specific review.

Hard Autonomy Caps

The following caps are regulatory floors — constraints derived from applicable law, not risk preference. A mature Phase 5 organization still cannot exceed these caps for the listed use cases.

Use Case	Maximum Tier	Regulatory Basis	Key Constraints
Credit and insurance underwriting, pricing, limit-setting	Tier 1 (observe only, conservative default)	EU AI Act Annex III §5 (high-risk); GDPR Art. 22; Fair Lending (ECOA, FHA)	Agent may analyze and recommend. Human makes every decision. Full explainability required. Fairness testing mandatory.
Algorithmic trading, execution, market making	Tier 1 (observe only, conservative default)	MiFID II Art. 17; MAR; Reg SCI	Kill switches mandatory and must operate sub-second. Agent cannot execute trades autonomously.
AML/KYC screening, SAR filing	Tier 2 max	AMLD6; FinCEN BSA; Wolfsberg Principles	Human review on every SAR. Agent assists triage and evidence assembly; does not make filing determinations.
Customer credit decisions (lending, card limits)	Tier 1 (observe only, conservative default)	EU AI Act Annex III §5; Consumer Credit Directive	Right to human review of automated credit decisions cannot be waived.
Claims decisioning affecting payout	Tier 1 (observe only, conservative default)	EU AI Act high-risk; FCA Consumer Duty	Agent may triage and summarize. Human adjudicates every claim.
Fraud detection triggering account action	Tier 2 max	Consumer Duty; GDPR	Agent may score and flag. Human authorises account restriction or closure.
Regulatory reporting (drafting, consistency checks)	Tier 2 max	COREP/FINREP; various reporting regulations	Accuracy requirements are absolute. Agent drafts; human approves before submission.
Back-office automation (reconciliation, data entry)	Tier 3 available	SOX (with evidence controls)	Standard manifesto adoption path. Evidence bundles satisfy change management requirements.

These are conservative defaults, not universal legal ceilings; legal review is required for each product and jurisdiction.

Market-Specific Autonomy Guidance

The table below maps common financial services workflows to recommended starting autonomy tiers. These are starting points; actual tier assignments must reflect the organization's risk appetite and regulatory obligations — and must not exceed the hard caps above.

Use Case	Risk Profile	Recommended Starting Autonomy	Key Regulations	Notes
Back-office automation (document processing, reconciliation, data entry)	Low	Tier 1-3	SOX	Standard manifesto adoption path. Evidence bundles satisfy change management. Low regulatory sensitivity allows higher autonomy tiers.
Model development support (quant code generation, research assistance, data exploration)	Medium	Tier 1-2	SR 11-7	Agent output independently validated by model validation team. The agent is a development tool, not the model itself. Output enters the model development lifecycle and is subject to full SR 11-7 validation.
Regulatory reporting (drafting, data aggregation, consistency checks)	Medium	Tier 1-2	Various (COREP, FINREP, FR Y-9C, Call Reports)	High value use case. Agent drafts, human approves. Traces provide audit trail for regulatory examination. Accuracy requirements are absolute -- no tolerance for reporting errors.
AML/KYC (transaction monitoring, customer due diligence, screening)	High	Tier 1-2	AML Directives (AMLD6), FinCEN BSA, Wolfsberg Principles	Human review on every SAR. Agent assists triage and evidence assembly but does not make filing determinations. False negative risk is regulatory and criminal.
Credit and insurance decisioning (underwriting, pricing, limit setting)	High	Tier 1 (observe only)	EU AI Act (high-risk), Fair Lending (ECOA, FHA), Consumer Duty	High-risk AI classification. Agent provides analysis and recommendations; human makes the decision. Full explainability required. Fairness testing mandatory.
Algorithmic trading (execution, market making, systematic strategies)	Highest	Tier 1 (observe only)	MiFID II Art. 17, MAR, Reg SCI	Kill switches mandatory. Agent cannot execute trades autonomously. Real-time monitoring required. Latency constraints may limit agent applicability.

Data Residency and Classification

Customer PII processed through external LLM APIs triggers GDPR cross-border transfer obligations (Chapter V), including adequacy decisions, standard contractual clauses, or binding corporate rules. The Schrems II framework adds requirements for supplementary measures when transferring data to jurisdictions without adequate protection.

Banking secrecy laws in certain jurisdictions (Switzerland, Luxembourg, Singapore, the Cayman Islands) may prohibit sharing financial data with third-party inference providers entirely. These laws operate independently of GDPR and may impose stricter constraints.

Data classification must gate agent access, model routing, and memory retention at the infrastructure level:

Public / Internal: Agent may use any model, including hosted APIs. Standard manifesto adoption applies.
Confidential: Agent restricted to approved models with appropriate data processing agreements. Memory retention subject to data minimization.
Restricted / Secret: Agent restricted to on-premises or private-cloud models only. No external API calls. Memory must not persist beyond session.

This is an infrastructure enforcement concern under P5 (autonomy tiers): data classification becomes an autonomy constraint enforced at the system level, not merely a policy document. The routing layer (P11) must respect classification boundaries -- a cost-optimal route that violates data residency rules is not a valid route.

Agent Tooling Configuration

This section maps the regulatory requirements above to the agent tooling configuration mechanisms described in the manifesto's companion documents. Read this alongside your tool's enterprise configuration guide — neither is sufficient alone.

DORA Article 9 Evidence Chain

DORA Article 9 requires that changes are recorded, tested, and approved before production. Each requirement maps to a specific hook type:

DORA Article 9 requirement	Hook type	What it produces
Changes are recorded	PostToolUse audit logging hook	SIEM record: timestamp, developer, tool calls, trace ID, deployment ID
Changes are tested before deployment	PreToolUse test enforcement hook	Test pass/fail record, coverage threshold evidence
Changes are approved before production	PreToolUse PR gate hook + Layer 7 RBAC	Named approver, approval timestamp, scope of approval
Sensitive data not exposed	PreToolUse data residency enforcement hook	Classification check log, block record if violated
Session activity is auditable	SessionStart hook + transcript centralization	Session initiation record; note: transcripts are stored locally by default and must be centralized via scheduled hook or script

The configuration repository (ai-governance-config or equivalent) is the auditable record of how these controls are configured. Point auditors to the repository — it is the answer to "how do you control what the AI tool can do."

Three Lines of Defense → RBAC Mapping

Line	Role	RBAC role	Hook infrastructure access
1st line — Development	Builds and operates agent systems	Developer role	Development hooks (secrets detection, test enforcement, security scanning)
2nd line — Risk/Compliance	Independent validation; sets risk policy	Read-only access to validation hook configs, or separate validation workspace	Validation hooks only — runs on independent infrastructure, not shared with 1st line
3rd line — Internal Audit	Audits the governance framework	Cost/usage visibility + read access to configuration repository	Audit log hooks output; configuration repository Git history

Segregation note: 2nd-line validation infrastructure must be organizationally separated from development infrastructure — use a separate workspace or tenant for 2nd-line validation execution, not shared infrastructure with the development environment.

MCP Allowlisting as Data Residency Control

The managed MCP policy (Layer 6) is the primary infrastructure control for GDPR cross-border transfer compliance and banking secrecy law requirements:

MCP servers calling external APIs for Confidential/Restricted data must be restricted to approved providers with signed Data Processing Agreements.
MCP servers must be routed through the corporate proxy — not direct external access — so egress is logged and auditable.
The MCP allowlist is the machine-enforced answer to "what third-party systems can the AI access?" Document every approved MCP server with its data classification scope and DPA reference.

Model Version Pinning for Regulatory Stability

During periods of regulatory sensitivity, pin the agent tooling to a specific model version in the managed settings file to prevent behavioral drift:

Q1 regulatory reporting cycles (COREP, FINREP, annual reports)
Year-end close periods
During active supervisory examinations
After any SR 11-7 independent validation that established a behavioral baseline

Behavioral drift between model versions can invalidate a validation baseline. Document pinning decisions with rationale in the configuration repository change log.

ALCOA+ Compliance

The manifesto's evidence model satisfies ALCOA+ data integrity requirements by construction. See Companion Frameworks — ALCOA+ Alignment for the complete mapping table.

For financial services, this means:

SR 11-7 model documentation: Evidence bundles and structured traces provide the "Attributable," "Legible," and "Contemporaneous" criteria that model validation teams rely on for independent review.
SOX audit trails: Traces meet the "Original," "Accurate," and "Complete" criteria required for IT General Controls over financial reporting pipelines.
DORA record-keeping: The "Enduring" and "Available" criteria are satisfied by trace retention infrastructure and queryable audit stores.

Key constraint: the trace infrastructure itself is a production system subject to IT change management. Organizations must version-control their observability configuration and include it in SOX ITGC scope.

Viable Starting Points

Not all financial services workflows carry equal regulatory burden. The following are realistic entry points for agentic engineering practices today:

Back-office automation (SOX-scoped). Reconciliation, data entry, document processing. Lower regulatory sensitivity. Standard evidence bundles satisfy SOX change management. Natural Phase 3→4 pilot domain.
Model development support. Agents assist quant researchers with code generation, data exploration, and analysis. Agent output enters the SR 11-7 model lifecycle and receives full independent validation — the agent is a development accelerant, not a replacement for model governance.
Regulatory reporting — consistency checking. Agents cross-check report data against source systems, flag inconsistencies, and draft narrative sections. Human approves before submission. Traces provide the audit trail regulators expect.
Traceability and evidence assembly. Agents assemble SR 11-7 model documentation packages, DORA incident records, and SOX change management evidence. Reduces cycle time for audits and examinations without automating the decisions themselves.
AML/KYC triage assistance. Agents pre-screen transaction monitoring alerts, assemble supporting evidence, and draft initial assessments. Human reviews every case before disposition. False negative risk means Tier 1-2 is the permanent ceiling, but agent triage can significantly reduce analyst workload.

Open Regulatory Questions

The following questions do not have settled regulatory answers. Organizations adopting agentic engineering in financial services should track these areas and engage with supervisors proactively.

SR 11-7 model inventory scope. Are agent systems "models" under SR 11-7? If an agent uses an LLM to generate risk assessments, is the agent the model, the LLM the model, or both? Inventory classification methodology for agent systems is unsettled. Conservative approach: register the agent system as a model and the underlying LLM as a vendor model.
DORA third-party risk for LLM APIs. When agents call external LLM APIs, does the LLM provider constitute a critical ICT third-party service provider? Concentration risk thresholds and contractual requirements for LLM providers are undefined in current regulatory technical standards.
Champion-challenger methodology. Traditional champion-challenger compares model outputs on identical inputs. Agent systems are non-deterministic and context-dependent. Methodology for meaningful comparison -- including statistical approaches to handle output variability -- is undeveloped.
Regulatory examination expectations. Supervisory examination procedures for agent governance do not yet exist. Early adopters should prepare for ad hoc supervisory inquiries and document governance frameworks defensively. Evidence bundles (P1) and traces (P9) position organizations well for this.
EU AI Act conformity assessment. The interaction between AI Act conformity assessment and existing financial services supervisory frameworks (CRD, MiFID, Solvency II) is not yet clarified by the European Commission. Dual compliance obligations may emerge.

ASDLC and APLC Regulatory Guidance

For financial services-specific regulatory requirements mapped to ASDLC Layers 1, 3, and 4 (SR 11-7, DORA Article 14, EU AI Act Article 15, FCA PS21/3), see ASDLC Financial Services Domain Guidance.

For agent product regulatory guidance (EU AI Act high-risk conformity, GDPR Article 22, SR 11-7 model governance) applicable to financial services agent products governed by the APLC, see APLC Financial Services Domain Guidance.

Mapping the Agentic Engineering Manifesto principles to insurance regulatory frameworks.

Disclaimer — This document maps concepts from the Agentic Engineering Manifesto to insurance regulatory frameworks. It does not constitute compliance or regulatory advice. Consult qualified risk, compliance, and regulatory professionals for compliance determinations.

Regulatory currency: This document reflects Solvency II (Directive 2009/138/EC), EIOPA Guidelines on the Use of AI and ML in Insurance (2021), IDD (Directive 2016/97/EU), GDPR, DORA (EU 2022/2554), FCA ICOBS, and IAIS Insurance Core Principles as understood at the time of last review. Insurance regulation varies significantly by jurisdiction and product line; this document uses conservative cross-jurisdictional defaults. The EU AI Act implementation timeline and Annex III classifications are subject to ongoing guidance; verify current status before relying on AI Act references here. Last reviewed: April 2026. Proposed changes not yet enacted are flagged as such.

Preamble

This document is a companion to manifesto.md. It assumes familiarity with the boundary conditions and the Agentic V-Model transition framework. Insurance already operates extensive model governance infrastructure — actuarial model validation, internal model approval under Solvency II, conduct oversight under the IDD — and the bridge to agentic engineering is extension of these existing frameworks, not construction of new ones. The manifesto's evidence-bundle and tiered-autonomy model maps well to Solvency II model governance requirements; the principal discipline is making that mapping explicit so internal audit, the actuarial function, and the regulator can follow it.

Canonical sources. Normative principle definitions (P1–P12) and autonomy tier definitions are in manifesto-principles.md. This document maps those definitions to insurance regulatory requirements; it does not redefine them.

Solvency II Model Governance

Solvency II imposes a three-pillar framework that shapes how agent systems are governed in insurance undertakings.

Pillar 1 — Solvency Capital Requirement (SCR). Agent products used in SCR calculation or that feed the technical provisions are internal models under Solvency II Article 112–127. The internal model approval process (IMAP) requires the undertaking to demonstrate that the model meets the use test, statistical quality standards, calibration standards, profit and loss attribution, validation standards, and documentation standards. An agent product that participates in SCR calculation is subject to all six tests. The APLC's behavioral specification, evaluation portfolio, and composite state manifest together constitute the technical documentation required for internal model approval, but they must be structured to address Solvency II's specific model documentation format and substance requirements.

Pillar 2 — Own Risk and Solvency Assessment (ORSA). The ORSA requires insurance undertakings to assess all material risks — including model risk from AI systems — as part of the ongoing solvency assessment. Agent products used in risk assessment, pricing, or reserve setting contribute to the risk profile that the ORSA must cover. The governance function's model risk assessment must include agent products within scope. The ORSA narrative must be able to explain, to the supervisory authority, how material AI system risks are identified, assessed, and managed.

Pillar 3 — Supervisory Reporting. Solvency II's Quantitative Reporting Templates (QRTs) and the Regular Supervisory Report (RSR) require documentation of models used in the undertaking's operations. Agent products used in SCR calculation or material risk decisions must be disclosed and their governance described. The composite state manifest and the change record chain provide the model identification and versioning information the QRTs require.

Internal model change policy. Solvency II distinguishes major model changes (requiring supervisory approval before implementation) from minor changes (subject to internal governance). The classification of a change as major or minor is a governance decision that must be made at Stage 1 of the APLC for any change to an agent product in Solvency II model scope. The APLC's composite versioning model in aplc/agent-composite-versioning.md provides the change audit trail that demonstrates how each change was classified and governed. A major model change deployed without supervisory approval is a Solvency II compliance breach; the APLC release gate for major model changes must include supervisory pre-approval as a gate condition.

Solvency II Requirement	Manifesto Mechanism	Alignment	Gap
Model use test — model must be central to risk management decisions	P12 accountability; decision authority structure	Partial — the use test requires evidence that the model drives decisions; the agent must be demonstrably used in decision-making, not merely consulted	The use test requires board-level attestation. The accountable human named at Stage 1 must have the standing to provide this attestation.
Statistical quality standards — mathematical methods, data quality, actuarial best practice	P8 evaluation portfolios; P7 context engineering	Strong — evaluation portfolios demonstrate statistical quality; context engineering ensures data quality	Actuarial best practice requires peer review by a Fellow of an actuarial professional body for material models. The independent validation at Stage 3 must include qualified actuarial review.
Calibration standards — calibrated to historical data, consistent with market data	P8 evaluation portfolios	Partial — evaluation portfolios cover correctness; calibration requires specific comparison to observed historical loss data and market pricing benchmarks	Calibration evidence requires actuarial sign-off. The Stage 3 evaluation portfolio must include a calibration section reviewed by the actuarial function.
Validation standards — independent validation of model quality	P8 independent validation; Stage 3 governance	Good fit — the APLC's independent validation gate maps to Solvency II validation requirements	Validation must be conducted by persons independent of model development and with sufficient actuarial expertise. The validation function must be organisationally separate.
Documentation standards — comprehensive model documentation	P2 specifications; P1 evidence bundles; composite state manifest	Good fit — APLC artifacts constitute the model documentation	Solvency II model documentation must follow the format prescribed by the applicable supervisory authority. The APLC produces substantively correct documentation; format compliance may require an additional translation step.
Model change governance — major vs. minor classification	Composite versioning model; Stage 3 release governance	Good fit — the composite versioning model classifies changes and provides the audit trail	Major model changes require supervisory pre-approval. The APLC release gate for Solvency II models must include supervisory notification as a condition before deployment.

EIOPA AI and ML Guidelines

The EIOPA Opinion on Artificial Intelligence Governance and Risk Management (2021) and subsequent guidance establish supervisory expectations for AI and ML use in insurance beyond the Solvency II model governance framework. These guidelines apply to all material AI systems — not only those in SCR calculation — including customer-facing systems for pricing, underwriting, and claims.

Governance expectations. EIOPA expects that insurance undertakings have board-level accountability for AI systems, that AI governance is integrated into the existing risk management framework (not siloed as a separate AI governance team), and that the second line of defence independently challenges AI system outputs. The manifesto's P12 (accountability requires visibility) and P5 (autonomy tiers) directly address these expectations. The named accountable human at Stage 1 is the board-level accountable person EIOPA expects; the second-line independent validation at Stage 3 is the independent challenge function.

Explainability. EIOPA expects that insurance AI systems are explainable to the extent necessary for the supervisory authority to understand their functioning and for affected individuals to understand decisions that affect them. For pricing and underwriting agents, explainability is not just a regulatory aspiration — it is required for conduct compliance (IDD suitability assessment) and for GDPR Article 22 compliance. The behavioral specification at Stage 2 must document the agent's explanation capability as a functional requirement, not a post-hoc documentation task.

Fairness and non-discrimination. EIOPA's guidelines address the risk that AI systems produce indirect discrimination through proxy variables correlated with protected characteristics. Underwriting and pricing agents that use behavioural, geographic, or lifestyle data must be assessed for proxy discrimination against protected characteristics. The evaluation portfolio at Stage 3 must include a fairness assessment for any agent product affecting individual insurance pricing or coverage eligibility.

Data governance. EIOPA expects robust data governance for training and inference data. The context engineering framework (P7) and knowledge governance (P6) map to these expectations. For actuarial data used in pricing and SCR models, the data governance must address lineage, quality, and fitness for purpose in actuarial terms.

EIOPA Expectation	Manifesto Mechanism	Alignment
Board-level AI accountability	P12 named accountable human; governance tier escalation	Strong
Second-line independent challenge	P8 independent validation at Stage 3	Strong
Explainability for supervisors and affected individuals	P9 structured traces; explanation generation capability in behavioral specification	Partial — traces provide system-level explainability; individual-level explanation capability must be designed into the behavioral specification
Fairness and non-discrimination assessment	P8 evaluation portfolios extended with fairness testing	Partial — evaluation portfolios must be explicitly extended with fairness categories
Data governance for AI inputs	P7 context engineering; P6 knowledge governance	Good fit
Ongoing performance monitoring	P9 observability; output quality rate SLO	Good fit

Insurance Distribution Directive (IDD)

The IDD (Directive 2016/97/EU) regulates the distribution of insurance products and imposes requirements on all parties involved in selling or advising on insurance — including, by extension, automated systems that perform advisory functions.

Automated advice and IDD applicability. Whether an agent product constitutes an insurance intermediary under the IDD depends on a jurisdiction-specific analysis of whether the agent is carrying out insurance distribution activities — providing advice on insurance contracts or carrying out other work preparatory to the conclusion of contracts. If yes, the agent product (or its deploying organisation) must satisfy IDD requirements. This determination must be made at Stage 1, not at product launch.

Suitability assessment. The IDD requires that insurance distributors carry out a suitability assessment before providing advice — establishing the customer's demands and needs and ensuring that the advice meets them. An agent product that provides insurance advice must perform the suitability assessment, document it, and provide the customer with a statement describing why the recommendation meets the customer's demands and needs. The behavioral specification at Stage 2 must encode the suitability assessment logic, and the output must include the IDD-required demands and needs statement.

Product oversight and governance (POG). The IDD requires insurance manufacturers to have product oversight and governance procedures that ensure products are designed for a specific target market and distributed accordingly. Agent products used in distribution (customer advisory agents, automated quotation agents) must comply with POG requirements: they must be configured to serve the identified target market, and any change to their configuration that affects the target market or distribution approach is a POG-relevant change requiring the product governance process. The APLC's composite versioning model captures these changes; the release governance must confirm POG compliance for changes affecting distribution.

FCA ICOBS (UK). In the UK market, the FCA's Insurance: Conduct of Business Sourcebook (ICOBS) imposes comparable conduct requirements. Agents providing insurance advice to retail customers must meet ICOBS fair, clear, and not misleading communication standards, provide appropriate information for informed decision-making, and ensure that recommended products are appropriate for the customer. The FCA Consumer Duty (PS22/9) further requires that firms act to deliver good outcomes for retail customers — including customers interacting with automated systems.

IDD / ICOBS Requirement	Manifesto Mechanism	Alignment
Suitability assessment before advice	Behavioral specification (Stage 2): suitability logic as functional requirement	Partial — the suitability logic must be specified and evaluated; the APLC provides the framework but the insurance suitability criteria must come from domain expertise
Demands and needs statement	Output specification in behavioral envelope	Partial — the output format must be specified to produce the IDD-required statement
POG compliance for distribution changes	Composite versioning; release governance compliance documentation condition	Good fit
FCA Consumer Duty — good customer outcomes	Output quality rate SLO calibrated against customer outcome metrics; ongoing monitoring	Partial — SLO calibration must be against customer outcome metrics, not only technical accuracy

Insurance operations involve extensive processing of personal data, including special category data under GDPR Article 9: health data, genetic data, and data concerning disability. This creates specific design constraints for agent products in insurance.

Health data in claims processing. Health insurance claims agents routinely process health data. GDPR Article 9 requires a legal basis for processing special category data; for health insurance claims, the most common bases are explicit consent or processing necessary for an insurance contract. The legal basis must be confirmed before the agent product is designed, and the data processing must be within the scope of the confirmed legal basis. Agent products that expand health data processing beyond the confirmed legal basis create GDPR compliance exposure that cannot be resolved through technical controls alone.

GDPR Article 22 and underwriting decisions. Insurance underwriting decisions based solely on automated processing of special category health or genetic data are subject to GDPR Article 22(4)'s prohibition on solely automated decisions based on special category data. The prohibition applies unless the individual has given explicit consent or Member State law provides for it. For automated underwriting agent products processing health or genetic data, this prohibition is the binding constraint: a human must be in the decision loop for every individual underwriting decision based on special category data — not available for review on request, but actually reviewing and accepting responsibility for the decision.

Data residency and cross-border transfer. Insurance data processing frequently involves cross-border transfers — particularly for international reinsurance, global corporate insurance, and shared service operations. The GDPR Chapter V cross-border transfer requirements must be incorporated into the agent product's data processing design at Stage 1: which data can the agent access, through which infrastructure, and under what transfer mechanism?

Solvency II Model Validation Mapping

Validation Requirement	APLC Mechanism	Notes
Conceptual soundness — mathematical and actuarial basis documented	P2 behavioral specification with actuarial rationale	The behavioral specification must document the actuarial methodology underlying the agent's outputs — not only what it does but why it is actuarially sound
Empirical validation — model outputs compared to observed outcomes	P8 evaluation portfolio with backtesting against historical claims and pricing data	Backtesting against realised outcomes is an actuarial discipline that must supplement the standard evaluation portfolio
Sensitivity and scenario testing	P8 adversarial evaluations; scenario-based test cases	Standard manifesto evaluation portfolios must be extended with actuarial scenario tests and stress tests relevant to the SCR calculation
Independent validation by persons not involved in development	Stage 3 independent validation gate	The validator must have actuarial qualifications appropriate to the model's subject matter for Solvency II compliance
Annual model validation cycle	Stewardship model; quarterly health review	The annual validation cycle must produce a formal validation report addressed to the board and the supervisory authority; the steward's monitoring data feeds into this report

Hard Autonomy Caps

The following caps are regulatory floors for insurance use cases — derived from Solvency II, IDD, GDPR, and EU AI Act requirements, not from risk preference. A mature Phase 5 organisation still cannot exceed these caps for the listed use cases.

Use Case	Maximum Tier	Regulatory Basis	Key Constraints
Underwriting decisions for individual cover (personal lines)	Tier 1 (observe only)	EU AI Act Annex III §5(b) (high-risk); GDPR Art. 22 (health/genetic data); EIOPA AI guidelines	Agent may analyse and recommend; human underwrites every individual risk. Full explainability required. Fairness testing mandatory.
Claims decisions affecting coverage or payout	Tier 1 (observe only)	EU AI Act high-risk; FCA Consumer Duty; GDPR Art. 22 where health data involved	Agent may triage and summarise; human adjudicates every claim. Right to contestation must be operational, not nominal.
IDD-scope customer advisory (products advice)	Tier 1 (observe only)	IDD suitability requirement; FCA ICOBS	Suitability assessment must be demonstrably connected to individual customer demands and needs. Automated advice without human confirmation is IDD non-compliant in most jurisdictions.
Fraud detection triggering account/claim action	Tier 2 max	Consumer Duty; GDPR	Agent may score and flag; human authorises account restriction or claim suspension.
Pricing optimisation (fleet, commercial, non-personal-lines)	Tier 2 max	EIOPA AI guidelines; indirect discrimination obligation	Agent may optimise; pricing actuary reviews material rate changes before implementation. Proxy discrimination assessment mandatory.
SCR calculation using internal model	Tier 1 (observe only)	Solvency II Art. 112–127; IMAP use test	Agent output is a model input; the actuarial function owns the SCR output. Agent cannot produce the final SCR without actuarial sign-off.
Back-office automation (document processing, data entry)	Tier 3 available	Minimal regulatory overlay	Standard manifesto adoption applies. Not in scope for Solvency II internal model governance unless it feeds risk calculation.

Market-Specific Autonomy Guidance

Use Case	Risk Profile	Recommended Starting Autonomy	Key Regulations	Notes
Claims document triage and classification	Low-Medium	Tier 1-2	FCA Consumer Duty; GDPR	Agent classifies and routes claims; human adjudicates. High-value efficiency use case with contained blast radius.
Fraud pattern detection and alert generation	Medium-High	Tier 1-2	Consumer Duty; GDPR; EIOPA AI guidelines	Agent flags anomalies; human investigates and decides. False negative risk is significant; Tier 1-2 is a permanent ceiling.
Regulatory reporting consistency checking	Medium	Tier 1-2	Solvency II QRT/RSR; DORA	Agent cross-checks data; actuary or reporting function approves before submission.
Actuarial analysis assistance	Medium-High	Tier 1-2	Solvency II validation standards	Agent assists data preparation, model calibration review, and report drafting. Fellow of actuarial professional body reviews and signs off all actuarial outputs.
Policy administration and renewals	Low	Tier 1-3	IDD (unless advice involved)	Standard manifesto adoption for non-advisory administrative tasks. Escalate to Tier 1 if the task involves advice.
Solvency II internal model support	High	Tier 1	Solvency II Art. 112–127; EIOPA guidelines	Agent assists model documentation, validation evidence assembly, and change impact analysis. The model itself and all SCR outputs remain human-owned.

ASDLC and APLC Regulatory Guidance

For insurance-specific regulatory requirements mapped to ASDLC Layers 1, 3, and 4, see ASDLC Insurance Domain Guidance.

For agent product regulatory guidance applicable to insurance agent products governed by the APLC, see APLC Insurance Domain Guidance.

Mapping the Agentic Engineering Manifesto principles to automotive functional safety and software process frameworks.

Disclaimer — This document maps concepts from the Agentic Engineering Manifesto to automotive regulatory frameworks. It does not constitute compliance or certification advice. Consult qualified functional safety engineers and type-approval specialists for compliance determinations.

Regulatory currency: This document reflects ISO 26262:2018, ASPICE 3.1, UN Regulation 157 (ALKS), UN Regulation 155 (cybersecurity), ISO/SAE 21434, and ISO PAS 8800 (draft) as understood at the time of last review. ISO PAS 8800 is under active development; its requirements may change materially before publication. Last reviewed: April 2026. Proposed changes not yet enacted are flagged as such.

See companion-frameworks.md for boundary conditions on regulated-industry adoption. See adoption-vmodel.md for the V-model adoption path applicable to verification-heavy lifecycles.

Canonical sources. Normative principle definitions (P1–P12) and autonomy tier definitions are in manifesto-principles.md. This document maps those definitions to automotive regulatory requirements; it does not redefine them.

Scope: ISO 26262, ASPICE (Automotive SPICE), UN Regulation 157 (ALKS), UN Regulation 155 (cybersecurity), ISO/SAE 21434 (cybersecurity), ISO PAS 8800 (AI in road vehicles — under development).

Audience: Functional safety engineers, ASPICE assessors, software leads, and systems engineers evaluating where agentic engineering practices can operate within existing type-approval and functional safety constraints.

Automotive Safety Integrity Level (ASIL) to Manifesto Autonomy Mapping

ISO 26262 assigns Automotive Safety Integrity Levels (ASIL A through D) to safety functions based on Severity × Exposure × Controllability. The mapping below constrains the maximum permissible agent autonomy tier based on the ASIL of the software element under development.

ASIL	Failure Potential	Max Agent Autonomy Tier	Verification Depth	Rationale
ASIL D	Most severe	Tier 1 — Observe only	All agent output independently verified through qualified means; Part 6 (software) objectives at ASIL D rigor	No tool credit for unqualified tool output. Agent assists analysis and proposes; qualified engineer authors and verifies.
ASIL C	Severe	Tier 1 — Observe only	Independent verification required; Part 6 ASIL C objectives apply	Same constraint as ASIL D. Reduced objective count does not relax the independence requirement.
ASIL B	Significant	Tier 1-2 — Observe or Branch	Agent may draft artifacts to isolated branches; merge requires qualified human verification against Part 6 ASIL B objectives	Fewer independence requirements at ASIL B. Agent-drafted code and tests are viable when independently reviewed.
ASIL A	Low	Tier 1-3 — Full tier range	Standard evidence bundles (P1) attached to each agent contribution; verification per Part 6 ASIL A objectives	Reduced verification rigor. Agent contributions with evidence bundles can satisfy most objectives with standard review.
QM (Quality Management only)	Negligible safety relevance	Tier 1-3 — Full tier range	Standard manifesto governance; no functional safety objectives	No ASIL applies. Normal manifesto governance is sufficient.

These are conservative defaults for safety-relevant software paths; lower-risk QM and supporting tooling may permit higher autonomy.

ASIL decomposition. ISO 26262 supports ASIL decomposition: an ASIL D requirement may be decomposed into two ASIL B requirements handled by independent channels. In agentic contexts, ASIL decomposition applies to the agent's contribution to each decomposed channel independently — the two-channel independence requirement must be preserved even when agents assist in developing both channels.

Key constraint: ASIL assignment is determined by the hazard analysis and risk assessment (HARA, ISO 26262 Part 3), not by the development team. The ASIL dictates the autonomy ceiling; the team cannot raise it.

ISO 26262 Software Process to Manifesto Mapping

ISO 26262 Part 4 (system level) and Part 6 (software level) govern the development process. The table below maps key activities to manifesto principles.

ISO 26262 Activity	Part / Clause	Manifesto Equivalent	Principle	Alignment	Gap
Initiation of product development at software level	Part 6, §5	Specification scope; autonomy tier assignment	P2, P5	Strong. Machine-readable specifications map to software development plan inputs.	SW development plan must document tool qualification and agent usage as part of the SW development environment.
Specification of software safety requirements	Part 6, §6	Specify phase; machine-readable specs with safety constraints	P2	Strong. Living specifications support traceability to ASIL-allocated safety requirements.	Formal notation may be required for ASIL C/D; agent-drafted formal specs must be independently reviewed.
Software architectural design	Part 6, §7	Design phase; domain boundaries (P3)	P3	Strong. Enforced boundaries map to software component isolation.	ASIL C/D require freedom from interference between components; independent verification of architectural decisions required.
Software unit design and implementation	Part 6, §8	Execute phase; agent generates code	P4, P5	Partial. Agent execution replaces human coding.	Tool qualification (Part 8, §11) applies to tools that automate Part 6 activities. See Tool Qualification section below.
Software unit verification	Part 6, §9	Verify phase; evaluation portfolio (P8)	P8	Strong. Evaluation gates exceed minimum unit test requirements.	Must include static analysis (ASIL B/C/D), code coverage (MC/DC at ASIL D), and review by independent party (ASIL C/D).
Software integration and testing	Part 6, §10	Verify phase; integration evaluations	P8, P9	Strong. Traces reconstruct cross-component interactions.	Integration testing must verify software component interfaces per architectural design.
Verification of software safety requirements	Part 6, §11	Validate phase; outcome-based evidence (P1)	P1, P8	Strong. Outcome-based validation aligns with ASIL-calibrated verification.	Requirements-based testing must trace to every software safety requirement.
Configuration management	Part 8, §7	Knowledge as versioned ground truth (P6)	P6	Strong. Versioned specifications and evidence bundles map to CM objectives.	Agent-generated artifacts must be CM items; model versions must be baselined alongside source code baselines.
Change management	Part 8, §8	Govern phase; autonomy tier gate on changes	P5, P12	Strong. Tier 2 branch-to-merge workflow enforces change management.	Impact analysis for ASIL-relevant changes must be performed by a qualified safety engineer before merge.

ISO 26262 Part 8, §11 — Tool Qualification

ISO 26262 Part 8, §11 determines whether a software development tool requires qualification. This is the primary constraint on agent use in ASIL-relevant development, analogous to DO-330 in aviation.

Tool Confidence Level (TCL) Determination

A tool's Confidence Level (TCL 1, 2, or 3) is determined by:

Tool Impact (TI): Could tool errors remain undetected and cause or contribute to a violation of safety requirements?
Tool Error Detection (TD): Could the error be detected before it could affect the safety of the item?

TCL	Basis	Agent Feasibility
TCL 1	Low tool impact or high detection	Viable. If agent output is always independently reviewed by qualified engineers, the detection probability is high, placing many agent functions at TCL 1.
TCL 2	Moderate tool impact, moderate detection	Viable with constraints. Requires increased confidence measures: use case restrictions, validation of tool use environment, or tool monitoring.
TCL 3	High tool impact, low detection	Challenging. Requires formal tool qualification or use of a pre-qualified tool. Current LLMs are not practical candidates for TCL 3 qualification under present evidence and qualification expectations.

The viable path. Independent human verification of all agent output is the primary mechanism for achieving high TD (tool error detection), which reduces the TCL classification for most agent functions. An agent that generates code which is always reviewed by a qualified engineer before integration typically achieves TCL 1 or TCL 2 — making tool qualification unnecessary for those functions.

ASPICE (Automotive SPICE) Process Alignment

ASPICE is the software process framework used across the automotive supply chain. Most OEM development contracts require ASPICE assessment at Level 2 or 3. Agentic engineering does not conflict with ASPICE — it accelerates several process areas.

ASPICE Process Area	Manifesto Alignment	Agent Contribution
SWE.1 — Software Requirements Analysis	P2 living specifications	Agents assist requirements traceability, consistency checking, and impact analysis
SWE.2 — Software Architectural Design	P3 defense-in-depth	Agents draft architectural views; qualified engineers verify against safety requirements
SWE.3 — Software Detailed Design and Unit Construction	P4/P5 execution with autonomy tiers	Agents generate code at ASIL-appropriate tier; independent review required for ASIL B+
SWE.4 — Software Unit Verification	P8 evaluations as contract	Agent-generated test cases and coverage analysis; qualified engineer reviews before baseline
SWE.5 — Software Integration and Integration Testing	P8/P9 evaluation and observability	Agents generate integration test suites; traces support integration evidence
SWE.6 — Software Qualification Testing	P1 outcome evidence	Agent-assisted test execution and evidence bundle assembly; human qualified by domain approves
SUP.1 — Quality Assurance	P12 accountability	Named domain owner accountable for agent output quality; QA role is independent oversight
SUP.8 — Configuration Management	P6 knowledge as versioned ground truth	Agent artifacts are CM items; model versions tracked alongside software baselines
SUP.10 — Change Request Management	P5 tier enforcement	Tier 2 branch gate enforces change request workflow before integration

UN Regulation 157 (ALKS) and Autonomous Driving

UN Regulation 157 governs Automated Lane Keeping Systems (ALKS) and represents the most developed regulatory framework for autonomous driving functions. It establishes performance requirements that interact directly with agent autonomy tiers.

The fundamental constraint: agents assisting in the development of ALKS software face the highest ASIL assignments (typically ASIL C/D for the safety-relevant functions). All development activity on these functions is subject to the ASIL-based autonomy caps in the first table above.

Agent use cases for ALKS development:

Use Case	Recommended Tier	Notes
Scenario generation for safety validation	Tier 1-2	Agents generate candidate scenarios from failure mode databases. Human safety engineer validates scenario coverage and acceptance criteria.
Simulation test infrastructure	Tier 1-3 (QM functions)	Simulation toolchain is typically QM; standard manifesto adoption applies.
Requirements traceability	Tier 1-2	Agents assemble traceability matrices from system, software, and test requirements. Human validates completeness against ASIL allocation.
Safety case argumentation	Tier 1	Agents may assist structuring the safety case (GSN/CAE format). All safety arguments require human authorship and qualified engineer sign-off.
Regression test suite maintenance	Tier 1-2	Agents update test cases as specifications evolve. Qualified engineer approves changes to safety-relevant test cases.

ISO/SAE 21434 — Cybersecurity Engineering

ISO/SAE 21434 governs cybersecurity engineering for road vehicles, complementing ISO 26262 for safety. Agents introduce specific cybersecurity risk vectors that must be addressed in the Threat Analysis and Risk Assessment (TARA).

Cybersecurity Concern	Manifesto Mapping	Automotive-Specific Note
Agent model supply chain integrity	P3 architecture boundaries	Model provenance, integrity verification, and version pinning. An untrusted model update is a supply chain attack vector affecting the CAL (Cybersecurity Assurance Level) of the affected function.
Prompt injection in development agents	P10 containment	Adversarial inputs to development agents could introduce vulnerabilities in vehicle software. Independent verification (TCL 1 path) is the primary mitigation.
Data exfiltration via agent context	P7 context engineering	Agent context windows may contain CSMS-protected design data or cybersecurity-relevant technical information.
Model routing and multi-vendor supply chain	P11 economics	Each model provider in a multi-model routing setup expands the supply chain; each requires TARA assessment under CSMS obligations.

Market-Specific Autonomy Guidance

Workflow	ASIL / Risk Level	Recommended Autonomy	Notes
ASIL D/C safety-critical software	ASIL D/C	Tier 1 (observe only)	Agent assists analysis and proposes; qualified engineer authors and verifies all artifacts. TCL qualification typically not required due to high TD through independent review.
ASIL B software	ASIL B	Tier 1-2	Agents draft to isolated branches; independent verification required before integration.
ASIL A and QM software	ASIL A / QM	Tier 1-3	Standard evidence bundles sufficient. Natural pilot domain for early adoption.
Test generation (any ASIL)	Tool output only	Tier 1 (observe)	Agents generate candidate test cases, traceability matrices, and coverage analyses. Qualified engineer accepts before baseline. No TCL 3 qualification required at Tier 1.
Simulation and virtual validation	QM context	Tier 1-3	Simulation infrastructure is typically QM. Standard manifesto adoption. High-value domain for accelerating validation campaigns.
Safety case and FMEA support	Safety-critical analysis	Tier 1	Agents assist structuring FMEAs, FTAs, and safety cases. All safety determinations are human-authored and signed off by a qualified functional safety engineer.
ASPICE process documentation	Process improvement	Tier 1-3	ASPICE artifacts (work products) are human-reviewed. Agent-assisted generation reduces cycle time for lower-risk work products.

Viable Starting Points

QM software development. No ASIL obligations. Full agentic loop permissible. Standard evidence bundles. Use to build team competency and evidence practices before taking on ASIL-rated functions.
Test generation for any ASIL (Tier 1 observe). Agents generate candidate unit tests, integration scenarios, and regression cases. Qualified engineer accepts before baseline. High value, low regulatory risk regardless of ASIL level.
ASPICE process documentation. Agent-assisted generation of work products: software development plans, traceability matrices, review records. Human authors and signs off. Reduces ASPICE preparation cycle time significantly.
Simulation scenario generation. Agents generate candidate test scenarios for virtual validation campaigns from failure mode libraries and operational design domain specifications. Safety engineer validates coverage and acceptance criteria.
Requirements traceability automation. Agents assemble specification-to-test-to-verification matrices. Qualified engineer validates completeness. Directly supports ASPICE SWE.4/SWE.5 evidence.
Regression test suite maintenance. As specifications evolve, agents update test cases to reflect changes. Human reviews all changes to safety-relevant test cases before re-baseline.

Tool Configuration Notes

How to configure agent tooling to satisfy ISO 26262 CM obligations and ISO/SAE 21434 cybersecurity requirements.

Configuration Management Hook Mapping

ISO 26262 Part 8, §7 requires that all safety-relevant development artifacts are identified, baselined, and change-controlled. Agent configuration contributes to this:

ISO 26262 CM Objective	Hook Type	What It Produces
Identification of agent-generated artifacts	PostToolUse audit hook	Artifact ID, agent session ID, model version, timestamp, ASIL context
Change control for ASIL-relevant artifacts	PreToolUse gate hook	ASIL classification check; blocks merge to safety-relevant branch without qualified reviewer approval
Problem reporting from evaluation failures	PostToolUse evaluation hook	Evaluation failure record with trace ID; automatic problem report creation
Model version baselining	SessionStart hook	Records model version in session metadata; must match approved baseline

Data and Design Protection

Restrict MCP servers to on-premises or approved endpoints for sessions containing CSMS-protected design data or ASIL-rated requirement documents.
Model version pinning is a CM obligation for ASIL-relevant development: pin to the approved model version in the development environment configuration; any model change requires a change request and ASIL impact assessment.
Apply ITAR/EAR controls (see defense-government.md) if the program involves defense-related content subject to export control.

Open Regulatory Questions

ISO PAS 8800 (AI in road vehicles). ISO PAS 8800 is under active development and will be the primary standard governing AI system development for road vehicles. Its release will clarify tool qualification requirements, autonomy constraints, and evidence requirements for AI-assisted development. Monitor ISO TC22/SC32.
Tool qualification path for AI-based development tools. ISO 26262 Part 8, §11 predates LLM-based development tools. The existing TCL framework can be applied (and the Tier 1 observe approach achieves high TD), but no guidance exists specifically for non-deterministic generation tools. Industry groups (ISO TC22, AUTOSAR) are developing clarifications.
ASIL decomposition and agent-generated dual-channel software. When ASIL decomposition is used to justify agent involvement in both channels, the independence requirement between channels must be preserved at the model, knowledge store, and evaluation infrastructure levels — not just at the code level. Methodology for demonstrating this independence is undeveloped.
UN Regulation 157 / ALKS edge case coverage. The regulation requires demonstration of performance across a defined operational design domain. Agent-generated scenario coverage methodologies for satisfying ODD completeness arguments are not yet standardized.
Memory and learned behavior in development tools. If agent learned memory influences ASIL-rated software output, does that memory become a CM item? The conservative position (consistent with aviation) is yes — but automotive standards do not address this explicitly.

ASDLC and APLC Regulatory Guidance

For automotive-specific regulatory requirements mapped to ASDLC Layers 1, 3, and 4 (ISO 26262, ASPICE, UN Regulation 157, SUMS), see ASDLC Automotive Domain Guidance.

For agent product regulatory guidance applicable to automotive agent products governed by the APLC, see APLC Automotive Domain Guidance.

Mapping the Agentic Engineering Manifesto principles to defense and government regulatory frameworks.

Disclaimer — This document maps concepts from the Agentic Engineering Manifesto to defense and government regulatory frameworks. It does not constitute compliance, legal, or security advice. Consult qualified security officers, program managers, and legal counsel for compliance determinations. Classification obligations vary significantly by program; this document addresses unclassified system development.

Regulatory currency: This document reflects CMMC 2.0, FedRAMP (current marketplace and authorization requirements), NIST SP 800-53 Rev 5, NIST SP 800-171 Rev 2, ITAR (22 CFR 120-130), EAR (15 CFR 730-774), and DoD Instruction 5000.02 as understood at the time of last review. CMMC scoping guidance for AI systems is not yet settled; DIBCAC has not issued definitive guidance on LLM API boundary classification. FedRAMP authorization status for frontier LLM providers is evolving rapidly; verify the FedRAMP marketplace before making infrastructure decisions. Last reviewed: April 2026. Proposed changes not yet enacted are flagged as such.

See companion-frameworks.md for boundary conditions on regulated-industry adoption.

Canonical sources. Normative principle definitions (P1–P12) and autonomy tier definitions are in manifesto-principles.md. This document maps those definitions to defense and government regulatory requirements; it does not redefine them.

Scope: CMMC 2.0 (DoD contractor cybersecurity), FedRAMP (federal cloud authorization), NIST SP 800-53 (federal security controls), NIST SP 800-171 (protecting CUI), ITAR (22 CFR 120-130) / EAR (15 CFR 730-774) export controls, DoD Instruction 5000.02 (acquisition).

Audience: Program managers, system security engineers, Authorizing Officials, ISSO/ISSMs, and technical leads evaluating agentic engineering in government and defense contexts.

Primary Constraint: Data Classification

In defense and government contexts, data classification is the primary autonomy constraint, preceding all other considerations. Unlike other regulated industries where data classification is one of several constraints, here it is the governing constraint that determines whether an agent system can be used at all, on what infrastructure, and with what controls.

Data Level	Agent Permissibility	Infrastructure Requirement	Memory Retention
Unclassified / Public	Fully permissible	Standard cloud or on-premises	Standard manifesto TTL policies apply
CUI (Controlled Unclassified Information)	Permissible with controls	FIPS 140-2/3 validated, CUI-authorized environment; FedRAMP High or equivalent	No CUI in external API calls; memory retention subject to CUI handling requirements (32 CFR Part 2002)
Classified (SECRET / TS / TS/SCI)	Not permissible with non-accredited commercial AI systems	Air-gapped, accredited systems only; non-accredited commercial LLM APIs are categorically excluded	No persistence whatsoever outside the accredited system boundary
ITAR / EAR Controlled Technical Data	Permissible only on compliant infrastructure	US-person-only access; no transmission to non-compliant cloud endpoints; Technology Control Plan required	Retention only within ITAR-compliant boundary; model training on controlled data requires authorization

The hard rule: No classified information may enter any commercial AI system, regardless of the system's other security controls. This is not a risk decision — it is a legal obligation under the National Industrial Security Program Operating Manual (NISPOM, 32 CFR Part 2004) and applicable security classification guides.

CMMC 2.0 to Manifesto Mapping

The Cybersecurity Maturity Model Certification (CMMC 2.0) is required for DoD contractors handling Federal Contract Information (FCI) or CUI. Agent systems that process FCI or CUI in the context of DoD work must be assessed as part of the CMMC boundary.

CMMC Level	Applicable To	Manifesto Alignment	Key Requirement for Agent Systems
Level 1 — Foundational	Contractors handling FCI only	Partially aligns with P3 (architecture) and P5 (access control)	17 basic safeguarding practices from FAR 52.204-21. Agents handling FCI must operate within an access-controlled boundary.
Level 2 — Advanced	Contractors handling CUI (most defense contractors)	Strong alignment with P3, P5, P8, P9, P12	110 NIST SP 800-171 practices. Agent systems are in-scope; all CUI flows through agents must be controlled, logged, and auditable. Third-party assessment required for critical programs.
Level 3 — Expert	Critical programs handling CUI	Alignment plus additional requirements	24 additional NIST SP 800-172 practices. Government-led assessment. Agent autonomy tier must be documented in the system security plan.

CMMC Practice Mapping (Level 2 / NIST SP 800-171)

NIST SP 800-171 Practice Family	Manifesto Mechanism	Alignment	Gap
Access Control (3.1.x)	P5 autonomy tiers with granular permissions; MCP allowlist	Strong	Agent-to-agent communications must also be access-controlled; A2A protocols require authorization evidence
Audit and Accountability (3.3.x)	P9 structured traces; PostToolUse audit hooks	Strong	Traces must meet NIST log requirements: user, time, type of event, success/failure, system component. Retention: 3 years for CUI systems.
Configuration Management (3.4.x)	P2 versioned specifications; P6 knowledge baseline	Strong	Model versions, prompt configurations, and tool permission sets are configuration items requiring CM controls. Changes require CM approval.
Identification and Authentication (3.5.x)	P5 tier enforcement; RBAC	Partial	Multi-factor authentication required for CUI access; agent identity (as distinct from human identity) must be established and logged.
Incident Response (3.6.x)	P12 accountability; P9 traces for diagnosis	Strong	CMMC requires documented incident response plan, testing, and reporting to appropriate authorities.
Risk Assessment (3.11.x)	P3 defense-in-depth; P5 blast radius	Moderate	Formal risk assessment of agent systems as part of the CMMC boundary; agent-specific threat vectors must be included.
System and Communications Protection (3.13.x)	P3 architecture; data classification enforcement	Strong	Network segmentation between agent systems handling CUI and those handling unclassified data. MCP traffic requires encryption and access controls.
System and Information Integrity (3.14.x)	P8 evaluations; P10 containment; P3 allowlists	Strong	Agents must not introduce unauthorized software or dependencies; allowlists are the enforcement mechanism. Security alerts from agent anomalies must be monitored.

FedRAMP Authorization for Agent Infrastructure

FedRAMP governs the use of cloud services by federal agencies. If agent infrastructure (model hosting, orchestration, memory storage) runs on a cloud service, that service must be FedRAMP authorized at the appropriate impact level.

FedRAMP Impact Level	Data Sensitivity	Agent Use
Low	Public federal information	Standard manifesto adoption; commercial cloud FedRAMP Low services permissible
Moderate	Most CUI, low-sensitivity PII	Most federal agency agent deployments; FedRAMP Moderate authorization required for all cloud components in the agent boundary
High	Law enforcement, emergency services, financial, health	Strictest cloud requirements; FedRAMP High authorization required; subset of cloud providers qualify

Key implication for agent systems: The LLM API, the orchestration layer, the memory store, and the observability pipeline are all in-scope for FedRAMP if they process federal information. Using a commercial LLM API not on the FedRAMP marketplace for federal agency use is a compliance violation. As of early 2026, a small number of LLM providers have obtained or are pursuing FedRAMP authorization; the landscape is evolving rapidly.

Multi-model routing (P11) and FedRAMP: Each model provider in a multi-model routing setup must be FedRAMP-authorized at the applicable impact level. Routing to a non-authorized provider for cost optimization is not permissible for in-scope federal workloads.

NIST SP 800-53 Security Control Mapping

NIST SP 800-53 is the security controls catalog for federal information systems (required under FISMA). Agent systems used in federal contexts are subject to these controls.

Control Family	Key Controls	Manifesto Mapping	Agent-Specific Note
Access Control (AC)	AC-2 (account management), AC-3 (access enforcement), AC-6 (least privilege)	P5 autonomy tiers, granular permissions	Agent service accounts must be managed identities with least-privilege permissions; periodic access review required
Audit and Accountability (AU)	AU-2 (event logging), AU-9 (protection of audit information), AU-12 (audit record generation)	P9 structured traces	All agent actions are auditable events; trace infrastructure must be tamper-evident and backed up separately from the agent system
Configuration Management (CM)	CM-2 (baseline configuration), CM-6 (configuration settings), CM-8 (information system component inventory)	P2/P6 versioned specifications and baselines	Model versions, agent configurations, and MCP tool connections are CM items; deviations from baseline require CM board approval
Incident Response (IR)	IR-4 (incident handling), IR-6 (incident reporting)	P12 accountability, P9 traces	Agent-related incidents must follow the organizational IR plan; traces support rapid incident diagnosis
Risk Assessment (RA)	RA-3 (risk assessment), RA-5 (vulnerability scanning)	P3/P5/P10 defense-in-depth	Formal risk assessment of agent systems with explicit consideration of AI-specific threat vectors (prompt injection, model poisoning, data exfiltration)
System and Services Acquisition (SA)	SA-4 (acquisition process), SA-9 (external information system services)	P11 multi-model routing	LLM providers are external system services subject to SA-9; supply chain risk management (SR family) applies

ITAR / EAR Export Control

ITAR (22 CFR 120-130) and EAR (15 CFR 730-774) are the primary export control frameworks for defense and dual-use technology. For agent systems in defense development contexts, these are not secondary considerations — they are fundamental constraints on infrastructure architecture.

What Constitutes an Export

Under ITAR/EAR, transmitting controlled technical data to a non-US-person or to a foreign country constitutes an "export" — even if the transmission is digital and within the same organization. Agent systems that process controlled technical data must be designed to prevent inadvertent export.

Agent-Specific Export Control Risks

Risk	Scenario	Manifesto Mitigation
Data exfiltration via LLM API	Agent sends ITAR-controlled design data to a commercial LLM inference API	MCP allowlist restricts all external API calls for ITAR-classified sessions; no external calls allowed
Context window as export vehicle	Agent context containing controlled data is logged or transmitted outside the controlled boundary	Context classification enforcement: sessions with controlled data produce traces that stay within the controlled boundary only
Model training on controlled data	Controlled technical data enters model training pipeline	Explicit prohibition in agent system policy; infrastructure-level prevention via data classification gates
Foreign national access	Agent system accessible to non-US-persons in a multi-tenant cloud	Architecture requirement: ITAR-rated workloads run on US-person-only infrastructure; access controls verified by system security plan

Technology Control Plan Requirements for Agent Systems

Organizations with ITAR/EAR controlled programs must maintain a Technology Control Plan (TCP). Agent systems handling controlled technical data must be explicitly included in the TCP with:

Identification of controlled data flows through agent systems
Access controls preventing foreign national access
Monitoring and audit procedures for agent access to controlled data
Incident reporting procedures for unauthorized disclosure

DoD Acquisition and Authority to Operate

Agent systems used in DoD programs must operate under an Authority to Operate (ATO) granted by the Authorizing Official (AO). The ATO process maps to the manifesto's governance model:

ATO Stage	Manifesto Mechanism	Evidence Required
System categorization	P3 architecture; data classification matrix	FIPS 199 categorization; system boundary definition including all agent components
Security plan development	P5 autonomy tiers; P12 accountability	System security plan documenting agent configurations, autonomy tier assignments, and accountability structures
Security assessment	P8 evaluations; P9 traces	Security control assessment evidence; penetration testing for ASIL C/D equivalent programs
ATO decision	P12 named human accountability	AO decision with explicit residual risk acceptance; agent systems are in-scope for risk determination
Continuous monitoring	P9 observability; P10 containment	Ongoing monitoring plan; automated alerts for agent anomalies; periodic reauthorization

Zero Trust Architecture (ZTA) alignment. DoD has mandated Zero Trust Architecture adoption (DoD Zero Trust Strategy, 2022). Agent systems must be consistent with ZTA principles: assume breach, verify explicitly, use least privilege. The manifesto's tiered autonomy (P5), enforcement at infrastructure level (P3), and comprehensive tracing (P9) are direct implementations of ZTA principles in agentic systems.

Market-Specific Autonomy Guidance

Use Case	Classification Level	Recommended Autonomy	Notes
Unclassified development tooling (code generation, test automation)	Unclassified	Tier 1-3	Standard manifesto adoption. CMMC Level 1 practices apply if FCI is involved.
CUI document processing and analysis	CUI	Tier 1-2	Agents analyze and draft; human reviews before any CUI record is modified or transmitted. FedRAMP Moderate infrastructure required.
Requirements and traceability analysis	Unclassified / CUI	Tier 1-2	High-value use case. Agent assembles traceability matrices; human qualified engineer validates. Evidence bundles support ATO documentation.
ITAR-controlled program development	ITAR technical data	Tier 1 (observe only)	ITAR compliance requires human control over all controlled technical data. Agent may analyze within the controlled boundary; no external API calls.
Classified program development	SECRET / TS	Not permissible	Non-accredited commercial AI systems are categorically excluded from classified programs. No exceptions without accredited system boundary and government authorization. Unclassified still requires access control, auditability, and change management.
Cybersecurity assessment and testing	Varies	Tier 1-2	Agent assists vulnerability analysis and security assessment; ISSO/ISSM reviews and approves all findings before remediation actions.
Logistics and sustainment analytics	Unclassified	Tier 1-3	Non-safety-critical domain; standard manifesto adoption. High-value opportunity for cost reduction.

Viable Starting Points

Unclassified administrative and logistics software. No classified data, no ITAR. Standard manifesto adoption applies. Natural pilot domain for building competency before tackling CMMC/FedRAMP requirements.
CUI document analysis (Tier 1 observe). Agents analyze CUI documents, extract requirements, identify inconsistencies, draft summaries. Human reviews all outputs before any CUI record is modified. FedRAMP Moderate infrastructure required.
Requirements traceability for DoD programs. Agents assemble specification-to-test-to-verification traceability matrices from program documentation. Qualified engineer validates completeness. Directly supports DI-IPSC-81433B data items and ATO documentation.
CMMC evidence and documentation assembly. Agents compile CMMC assessment evidence packages, security plan sections, and POA&M tracking. Reduces CMMC preparation cycle time while keeping human reviewers accountable for all compliance determinations.
Software security analysis (Tier 1). Agents perform static analysis, dependency scanning, and security posture assessment at Tier 1. ISSO reviews all findings. Human authorizes any remediation actions. No agent access to classified or ITAR-controlled components.

Tool Configuration Notes

How to configure agent tooling to satisfy CMMC and FedRAMP audit trail requirements and export control data classification obligations.

Audit Trail Hook Mapping (NIST SP 800-53 AU family)

NIST Control	Hook Type	What It Produces
AU-2 / AU-12 Event logging	PostToolUse audit hook	Agent identity, action type, success/failure, component accessed, timestamp — every event logged
AU-3 Content of audit records	PostToolUse with structured schema	Full structured trace including tool calls, data accessed, decision chain
AU-9 Protection of audit information	Separate audit log infrastructure	Traces written to tamper-evident, separately-backed-up store; no agent write access to its own audit log
AU-11 Audit record retention	Scheduled retention hook	Trace retention: minimum 3 years for CUI systems; format-migrated for long-term preservation

CUI Classification Enforcement

The MCP allowlist (Layer 6 in enterprise configuration) is the primary infrastructure control for CUI data residency:

CUI-handling sessions must be restricted to FedRAMP-authorized or on-premises infrastructure only. No external APIs without authorization.
Session metadata must include a CUI indicator; traces for CUI sessions are handled and retained under CUI requirements.
Data residency enforcement hook: PreToolUse hook checks the data classification of the session context; blocks external API calls for CUI-classified sessions.

Model Version Pinning for ATO Stability

Pin model versions during ATO assessment periods and continuous monitoring reviews:

Behavioral changes from model updates may trigger a change request requiring AO review.
Document model version in the system security plan as a CM item.
Any model change affecting security-relevant behavior (output filtering, tool call behavior) must be assessed for ATO impact before deployment.

Open Regulatory Questions

FedRAMP authorization for frontier LLM providers. The pathway for frontier LLM providers to obtain FedRAMP High authorization is complex and slow. Most commercially capable models are available only at FedRAMP Moderate or below as of early 2026. Monitor FedRAMP marketplace for updates; plan architecture to accommodate the current authorization landscape.
CMMC scoping for agent systems. How far does the CMMC assessment boundary extend for agent systems? Is the LLM API a third-party service subject to SA-9 controls, or is it in-scope as a system component? DIBCAC (Defense Industrial Base Cybersecurity Assessment Center) has not issued definitive guidance on LLM API scoping.
AI in weapons systems and autonomous functions. DoD Directive 3000.09 governs autonomous weapons systems. This document does not address weapons systems development. Teams working on programs subject to 3000.09 should seek specific guidance from the program legal and policy advisors.
Zero Trust and agent identity. DoD's Zero Trust Architecture requires explicit identity verification for every access request. Agent identity (as distinct from human operator identity) must be established in a standards-based way. No published DoD standard addresses agent identity in the ZTA context; the most current guidance treats agents as service accounts, but this is likely insufficient for the autonomy levels the manifesto describes.
CUI in agent training and fine-tuning. Using CUI data to fine-tune models creates a complex set of obligations: the fine-tuned model may "memorize" CUI that could be extracted later. No regulatory guidance addresses this risk specifically. Conservative position: do not use CUI for model fine-tuning without explicit legal and security review.

ASDLC Regulatory Guidance

For defense and government-specific regulatory requirements mapped to ASDLC Layers 1, 3, and 4 (CMMC, FedRAMP, NIST SP 800-53, ATO process), see ASDLC Defense/Government Domain Guidance.

The agentic governance stack

Where to Start (Across the Stack)

Six Values

Twelve Principles

The Agentic Loop

Who Is This For?

Repository Map

1) Beyond Agile (Case for Change)

2) The Manifesto (Normative Core)

3) Implementation Guide

4) Adoption Playbook (Organizational Transition)

5) Domain-Specific Regulatory Alignment

The Agile Manifesto Is Twenty-Five Years Old — and It Shows

Contents

What Is Actually Needed: The Case for a New Agentic Engineering Manifesto

The Urgency

The Four Values — Challenged

1. "Individuals and interactions over processes and tools" — Inverted

2. "Working software over comprehensive documentation" — Dangerous

3. "Customer collaboration over contract negotiation" — Reframed

4. "Responding to change over following a plan" — Incomplete

The Practices — Obsolete

5. Sprint Cadences Are Irrelevant to Machine-Speed Execution

6. Estimation and Velocity Tracking Lose Meaning

7. Human Code Review Becomes the Bottleneck

The Conceptual Gaps — Missing Entirely

8. No Framework for Non-Deterministic Behavior

9. No Concept of Systems That Learn from Their Own Execution

10. No Treatment of Memory as Infrastructure

Casey West's Agentic Manifesto

The SASE Framework (Academic SE 3.0)

The DEV Community "Agentic Manifesto"

The P3 Group's "From Sprints to Swarms"

The AWS Prescriptive Guidance

ISO/IEC 5338:2023

The Agentic AI Foundation and the Emerging Standards Stack

Source Weighting

What is Agentic Engineering?

What This Is — and What It Is Not

The Agentic Loop

What the Loop Produces

What Must Be True Before Entering Specify

The New Way of Working

Scope and Framework Context

What this manifesto covers

Position as shared engineering execution infrastructure

What this manifesto does not cover

How to Read This Manifesto

Contents

Contents

1. Outcomes are the unit of work

2. Specifications are living artifacts that evolve through steering

3. Architecture is defense-in-depth, not a document

4. Right-size the swarm to the task

5. Autonomy is a tiered budget, not a switch

6. Knowledge and memory are distinct infrastructure

7. Context is engineered like code

8. Evaluations are the contract; proofs are a scale strategy

9. Observability and interoperability cover reasoning, not just uptime

10. Assume emergence; engineer containment

11. Optimize the economics of intelligence

12. Accountability requires visibility

The Agentic Definition of Done

Definition of Done for Hardening

Evidence Freshness

Contents

Contents

Principle 1 — Outcomes: Extended Guidance

The Probability-Compounding Problem

Evidence Bundles and Assurance Levels

Principle 2 — Specifications: Extended Guidance

Contract-First Agentic Development

Specifications as Agent-Consumable Artifacts

The Specification-Driven Development Movement

Convergence Criteria

Validation vs. Verification

Requirements Engineering for Agentic Systems

The Architect Pattern: Agent-Generated Specifications

Specifications vs. Constraints

Principle 3 — Architecture: Extended Guidance