The Agentic Engineering Manifesto
Principles for building systems where humans steer intent, agents execute within governed boundaries, and verified outcomes are the only measure that matters.
This is a living document. Agentic engineering is a fast-moving field, and this manifesto evolves continuously — informed by our own practices, what we witness in the field, and the new technologies, trends, and practices that emerge. Contributions are welcome.
Editorial Stance
Principles for building systems where humans steer intent, agents execute within governed boundaries, and verified outcomes are the primary measure that matters for agentic work.
The Agile Manifesto was written for a world where humans wrote all the code. That world no longer exists.
In agentic workflows, generation, verification, and deployment run at machine speed. Legacy ceremonies — sprint cadence, velocity scoring, manual review-first pipelines — become bottlenecks and blind spots. Early empirical evidence, including the SWE-CI benchmark showing regression rates above 75% per CI iteration across 18 models (arXiv:2603.03823), confirms that agentic systems require purpose-built engineering discipline, not retrofitted Agile ceremonies.
This repository provides a complete alternative: the case for change, the manifesto itself, a companion implementation guide, an organizational adoption playbook, and domain-specific regulatory alignment for six industries.
Six Values
| We value more | over | We also value |
|---|---|---|
| Iterative steering and alignment | Rigid upfront specifications | |
| Verified outcomes with auditable evidence | Fluent assertions of success | |
| Right-sized agent collaboration | Monolithic god-agents | |
| Curated, high-signal context and memory | Stateless sessions and noisy memory | |
| Tooling, telemetry, and observability | Chat-based heroics | |
| Resilience under stress | Performance in ideal conditions |
While there is value in the items on the right, we value the items on the left more.
Twelve Principles
- Outcomes are the unit of work
- Specifications are living artifacts that evolve through steering
- Architecture is defense-in-depth, not a document
- Right-size the swarm to the task
- Autonomy is a tiered budget, not a switch
- Knowledge and memory are distinct infrastructure
- Context is engineered like code
- Evaluations are the contract; proofs are a scale strategy
- Observability and interoperability cover reasoning, not just uptime
- Assume emergence; engineer containment
- Optimize the economics of intelligence
- Accountability requires visibility
See full text in manifesto-principles.md.
The Agentic Loop
Specify → Design → Plan → Execute → Verify → Validate → Observe → Learn → Govern → Repeat
Any phase can trigger a return to an earlier one based on evidence. The loop is the system. The principles are how you keep it honest.
Who Is This For?
| If you are | Start with |
|---|---|
| New to agentic engineering | Beyond Agile → The Manifesto → Adoption Playbook |
| A practitioner implementing now | Twelve Principles → Principle Guidance → Patterns → Adoption Path |
| An engineering leader or change owner | Beyond Agile Landscape → Adoption Roles → Metrics |
| In a regulated industry | Domain Overview → your domain document |
Repository Map
1) Beyond Agile (Case for Change)
- beyond_agile.md: The argument for why Agile is insufficient for agentic systems.
- beyond-agile-failures.md: Ten structural failures in values, practices, and conceptual coverage.
- beyond-agile-landscape.md: Critical comparison of competing manifestos, standards, and frameworks.
- beyond-agile-sources.md: Twenty-three cited sources including academic benchmarks (SWE-CI, Feldt et al.), industry frameworks (AWS, P3 Group, ISO 5338), and practitioner perspectives.
2) The Manifesto (Normative Core)
- manifesto.md: Core values, scope, Agentic Loop, and reading guide.
- manifesto-principles.md: Twelve principles with minimum bars.
- manifesto-done.md: Agentic Definition of Done (seven criteria plus evolvability) and Definition of Done for Hardening (vibe-to-prod path).
- glossary.md: Canonical definitions for all terms used across the manifesto document set.
3) Implementation Guide
- companion-principles.md: Extended guidance and tradeoffs by principle. Includes the Architect–Programmer pattern, evaluation holdout and probabilistic satisfaction, and behavioral vs. structural regression analysis.
- companion-frameworks.md: Maturity spectrum, boundary conditions, and operational definitions.
- companion-patterns.md: Worked patterns and failure patterns.
- companion-re-framework.md: Requirements engineering framework — two-axes classification, behavioral envelopes, and probabilistic assurance targets.
- companion-reference.md: Failure modes and skill requirements.
4) Adoption Playbook (Organizational Transition)
- adoption-playbook.md: Playbook overview and new way of working.
- adoption-roles.md: Role evolution and human-side transition guidance.
- adoption-path.md: Incremental technical adoption path and phase transitions.
- adoption-vmodel.md: V-model-specific adoption path for regulated and verification-heavy organizations.
- adoption-pilot.md: Resistance management and first pilot execution.
- adoption-metrics.md: Success metrics, quarterly review cadence, and failure modes.
5) Domain-Specific Regulatory Alignment
- domains/README.md: Navigation and disclaimers.
- domains/aviation.md: DO-178C, DO-330, DO-333, ARP 4754A.
- domains/medical-devices.md: IEC 62304, ISO 14971, ISO 13485, FDA SaMD.
- domains/pharma.md: GAMP 5, CSA, 21 CFR Part 11, ICH.
- domains/financial-services.md: SR 11-7, DORA, EU AI Act, SOX, Three Lines of Defense.
- domains/automotive.md: ISO 26262, SOTIF, ASPICE.
- domains/defense-government.md: MIL-STD-882, DO-326A, NIST AI RMF.
This is a living document. Contributions are welcome — see CONTRIBUTING.md for guidelines on proposing changes, submitting worked patterns, or reporting issues. See AUTHORS.md for contributors. See LICENSE for terms.
The Agile Manifesto was written for a world where humans wrote all the code. That world no longer exists.
The Agile Manifesto Is Twenty-Five Years Old — and It Shows
In February 2001, seventeen software developers gathered at a ski lodge in Snowbird, Utah, and wrote a document that would reshape an industry. The Agile Manifesto was a rebellion against waterfall bureaucracy, and it won. Its four values — individuals and interactions, working software, customer collaboration, responding to change — liberated a generation of engineers from Gantt charts and heavyweight process.
But the Agile Manifesto was written with an unstated assumption so fundamental that nobody thought to say it aloud: humans write all the code.
Every practice built on the manifesto's assumptions — Scrum's two-week sprints, SAFe's velocity tracking, daily standups, story points, pair programming, retrospectives — is calibrated to the pace, cognition, and coordination needs of human teams. When autonomous agents can generate functional applications in hours, execute legacy migrations during a single flight, and run verification pipelines that dwarf what any human QA team could attempt in a quarter, these practices do not merely feel dated. They become structural liabilities 15.
Steve Jones, in his widely-circulated essay "AI Killed the Agile Manifesto," argues that the manifesto is "a great way to screw up in a big way when using Agentic SDLCs at scale" because "Agentic SDLCs are too fast for Agile" 1. The P3 Group's From Sprints to Swarms white paper goes further, calling established agile frameworks "strategic liabilities" that throttle innovation when confronted with AI-driven workflows, and declaring the daily standup "an exercise in absurdity" when an AI orchestrator knows the precise status of every task at any given microsecond 5.
Not everyone agrees. Jon Kern, one of the original seventeen signatories, describes himself as "smitten" with vibe coding but insists the manifesto "will endure" — arguing that you "need to understand agility more than ever" and should "learn a little bit more about what constitutes the ability to create high-quality software at speed with responsibility" 10. Martin Fowler, hosting a 25th-anniversary workshop at Thoughtworks, said he doesn't "have a lot of time for manifestos" and that writing a new one is "way too early" — though the workshop itself concluded that test-driven development "has never been more important" and "produces dramatically better results from AI coding agents" 916. According to InfoQ's summary of Forrester's 2025 State of Agile Development report (primary report not publicly available), 95% of surveyed professionals affirm Agile's critical relevance 14.
But relevance and sufficiency are not the same thing. A compass is relevant in a car, but it does not replace the steering wheel. The Agile Manifesto remains relevant as a philosophical compass. It is fundamentally insufficient as an operating system for agentic engineering.
Contents
Where Agile Breaks: Ten Structural Failures
The four Agile values challenged, the practices rendered obsolete, and the conceptual gaps — memory, non-determinism, self-improving systems — that Agile never addressed because it never needed to.
The Existing Manifestos: What They Get Right and What They Miss
A critical review of every competing framework: Casey West's Agentic Manifesto, the SASE Framework, the DEV Community manifesto, P3 Group's "From Sprints to Swarms," the AWS Prescriptive Guidance, and ISO/IEC 5338.
Sources
Sixty cited sources classified by type: press, blog, industry, academic, standard, and internal reference.
What Is Actually Needed: The Case for a New Agentic Engineering Manifesto
Every existing framework gets something right. None of them are complete. The gap is not philosophical — it is operational. The industry needs a manifesto that is simultaneously:
- Philosophical — values that reframe priorities for a probabilistic world
- Principled — concrete engineering principles with minimum bars
- Operational — connected to real tooling and measurable outcomes
- Evolutionary — a maturity spectrum, not a binary switch
The Agentic Engineering Manifesto — six core values, twelve principles mapped to operational tooling, an agentic definition of done, and a maturity spectrum from first adoption to recursive self-improvement — is one attempt to meet this standard. Whether it succeeds is for the engineering community to determine through practice, evidence, and iteration.
What recent agentic-AI research adds is a stronger explanation of why a new discipline is needed. If intelligence at frontier scale is increasingly plural, relational, and organized through internal or external societies of thought, then the engineering problem is no longer "how do we steer one smart assistant?" It becomes: how do we govern distributed cognition across agents, tools, and humans using explicit protocols, evidence, and institutional checks and balances? Likewise, if agents can improve by externalizing reusable skills and refining them through experience, then memory governance is no longer a nice- to-have optimization. It becomes part of the control plane for learning systems 5960.
The Urgency
Current industry forecasts and long-horizon maintenance benchmarks suggest that agentic delivery fails when teams force probabilistic systems through legacy SDLC workflows 822. The signal is strongest for maintenance-heavy, multi-iteration work, not every software activity. Methodology mismatch is one failure mode among several; cost, governance, tool quality, and organizational incentives also matter.
The failure is not only technical — it is organizational. Enterprises are trying to adopt agentic technology using Agile governance structures designed for human teams. The Sprint Review is the governance checkpoint, but who reviews agent output at machine speed? The Scrum Master is the process guardian, but who governs a recursive feedback loop? The Product Owner is the requirements authority, but who owns the specification that constrains agent behavior across a swarm? These roles do not map to agentic engineering, and the P3 Group's observation that organizations face a choice between evolutionary and revolutionary adoption 5 understates the challenge: most organizations are attempting neither, clinging instead to methodologies calibrated for a world that no longer exists.
The industry is repeating the pattern it followed with every previous paradigm shift: rushing to adopt the new technology while clinging to the old methodology. The two-week sprint does not accommodate machine-speed execution. Story points do not measure probabilistic output. Human code review does not scale to agent-generated volume. Standups do not synchronize digital swarms. Early empirical evidence confirms the scale of the problem: the SWE-CI benchmark, testing 18 models across 100 tasks spanning an average of 233 days of real development history, found that most agents introduce at least one regression in three out of four CI iterations — many of them structural regressions that pass current tests but degrade the codebase's capacity for future change 22.
The question is not whether the Agile Manifesto needs a successor. The question is whether the successor will emerge from principled engineering or from the wreckage of failed projects.
The window is closing. Every month without a coherent engineering discipline for agentic systems is another month of "vibe coding" masquerading as engineering, another month of hallucination loops shipping to production, another month of technical debt accruing at machine speed. The organizations that adopt a rigorous agentic engineering manifesto now — with verified outcomes, governed autonomy, curated memory, defense-in-depth architecture, economics-aware routing, formal verification where risk warrants it, and human accountability at every tier — will define the next era of software. The rest risk becoming Gartner's statistic.
Exploration is a phase. Engineering is a discipline.
The four values challenged, the practices rendered obsolete, and the conceptual gaps Agile never addressed.
See Beyond Agile for the full argument. See the Existing Manifestos for what competing frameworks get right and miss. All references link to Sources.
The failures are not cosmetic. They are structural — rooted in assumptions that no longer hold. Some belong to the Agile Manifesto itself: its four values, written for a human-only world. Others belong to the practices that grew around it — Scrum's sprints, SAFe's velocity tracking, the ceremonies and metrics that became the operational expression of Agile but were never part of the original document. The distinction matters: the manifesto is a philosophical statement; the practices are an implementation. Both break, but for different reasons. These failures are failures of Agile in agentic systems; they are not a claim that Agile is obsolete for all human-led software work.
The Four Values — Challenged
1. "Individuals and interactions over processes and tools" — Inverted
This was Agile's most liberating principle: trust people, not bureaucracy. But in an agentic pipeline, the toolchain is the capability. The choice of orchestration platform, the choice of verification fleet, the choice of memory infrastructure — these are not implementation details. They are architectural decisions that determine what is possible. Using one tool versus another creates fundamentally different operational realities 1.
In agentic systems, processes and tools are now fundamental to success, not obstacles to it. The human's role has shifted from writing code to architecting the environment in which agents write code. The Agile Manifesto's founding value has been inverted by the reality it never anticipated 16.
2. "Working software over comprehensive documentation" — Dangerous
This value was a corrective against waterfall's thousand-page specifications that nobody read. It made sense when humans wrote code deliberately and could explain their reasoning. It is actively dangerous when applied to autonomous agents.
In agentic systems, AI models excel at producing software that appears to work. Jones calls this out directly: "AI is spectacular at building software that looks like it works" but "can create technical debt at a rate that normal developers absolutely couldn't" 1. The phenomenon — what Andrej Karpathy termed "vibe coding" 7 — generates code satisfying immediate tests while lacking modularity, architectural integrity, and scalability. Without documentation serving as the contractual boundary holding agents accountable, systems hallucinate from legacy training data and corrupt their own operational context 13.
In agentic engineering, documentation is the specification that constrains agent behavior. Architecture Decision Records, formal contracts, constraint files, capability definitions — these are not bureaucratic overhead. They are the machine-readable rules that prevent autonomous systems from optimizing for the wrong thing. The Agile Manifesto's suspicion of documentation becomes negligence when your workforce is probabilistic 36.
3. "Customer collaboration over contract negotiation" — Reframed
Agile rightly elevated direct customer collaboration over adversarial contract negotiations. But in an agentic system, the "contract" is no longer a legal document between humans — it is the machine-readable specification that governs agent behavior. The agent does not collaborate; it executes within constraints. If those constraints are vague, the agent will fill the gaps with its own probabilistic inference — and the customer will receive something nobody specified 3.
Contract negotiation has been reborn as specification engineering: defining precise, testable, machine-enforceable boundaries. The collaboration happens between humans during specification. The contract happens between human intent and agent execution 411.
4. "Responding to change over following a plan" — Incomplete
Agile rightly valued adaptability over rigid upfront planning. But it assumed the entity responding to change was a human with judgment, context, and accountability. When agents respond to change, they do so probabilistically — and without the judgment to know when adaptation has become drift 3.
In agentic systems, specifications need to steer behavior and evolve through evidence — something the Agile Manifesto never contemplated. Not rigid plans, but not unconstrained adaptation either. Living specifications that tighten through iterative refinement — specify, execute, evaluate, adjust — with convergence criteria that distinguish productive evolution from scope drift 34.
The four Agile values each assumed a human-only world. But the structural failures extend beyond the values themselves — into the practices, metrics, and ceremonies that Scrum, SAFe, and related frameworks built on top of those values.
The Practices — Obsolete
5. Sprint Cadences Are Irrelevant to Machine-Speed Execution
A two-week sprint assumes human pace. When agents can complete a full development cycle in hours 1, the sprint boundary is not just arbitrary — it is a bottleneck that prevents the system from shipping validated increments the moment they are architecturally sound 511.
The replacement is continuous flow with verification gates: agents produce work continuously, and every increment passes through deterministic checks, evaluation harnesses, and proof generation before it advances. The cadence is not time-boxed — it is evidence-gated 411.
6. Estimation and Velocity Tracking Lose Meaning
Story points and velocity metrics assume human cognitive throughput as the constraint. When an agent can generate ten implementations in the time a human would estimate one, velocity tracking measures the wrong thing 5. The meaningful metric becomes total cost of correctness (the sum of inference spend, verification overhead, and incident remediation when failures escape) — not story points completed 318.
This point is worth sharpening because even sophisticated frameworks miss it. McKinsey's AI Transformation Manifesto frames success in terms of EBITDA uplift and return on AI investment — business outcomes, not engineering activity. That framing is correct and the manifesto community should adopt it. But McKinsey's own framework contains no mechanism for how those outcomes are verified at the task level. The missing link is exactly total cost of correctness: the economics-aware routing, verification overhead, and incident cost that determine whether a given investment in AI delivery produces real return or just faster output that fails downstream.
The economics shift runs deeper than metrics. In Agile, cost is simple: developers cost X per sprint, multiply by sprints. In agentic engineering, the cost model is per-token, per-model, per-task — and it varies by orders of magnitude depending on which model is routed to which task. Sending every task to the most capable model is like flying first-class for a cross-town trip; sending every task to the cheapest is like taking a bicycle to the airport. The Agile Manifesto has no vocabulary for economics-aware routing — selecting which model handles which task based on the cost-quality tradeoff — because it never needed one 17. Reuven Cohen has described a "sudden flip in the cost curve" around mid-2025 — the moment long-horizon agentic swarms became economically feasible, making cost-quality routing not just desirable but essential 21. The Gartner prediction that 40% of agentic projects will be canceled cites "escalating costs" as a primary driver 8 — this is precisely the problem: organizations burning through inference budgets because they lack cost-quality routing discipline.
7. Human Code Review Becomes the Bottleneck
When agents produce code at machine speed, the human reviewer becomes the rate limiter. The Agile Manifesto has no answer for this because it never imagined a world where code generation was not the bottleneck 18. The answer requires tiered verification: deterministic checks filter autonomously, statistical evaluation filters semi-autonomously, and human review focuses exclusively on high-risk deltas and policy exceptions 1318.
The Conceptual Gaps — Missing Entirely
8. No Framework for Non-Deterministic Behavior
In agentic systems, this is the deepest category of failure — concepts that neither the manifesto nor its derivative practices ever addressed, because they did not need to. Agile assumes deterministic execution: write code, run tests, the same input produces the same output. Agents are probabilistic. The same specification can produce different implementations across runs. The same tool call can produce different results depending on context window contents, model temperature, and retrieved memory 34.
The Agile Manifesto has no vocabulary for emergence, containment, hallucination loops, memory poisoning, or probability-compounding across multi-agent systems 413. These are not edge cases — they are routine operating conditions in agentic engineering.
It is worth noting that even current enterprise guidance — including McKinsey's AI Transformation Manifesto — reproduces this gap at the strategic level. McKinsey's theme on agentic engineering (#11 of their twelve themes) describes the challenge as "ingesting unstructured data, extending AI platforms with agentic capabilities, automating guardrails and controls." This describes Agile-era configuration management dressed in agentic vocabulary. It has no concept of blast radius, swarm topology, correlated failure modes, or the verification/validation distinction that separates "the agent said it worked" from "we can prove it worked." The absence of non-determinism vocabulary in enterprise AI guidance is not a strategic oversight — it is a symptom of the same conceptual gap that limits Agile: the framework was designed for human executors, and the vocabulary has not caught up to probabilistic ones.
9. No Concept of Systems That Learn from Their Own Execution
In agentic systems, this may be the most consequential failure — the one that generates all the others. Agile's feedback loop is the retrospective: humans reflecting on what happened, deciding what to change, implementing changes in the next sprint. That loop runs on a two-week cadence because it requires human cognition. The feedback is soft — "we should try shorter sprints," "let's refine our definition of done."
Agentic systems have a qualitatively different feedback loop. Errors, logs, successes, and failures feed back into the system in real-time. The system does not wait for a retrospective. It does not require human reflection to adapt. The feedback is hard, unambiguous signals: passing tests, zero runtime errors, validated API responses, converging evaluation metrics 217. This is not merely faster iteration — it is a different kind of learning. Reasoning consolidation cycles can compress chains of inference using reinforcement-learning algorithms in seconds. Meta-cognitive layers enable systems to monitor and modify their own operational parameters. Self-optimizing architectures adapt query and retrieval strategies in microseconds based on access patterns 19.
The Agile Manifesto has no concept of a system that improves its own process without human intervention. It assumes the learner is human, the cadence is weekly, and the feedback requires interpretation. In agentic systems, the learner is the system itself, the cadence is continuous, and the feedback is machine-readable. This is not an incremental improvement over retrospectives — it is a paradigm shift that Agile's vocabulary cannot express 2.
10. No Treatment of Memory as Infrastructure
In agentic systems, Agile's concept of institutional memory — tribal knowledge and documentation that nobody reads — is fatally insufficient. There is no Agile practice for curating, governing, or versioning what the organization knows — let alone what it has learned. In an agentic system, this gap is fatal.
An agent without persistent memory — whether internalized or externalized through retrieval layers, episodic stores, or vector databases — must be reconstructed from scratch for every task. Some architectures externalize memory entirely, separating the agent from its state. But the memory still exists as infrastructure; it still requires curation, governance, and retrieval engineering. The question is not whether memory is needed but where it lives and who governs it. In practice, the distinction between a stateless tool and a memory-augmented agent is the ability to accumulate, curate, and act on context across invocations 19. And memory itself is not monolithic: knowledge (what was given) and learned memory (what was discovered through execution) are distinct infrastructure with different curation, governance, and retrieval requirements 17.
Context windows are finite. What goes into them determines what comes out. Memory governance — the discipline of deciding what to retain, what to forget, how to retrieve, and how to prevent poisoning of an agent's accumulated context — is an engineering discipline as consequential as database design. The Agile Manifesto treats memory as overhead. Agentic engineering treats it as infrastructure 3.
A critical review of every competing framework.
See Beyond Agile for the full argument. See Ten Structural Failures for how Agile breaks. All references link to Sources.
The industry has not been idle. Multiple manifestos and frameworks have emerged to fill the vacuum. But none of them are sufficient.
Casey West's Agentic Manifesto
What it gets right: The shift from verification ("did it do what I said?") to validation ("did it do what I wanted?"). The Agentic Delivery Lifecycle (ADLC) across five non-linear phases. The "Determinism Gap" — the fundamental difference between a system whose output is known in advance and one whose output is discovered in real-time. The emphasis on continuous flow over time-boxed sprints. The insistence that human engineers and agents must work together continuously, rejecting fully unsupervised delegation 4.
What it misses: No treatment of memory as infrastructure — West does not distinguish knowledge from learned memory or address memory governance. No economics-aware routing — no recognition that model choice is a runtime decision with cost implications 17. No framework for formal verification or proof generation — and this matters because when execution is non-deterministic, at least one layer of the verification pyramid must be provably correct; executable specification languages and model checkers are now production-viable for this purpose 1219. No treatment of swarm topology as an engineering decision. No recognition that alignment must move from the single-agent prompt layer toward institutional alignment across interacting agents, tools, and humans 59. The manifesto reads as a philosophical reframe of Agile rather than a new engineering discipline.
The SASE Framework (Academic SE 3.0)
What it gets right: The dual modality of SE4H (Software Engineering for Humans) and SE4A (Software Engineering for Agents). The elevation of the developer from syntax author to "Agent Coach." The structured artifacts: BriefingScripts, Merge-Readiness Packs (MRPs), Consultation Request Packs (CRPs). The separation of Agent Command Environment (ACE) and Agent Execution Environment (AEE). The Plan-Do-Assess-Review (PDAR) loop with agent-initiated callbacks 12.
What it misses: Overly academic — lacks operational tooling references. No treatment of cost-quality routing. No framework for memory governance beyond "institutional memory." No recognition that formal verification and statistical evaluation are complementary disciplines with different cost curves 19. No treatment of self-improving recursive systems or of skill memory as an external learning substrate that must itself be governed 60.
The DEV Community "Agentic Manifesto"
What it gets right: Four values that correctly reframe priorities: human intent over exhaustive requirements, continuous flow over sprints, architectural integrity over feature output, automated validation over manual estimation 11.
What it misses: Values without principles are aspirational, not operational. No definition of "done." No treatment of observability, memory, domain boundaries, or accountability. No framework for what happens when architectural integrity conflicts with continuous flow. No recognition that "automated validation" requires a multi-layered verification pyramid (deterministic → statistical → formal → human) rather than a single binary gate 13.
The P3 Group's "From Sprints to Swarms"
What it gets right: The most thorough deconstruction of how specific Agile practices (standups, sprints, estimation, retrospectives) fail under agentic workflows. The strategic framing of evolutionary versus revolutionary adoption paths. The recognition that the Agile Manifesto's values can survive as governance principles even as its practices become obsolete 5.
What it misses: Primarily diagnostic rather than prescriptive. Identifies what breaks but does not provide the replacement engineering discipline with sufficient depth. No treatment of formal verification, memory governance, or economics-aware routing. No operational tooling framework.
The AWS Prescriptive Guidance
What it gets right: "Zones of intent" — bounded operational spaces where agents have high autonomy within architectural constraints. The evolution from "Sprint Planning" to "Intent Design." The recognition that "done" must be redefined as runtime readiness with observability, explainable traces, and feedback mechanisms 3.
What it misses: Vendor-contextualized (AWS-centric). No treatment of multi-vendor swarm coordination. No framework for formal verification. Limited treatment of memory and learning systems.
ISO/IEC 5338:2023
What it gets right: Among the first comprehensive international frameworks for AI system lifecycle processes 20. The integration of Model Engineering into standard Implementation processes. The mandate for Continuous Validation — acknowledging that AI agents can suffer from context drift, hallucination, and data staleness over time. The emphasis on bias mitigation, transparency, and purpose-binding for training data 15.
What it misses: Designed for AI systems broadly, not for agentic engineering specifically. No treatment of multi-agent coordination, swarm topologies, or inter-agent trust. No framework for memory governance, economics-aware routing, or self-improving systems. Compliance-oriented rather than engineering-oriented.
The Agentic AI Foundation and the Emerging Standards Stack
Since the frameworks above were published, the most significant structural development has been institutional: in December 2025, the Linux Foundation launched the Agentic AI Foundation (AAIF), co-founded by Anthropic, OpenAI, Google, Microsoft, AWS, and Block 35. MCP, A2A, AGENTS.md, and goose were donated as founding projects.
This matters because the competing frameworks above all suffer from the same gap: they describe what agentic engineering needs without naming the protocols that implement it. As of the current ecosystem snapshot, the industry is actively standardizing into four complementary layers, all under neutral governance:
- MCP (Model Context Protocol) — agent-to-tool connectivity. Defines typed schemas, auth boundaries, and replayable tool logs 36.
- A2A (Agent-to-Agent Protocol) — agent discovery, task delegation, and cross-framework collaboration 37.
- Agent Skills — capability definition via SKILL.md files consumed at runtime 38.
- AGENTS.md — repository-level machine-readable constraints for coding agents 39.
None of the six frameworks reviewed above anticipated this convergence. The Agentic Engineering Manifesto does not prescribe specific protocols — its contribution is the governance model that sits across all four layers. But the existence of AAIF supports one of the manifesto's core theses: vendor-neutral, interoperable architecture is not aspirational but actively being built.
A parallel movement reinforces the shift: specification-driven development (SDD) frameworks are emerging as an increasingly common workflow pattern for agentic coding. Multiple widely-adopted open-source frameworks 4344454648 now enforce the same discipline — write the specification before the agent writes the code. This effectively inverts Agile's founding principle of "working software over comprehensive documentation." In agentic workflows, comprehensive specification is the precondition for working software. The documentation is not overhead; it is the control surface.
Sources are classified by type: [P] press/trade, [B] blog/opinion, [I] industry/vendor, [A] academic, [S] standard, [R] internal reference.
See Beyond Agile for the full argument.
Source Weighting
Not all sources carry equal evidentiary weight. When drawing conclusions, rely on the highest-weight source available for the claim:
- Primary — standards, regulations, official documentation, peer-reviewed academic papers. Strongest evidentiary weight.
- Secondary — industry and vendor guidance, white papers, practitioner frameworks. Useful for operational context; may reflect vendor perspective.
- Tertiary — press, blogs, opinion pieces. Useful for framing and practitioner sentiment; treat as directional, not conclusive.
Conclusions about requirements, governance obligations, or regulatory constraints should rest on primary sources. Secondary and tertiary sources provide context and practitioner signal, not evidentiary grounding for technical claims.
[1] S. Jones, "AI Killed the Agile Manifesto," MetaMirror (blog), Jan 2026. [B] https://blog.metamirror.io/ai-killed-the-agile-manifesto-805ad9a639db
[2] Infosys, "How Is AI-Native Software Development Lifecycle Disrupting Traditional Software Development?" Infosys IKI TechCompass, 2025. [I] https://www.infosys.com/iki/techcompass/ai-native-software-development-lifecycle.html
[3] AWS, "Evolving Software Delivery for Agentic AI," AWS Prescriptive Guidance, 2026. [I] https://docs.aws.amazon.com/prescriptive-guidance/latest/strategy-operationalizing-agentic-ai/software-delivery.html
[4] C. West, "The Agentic Manifesto: Engineering in the Era of Autonomy," caseywest.com, Nov 2025. [B] https://caseywest.com/the-agentic-manifesto/
[5] P3 Group, "From Sprints to Swarms: Navigating the Post-Agile Future in the Age of AI," P3 Group White Paper, Sep 2025. [I] https://www.p3-group.com/en/p3-updates/navigating-the-post-agile-future-in-the-age-of-ai/
[6] D. Shortino, "The Software Development Lifecycle as We Know It Is Over," WebProNews, Jan 2026. [P] https://www.webpronews.com/the-software-development-lifecycle-as-we-know-it-is-over-and-ai-agents-are-writing-the-obituary/
[7] D. Rubinstein, "Is Agile Dead in the Age of AI?" SD Times, 2025. [P] https://sdtimes.com/agile/is-agile-dead-in-the-age-of-ai/
[8] Gartner, "Gartner Predicts Over 40% of Agentic AI Projects Will Be Canceled by End of 2027," Gartner Press Release, Jun 2025. [I] https://www.gartner.com/en/newsroom/press-releases/2025-06-25-gartner-predicts-over-40-percent-of-agentic-ai-projects-will-be-canceled-by-end-of-2027
[9] L. Claburn, "Test-Driven Development Ideal for AI, Says Agile Workshop," The Register, Feb 2026. [P] https://www.theregister.com/2026/02/20/from_agile_to_ai_anniversary/
[10] L. Claburn, "Agile Manifesto Co-Author 'Smitten' with Vibe Coding," The Register, Feb 2026. [P] https://www.theregister.com/2026/02/19/jon_kern_vibe_coding/
[11] crywolfe, "The Agentic Manifesto: Why Agile Is Breaking in the Age of AI Agents," DEV Community, 2025. [B] https://dev.to/crywolfe/the-agentic-manifesto-why-agile-is-breaking-in-the-age-of-ai-agents-1939
[12] R. Feldt et al., "Agentic Software Engineering: Foundational Pillars and a Research Roadmap," arXiv:2509.06216v2, Sep 2025. [A] https://arxiv.org/html/2509.06216v2
[13] B. Linders, "From Prompts to Production: A Playbook for Agentic Development," InfoQ, 2026. [P] https://www.infoq.com/articles/prompts-to-production-playbook-for-agentic-development/
[14] B. Linders, "Does AI Make the Agile Manifesto Obsolete?" InfoQ, Feb 2026. Note: cites Forrester's 2025 State of Agile Development report; primary report not publicly available. [P] https://www.infoq.com/news/2026/02/ai-agile-manifesto-debate/
[15] Software Improvement Group, "ISO/IEC 5338: Get to Know the Global Standard on AI Systems," SIG Blog, 2024. [I] https://www.softwareimprovementgroup.com/blog/iso-5338-get-to-know-the-global-standard-on-ai-systems/
[16] T. Claburn, "From Agile to AI: Anniversary Workshop Says Test-Driven Development Ideal for AI Coding," DevClass, Feb 2026. [P] https://www.devclass.com/development/2026/02/21/should-there-be-a-new-manifesto-for-ai-development/4091612
[17] Y. Zhou, "2025 Overpromised AI Agents. 2026 Demands Agentic Engineering," Medium, Jan 2026. [B] https://medium.com/generative-ai-revolution-ai-native-transformation/2025-overpromised-ai-agents-2026-demands-agentic-engineering-5fbf914a9106
[18] Svngoku, "2026 Agentic Coding Trends — Implementation Guide," Hugging Face Blog, 2026. [B] https://huggingface.co/blog/Svngoku/agentic-coding-trends-2026
[19] L. Cabrera-Diego et al., "Toward Agentic Software Engineering Beyond Code: Framing Vision, Values, and Vocabulary," arXiv:2510.19692v2, Oct 2025. [A] https://arxiv.org/html/2510.19692v2
[20] ISO/IEC, "ISO/IEC 5338:2023 — Information technology — AI system life cycle processes," International Organization for Standardization, 2023. [S] https://www.iso.org/standard/81118.html
[21] The AI Native Dev podcast interview, "Can Agentic Engineering Really Deliver Enterprise-Grade Code?" (with R. Cohen), Sep 2025. [I] https://ainativedev.io
[22] Y. Pan et al., "SWE-CI: Evaluating LLM-based Agents in Continuous Integration Environments," arXiv:2603.03823, Mar 2026. [A] https://arxiv.org/abs/2603.03823
[23] D. Fretz, "The 5 Levels of AI Agentic Software Development," LinkedIn, Feb 2026. [B] https://www.linkedin.com/pulse/5-levels-ai-agentic-software-development-dominik-fretz-mba-pmp-xhvze/
[24] OpenAI, "Harness engineering: leveraging Codex in an agent-first world," OpenAI, Feb 2026. [I] https://openai.com/index/harness-engineering/
[25] Anthropic, "Demystifying evals for AI agents," Anthropic Engineering, Jan 2026. [I] https://www.anthropic.com/engineering/demystifying-evals-for-ai-agents
[26] D. Bursztein and B. Lewis, "Building agents with the Claude Agent SDK," Claude Blog, Jul 2025. [I] https://claude.com/blog/building-agents-with-the-claude-agent-sdk
[27] C. Horne, "Writing effective tools for AI agents," Anthropic Engineering, Sep 2025. [I] https://www.anthropic.com/engineering/writing-tools-for-agents
[28] A. Zhang et al., "Building a C compiler with a team of parallel Claudes," Anthropic Engineering, Feb 2026. [I] https://www.anthropic.com/engineering/building-c-compiler
[29] B. Böckeler, "Harness Engineering," Martin Fowler, Feb 2026. [B] https://martinfowler.com/articles/exploring-gen-ai/harness-engineering.html
[30] B. Böckeler, "Context Engineering for Coding Agents," Martin Fowler, Feb 2026. [B] https://martinfowler.com/articles/exploring-gen-ai/context-engineering-coding-agents.html
[31] K. Morris, "Humans and Agents in Software Engineering Loops," Martin Fowler, Mar 2026. [B] https://martinfowler.com/articles/exploring-gen-ai/humans-and-agents.html
[32] Google Cloud, "Vertex AI Agent Builder," Google Cloud, 2026. [I] https://cloud.google.com/products/agent-builder
[33] Google Cloud, "Agent Development Kit overview," Google Cloud Docs, 2026. [I] https://docs.cloud.google.com/agent-builder/agent-development-kit/overview
[34] G. Franceschini et al., "Build with Google Antigravity, our new agentic development platform," Google Developers Blog, Apr 2025. [I] https://developers.googleblog.com/en/build-with-google-antigravity-our-new-agentic-development-platform/
[35] Linux Foundation, "Linux Foundation Announces the Formation of the Agentic AI Foundation," Linux Foundation Press, Dec 2025. [I] https://www.linuxfoundation.org/press/linux-foundation-announces-the-formation-of-the-agentic-ai-foundation
[36] Anthropic, "MCP Joins the Agentic AI Foundation," Model Context Protocol Blog, Dec 2025. [I] https://blog.modelcontextprotocol.io/posts/2025-12-09-mcp-joins-agentic-ai-foundation/
[37] Google, "A2A: A New Era of Agent Interoperability," Google Developers Blog, Apr 2025. [I] https://developers.googleblog.com/en/a2a-a-new-era-of-agent-interoperability/
[38] Anthropic, "Equipping Agents for the Real World with Agent Skills," Anthropic Engineering, Oct 2025. [I] https://www.anthropic.com/engineering/equipping-agents-for-the-real-world-with-agent-skills
[39] OpenAI, "AGENTS.md," GitHub, 2025. [I] https://github.com/agentsmd/agents.md
[40] NVIDIA, "NVIDIA Announces NemoClaw," NVIDIA News, Mar 2026. [I] https://nvidianews.nvidia.com/news/nvidia-announces-nemoclaw
[41] S. Yegge, "Introducing Beads: A Coding Agent Memory System," Medium, 2026. [B] https://steve-yegge.medium.com/introducing-beads-a-coding-agent-memory-system-637d7d92514a
[42] CrowdStrike, "What Security Teams Need to Know About OpenClaw AI Super Agent," CrowdStrike Blog, 2026. [I] https://www.crowdstrike.com/en-us/blog/what-security-teams-need-to-know-about-openclaw-ai-super-agent/
[43] GitHub, "Spec-driven development with AI: Get started with a new open source toolkit," GitHub Blog, 2025. [I] https://github.blog/ai-and-ml/generative-ai/spec-driven-development-with-ai-get-started-with-a-new-open-source-toolkit/
[44] Fission AI, "OpenSpec: The Spec Framework for Coding Agents," Y Combinator Launch, 2025. [I] https://www.ycombinator.com/launches/Pdc-openspec-the-spec-framework-for-coding-agents
[45] J. Vincent, "Superpowers: How I'm using coding agents in October 2025," blog.fsck.com, Oct 2025. [B] https://blog.fsck.com/2025/10/09/superpowers/
[46] BMad Code, "BMAD-METHOD: Breakthrough Method for Agile AI-Driven Development," GitHub, 2025-2026. [I] https://github.com/bmad-code-org/BMAD-METHOD
[47] Oracle, "Introducing the Open Agent Specification (Agent Spec)," Oracle AI Blog, 2025. [I] https://blogs.oracle.com/ai-and-datascience/introducing-open-agent-specification
[48] spec-kit, "spec-kit: Technology-independent SDD toolkit," GitHub, 2025-2026. [I] https://github.com/spec-kit/spec-kit
[49] Model Context Protocol, "Key Changes," MCP Specification, Mar 2025. [S] https://modelcontextprotocol.io/specification/2025-03-26/changelog
[50] Model Context Protocol, "One Year of MCP: November 2025 Spec Release," MCP Blog, Nov 2025. [S] https://blog.modelcontextprotocol.io/posts/2025-11-25-first-mcp-anniversary/
[51] Linux Foundation, "Linux Foundation Launches the Agent2Agent Protocol Project to Enable Secure, Intelligent Communication Between AI Agents," Linux Foundation Press Release, Jun 2025. [I] https://www.linuxfoundation.org/press/linux-foundation-launches-the-agent2agent-protocol-project-to-enable-secure-intelligent-communication-between-ai-agents
[52] OpenTelemetry, "AI Agent Observability: Building Trust in Autonomous Systems with OpenTelemetry," OpenTelemetry Blog, Mar 2025. [S] https://opentelemetry.io/blog/2025/ai-agent-observability/
[53] OpenAI, "Why we no longer evaluate SWE-bench Verified," OpenAI, Feb 2026. [I] https://openai.com/index/why-we-no-longer-evaluate-swe-bench-verified/
[54] C. Cuadron et al., "Saving SWE-Bench: A Benchmark Mutation Approach for More Realistic Agent Evaluation," Microsoft Research, Oct 2025. [A] https://www.microsoft.com/en-us/research/publication/saving-swe-bench-a-benchmark-mutation-approach-for-realistic-agent-evaluation/
[55] OpenAI, "Understanding Prompt Injection," OpenAI, Nov 2025. [I] https://openai.com/index/prompt-injections/
[56] OpenAI, "Designing AI Agents to Resist Prompt Injection," OpenAI, Mar 2026. [I] https://openai.com/index/designing-agents-to-resist-prompt-injection/
[57] European Commission, "The General-Purpose AI Code of Practice," Shaping Europe's Digital Future, Jul 2025. [S] https://digital-strategy.ec.europa.eu/en/policies/contents-code-gpai
[58] NIST, "Announcing the 'AI Agent Standards Initiative' for Interoperable and Secure Innovation," NIST News, Feb 2026. [S] https://www.nist.gov/news-events/news/2026/02/announcing-ai-agent-standards-initiative-interoperable-and-secure
[59] J. Evans, B. Bratton, and B. Agüera y Arcas, "Agentic AI and the next intelligence explosion," Science, 2026. [A] https://www.science.org/doi/10.1126/science.aeg1895
[60] H. Zhou et al., "Memento-Skills: Let Agents Design Agents," arXiv:2603.18743, Mar 2026. [A] https://arxiv.org/abs/2603.18743
Principles for building systems where humans steer intent, agents execute within governed boundaries, and verified outcomes are the only measure that matters.
We are moving from writing software to architecting systems that write, test, and ship software under human direction. Through this work, we have come to value:
| We Value More | over | We Also Value |
|---|---|---|
| Iterative steering and alignment | over | Rigid upfront specifications |
| Verified outcomes with auditable evidence | over | Fluent assertions of success |
| Right-sized agent collaboration | over | Monolithic god-agents |
| Curated, high-signal context and memory | over | Stateless sessions and noisy memory |
| Tooling, telemetry, and observability | over | Chat-based heroics |
| Resilience under stress | over | Performance in ideal conditions |
That is, while there is value in the items on the right, we value the items on the left more.
Architectural basis (vendor-neutral): enforceable constraints, durable knowledge and memory, continuous evaluations, behavioral observability, and economics-aware routing.
What is Agentic Engineering?
Agentic Engineering is the discipline of architecting environments, constraints, protocols, and feedback loops where autonomous agents can safely plan, execute, and verify complex work under human governance.
It is distinct from:
- AI Engineering: Building and training the base models themselves.
- Prompt Engineering: Crafting text inputs to steer model outputs.
- AI-Assisted Software Engineering: Using AI as an autocomplete or co-pilot to write human-authored code faster.
Agentic Engineering is about treating agents as governed system participants rather than as human proxies. It shifts the primary human role from writing code to specifying intent, defining verifiable contracts, and operating the system that executes the work. As agent capability scales, the governing challenge shifts from aligning one model in isolation toward aligning a society of interacting agents, tools, and humans through checks, balances, and explicit institutional control.
What This Is — and What It Is Not
This manifesto is not "prompting harder." It is not LLMs running production unsupervised. It is not replacing engineering judgment with agent confidence, and it is not more meetings with new names.
It is enforced constraints, verified outcomes, persistent learning, and human accountability — applied to systems that include AI agents as first-class participants in the engineering process.
The Agentic Loop
Every principle in this manifesto serves a single feedback cycle:
Specify → Design → Plan → Execute → Verify → Validate → Observe → Learn → Govern → Repeat
This loop is not a waterfall. Any phase can trigger a return to an earlier one based on evidence. The loop is the system. The principles are how you keep it honest.
- Specify defines what to build and why.
- Design architects how to build it: boundaries, topology, constraints, and coordination rules.
- Plan decomposes the design into executable steps.
- Execute carries out the plan within bounded autonomy.
- Verify checks the output against the specification (did we build it right?).
- Validate checks the outcome against real-world need (did we build the right thing?).
- Observe monitors runtime behavior, drift, and cost.
- Learn updates knowledge and memory from observations. At Phases 4–5, this means: add durable findings to the knowledge base and curate learned memory with new heuristics, routing preferences, and reusable skills. Updating model weights (fine-tuning, RLHF) is a separate infrastructure concern applicable at Phase 6 and beyond — not a per-loop operation for most organizations. Knowledge captures durable truth; memory captures learned heuristics and reusable skills.
- Govern applies policy, accountability, change control, and economics review. When inference or governance cost exceeds the value of the work, Govern signals Specify to simplify scope or reduce autonomy rather than continuing to spend. A Govern cycle is not complete until: all outstanding policy violations are resolved, accountability signals are within threshold (no rubber-stamping pattern detected), economics review is recorded, and any architectural decisions triggered by governance are filed back into Design.
Verification and validation are distinct disciplines. Verification is technical correctness against the spec. Validation is fitness for intended use in the real world. An agent can pass every verification check and still fail validation. Both are required.
Failures are data across every phase. Incidents, hallucinations, and policy violations must produce post-incident updates to specifications, evaluations, tooling constraints, and memory before retry.
When a feedback arrow fires, a remediation sub-cycle must complete before re-entering the loop:
- Diagnose — classify the failure from traces: specification error, verification gap, enforcement failure, or operational override.
- Update — patch memory, tighten contracts, or revise the specification to address the root cause.
- Gate — add or strengthen an evaluation that would catch this failure class before retrying.
- Re-verify — run the updated evaluation suite before advancing.
Skipping to step 4 without steps 1–3 is a retry, not remediation, and is the primary cause of hallucination loops.
flowchart LR
Specify --> Design --> Plan --> Execute --> Verify --> Validate --> Observe --> Learn --> Govern
Govern -->|Repeat| Specify
Verify -.->|Plan / Execution Failure| Plan
Verify -.->|Invalid Intent| Specify
Validate -.->|Wrong Thing Built| Specify
Validate -.->|Design Flaw| Design
Observe -.->|Runtime Drift| Specify
Observe -.->|Decomposition Error| Plan
Govern -.->|Economics / Complexity Breach| Specify
Govern -.->|Architectural Policy Change| Design
The New Way of Working
Humans express intent as specifications with constraints and acceptance criteria — then refine those specifications as evidence accumulates. They encode architecture as enforceable, monitored domain boundaries. They set autonomy tiers appropriate to risk. They own outcomes and remain accountable. They do not supervise every intermediate step — they define what success looks like, verify that the system achieved it, and inspect the reasoning when it matters.
Agents decompose specifications into executable tasks. They execute within domain boundaries, right-sized to complexity. They verify their own outputs against evaluations. They report evidence, not assertions. They learn from failure and encode that learning in memory — with provenance, so the system knows where every lesson came from.
Systems maintain persistent knowledge and curated learned memory. They route work to appropriate model tiers based on cost and quality requirements. They enforce architectural constraints at runtime and monitor for violations. They observe behavior, surface anomalies, and maintain the feedback loops that make everything else work. They forget what no longer serves them.
See Roles and the Human Side for how each role evolves through the phase transitions.
Scope and Non-Goals
What this manifesto covers:
- The engineering discipline for building and operating systems that include autonomous agents as first-class participants in the software development and delivery lifecycle (SDLC).
- Governance structures, autonomy controls, and evidence practices for agent-assisted software engineering.
- Adoption guidance for regulated and unregulated software delivery contexts, including V-model and compliance-heavy organizations.
- Domain-specific mappings to regulatory frameworks for aviation, automotive, medical devices, pharmaceuticals, financial services, and defense/government.
What this manifesto does not cover:
- Training, fine-tuning, or evaluating foundation models. That is AI engineering.
- Deploying agents in physical systems, robotics, or non-software operational domains.
- Product management, UX design, or organizational strategy beyond what directly governs agent autonomy and accountability.
- Legal advice, compliance determinations, or jurisdiction-specific regulatory guidance. The domain pages map principles to frameworks; they are not substitutes for qualified regulatory counsel.
- Autonomous weapons systems, or the safety certification of autonomous control systems themselves (e.g., certifying an ALKS or autopilot function). The domain pages cover software engineering governance for teams building those systems; they do not cover the operational safety certification of the resulting autonomous system.
What requires separate guidance:
- Agentic systems operating outside the SDLC (e.g., customer-facing autonomous agents, trading agents, autonomous process automation at industrial scale). The principles are relevant starting points, but the operational context — real-time customer exposure, regulatory regimes, failure modes — differs enough to require purpose-built guidance rather than direct application.
- Federated agent networks without a single accountable operator (distinct from multi-provider model routing, which P11 addresses).
- Agent deployment in classified environments (the domain pages note this boundary; they do not provide guidance for classified system development).
How to Read This Manifesto
Use two layers:
- Manifesto core (this document + Twelve Principles + Definition of Done): values, principles with minimum bars, and what "done" means. Start here.
- Companion guidance (Companion Guide and its linked documents): extended rationale, tradeoffs, worked patterns, failure modes, organizational change management, and domain-specific regulatory alignment. Come here when implementing. The companion layer is itself multi-document; the full map is in companion-guide.md.
The two-layer framing is accurate but incomplete. The minimum bars in the principles are necessary conditions; they are not sufficient for safe operation at Phase 4 and above. At higher phases, certain companion content becomes operationally essential rather than supplementary: the Specifications vs. Constraints distinction (P2), rubber-stamping detection (P12), the Adaptation Envelope — Layer 4 (P6), and the worked failure-mode patterns (P10/P12) are required reading before operating autonomy above Tier 1. If the core document describes the floor, these documents describe the walls and ceiling.
On evidence. This manifesto demands evidence as a discipline. We apply that standard to our own claims: empirically supported claims carry citations; threshold values are labeled as practitioner heuristics; deductive arguments are stated as arguments so they can be evaluated independently. Some claims in an emerging discipline necessarily precede the empirical grounding they ideally require. Treat those claims as hypotheses and revise them as evidence accumulates. That is what a living specification means in practice.
Contents
Twelve Principles
The engineering principles that operationalize the six values: outcomes, specifications, architecture, swarm topology, autonomy tiers, knowledge and memory, context, evaluations and proofs, observability and interoperability, emergence and containment, economics, and accountability.
The Agentic Definition of Done
What "done" means in agentic engineering: shipped, observable, verified, provable, learned from, governed, and economical. Phase-calibrated, not all-or-nothing.
Glossary
Canonical definitions for terms used across this document set: agent, autonomy tier, blast radius, evidence bundle, evaluation, knowledge, learned memory, specification, trace, verification, validation, and more.
Exploration is a phase. Engineering is a discipline. These principles are not the last word — they are the minimum for a world where systems build, test, and ship their own code under human direction. The question that remains is whether governance can scale as fast as autonomy. We bet it can. This manifesto is how we intend to prove it.
The engineering principles that operationalize the six values.
See the Manifesto for the core values and the Agentic Loop. See the Definition of Done for what "done" means.
Values-to-principles mapping. The manifesto claims these twelve principles operationalize the six values. The correspondence:
| Value | Principles |
|---|---|
| Iterative steering and alignment | 1 — Outcomes, 2 — Specifications |
| Verified outcomes with auditable evidence | 8 — Evaluations, 12 — Accountability |
| Right-sized agent collaboration | 3 — Architecture, 4 — Swarm, 5 — Autonomy tiers |
| Curated, high-signal context and memory | 6 — Knowledge/memory, 7 — Context |
| Tooling, telemetry, and observability | 9 — Observability |
| Resilience under stress | 10 — Containment, 11 — Economics |
Sequencing matters. These principles are not independent. Prerequisites: Principle 2 (specifications) before Principle 8 (evaluations); Principle 3 (architecture) before Principle 5 (autonomy tiers); Principle 6 (knowledge/memory) before Principle 7 (context); Principle 9 (observability) before Principle 12 (accountability). The Incremental Adoption Path gives the recommended implementation order.
1. Outcomes are the unit of work
Progress is measured by the cycle Outcome → Evidence → Learning — not by tokens generated, tasks dispatched, or agents spawned. An agent that says "done" has proven nothing. A change is done only when it is shipped, observable, verified, validated, and learned from.
Four distinct claims must hold before "done" is true:
Evaluation is the contract that defines correctness. Evaluations are versioned, machine-readable, and coupled to the specification. They define what "correct" means in terms the system can check autonomously.
Verification is the act of running evaluations to confirm the implementation matches the specification. Verification answers: did we build it right? It produces evidence — test reports, policy check outputs, trace IDs — that an agent's output satisfies the acceptance criteria.
Validation is the judgment that the specification itself was worth building. Validation answers: did we build the right thing? It checks fitness for real-world use: does the deployed behavior produce the intended business outcome? Verification can pass completely while validation fails — you can build exactly what the specification said, correctly, and ship the wrong thing.
Independent validation is the organizational challenge of whether verification and validation were genuinely rigorous. It answers: were the first two real? In regulated contexts, this must be performed by a party organizationally independent from the team that developed and verified the system. It is not a technical step — it is a governance requirement.
Evidence means: evaluation reports with pass/fail and metrics, trace IDs linking to the full decision chain, diffs showing what changed, deployment IDs confirming what shipped, rollback plans confirming reversibility, policy check outputs confirming constraint compliance, and memory updates confirming what was learned. Anything less is assertion, not evidence.
Minimum bar: If it is not deployed, instrumented, verified against evaluations, and validated against real-world outcomes, it is not done.
2. Specifications are living artifacts that evolve through steering
Requirements, constraints, and acceptance criteria must be versioned, reviewable, and machine-readable — because they drive agent behavior directly. Specifications are hypotheses that sharpen as agents explore the problem space and evidence accumulates. Express what must be true when the work is complete. Express what is forbidden. Let the swarm find the path. When the path reveals that the spec was wrong, update the spec and run again.
Specifications and architectural constraints operate at different layers and change at different speeds. Constraints are invariants — security policies, domain ownership boundaries, data integrity rules — that hold across specification iterations. Specifications are goals and acceptance criteria that evolve within those invariants. An agent can propose a revised acceptance criterion without governance overhead; proposing a relaxed constraint triggers a governed review. If the system cannot distinguish these two change types, specification iteration will silently erode architectural boundaries. See Specifications vs. Constraints in the extended guidance.
Minimum bar: If a specification cannot be versioned, reviewed, and revised based on execution evidence, it is a wish, not an engineering artifact.
These are starter defaults, not universal stop conditions. Calibrate them per domain, track false-convergence and false-drift, and harden them only after local evidence justifies the thresholds.
A specification is done iterating when:
- Acceptance criteria remain stable across three consecutive iterations (no new criteria added, no existing criteria changed).
- Scope is contracting, not expanding — each iteration narrows requirements, does not broaden them.
- Agent first-pass verification rate exceeds 80% (the specification is clear enough for the agent to satisfy it without mid-task clarification).
- No new stop criteria emerge in the last iteration.
If these are not met after three iterations, treat it as scope drift — not optimization — and reset the boundary. Iteration is not the goal; convergence is.
3. Architecture is defense-in-depth, not a document
Domain boundaries define what agents may do and what they must not do. Encode boundaries as machine-enforced policies: repository gates, type contracts, lint rules, domain ownership maps, CI checks.
Orchestration is a deterministic concern; execution is a probabilistic one — conflating them is the root failure mode. Do not rely on an LLM's system prompt to enforce your business rules. Build deterministic infrastructure wrappers around your probabilistic AI. Enforce permissions, repository gates, API rate limits, and data access at the system level. Expect the boundary to be tested. Design for what happens when it is crossed.
Deterministic wrappers catch structural failures — unauthorized access, schema violations, forbidden operations. They cannot catch semantic failures — an agent that writes syntactically valid but logically wrong code. That is why architecture is defense-in-depth, not a single layer: wrappers catch structural violations (Principle 3), verification catches semantic errors (Principle 8), and observability catches behavioral drift (Principle 9). No single layer catches everything. All three must hold.
Minimum bar: If a boundary is described but not enforced at runtime with automated detection and recovery, it is not architecture — it is documentation.
4. Right-size the swarm to the task
Prefer specialized agents coordinated through shared contracts and state. But do not default to maximum parallelism. A single well-evaluated agent with excellent tools often outperforms an expensive, uncoordinated swarm. Scale agents to complexity, not to ambition.
Design conflict resolution, not just parallelism. Swarms propose; a single commit path commits. Choose the simplest topology that solves the problem and graduate to more complex coordination only when evidence shows it is needed.
The point of a swarm is not to mimic an organization chart. It is to create structured disagreement, specialization, and reconciliation where the workload benefits from multiple perspectives. Intelligence at system scale is often plural rather than monolithic. The engineering question is not "how many agents can we run?" but "what coordination pattern produces better verified outcomes than a single agent on this workload?"
Signals that indicate a single agent is insufficient:
- The task requires concurrent reads or writes across multiple bounded contexts where race conditions cannot be resolved inside a single agent.
- Evaluation pass rate plateaus below threshold across successive sessions despite specification refinement, indicating context degradation under length.
- The task requires adversarial specialization — roles whose objectives conflict and cannot be fully trusted from the same agent (e.g., implementation and independent security review).
- Single-agent tool call depth or context budget is consistently saturated on representative tasks.
In the absence of these signals, default to single-agent or pipeline.
Minimum bar: If shared state is not typed, versioned, and reconciled, the swarm is a mob.
Minimum bar (tier containment): An orchestrator cannot delegate actions to specialist agents that exceed its own authorized autonomy tier. Tier elevation requires the same approval path regardless of whether the request originates from a human or an orchestrating agent.
5. Autonomy is a tiered budget, not a switch
Grant permissions by risk tier, least privilege, and blast-radius limits. Agents behave like serverless functions, not employees: spin up for a guarded task, verify the result, and terminate.
Autonomy operates in explicit governance tiers — each defining who approves, what evidence is required, and what blast radius is acceptable:
Tier 1 — Observe. Agents analyze and propose. Blast radius: none.
Tier 2 — Branch. Agents write to isolated branches. Humans approve merges. Blast radius: contained.
Tier 3 — Commit. Agents take production-impacting actions with explicit human approval, attached rollback plans, and verified evidence. Blast radius: governed.
Within each tier, define granular permissions: read production data but not write, deploy to canary but not full rollout, modify test code but not application code, change configuration but not schema. Tiers define the governance level; permissions define the allowed actions within that level.
Minimum bar: If you cannot reconstruct an agent's reasoning at any tier, your autonomy model has failed.
Phase maturity is a prerequisite for autonomy tier. Tiers and phases are not independent: a team cannot safely operate at a higher tier than their phase supports, regardless of available infrastructure.
These maximum tiers are conservative defaults for the relevant work item, not a blanket organization-wide policy. Calibrate by domain, data classification, and blast radius.
| Phase | Maximum available tier | Rationale |
|---|---|---|
| Phase 1-2 | Tier 1 only for governed production work | No evaluation suite, no evidence bundles — agent output is unverified |
| Phase 3 | Tier 1 only for governed production work | Autonomy without verification; governance infrastructure not yet in place |
| Phase 4 | Tier 2 (branch + human approval) | Verification gates operational; blast radius is contained |
| Phase 5+ | Tier 3 (governed production impact) | Full Agentic Loop with verification, validation, and domain-scoped accountability |
In regulated industries, use-case-specific caps apply independently of phase. See Companion Frameworks for the regulated-industry cap table.
Phase maturity and task blast radius are independent checks. Team phase determines the governance capability ceiling; it does not automatically qualify every task that falls nominally within that tier. For each task, perform a separate blast-radius assessment before acceptance:
- What is the maximum credible impact if this specific task fails?
- Does that impact stay within the governance coverage of the current phase?
- If not — escalate the task to a phase with appropriate coverage, or decompose it so each subtask stays within the governance boundary.
A Phase 4 team operating correctly for Phase 4 can still fail on a cross-domain task whose blast radius exceeds Phase 4 governance coverage. Phase is a team capability ceiling; blast-radius assessment is a per-task gate. The most consequential failures tend to occur at domain boundaries, where tasks cross phase ceilings that are not checked at the task level.
6. Knowledge and memory are distinct infrastructure
An agent without memory is a liability. But knowledge and memory are not the same thing, and conflating them is dangerous.
Knowledge is ground truth: code, documentation, ADRs, formal contracts, domain constraints. It is versioned, deterministic, and authoritative.
Learned memory is heuristic: reasoning patterns, incident learnings, routing preferences, and reusable skills. It is probabilistic, subject to decay, and requires continuous renewal — not just point-in-time control. Provenance, expiration, compression, rollback, and domain scoping are the mechanisms of that renewal cycle: each one governs not only what is stored, but whether what was learned is still valid before it is reused.
The practical test: if it changes through governed processes (pull requests, ADR reviews, schema migrations), it is knowledge. If it changes through feedback loops (agent learning, incident adaptation, routing optimization), it is learned memory. The governance mechanism determines the classification.
At the frontier, memory is not only retrieval. Agents can externalize procedures as reusable skill artifacts that evolve through experience without changing model weights. Those learned skills require the same provenance, review, rollback, and scoping discipline as any other memory layer.
Memory failure modes. The governance mechanisms above address the what-and-when of memory management. The threat model addresses what goes wrong when they fail:
- Memory poisoning — an agent writes incorrect learnings that corrupt future agent behavior across sessions. Mitigate with human review gates on memory writes from agents operating at Tier 2 or above.
- Cross-agent contamination — Agent A's domain-specific memory leaks into Agent B's reasoning context. Mitigate with domain-scoped memory namespacing and access controls on memory read paths.
- Consistency under concurrency — two agents update the same memory store with conflicting observations. Mitigate with versioned writes and explicit conflict resolution policies, the same as for any shared mutable state.
- Audit trail gap — "what version of memory was active when this decision was made?" requires point-in-time snapshots, not just current state, for meaningful incident reconstruction.
Minimum bar: If memory cannot expire, be rolled back, or show provenance, it is not memory — it is a liability. And if memory is not revalidated against current architecture and process before reuse, it is not being governed — it is being trusted.
7. Context is engineered like code
If the knowledge store is polluted with bad embeddings or stale data, the agent hallucinates — no matter how clean the code. Context quality and code quality are coupled. Context is a first-class dependency, engineered with the same rigor as code: versioned, tested, and performance-benchmarked.
Context retrieval must be fast enough to sustain the reasoning loop. Context windows are finite and reasoning quality degrades as low-signal context accumulates. Engineer explicit context budgeting: hierarchical retrieval, rolling summaries, state compaction, and authority-weighted pruning.
Minimum bar: If retrieval takes longer than the reasoning loop tolerates, context is broken infrastructure. But slow is not the only failure mode: stale embeddings, conflicting sources, semantic precision failures (fast retrieval of wrong artifacts), poisoned retrieval artifacts, and authority-weighting errors (an outdated ADR silently overriding current policy) are quality failures that a performance criterion does not catch. Context quality and code quality are coupled — both must be verified, not just timed.
8. Evaluations are the contract; proofs are a scale strategy
Evaluations define what "correct" means in terms the system can check autonomously. Every change must be verified against the evaluation suite — and every change must preserve or improve evaluation performance. Without evaluations, verification is assertion. Without verification, done is a claim.
Evaluations evolve with the system: coverage of the happy path, adversarial cases, regression scenarios, and behavioral checks. They are the machine- readable form of the acceptance criteria in Principle 2. When the specification changes, evaluations change with it.
"Proofs" here means formal verification of the contracts and infrastructure around agents — not of the agent's reasoning itself. You can prove that a retry policy is idempotent, that a state machine has no deadlocks, or that a type contract is satisfied. You cannot formally prove what an LLM will decide. The value of proofs scales with module count and risk: as more agents interact through more contracts, the contracts themselves become worth proving.
Minimum bar: If evaluations do not include regression cases, verification is incomplete.
Verification, validation, and independent validation are distinct disciplines. Passing evaluations satisfies verification. It does not satisfy validation or independent validation, which require additional steps:
| Discipline | Question answered | Owner | Timing | Required by |
|---|---|---|---|---|
| Verification | Did we build it right? Implementation matches specification. | Development / QA team | Pre-merge, every change | Always |
| Validation | Did we build the right thing? Specification matches real-world need. | Product / domain owner | Pre-release | Phase 4+; always for regulated systems |
| Independent validation | Were verification and validation themselves rigorous? | Organizationally separate team (2nd line) | Pre-production | Any high-stakes system; mandated by SR 11-7, SS1/23, DORA in regulated industries |
Independent validation is a governance principle, not merely a compliance requirement. Any system where a verification failure could cause significant harm — financial, safety-critical, reputational, or legally consequential — warrants organizational separation between the team that builds and verifies and the team that validates. Regulation formalizes this requirement; it does not create it. The most common failure: teams perform verification, label it validation, and have no independent validation. This is a quality gap in any context, not only a regulatory audit finding.
Independent validation must be capable of blocking production deployment. A team that can only observe and advise is not independent validation — it is a consultation. See Principle 12 for the accountability structure that makes independent validation meaningful.
9. Observability and interoperability cover reasoning, not just uptime
Instrument decisions, tool calls, policy violations, memory retrievals, cost per task, and near-misses — so you can explain why something happened, not just that it happened. Every agent action must produce an inspectable trace: diffs, tool calls, decision chains, evaluation results, rollbacks.
Traces are not logging. Logging records events. Traces reconstruct reasoning — the full chain from specification to decision to action to outcome. They are the audit trail that makes agentic systems governable, debuggable, and safe.
Observability and interoperability are coupled here because portable observability requires interoperable trace formats. You cannot aggregate traces across vendor boundaries without standardized contracts, and you cannot debug cross-runtime failures without replayable tool logs. They have separate minimum bars but share a dependency: without interoperability, observability fragments at the system boundary where it matters most.
Minimum bar (observability): If you cannot answer "why did this happen" from traces alone, you are not instrumented.
Minimum bar (interoperability): If tools cannot be swapped or replayed across runtimes without rewriting core workflows, the platform is brittle.
10. Assume emergence; engineer containment
Multi-agent systems exhibit emergent behavior by nature — some useful, some dangerous. Expect nonlinear failures, feedback loops, and phase changes. Build guardrails, rate limits, circuit breakers, and safe fallbacks before you need them.
When emergence produces useful behavior, capture it. When emergence produces dangerous behavior, contain it. The difference between these two outcomes is the quality of your containment engineering.
Security is a containment concern, not a separate audit. Agentic systems that autonomously write, execute, and deploy code present a distinct attack surface that must be threat-modeled before granting autonomy beyond Tier 1:
- Prompt injection — adversarial content in retrieval artifacts, tool responses, or code patterns that redirects agent behavior without the operator's knowledge.
- Privilege escalation — chained agent calls that accumulate permissions no single call would be granted under least-privilege policy.
- Data exfiltration — tool calls that surface sensitive data to outputs that are not fully inspected or logged.
- Supply chain attacks — poisoned tool registries, model adapters, or retrieval sources that corrupt agent behavior at ingestion time.
- Social engineering — AI-generated outputs crafted to pass human reviewer scrutiny by exploiting reviewer trust in fluent, confident text.
Treat every retrieval artifact, tool response, and agent-to-agent message as untrusted input. Defense-in-depth means identity for agents and tools, signed provenance for shared state, least-privilege tool scopes, egress controls, and continuous anomaly detection for cross-agent trust edges.
Minimum bar: If you have not tested with tool outages, noisy retrieval, and adversarial inputs, you are not chaos-tested. If you have not threat-modeled prompt injection, privilege escalation, and exfiltration vectors for your specific agent topology, you are not security-tested.
11. Optimize the economics of intelligence
Not every task requires the most capable model. Build a dynamic routing layer. Route simple tasks to fast, cheap models. Reserve expensive, high-reasoning models for complex orchestration and critical decisions. Model choice is a runtime decision, not a configuration constant.
Optimize total cost of correctness — not just inference cost, but the full
cycle: inference + verification + governance overhead + incident remediation.
Include human costs: review time per tier, context-switching across model
behaviors, and debugging heterogeneous failure modes in multi-model routing.
Track cost per task, cost per outcome, and cost per quality unit. When
governance overhead exceeds the value of the work, that is a signal to simplify,
not to add more governance.
Multi-model coherence. In heterogeneous swarms, different models may hold conflicting internal representations of the same codebase — different architectural pattern priors, different conventions for what "correct" looks like, different training-data views of domain boundaries. This coherence gap compounds at Phase 5+ when agent roles are highly specialized. Mitigate by: making shared architectural decisions explicit in the knowledge base rather than relying on implicit prompt conventions; routing semantically related tasks through the same model tier when consistency matters more than cost; and treating cross-model disagreement on shared artifacts as an observable quality signal rather than a coordination annoyance.
Minimum bar: If model choice is a configuration constant instead of a runtime decision, you are overspending.
12. Accountability requires visibility
Agents execute; humans own outcomes, risks, approvals, and incidents. No agent — however capable — absorbs legal, ethical, or operational responsibility. Release decisions, risk acceptance, production behavior, and incident response require a human with skin in the game.
But accountability without visibility is a legal fiction. You cannot own what you cannot see. The autonomy tiers in Principle 5, the traces in Principle 9, and the verification and validation disciplines in Principle 8 exist to make human accountability meaningful rather than ceremonial.
In regulated environments, accountability extends to independent validation: the organizational separation between the team that builds and verifies a system and the team that independently validates it is not bureaucracy — it is the mechanism that makes accountability real. A governance structure where the same team both builds and validates has no external check on whether its verification was genuine.
Accountability at scale operates at the policy level, not the action level. When agents process thousands of actions daily, per-action human review is neither feasible nor the right model. The resolution is a three-tier framework applied per action class:
| Action class | Human involvement | Accountability mechanism |
|---|---|---|
| Low-risk, reversible (Tier 1, contained blast radius) | None per action; domain owner reviews statistical samples and trend dashboards | Automated evidence bundle; rollback ready; anomaly alert if pattern deviates |
| Medium-risk, governed (Tier 2, branch + approval) | Human approves merge; does not review every line | Evidence bundle gates approval; trace available on demand |
| High-risk, production-impacting (Tier 3) | Named human reviews evidence and accepts risk per change | Full evidence bundle required; no automated promotion |
A domain owner owns the risk policy, the autonomy tier ceiling, the escalation path, and the incident response protocol for their domain. They do not approve every low-risk action — they own the framework that governs those actions, and they carry the accountability when that framework fails. When trace volume exceeds meaningful review capacity, the correct response is to raise automation barriers (tighten evaluation thresholds, lower autonomy tiers) until oversight signal quality is restored — not to accept degraded oversight as a workload problem.
Failures are data: errors and crashes are learning opportunities, and hallucinations can become a hallucination loop where plausible-but-wrong early output drives increasingly wrong follow-on fixes. Never simply retry a failed prompt. Diagnose, update memory, strengthen contracts and constraints, and rerun verification before retrying. But someone must own the consequences when systems go live. Clear responsibility is not bureaucracy; it is system safety.
Minimum bar: If no named human can inspect the reasoning, review the evidence, and own the outcome of a production agent, the system is ungoverned.
What "done" means in agentic engineering.
See the Manifesto for the core values and the Agentic Loop. See the Twelve Principles for the engineering principles.
The Agentic Definition of Done
Tokens generated and tasks dispatched are vanity metrics. "The agent said it worked" is not a completed ticket.
A change is done when it is:
Shipped — deployed or delivered, not just merged.
Observable — instrumented and logged so reasoning can be inspected and reconstructed from traces.
Verified — evaluated against regression tests (and adversarial cases), with an evidence bundle (diffs, trace IDs, policy check outputs) required for every automated merge.
Provable (when risk requires it) — formalized invariants and replayable proof artifacts attached for critical workflows.
Learned from — knowledge base and learned memory updated with what was discovered, with provenance.
Governed — operating within autonomy tiers appropriate to its risk, with human accountability assigned.
Economical — routed through appropriate model tiers, cost tracked and justified per outcome.
Anything less is not done for the current phase.
This DoD is phase-calibrated, not all-or-nothing. At Phase 3, "verified" means tests and a diff; at Phase 5, it means reproducible replay with formal artifacts where justified. "Provable" applies only when risk requires it; "economical" matters only when routing infrastructure exists. The bar rises with the stakes — but at every phase, the question is the same: can you show evidence, not just assertions?
Evolvability as an implicit criterion. A change that passes today's tests but degrades the codebase's capacity for future change is not truly done — it has traded short-term correctness for structural regression. The SWE-CI benchmark (arXiv:2603.03823) documents that most agents introduce behavioral regressions in over 75% of CI iterations, showing high regression rates on a long-horizon maintenance benchmark; treat it as one calibration point, not a universal rate. This is evidence of behavioral regression risk, not a direct measure of architectural evolvability: CI metrics do not capture coupling growth, cohesion decay, abstraction quality, or future-changeability. Both risks are real and distinct. At Phase 4 and above, "verified" should include evolution-weighted signals beyond CI pass rates — static analysis for coupling growth, module boundary stability, and change amplification — alongside the behavioral regression coverage the benchmark measures. See Structural Regression in the Companion Guide.
Why it matters: This forces the system to optimize for actual business outcomes rather than raw output volume, killing the illusion of productivity.
Definition of Done for Hardening
Applying the agentic DoD to work that begins as rapid exploration ("vibe coding") and must become governed engineering before it ships.
Exploratory agent output is not production-ready by default. A prototype that "worked in the demo" has not passed the Agentic Definition of Done. The four steps below define what hardening means: the path from captured exploration to governed, verifiable output.
Step 1 — Capture. Record the vibe output exactly as produced: diffs, trace IDs, prompts used, tool calls made, and any model or configuration state at the time of generation. Treat this as raw evidence, not a deliverable. Do not edit or clean the output before capturing it — the unmodified artifact is the baseline.
Step 2 — Extract Specification. From the captured output, derive the specification the agent was implicitly working toward: what behavior does the output exhibit, what constraints does it respect (or violate), and what observable success criteria would confirm it is correct? This step converts intent from the agent's context window into a machine-readable, reviewable specification. If no coherent specification can be extracted, the output is not a candidate for hardening — it is a candidate for restart.
Step 3 — Build Evaluation Portfolio. For the extracted specification, author an evaluation portfolio (P8): behavioral tests, adversarial cases, and at least one holdout case not derived from the captured output. The portfolio must include explicit regression coverage for any behavior the captured output depends on. Evaluation theater — a portfolio that only tests the happy path the exploration already demonstrated — does not satisfy this step.
Step 4 — Verify and Refactor. Run the evaluation portfolio against the captured output. Fix every failure. Refactor for structural quality (coupling, abstraction, module boundary stability) sufficient for the change's autonomy tier and risk level. Attach the evidence bundle (passing evaluations, trace IDs, refactoring diffs) to the change. The change is done when the evidence bundle is complete and a named human is accountable for it (P12).
The practical test. Ask: if the person who ran the exploration session left today, could another engineer reproduce, modify, and extend this output using only the specification, the evaluation portfolio, and the evidence bundle? If the answer is no, hardening is not complete.
When to skip hardening. Exploration output that will be discarded — a spike, a proof of concept that will be rewritten, a learning exercise — does not require hardening. The trigger for hardening is intent to ship, not intent to keep. If the output is going to influence production behavior in any form, the four steps apply.
Extended guidance, tradeoffs, and operational detail for each principle in the Agentic Engineering Manifesto.
Read the Manifesto for the core values and minimum bars. See the Companion Guide for the full table of contents. See the Adoption Playbook for organizational change management, role transitions, and pilot design.
Principle 1 — Outcomes: Extended Guidance
See Principle 1 in the manifesto for the core statement and minimum bar.
The Probability-Compounding Problem
A common intuition is that system correctness compounds multiplicatively — if
each module is correct with probability p, a system of N modules has roughly
p^N correctness. This mental model is misleading in two directions:
- Too optimistic, because it assumes independent failures. Real agentic
systems share models, knowledge bases, and tool chains that create correlated
failure domains. A single poisoned retrieval shard or a shared model blind spot
can invalidate every agent simultaneously — far worse than
p^Npredicts. - Too pessimistic, because cross-verification between agents can break the compounding chain in ways that independent modules cannot. When agents verify each other's outputs against independent evidence sources, the effective error rate can be driven below any individual module's failure rate.
The useful question is not "what is p^N?" but "where are the shared
dependencies that make failures correlated?" A working failure-domain
decomposition:
- Correlated model failure: The same base model is used everywhere, making reasoning blind spots systemic.
- Correlated retrieval failure: The same poisoned or stale knowledge base shard feeds multiple agents. In practice, this is often the most insidious class because it produces plausible-looking but systematically wrong outputs.
- Correlated tool failure: The same flaky integration or API rate limit blocks the entire swarm.
- Correlated governance failure: The same reviewer fatigue or policy misconfiguration rubber-stamps errors.
This is a practitioner framework, not a proven exhaustive taxonomy. Teams should extend it for their specific failure surfaces and validate priority ordering against their own incident data. The shared dependencies it names mean system-level risk is often much worse than independent-failure models suggest — but also that targeted decorrelation (diverse models, independent retrieval indexes, redundant tool chains) can yield outsized reliability gains.
Evidence Bundles and Assurance Levels
This does not mean full formal verification is a near-term default for every team. It means assurance must scale with blast radius and system size. Evidence bundles should be immutable, replayable, and auditable, with proof artifacts introduced where risk justifies cost: signed trace manifests when required by policy, deterministic replay artifacts, and formalized invariants verified by proof or model-checking tools where warranted.
Principle 2 — Specifications: Extended Guidance
See Principle 2 in the manifesto for the core statement and minimum bar.
Contract-First Agentic Development
In practice, this can include contract-first agentic development: agents propose both implementation and machine-checkable contracts (preconditions, postconditions, invariants), then iterate in a tight loop: specify, implement, attempt to prove, fail, refine, repeat. Proof failure is not a blocker to hide; it is a steering signal.
Specifications as Agent-Consumable Artifacts
The specification-as-living-artifact pattern now has concrete implementations. Agent Skills (SKILL.md files — structured metadata plus step-by-step instructions that agents consume at runtime) and AGENTS.md (repository-level machine-readable constraints) are increasingly supported across several IDEs and coding agents. Both formats validate the core P2 claim: specifications that agents can parse directly reduce ambiguity, improve adherence, and make convergence measurable. Skills define what an agent can do; AGENTS.md defines how it must behave within a codebase. Together with agent-to-tool protocols (which define how agents connect to external capabilities), they form the specification layer of the emerging standards stack.
The Specification-Driven Development Movement
The specification-first pattern is not just an architectural recommendation — it is converging as the dominant practitioner workflow. A wave of open-source specification-driven development (SDD) frameworks has emerged, all built on the same thesis P2 advocates: write the spec before the agent writes the code. The pattern across these frameworks is consistent: specifications are treated as code artifacts, baked into workflows, and consumed by agents before implementation begins — whether through specify-plan-implement pipelines, state-machine-governed iteration, or composable skill-driven workflows. This validates P2's core claim at practitioner scale. See Sources for specific framework references.
Convergence Criteria
Specification evolution needs convergence criteria. Treat a specification as converging when acceptance criteria remain stable across successive iterations, scope narrows rather than expands, and incident classes trend downward. If each loop adds ambiguity or expanding goals without quality improvement, treat it as scope drift and reset the boundary.
Validation vs. Verification
Evaluations (Principle 8) and evidence bundles (Principle 1) answer the verification question: did we build it right? They confirm the implementation matches the specification. But verification alone has a blind spot: you can pass every check and still ship the wrong thing, just faster.
Validation answers a different question: did we build the right thing? Does the specification itself make business sense? Is the work scoped correctly? Will real users get value from it? Agents make the validation gap more dangerous because they can generate feature-shaped output quickly — complete with passing tests, clean architecture, and a full evidence bundle — while the underlying specification was never worth implementing.
The Agentic Loop addresses validation explicitly through the Validate → Observe → Learn → Govern cycle: after verification confirms technical correctness, validation checks fitness for real-world use; runtime behavior, usage data, and business outcomes then feed back into specification revision. But this only works if teams treat Validate as a distinct discipline from Verify, not just a technical monitoring step. Concretely:
- Frame the work in context before specifying. Is this a proof of concept, a minimum viable feature, or a production commitment? Define "good enough" for each context and make the underlying business assumptions explicit. An agent cannot validate its own specification against business reality — that is a human judgment that must happen before the Loop begins.
- Define stop criteria, not just acceptance criteria. Acceptance criteria tell the agent when the implementation is correct. Stop criteria tell the team when to abandon or pivot the specification itself — when usage data, customer feedback, or market evidence shows the spec was wrong regardless of implementation quality.
- Connect evaluation results to business outcomes. If escaped defect rate is low but adoption, usage, or customer satisfaction metrics don't improve, the verification machinery is working but the validation loop is broken.
This is not a new idea — it is the core of Agile's "customer collaboration" value, and it survives unchanged into agentic engineering. What changes is that agents amplify the failure mode: without explicit validation loops, a team can ship more verified-but-wrong features in a month than a human team could in a quarter.
Requirements Engineering for Agentic Systems
Traditional RE was designed for deterministic systems. Agentic and hybrid
systems require an extended framework. The key extensions are covered in
companion-re-framework.md. The three most important for specification work:
Two-axes classification. Every requirements artifact sits on two axes: (1) system type — deterministic, agentic, or hybrid; and (2) artifact consumer — human, agent, or hybrid. The cell your requirement occupies determines the correct format and verification approach. Probabilistic assurance targets replace binary pass/fail requirements for agentic components. Agent-consumable specifications must be unambiguous to a machine — contextual inference is unreliable.
Behavioral envelopes. For agentic components, the primary specification artifact is a behavioral envelope — the boundary the system must stay within — not a list of enumerated acceptable outputs. The envelope's Layer 1 hard boundaries must be enforced by infrastructure policy, not prompt instruction. The performance envelope generates the evaluation suite directly.
Single-source principle. When a specification serves both human and agent consumers, one canonical document must be the source of truth. All other representations — governance prose, machine-readable encoding, evaluation criteria, compliance mapping — are derived projections. Independent authoring of separate documents is a divergence schedule.
See companion-re-framework.md for the full framework: two-axes matrix, hard
requirements vs. probabilistic assurance targets, behavioral envelope structure,
tiered lifecycle, per-requirement checklist, and academic references
(arXiv:2602.22302, arXiv:2503.18666, NIST AI 600-1, ISO/IEC 5338).
The Architect Pattern: Agent-Generated Specifications
The manifesto treats specification steering as a human-governed activity. But emerging evidence shows that specification generation itself can be an agent capability — and that the quality of this capability is the primary differentiator in long-term maintainability.
The Architect–Programmer pattern separates these concerns explicitly: an Architect agent observes system behavior (test results, CI feedback, runtime metrics), diagnoses root causes, and generates machine-readable requirements. A Programmer agent implements those requirements. The cycle repeats: the Architect observes the results, refines the specification, and the Programmer iterates.
This pattern is a concrete instantiation of the Agentic Loop's Observe → Learn → Specify cycle. The SWE-CI benchmark (arXiv:2603.03823) validates it empirically: across 100 tasks spanning an average of 233 days and 71 commits of real-world development history, the Architect's ability to transform CI feedback into actionable requirements was the primary differentiator in long-term code maintainability. The three-step Architect protocol — Summarize (review failures), Locate (attribute to deficiencies), Design (produce requirements) — maps directly to the manifesto's convergence criteria: specifications that sharpen as evidence accumulates.
When to use this pattern: Long-running maintenance tasks where the specification must evolve across many iterations. For bounded, short-horizon tasks, a single agent with a clear specification may be more efficient (see Principle 4 guidance on topology choices). The Architect pattern is not a universal requirement — it is a validated topology for sustained evolution.
The governance implication: When specifications are agent-generated, the human role shifts from writing specifications to governing specification quality. The human defines the acceptance criteria for the Architect's output — what constitutes a valid requirement — and reviews the Architect's decisions at a cadence appropriate to the risk tier. The specification is still a governed artifact; the governance mechanism changes.
Specifications vs. Constraints
Specifications and architectural constraints (Principle 3) operate at different layers and change at different speeds. Constraints are invariants — security policies, domain ownership boundaries, data integrity rules — that hold across specification iterations. Specifications are goals and acceptance criteria that evolve within those invariants.
In practice, this means: an agent can propose a revised acceptance criterion without governance overhead, but proposing a relaxed constraint triggers a governed review (ADR update, policy approval, impact assessment). If your system cannot distinguish these two change types, specification iteration will silently erode your architectural boundaries.
Principle 3 — Architecture: Extended Guidance
See Principle 3 in the manifesto for the core statement and minimum bar.
Prompt Drift and Enforcement
Prompts drift, and context windows degrade. They approximate compliance — they do not guarantee it the way a compiler obeys syntax. When architecture is merely described rather than enforced, agents will violate it. When architecture is enforced but not monitored, violations will go undetected.
Domain-Driven Design for Swarms
Domain-Driven Design gives each swarm a bounded context — what it owns, where code belongs, what is forbidden to reinvent. Retrieval is untrusted input; treat context injection as a threat vector. This reduces swarm collisions and hardens the system against both accidental drift and adversarial conditions.
AGENTS.md files (an emerging repository-level convention in the AAIF ecosystem for agent instructions) offer a practical mechanism for encoding architectural constraints at the repository level. They function as machine-readable ADRs that coding agents respect at runtime — a concrete implementation of architecture as defense-in-depth.
Agent-as-Tool and Software of Unknown Provenance
In regulated development, software components are classified by provenance and qualification status. When agents participate in development, three classification questions arise:
- The AI model itself: Non-deterministic, version-dependent, and opaque. Under IEC 62304 (SOUP), DO-178C/DO-330 (tool qualification), and GAMP 5 (software categories), the model cannot currently be qualified through traditional means.
- Agent-selected dependencies: When an agent pulls in a library or pattern, it is making a provenance decision that may carry regulatory consequences. The human must own dependency approval; the agent must not introduce unvetted dependencies silently.
- Agent-generated code: May incorporate training-data patterns that constitute derivative unclassified software. Evidence bundles must capture sufficient provenance to support classification.
The manifesto's defense-in-depth response: treat the agent as an unqualified tool and independently verify all output through qualified means. This is architecturally equivalent to treating retrieval as untrusted input (above). The infrastructure must enforce dependency allow-lists, and evidence bundles must capture dependency provenance.
See companion-frameworks.md for the cross-domain analysis and domains/ for domain-specific classification requirements.
Principle 4 — Swarm Topology: Extended Guidance
See Principle 4 in the manifesto for the core statement and minimum bar.
Topology Choices
Topology choices must be explicit, for example:
- Single agent/pipeline for bounded tasks with low coordination overhead.
- Hierarchy for clear decomposition with centralized decision checkpoints.
- Mesh for discovery-heavy work where peers benefit from lateral coordination.
Bio-inspired swarms (experimental): bee-hive patterns and similar biologically-inspired coordination models appear in research for large search and exploration spaces. These are not production-proven at the time of writing. Naming them here is not an endorsement — it is an acknowledgment that teams will encounter them. Default to single, pipeline, hierarchy, or mesh unless your own measured results on your own workload justify bio-inspired coordination.
Inter-Agent Communication Standards
Open agent-to-agent protocols are beginning to standardize agent discovery, task lifecycle management, and cross-framework collaboration. The manifesto's governance model — tiers, traces, accountability — sits above these protocols: the protocol handles agent-level coordination; the manifesto's principles govern what those agents are allowed to do and how their decisions are audited. Teams adopting multi-agent topologies should treat communication protocols as the coordination layer and the manifesto's tier model as the authorization layer.
Expected Failure Modes by Topology
Expected failure modes differ by topology: bottlenecked leads in hierarchies, coordination storms in meshes, hidden coupling in pipelines, and role drift or signal-amplification errors in bio-inspired swarms (for example, over-committing to early weak signals). Use bio-inspired topologies only with empirical evidence that they outperform simpler topologies for the target workload.
The Single-Agent Default and Its Limits
The manifesto states: "a single well-evaluated agent with excellent tools often outperforms an expensive, uncoordinated swarm." This holds for bounded, short-horizon tasks where specification and implementation can be handled in a single context.
For long-term maintenance tasks — where the specification must evolve across dozens of iterations based on accumulated evidence — the Architect–Programmer separation may be structurally necessary, not just a preference. The SWE-CI benchmark (arXiv:2603.03823) provides evidence: across tasks spanning an average of 233 days and 71 commits, separating specification generation (Architect) from implementation (Programmer) is the minimal viable structure for sustained code maintainability. A single agent attempting both roles must hold implementation context and specification-steering context simultaneously, which degrades at the timescales long-term maintenance requires.
The practical rule: default to a single agent for bounded tasks. Adopt the Architect–Programmer topology when the task horizon exceeds what a single context window can sustain, or when specification quality is the primary bottleneck. See the Architect Pattern in the P2 extended guidance for operational detail.
Topology as a Runtime Concern
The topology choices above are presented as design-time decisions, and for most teams at Phase 3–4 they are. But the frontier is moving toward adaptive topology selection — systems that choose coordination patterns at runtime based on task characteristics, resource availability, and learned performance data. Indicators of this shift include: federation hubs that route work across heterogeneous agent pools, ephemeral workers that share persistent state rather than maintaining their own, and consensus-backed coordination that replaces static orchestrator hierarchies.
Teams should design their topology as a deliberate architectural choice today, but build the abstraction layer that allows the topology to change without rebuilding the system. The practical test: can you switch from hierarchy to mesh for a given task class without rewriting coordination logic? If not, the topology is hardcoded, and you will pay for that rigidity as the ecosystem matures.
Coordination Discipline
Parallelize exploration and analysis. Serialize decisions that change shared state. Coordination is never free: shared state must be typed, versioned, and reconciled. Contracts must be logged. Domain boundaries must prevent collisions. Without these, a swarm is a mob — agents duplicating work, producing conflicting diffs, or interpreting constraints inconsistently.
Principle 5 — Autonomy: Extended Guidance
See Principle 5 in the manifesto for the core statement and minimum bar.
Setting Tier Boundaries
The manifesto defines three tiers (Observe, Branch, Commit), but choosing where to draw the boundaries for your organization is the harder problem. Tier assignment should be driven by three factors:
- Blast radius: What is the maximum credible impact if the agent acts incorrectly? Tier 1 (Observe) for actions with no production impact. Tier 2 (Branch) for actions contained to isolated environments. Tier 3 (Commit) only for production-impacting actions with verified rollback.
- Reversibility: How quickly and completely can you undo a wrong action? Fast, clean rollback justifies higher autonomy. Irreversible actions (data deletion, external API calls, customer-facing communications) demand stricter gates regardless of blast radius.
- Confidence maturity: How long has the agent been operating on this task class, and what is the historical error rate? New task types start at Tier 1 even if the blast radius would theoretically permit Tier 2. Promote only when evidence shows consistent correctness over a meaningful sample size.
In practice, start conservative. Most teams should default every new agent capability to Tier 1 and promote through evidence, not through optimism.
Runtime Tier Escalation
Agents sometimes discover mid-task that they need capabilities above their current tier. The protocol for tier escalation must be explicit:
- The agent pauses execution and emits a structured escalation request: what action it needs, why, what evidence supports the request, and what the blast radius would be.
- The system routes the request to the appropriate approver (automated policy check for Tier 1→2, human reviewer for Tier 2→3).
- Approval is scoped and time-bounded — the agent receives temporary elevation for a specific action, not a blanket tier promotion.
- The escalation, approval, and outcome are traced and auditable.
If tier escalation happens frequently for a given task class, that is a signal to reassess the tier assignment — either the task class belongs at a higher tier, or the specification needs refinement to keep the task within its current tier.
Long-Lived Agents
Long-lived agents are an exception that requires explicit justification, heartbeat monitoring, and drift controls. Tools are capabilities; audit tool access and grant least privilege. Make risky actions reversible or approval-gated.
The human role is to define the specification, set the tier, and own the outcome — not to supervise every intermediate step. But autonomy without governance is negligence. Calibrate the tier to the stakes.
Infrastructure-Level Tier Enforcement in Practice
Enterprise agent runtimes are demonstrating what infrastructure-level tier enforcement looks like at scale: declarative permission policies (typically YAML or equivalent), audit logs for every agent action, and guardrail constraints that the agent cannot override regardless of prompt instructions. This is the pattern the manifesto requires — enforcement at the infrastructure layer, not the prompt layer. Tiered autonomy is only meaningful when the infrastructure, not the agent, enforces the boundaries.
Auditing Tier Compliance
Tier boundaries are only meaningful if compliance is verified. Implement:
- Runtime enforcement: The infrastructure (not the agent) blocks actions outside the agent's tier. An agent at Tier 1 physically cannot write to a production database, regardless of what its prompt says.
- Compliance dashboards: Track tier violations, escalation frequency, and approval latency per domain. Rising violation rates signal either misconfigured tiers or inadequate specifications.
- Periodic tier reviews: Quarterly review of tier assignments against incident data. Promote agents with strong track records; demote or constrain agents with elevated error rates.
Tier Assignment Decision Checklist
Before assigning a tier to a new agent capability — or promoting an existing capability to a higher tier — answer the following questions. Each "yes" to a risk question is a reason to stay conservative or require additional gates. This checklist is a decision aid, not a policy replacement; it does not substitute for domain-specific regulatory requirements.
Blast radius and reversibility
- Could a wrong action affect production data, external parties, or safety-critical systems? → Default Tier 1 unless verified rollback exists.
- Is the action irreversible within a one-hour window (data deletion, external API calls, customer-facing communications, financial transactions)? → Require Tier 1 or an explicit human approval gate at Tier 2.
- Does the action cross a domain boundary (e.g., write to a system outside the agent's primary scope)? → Require explicit authorization, regardless of tier.
Confidence maturity
- Has this agent operated on this exact task class for fewer than a calibration-minimum number of cycles with tracked outcomes? → Stay at Tier 1 until evidence accumulates. (Calibrate the minimum to domain: typically 20–50 cycles for low-blast-radius tasks; 100+ for production-impacting tasks.)
- Has the agent's error rate on this task class been measured and is it within the threshold for the target tier? → If not measured, start at Tier 1.
Specification and governance readiness
- Is the specification for this task class machine-readable with observable success criteria? → If no, Tier 1 regardless of blast radius. Tier escalation without a complete specification is not a risk decision — it is an unmanaged risk.
- Is there an evaluation portfolio covering adversarial cases, not just happy-path behavior? → If no, do not promote beyond Tier 1.
- Does the applicable domain set a regulatory floor (e.g., aviation DAL A/B, automotive ASIL C/D, medical device Class C, financial services SR 11-7 high-risk model)? → The regulatory floor overrides the blast-radius assessment; it cannot be overridden by team judgment.
Promotion and demotion rules
- Promote one tier at a time, only after a consecutive-cycle window with zero incidents where the agent exceeded its authorized scope or caused undetected harm downstream (calibrate cycle count to domain; a reasonable starting default is 30 cycles for Tier 1→2 and 60 cycles for Tier 2→3).
- Demote immediately on any of: agent exceeded authorized scope; incident where blast radius exceeded predicted level; regulatory audit finding; specification drift detected; new task class introduced without fresh assessment.
- Demotion is immediate; re-promotion requires a fresh checklist pass and a complete incident review.
Principle 6 — Knowledge & Memory: Extended Guidance
See Principle 6 in the manifesto for the core statement and minimum bar.
Memory Governance Properties — Operational Detail
The manifesto lists five governance properties. Here is what each means in practice:
Provenance: Every memory entry carries metadata: what event created it, which agent, what evidence supported it, when. Implementation: structured metadata fields on every entry in your memory store (vector DB, episodic store, or whatever layer holds learned memory). Without provenance, you cannot trace a bad decision back to a bad lesson.
Expiration: Learned memory decays. A routing preference learned during a model outage is wrong once the model recovers. A code pattern learned from a since-deprecated API is harmful. Implementation: TTLs on memory entries, calibrated by domain. High-volatility domains (model routing, API behavior) expire fast. Low-volatility domains (architectural patterns, security policies) expire slowly or never. Review expired entries before deletion — some should be promoted to knowledge; others should simply disappear.
Compression: Long-running agents accumulate memory faster than it can be consumed. Raw memory is noise; compressed memory is signal. Implementation: periodic consolidation passes that merge redundant entries, extract patterns from clusters of similar learnings, and discard entries that have been superseded. Think of it as garbage collection for learned context.
Rollback: When memory is poisoned — an agent learned something wrong from a bad incident, a corrupt retrieval shard, or a flawed evaluation — you need to undo the damage. Implementation: versioned memory snapshots (daily or per significant learning event), with the ability to revert a domain's learned memory to a known-good state. Test rollback before you need it. See Pattern C (Memory Poisoning Recovery) in the Worked Patterns.
Domain scoping: A lesson learned in the payments domain should not influence code generation in the notification service. Implementation: namespace or tag memory entries by domain, and enforce scope boundaries in retrieval queries. Cross-domain memory should be explicitly promoted, not implicitly leaked.
Emerging Memory Infrastructure
The memory infrastructure the manifesto calls for is beginning to materialize. Git-native agent memory systems demonstrate what governance-aware memory looks like in practice: provenance (every entry traceable to its source), rollback (versioned snapshots with merge-safe conflict resolution), and domain scoping (namespace isolation preventing cross-agent collisions in multi-branch workflows). Dependency-graph approaches validate the P7 claim that context must be engineered, not concatenated — tracking explicit task dependencies rather than relying on flat retrieval. Teams evaluating memory infrastructure should assess whether their chosen solution provides at minimum: provenance metadata, versioned snapshots, and scoped namespaces.
Beyond Retrieval: Persistent Agent Cognition
The manifesto frames memory governance in terms of retrieval infrastructure — provenance, expiration, rollback, scoping. This is necessary but no longer sufficient to describe the frontier. The emerging memory discipline includes three layers:
- Retrieval memory — the layer the manifesto already covers well. Embedding stores, vector search, scoped retrieval with SLOs. This is the "better RAG" layer.
- Skill memory — durable behavioral patterns agents acquire through experience, stored as reusable artifacts rather than retrieved context. An agent that has solved a class of problem before should carry forward not just the facts it retrieved but the approach that worked. Skill memory is closer to procedural knowledge than to information retrieval.
- Causal and trajectory memory — the ability to store not just what happened but why it worked or failed, and to consolidate trajectories across tasks into generalizable reasoning patterns. This is learning in the operational sense: the agent's future behavior improves based on structured reflection over past behavior.
All three layers require the same governance properties (provenance, expiration, rollback, scoping). But they differ in what "poisoning" means and how rollback works. Reverting a bad embedding is straightforward. Reverting a bad learned skill is harder — the skill may have influenced downstream decisions that themselves became learned patterns. Teams building memory infrastructure should design for rollback at each layer independently.
The full operational specification for governing learned memory — what counts as adaptation, who may write to persistent memory and under what conditions, provenance requirements, retention and expiry policy, rollback mechanisms, and which behavioral changes trigger a revalidation cycle — is the Adaptation Envelope (Layer 4) of the behavioral envelope framework. See companion-re-framework.md, Section 4 (Behavioral Envelope, Layer 4) for the complete specification. Principle 6 names the governance properties; Layer 4 specifies what to actually write.
Recent agent-learning work sharpens this distinction further: reusable skills can function as an external learning substrate, allowing agents to improve by writing, selecting, and refining structured procedural artifacts rather than by updating model weights. This makes skill governance a first-class engineering concern. If a learned skill can change behavior across many future tasks, it should be treated as governed operational memory, not as an implementation detail hidden inside prompts.
This also changes the minimum governance question. It is no longer enough to ask whether a memory entry is traceable. Teams also need to ask:
- Who may promote a learned behavior into a reusable skill?
- What evidence is required before a skill is reused across domains?
- How is skill rollback triggered and validated after an incident?
- Which skills are experimental, local, approved, or forbidden?
Without these controls, a successful one-off workaround can silently become a portable failure mode.
The Knowledge-Memory Boundary in Practice
The manifesto defines the boundary by governance mechanism: knowledge changes through governed processes (PRs, ADRs); learned memory changes through feedback loops. In practice, entries migrate between the two:
- Memory → Knowledge promotion: An agent repeatedly learns that a certain retry pattern works. After validation, this should be codified as an ADR or repository policy — promoted from heuristic to ground truth.
- Knowledge → Memory demotion: A documented best practice stops holding under new conditions. Rather than immediately deleting the ADR, demote it to learned memory with an expiration, so the system can accumulate evidence for or against the change before formalizing it.
The migration process itself needs governance. Unreviewed promotions pollute your knowledge base. Unreviewed demotions erode architectural standards.
Memory Governance at Machine Scale
The governance properties described above (provenance, expiration, compression, rollback, domain scoping) are necessary but not sufficient at production volume. A single agent executing 100 tasks per hour generates 100 memory entries per hour. Human curators can meaningfully review 10-20 entries per hour — an immediate 5-10x backlog. At this scale, reactive curation (diagnose regression, identify poisoned entry, rollback) is a post-mortem methodology, not a governance strategy. Proactive detection is required.
Implement these four mechanisms before agents generate significant memory volume:
1. Retrieval canaries (continuous). For each memory shard serving a production domain, define one known-good query with an expected result. Run it on every retrieval cycle. If retrieved results deviate from expected, isolate the shard immediately and alert. This catches poisoning before agents act on bad context. Pattern C in companion-patterns.md shows this as a recovery step — it should be a permanent fixture, not a post-incident addition.
2. Consistency check on write. When a new memory entry contradicts an existing entry in the same domain, flag both for resolution before the new entry is propagated. Do not silently overwrite. The contradiction is signal — either the new lesson is wrong, the old lesson is stale, or both need re-examination.
3. Structured memory entry schema. Require all memory entries to carry:
lesson: what was learned (one sentence)rationale: why this is believed to be trueconfidence: 1-5 (1 = tentative observation, 5 = validated across many cases)domain_scope: which domain(s) this applies toexpires_at: ISO 8601 datetime (see defaults below)provenance: trace ID of the event that generated this entry
Agents cannot store memory without these fields. Entries without valid schema are rejected at the memory layer, not silently dropped.
4. Default TTL policy by volatility.
| Domain type | Default TTL | Rationale |
|---|---|---|
| Model routing preferences | 7 days | Provider behavior changes frequently |
| Transient operational learnings | 7 days | Short-lived context (incidents, deployments) |
| API behavior and integration patterns | 30 days | APIs change on release cycles |
| Architectural patterns (project-specific) | 90 days | Reviewed at quarterly retro |
| Security policies and constraints | Never auto-expire | Human review required for any change |
| Compliance-relevant learnings | Never auto-expire | Regulatory retention requirements apply |
Expired entries are not deleted automatically — they enter a review queue. A domain expert validates or discards them monthly. Target: 5% of active entries reviewed per month (manageable volume, full corpus covered in 20 months). Low validation rate triggers memory system remediation.
When memory governance fails at scale, the tell is a sudden degradation in evaluation metrics for a specific domain without a corresponding code change. The recovery path is Pattern C (Memory Poisoning Recovery). The prevention path is these four mechanisms deployed before the volume problem appears.
Memory Governance in Regulated Environments
The governance properties described above (provenance, expiration, compression, rollback, domain scoping) are necessary everywhere and insufficient in regulated environments. Data classification adds a layer of constraints on what agents may accumulate, retain, and retrieve.
What regulated environments add to memory governance:
| Domain | Memory Retention Constraint | Retrieval Constraint | Key Regulatory Basis |
|---|---|---|---|
| Financial services | Customer PII must not persist in agent memory beyond the session unless a DPA is in place. Banking secrecy jurisdictions may prohibit retention entirely. | External LLM retrievals must not send Confidential/Restricted financial data to unclassified endpoints. | GDPR Art. 5 (data minimisation); DORA third-party risk |
| Medical devices / pharma | Patient-level data must not persist in learned memory. GxP operational data retention follows the applicable retention schedule, not agent TTL. | GxP raw data must never be retrieved into an agent context that has write access to production records. | HIPAA §164.528; GDPR Art. 5; GxP data integrity |
| Aviation | ITAR/EAR-controlled technical data retained in agent memory constitutes a controlled export if transmitted to a non-compliant endpoint. | Retrieval from ITAR-controlled knowledge stores must operate within a Technology Control Plan. | ITAR 22 CFR 120-130; EAR 15 CFR 730-774 |
| Defense / government | CUI (Controlled Unclassified Information) must not persist in any memory store without appropriate classification handling. Classified information must not enter agent systems at all. | Retrieval must be restricted to approved, accredited environments. | CMMC 2.0; NIST SP 800-171; 32 CFR Part 2002 |
The practical rule: In regulated environments, learned memory is a data store subject to the same classification, retention, and access controls as any other system data. The manifesto's memory governance properties (provenance, expiration, rollback, scoping) are the mechanism; the applicable data regulation determines the thresholds. A GDPR data minimisation obligation, for instance, means the TTL default for customer-identifiable learnings is "session only" — not 30 days.
Audit trail for memory changes. In regulated contexts, the memory
governance operations themselves (write, expire, rollback) must be logged.
The standard memory entry schema fields (provenance, expires_at,
domain_scope) are the minimum; add classification and
retention_basis fields for regulated memory stores to make the audit
trail complete.
See the domain documents for domain-specific memory classification requirements: financial-services.md · pharma.md · medical-devices.md · aviation.md
Principle 7 — Context: Extended Guidance
See Principle 7 in the manifesto for the core statement and minimum bar.
Retrieval SLOs
Define tiered SLO guidance by architecture class for context retrieval and decision latency. Not every retrieval path needs the same latency target:
- Local retrieval (file system, in-process cache): < 100ms. This is the baseline for interactive agent loops where the developer is waiting.
- Remote retrieval (vector DB, API-backed knowledge base): < 500ms with a relevance threshold. If retrieval takes longer, the agent should proceed with available context and flag the gap rather than block.
- Hybrid + rerank (remote retrieval with a reranking model): < 1s end-to- end. The reranking step improves precision but adds latency; set a hard ceiling and degrade gracefully if exceeded.
- Regulated logging (audit-required retrieval in compliance environments): latency is secondary to completeness and provenance. Log every retrieval with source, relevance score, and timestamp.
When retrieval SLOs are breached, alert and degrade — do not silently return stale or irrelevant context. An agent that reasons from bad context produces confidently wrong output.
Context Budgeting
Context windows are finite and reasoning quality degrades as low-signal context accumulates. This is not a theoretical concern — it is the most common root cause of agent quality degradation in long-running tasks. Engineer explicit context budgeting:
- Hierarchical retrieval: Retrieve summaries first, then pull detailed context only for the sections the agent identifies as relevant. This avoids filling the window with potentially irrelevant detail.
- Rolling summaries: For multi-step tasks, compress completed steps into structured summaries before starting the next step. The summary should capture decisions and outcomes, not raw content.
- State compaction: Periodically replace accumulated context with a compact representation of current state. The compacted state is the new starting point; the raw history is available in traces for debugging but does not consume the active context window.
- Authority-weighted pruning: When the context budget is exhausted, discard low-authority context first (heuristic suggestions, old memory entries) and preserve high-authority context (specifications, constraints, evaluation results).
A worked example: an agent tasked with refactoring a module across 15 files hits the context limit at file 8. Without budgeting, it either hallucinates the remaining files or produces inconsistent changes. With rolling summaries, it carries a compact summary of decisions made for files 1-7 and retrieves fresh context for files 8-15.
Context Poisoning
Context poisoning is distinct from memory poisoning (Principle 6) — it occurs when the retrieval system returns contextually appropriate but factually wrong or outdated content within a single task. Memory poisoning is a persistent corruption; context poisoning can happen on any retrieval call.
Common sources: stale index entries that survived re-indexing, retrieved content from a deprecated branch that was never cleaned up, code examples from a library version that no longer matches the project's dependencies.
Detection: monitor for sudden quality drops in agent output that correlate with specific retrieval sources. Track retrieval source freshness (time since last validation) and alert when agents consume context older than a configurable threshold.
Mitigation: retrieval canaries (known-good queries with expected results, run on every retrieval cycle), source freshness metadata in every retrieval response, and a circuit breaker that falls back to specification-only context when retrieval confidence drops below threshold.
Self-Improving Knowledge Bases
Codify "never do X here" as machine-enforced guidance: repository policies, architectural constraints, ADR rules, lints, CI gates. Make the knowledge base self-improving: let retrieval quality metrics feed back into indexing and curation, so the system gets more precise over time rather than more cluttered.
The feedback loop: track which retrieved contexts led to successful agent outcomes (evidence bundle accepted, evaluations passed) and which led to failures. Over time, demote or remove context sources that consistently correlate with poor outcomes. This is garbage collection for your knowledge base, driven by evidence rather than manual curation.
Cross-Iteration Learning and CI Context
A specific and increasingly important case of context budgeting is learning across CI iterations — where each iteration generates new evidence about the consequences of previous decisions. In a CI loop spanning dozens of iterations (the SWE-CI benchmark averages 71 commits per task), the agent must carry forward not just what changed, but what effect each change had on subsequent iterations.
This is distinct from single-task context budgeting because the evidence compounds: iteration 15 generates information about decisions made in iterations 3, 7, and 12. The context that matters is not "what happened last" but "which earlier decisions are causing current problems."
Practical approaches for cross-iteration context:
- Decision-consequence summaries: After each iteration, compress the results into a structured summary that links decisions to outcomes. "Changed the retry logic in iteration 5; iteration 9 test failures trace to that change." These summaries are the rolling context for subsequent iterations.
- Regression attribution: When a regression appears, trace it to the iteration that introduced the structural cause — not just the iteration that triggered the test failure. This requires structured tracing across iterations, not just within them.
- Evolvability signals: Track whether each iteration's decisions made the next iteration easier or harder. The SWE-CI benchmark's EvoScore metric (arXiv:2603.03823) measures this explicitly: agents whose early decisions facilitate subsequent evolution score higher. Teams can approximate this by tracking iteration-over-iteration test pass rates and regression frequency.
Cross-iteration context management is the primary capability differentiator for long-running agent pipelines. Without it, agents repeat mistakes, fail to learn from structural consequences, and accumulate technical debt that traditional single-iteration metrics miss.
Tooling Maturity and Adoption
The context engineering standard described here exceeds what most teams can build today. The tooling ecosystem is maturing rapidly — open protocols for tool connectivity, structured capability definitions, and version-aware memory layers now exist — though production-grade governance tooling remains nascent. Adopt incrementally: start by measuring retrieval quality (relevance, latency, staleness), then add context budgeting for long-running tasks, then tiered SLOs as scale demands. The principle describes the engineering standard; the adoption path acknowledges the gap.
The Emerging Agent Stack
Recent frontier-lab writing is converging on a useful systems frame: the agent is not just a model with a prompt. The operational stack increasingly looks like:
- Model — the reasoning engine
- Context layer — retrieval, summaries, memory, and task framing
- Harness — execution loop, tool orchestration, constraints, checkpoints, and cleanup
- Tools / APIs — the external actions available to the agent
- Environment / runtime — the bounded execution context, permissions, traces, and operational controls
This is mostly a vocabulary clarification, not a new principle. The manifesto's contribution is that it provides the governance model across this stack. P7 governs the context layer directly, but its quality depends on the harness that selects and compacts context, the tools that retrieve it, and the runtime that preserves or constrains state across sessions. In practice, treating "context engineering" as a standalone discipline without connecting it to the harness and runtime is how teams end up with excellent retrieval feeding poorly-governed execution loops.
As of early 2026, four open interface patterns are crystallizing around this stack:
- Tool connectivity protocols — typed schemas, capability discovery, authorization, and structured tool invocation at the tools/APIs layer.
- Agent coordination protocols — agent discovery, task lifecycle management, and cross-runtime delegation at the coordination layer.
- Capability definition artifacts — reusable, reviewable descriptions of domain procedures, constraints, and operational skills at the harness layer.
- Repository-level instruction artifacts — machine-readable project constraints and local conventions at the environment layer.
The manifesto's governance model — tiers, traces, accountability, evaluations — sits across all four. No single protocol provides governance; the manifesto's principles provide the governance framework that connects them.
Principle 8 — Evaluations & Proofs: Extended Guidance
See Principle 8 in the manifesto for the core statement and minimum bar.
Assurance Disciplines
As autonomy and module count grow, assurance must move across distinct practices with different cost curves:
- Evaluations and tests for dynamic, example-based validation.
- Formal contracts + proofs for mathematically checking module properties.
- Model checking for state-space behavior (especially concurrency and protocol invariants).
These are separate disciplines. Use them intentionally: tests by default, formal methods first on critical paths and high-blast-radius components, then expand coverage where incident data and economics justify it.
The "proofs are a scale strategy" claim is now operationally achievable, not just theoretically sound. Executable specification languages allow teams to write specifications that are simultaneously human-readable documentation, testable assertions, and inputs to model checkers — collapsing the gap between "we wrote a spec" and "we proved a property." Model-based testing workflows can generate test suites directly from executable specifications, connecting formal models to CI pipelines without requiring teams to become proof engineers. The practical entry point is not theorem proving but executable specs on one critical path — the same scope recommended in the adoption playbook's formal contracts step.
LLM-as-Judge Risk
When models judge model-generated outputs, evaluator and producer can share blind spots. Mitigate LLM-as-judge risk with deterministic anchors, diverse judge models, periodic human-calibrated gold sets, and disagreement tracking between judges and production outcomes.
Evaluation Theater
Beware evaluation theater: evals that pass but do not test what matters. If evaluations do not cover edge cases, adversarial inputs, and behavioral regressions, they are measuring comfort, not correctness. When evaluation metrics become optimization targets rather than measures of quality, the system games the metric and drifts from the goal.
Detecting evaluation theater. Evaluation theater is recognizable by the gap between evaluation metrics and production outcomes. Watch for these signals:
- Evaluation pass rates near 100% while escaped defect rates or user-reported issues remain elevated — the evaluation suite is not covering the failure modes that matter.
- Adversarial inputs outside the evaluation distribution produce failures the suite never triggered — the evaluation distribution is too narrow.
- Evaluation coverage grows (more tests, higher numbers) without growing the distribution of tested conditions — the same scenarios run repeatedly with minor variations, providing false coverage confidence.
- Incident classes not covered by the current suite recur after remediation — the suite did not capture the failure mode, so the same issue reappears.
The primary structural defense is evaluation holdout (see below): scenarios the agent has never seen and cannot overfit to. Without holdout, high eval pass rates are consistent with both genuine quality and evaluation theater. The measurement mechanism for "evaluation theater detection rate" (listed as a Phase 5→6 metric) is therefore: track the fraction of production incidents that were not predicted by any evaluation failure in the preceding cycle.
Advanced bar: include adversarial cases for externally exposed or high-blast-radius systems. For model-judged evaluations, calibrate against human-labeled samples on a defined cadence.
Evaluation Holdout and the Gaming Problem
If agents can see the evaluation criteria during development, they can overfit to them — producing output that passes the specific tests while missing the intent behind them. This is the evaluation equivalent of teaching to the test.
The fix borrows from machine learning: evaluation holdout. Behavioral scenarios — specifications of what the software should do in realistic end-to-end conditions — are stored separately from the development context. The agent builds software without access to the evaluation criteria. The scenarios evaluate whether the output works. Because the agent never sees the evaluation criteria, it cannot game them.
This pattern is already in production. StrongDM's software factory uses holdout behavioral scenarios as the primary evaluation mechanism, with agents that implement against specifications and are evaluated against criteria they cannot see. The result is evaluation that tests intent, not just compliance.
When to use holdout evaluation: For any system where agents iterate autonomously (Phase 4+), especially when evaluation metrics show suspiciously high pass rates that do not correlate with production quality. Holdout evaluation is more expensive to maintain (two separate artifact sets: development specs and evaluation scenarios) but eliminates the most insidious form of evaluation theater — evaluations that pass because the agent learned the answers, not because it solved the problem.
Champion-Challenger Testing in Regulated Contexts
Champion-challenger testing compares agent system performance against an incumbent approach — the current model, the prior system version, or the clinical/operational standard of care. This is a cross-domain regulatory expectation, not a financial-services-specific concept:
- Financial services (SR 11-7): Requires comparing agent outputs against alternative approaches or incumbent models. Statistical methodology for handling output variability (non-deterministic agents) is an open regulatory question; conservative approach is to run champion-challenger on a held-out sample with human adjudication of disagreements.
- Medical devices: FDA GMLP and ISO/TS 24971-2 expect performance comparison against predicates (prior cleared devices) or the clinical standard of care. The manifesto's evaluation portfolio (P8) is the infrastructure for this comparison — extend evaluation suites with predicate-device test cases.
- Pharma: CSA expects assurance that a new system performs at least as well as the system it replaces. Run champion-challenger during PQ by executing parallel workflows and comparing outputs. Evidence bundle includes disagreement analysis and resolution rationale.
- Aviation: No direct champion-challenger requirement, but DO-178C requires that verification objectives are satisfied. For agent-assisted workflows replacing manual activities, demonstrate that the agent-assisted approach produces equivalent or better coverage per Table A objectives.
The non-determinism problem. Traditional champion-challenger assumes identical inputs produce comparable outputs. Agents are non-deterministic. Practical mitigation: run multiple agent invocations per input (N=3-5); use the majority-vote or highest-confidence output as the champion response; compare the distribution of champion responses against the incumbent. Statistical confidence intervals, not point comparisons, are the evidence.
Independent Verification in Regulated Contexts
Regulated industries share a common governance requirement: the party that verifies a system must be organizationally independent from the party that built it. SR 11-7 (financial services) requires independent model validation. IEC 62304 (medical devices) requires verification by qualified parties distinct from developers. DO-178C (aviation) requires independence at each design assurance level.
In agentic engineering, this principle extends to agent-generated output: the evaluation infrastructure that verifies agent work should be independent of the agent that produced it. Concretely:
- Evaluation criteria should not be visible to the producing agent (evaluation holdout, described above)
- Evaluation models should differ from production models where feasible (avoid shared blind spots — see P1 correlated failure domains)
- For Tier 3 operations in regulated environments, organizational independence between agent development and agent validation should mirror existing regulatory expectations
This is not a new principle — it is a regulated-environment application of the existing evaluation-as-contract pattern. See companion-frameworks.md for the cross-domain analysis and domains/ for domain-specific independence requirements.
Fairness and Bias Testing in High-Risk AI
EU AI Act Article 10 requires that training, validation, and testing datasets for high-risk AI systems are "free of errors and complete" and that they account for "characteristics or elements that are particular to the specific geographical, behavioural or functional setting." In practice, this mandates bias testing as part of the evaluation portfolio for any high-risk AI system.
This is a cross-domain obligation, not a financial-services-specific one:
- Financial services: Explicit fairness testing against protected classes under ECOA, FHA, and FCA Consumer Duty. Evaluation suites must include demographic parity and disparate impact analysis.
- Medical devices: Clinical AI systems must demonstrate equivalent performance across demographic subgroups (age, sex, ethnicity). ISO/TS 24971-2 explicitly addresses this. Evaluation portfolios for Class B/C SaMD must include subgroup performance analysis.
- Pharma: ICH E8(R1) requires that clinical trial populations are representative of the intended treatment population. AI systems used in patient selection or stratification must be tested for demographic bias.
- Automotive / industrial: AI systems in driver monitoring or operator safety systems must demonstrate consistent performance across demographic characteristics that could influence detection accuracy.
Minimum evaluation bar for high-risk AI systems: Include at least one explicit fairness evaluation category alongside behavioral regression and adversarial cases. Fairness evaluation should specify: (1) which subgroup characteristics are tested, (2) which performance disparity metric is used (demographic parity, equalized odds, etc.), (3) the maximum acceptable disparity, and (4) who owns the determination that the disparity is acceptable. The last item is a human judgment — not an evaluation output.
Workflow-Level Evaluation Enforcement
The evaluation-as-contract pattern extends beyond test suites into the development workflow itself. Workflow-level skill frameworks now enforce strict red-green-refactor TDD: if an agent writes implementation code before a failing test exists, the framework deletes the code and forces a restart. Design-first, plan-first, and test-first phases are mandatory, not suggested. This is evaluation-as-contract applied to the development process rather than the runtime — and it demonstrates that P8's principle operates at multiple layers, from CI pipelines to agent harness constraints.
Boolean vs. Probabilistic Evaluation
The manifesto's current evaluation model is largely boolean: tests pass or fail, regression cases are covered or not, evidence bundles are complete or incomplete. This framing is necessary for minimum bars but insufficient for mature agentic systems.
At Phase 5 and above, consider probabilistic satisfaction: of all observed execution trajectories through all behavioral scenarios, what fraction actually satisfies the specification? This replaces "did it pass?" with "how reliably does it pass, across how many conditions?"
The shift matters because agentic systems are inherently probabilistic. A boolean "pass" on ten test cases tells you the agent produced correct output ten times. It tells you nothing about the eleventh case, the hundredth case, or the distribution of partial failures. Probabilistic satisfaction metrics — drawn from scenario-based evaluation at volume — give a confidence distribution rather than a binary verdict.
Practical adoption: Start boolean (Phase 3-4). Add scenario coverage and pass-rate distributions as the evaluation portfolio matures (Phase 4-5). Treat probabilistic satisfaction as the target metric for fully autonomous pipelines where human review is sampled rather than comprehensive.
Behavioral Regression vs. Structural Regression
The manifesto's minimum bar for evaluations states: "If evaluations do not include regression cases, they are insufficient." In practice, there are two distinct categories of regression, and most teams only test for one.
Behavioral regression is what traditional regression testing catches: a change breaks existing functionality. The tests that passed before now fail. This is well-understood and well-tooled.
Structural regression is subtler and more dangerous: a change passes all current tests but degrades the codebase's capacity for future change. The code is locally correct but globally harmful — naming conventions that create confusion across iterations, architectural choices that increase coupling, dependency structures that make the next change harder. Structural regression does not fail any test today; it fails the test that you will need to write tomorrow.
The SWE-CI benchmark (arXiv:2603.03823) provides the first empirical evidence for this distinction. Across 100 tasks spanning an average of 233 days of development history, most agents achieve a zero-regression rate below 0.25 — meaning in over 75% of CI iterations, agents introduce at least one regression. Many of these regressions are structural: the agent's decisions in early iterations create friction that compounds across subsequent iterations. The benchmark's EvoScore metric captures this by measuring functional correctness on future modifications — not just current tests.
Detecting structural regression:
- Evolution-weighted metrics: Track not just whether today's tests pass, but whether each change makes the next change easier or harder. EvoScore is one formalization; a simpler proxy is iteration-over-iteration regression frequency.
- Coupling analysis: Monitor dependency graphs, import structures, and module boundaries across iterations. Rising coupling without corresponding functionality is a structural regression signal.
- Specification convergence: If specifications become harder to express precisely over time, the codebase's structure is degrading even if tests pass. The manifesto's convergence criteria (P2) apply here: diverging specifications are a symptom of structural regression.
The implication for evaluation portfolios: Teams at Phase 4 and above should include structural regression indicators alongside behavioral regression tests. This does not require formal verification — it requires tracking the trajectory of code quality across iterations, not just the state of code quality at each iteration.
Benchmark Instability and Contamination Risk
Benchmarks are necessary and insufficient. As public agent benchmarks mature, they are increasingly affected by contamination, target leakage, and adaptation to the benchmark rather than to the underlying engineering problem. Treat benchmark gains as directional evidence, not as durable truth about production readiness.
Three practical rules follow:
- Prefer mutation and refresh over static leaderboard worship. If a benchmark remains unchanged for long enough, the ecosystem will optimize for it directly.
- Maintain private holdouts. Public benchmarks are useful for comparability; private evaluations are necessary for real assurance.
- Test transfer, not just score. A claimed improvement matters only if it carries over to your stack, constraints, and failure modes.
The manifesto's position is intentionally conservative: external benchmarks help calibrate ambition, but promotion between maturity phases should be based on the evidence your own system can produce under your own operating conditions.
See also Verification without validation in the Failure Modes section, which describes the related but distinct case where verification machinery confirms correctness without confirming value.
Principle 9 — Observability & Interoperability: Extended Guidance
See Principle 9 in the manifesto for the core statement and minimum bar.
What a Trace Must Contain
A trace is not a log line. A complete agentic trace captures:
- Specification received: What was the agent asked to do? The versioned specification or task decomposition that initiated the work.
- Decision chain: What options did the agent consider, what did it select, and what reasoning or scoring drove the selection? For multi-step tasks, the chain must show each decision point, not just the final output.
- Tool calls and responses: Every external tool invocation — API calls, file operations, retrieval queries — with inputs, outputs, and latency.
- Memory retrievals: What context was retrieved, from which store, with what relevance scores? This is critical for diagnosing retrieval-driven hallucinations.
- Evaluation results: Which evaluations ran, what passed, what failed, what was the delta from previous runs?
- Policy checks: Which constraints were checked, which passed, which triggered violations or near-misses?
- Cost accounting: Tokens consumed, model used, inference latency, total cost of this task.
The trace must be structured, not free-text. Structured traces can be queried, aggregated, and replayed. Free-text logs require human interpretation at every step.
Near-Real-Time Drift Detection
Observability is incomplete if it only reconstructs the past. For production agentic systems, you also need near-real-time detection of constraint violations, behavioral drift, and anomalous patterns:
- Constraint violation alerts: Immediate notification when an agent attempts or completes an action outside its tier or domain boundary.
- Behavioral anomaly detection: Statistical monitoring of agent outputs over time. A sudden shift in code style, error rate, or tool usage pattern may indicate context poisoning, model degradation, or specification drift.
- Cost anomaly alerts: A task that normally costs $0.50 suddenly costing $15 signals a reasoning loop, retry storm, or routing failure.
The goal is not to alert on everything but to detect when the system has left its expected operating envelope before the damage compounds.
Interoperability Requirements
Interoperability requires typed schemas, explicit auth boundaries, versioned capabilities, and replayable tool logs. Treat adapters as temporary bridges, not architecture. The goal is replaceable components, not locked pipelines.
The emerging open-protocol stack now covers both interoperability axes the manifesto requires: how agents connect to tools, and how agents coordinate with other agents. Recent protocol revisions added stronger authorization models, structured capability metadata, safer transport patterns, and more durable task lifecycle support. These developments matter because they move interoperability from vendor-specific SDK behavior toward inspectable contracts that can be governed, audited, and replaced.
Interoperability minimum bar: If tools cannot be swapped or replayed across runtimes without rewriting core workflows, the platform is brittle.
Principle 10 — Emergence & Containment: Extended Guidance
See Principle 10 in the manifesto for the core statement and minimum bar.
Chaos Practice
Practice chaos: test with tool outages, noisy retrieval, adversarial inputs, partial memory corruption, reordered swarms, and model degradation — before reality does. Offline tests are insufficient for systems that operate autonomously in the wild. Enforce invariants at runtime with policy checks, monitors, and automated intervention.
Chaos testing for agentic systems requires its own safety model:
- Steady-state hypothesis: define expected behavior before injecting faults, so you can detect when the system has left its safe operating envelope.
- Blast-radius controls: isolate chaos experiments to scoped environments, shadow traffic, or canary populations — never inject faults into the full production agent population.
- Automated abort conditions: if the system breaches predefined thresholds (error rate, latency, cost spike), halt the experiment and roll back automatically.
- Graduated severity: start with single-fault injection (one tool outage), then compound faults only after single-fault resilience is proven.
Threat Modeling
Threat modeling must explicitly include:
- Prompt injection and jailbreak propagation across agent chains
- Memory/context poisoning and supply-chain contamination
- Agent impersonation and forged role assertions in swarm coordination
- Data exfiltration through tool permissions and connector abuse
Defense-in-depth means identity for agents and tools, signed provenance for shared state, least-privilege tool scopes, egress controls, and continuous anomaly detection for cross-agent trust edges.
Real-World Containment Failures
The OpenClaw ecosystem (2025-2026) provides instructive case studies. OpenClaw itself — an open-source autonomous agent with 247K GitHub stars — demonstrated how rapidly agentic systems scale when governance is absent. The Moltbook incident (February 2026) exposed 1.5 million registered agents (only 17,000 human owners) through a misconfigured Supabase database with full read/write access. The failure hit every threat category above: no identity controls, no domain scoping, no blast-radius limits, no audit trail.
NVIDIA's response — NemoClaw (GTC 2026) — is an enterprise-hardened fork that adds YAML-based permission policies, audit logging, and guardrail constraints. This is containment engineering in practice: the same agent runtime, now with the governance layer the manifesto requires. The pattern validates the core P10 claim: emergence is not a feature to celebrate but a hazard to engineer around. Systems that scale without containment infrastructure will produce incidents at scale.
Principle 11 — Economics: Extended Guidance
See Principle 11 in the manifesto for the core statement and minimum bar.
Intelligent Routing
Intelligent routing — selecting the right model, the right agent topology, and the right resource tier for each task — extends effective capacity by multiples while maintaining quality. This "economics-aware routing" must consider not just token cost, but correlation cost (avoiding a single point of epistemic failure by using diverse models and independent tool chains).
Total Cost of Correctness
Inference cost and assurance cost are coupled, not independent knobs. Cheaper models may require stronger verification, more retries, or tighter approvals.
The full cost model includes:
- Inference cost: tokens, compute, API fees.
- Verification cost: evaluation runs, proof checking, canary deployments.
- Governance overhead: human review time per tier, approval latency, policy maintenance.
- Incident remediation: rollback, diagnosis, constraint updates, re-verification.
- Opportunity cost: delayed decisions from approval queues or routing latency.
- Context-switching cost: debugging heterogeneous failure modes across models and vendors.
Optimize total cost of correctness, not inference cost alone. When governance overhead exceeds the value of the work, reduce governance complexity rather than adding more layers.
Multi-Model Risk
Multi-model and multi-vendor swarms introduce heterogeneous failure and policy risk. Model errors are often correlated through shared dependencies, similar training artifacts, or vendor-side incidents. Routing policies must include failure-domain isolation, cross-model canary checks, and explicit data handling boundaries per provider.
Resilience Measures
To mitigate systemic fragility, extend resilience measures across the stack:
- Diversity routing (different models/judges) to reduce correlated hallucinations.
- Retrieval canaries across independent indexes.
- Tool redundancy plans for rate limits/outages.
This is the "organism avoiding monoculture collapse."
Advanced bar: route by expected total cost of correctness, not token price.
Total Cost of Correctness — Decision Framework
The manifesto defines the formula conceptually. Here is how to use it for routing decisions.
The formula:
Total Cost of Correctness =
(Inference cost per task × Task count)
+ (Verification cost per task × Task count)
+ (Governance overhead per task × Task count)
+ (Expected remediation cost per failure × Failure rate)
+ (Opportunity cost of latency)
Worked example: generating integration tests for a new API endpoint
| Model tier | Inference cost | Expected pass rate | Rework cost on failure | Total cost of correctness |
|---|---|---|---|---|
| Fast/cheap model | $0.04 | 85% (3 failures of 20) | $0.50/failure = $1.50 | $1.54 |
| Balanced model | $0.08 | 95% (1 failure) | $0.50/failure = $0.50 | $0.58 |
| High-capability model | $0.20 | 99% (0.2 failures) | $0.50/failure = $0.10 | $0.30 |
Naive cost optimization picks the fast model. Total-cost optimization picks the high-capability model. The fast model's lower failure rate in simple cases matters less than the higher-capability model's reliability on edge cases.
Routing decision record. For each routed task, capture:
task_type: [description]
estimated_complexity: [1-10]
model_selected: [model name/tier]
rationale: [why this model for this complexity]
actual_outcome: [pass / fail / rework]
actual_cost: [inference + verification + remediation]
Feed these records into a FinOps dashboard quarterly. Within three months of operation, you will have an empirical cost model that makes routing decisions data-driven rather than intuition-driven. The goal is not the cheapest model — it is the model with the lowest total cost of correctness for that task class.
DORA concentration risk note. In regulated financial services, model routing is not only an economics decision — it is a DORA third-party risk control. Routing policies must include: failure-domain isolation (ensure no single provider failure disables all tasks), cross-model canary checks, and documented exit procedures if a provider becomes unavailable. Multi-model routing should be documented in the DORA third-party risk register.
Principle 12 — Accountability: Extended Guidance
See Principle 12 in the manifesto for the core statement and minimum bar.
Domain-Scoped Ownership
At scale, ownership is domain-scoped, not change-scoped. A named human owns the risk policy, approval thresholds, and incident response for a bounded domain; the system enforces those policies per change. Human review must focus on exceptions, high-risk deltas, and statistically valid sampling, not every low-risk action.
The Accountability Paradox
The manifesto states: "Agents execute; humans own outcomes, risks, approvals, and incidents. No agent — however capable — absorbs legal, ethical, or operational responsibility." This is the manifesto's strongest claim about the human role. It is also the claim most certain to break under scale.
If your agents process thousands of actions per day, human review of every action is not just impractical — it is impossible. A domain owner who "approves" 200 changes per day is not governing; they are rubber-stamping. The manifesto's accountability model, applied literally at volume, collapses into control theater (see Failure Modes).
This is not a minor gap. It is the central tension of the entire manifesto: the principles require human accountability, and the economics of agentic systems at scale make comprehensive human accountability impossible.
How to Navigate the Paradox
The manifesto does not resolve this tension — it provides the tools to manage it. The resolution is not "remove the human" or "review everything." It is a phase-calibrated layering of accountability mechanisms:
At Tier 1 (Observe): Agents can only analyze and propose. Human accountability is inherent because no action reaches production without human execution. This is fully compatible with the manifesto at any volume.
At Tier 2 (Branch): Agents write to isolated environments. Accountability shifts from reviewing every action to designing the constraints that bound agent behavior and the evaluations that verify output. The human owns the constraint design and the evaluation portfolio, not every individual diff. When an escaped defect occurs, accountability traces to which constraint or evaluation was missing — not which reviewer missed which line.
At Tier 3 (Commit): Agents take production-impacting actions. This is where the tension is sharpest. The practical approach: automated policy enforcement handles routine checks at machine speed; human review focuses on exceptions, high-risk deltas, and statistically valid sampling. The human is accountable for the policy, the sampling strategy, and the incident response — not for having personally inspected every action.
In all tiers, build recursive feedback mechanisms: systems evaluate their own errors, feed failures back into context, and self-correct or automatically roll back. This is not replacing human accountability — it is extending the human's reach through system design.
The Level 5 Challenge: No Human Writes or Reviews Code
The sharpest version of the accountability challenge comes from teams already operating at what practitioners call "Level 5" or "dark factory" mode: specifications go in, working software comes out, no human writes or reviews code. StrongDM's software factory is the most documented example — three engineers, no code writing, no code review. Humans write specifications and evaluate outcomes. Machines do everything in between.
This sounds like it contradicts the manifesto's accountability model. It does not — but it forces the model to its logical conclusion. In a Level 5 system:
- Accountability shifts from reviewing code to designing constraints. The human owns the specification quality, the evaluation portfolio (including holdout scenarios the agent cannot see), and the incident response policy. They do not own every line of code — they own the system that produces and verifies the code.
- Evaluation replaces review. Instead of reading diffs, humans evaluate outcomes against behavioral scenarios, probabilistic satisfaction metrics, and business impact measures. The evaluation infrastructure is the review process — it just runs at machine speed rather than human speed.
- The accountability surface changes, not the accountability principle. A human is still accountable for production behavior. But "accountable" means "designed the constraints, approved the evaluation portfolio, and owns the incident response" — not "read every line of code."
This is consistent with the manifesto's Tier 3 governance at scale: automated policy enforcement handles routine verification, human review focuses on exceptions and high-risk deltas, and accountability traces to constraint design rather than individual code inspection. Level 5 is what Tier 3 governance looks like when the constraints, evaluations, and evidence infrastructure are mature enough to replace line-by-line review entirely.
The manifesto does not prescribe Level 5 as a target. Most teams are not ready for it — and the perception gap is real: a 2025 study reported that experienced developers using AI tools took 19% longer to complete tasks while believing AI made them 24% faster. Teams that believe they are operating at Level 4 or 5 are often stuck at Level 2, confusing tool adoption with workflow transformation. The maturity spectrum (Phase 1-6) and the evidence requirements at each phase exist precisely to prevent this self-assessment inflation.
The Open Problem
This layered approach is mitigation, not resolution. Oversight saturation at scale remains an open problem: systems can outgrow meaningful human governance bandwidth faster than governance practices mature. This is not a caveat buried in extended guidance — it is a load-bearing limitation of the entire manifesto.
The twelve principles are designed to remain useful at any scale, but the governance model that binds them (human accountability for production outcomes) is bounded by human bandwidth. As agentic systems scale toward Phase 6 (adaptive, self-improving), the fraction of system behavior that any human can meaningfully review approaches zero. The manifesto's answer — governance through constraints, evaluations, and evidence rather than through direct oversight — delays this limit but does not eliminate it.
Treat this as the manifesto's most important active frontier. If your engineers spend all day reviewing agent trace logs, you have replaced coding with babysitting and the governance model is already failing. If they review nothing, accountability is fictional. The correct position is somewhere between, defined by the quality of your constraints, evaluations, and feedback loops — and it must be re-evaluated as the system grows.
Governance as Practice — The Domain Owner's Routine
The manifesto describes governance structure: named owners, defined tiers, evidence bundles, approval gates. Structure is necessary but not sufficient. A team can have all structural components in place and still have non-functional governance: domain owners who approve evidence bundles without understanding them, audit trails no one reads, policy violations detected but not acted upon. Governance also requires practice — the ongoing behavioral routine by which a domain owner actually performs governance rather than performs its appearance.
What distinguishes performed governance from simulated governance:
Understanding what is being approved. A domain owner performing governance can answer, without prompting: what changed, why, what could go wrong, and why the evidence bundle indicates those risks were addressed. If they cannot answer, they are signing, not governing.
Acting on anomalies. When accountability signals degrade — review times drop, rejection rate trends toward zero — a governing domain owner reduces autonomy scope for that domain. A domain owner performing governance theater adds reviewers or frames the problem as a workload issue.
Reading incidents as policy feedback. After an incident, the governing question is: which constraint was missing, which evaluation didn't catch this, which evidence bundle criterion was insufficient? The non-governing question is: who approved the change that caused the incident? The first drives remediation; the second drives blame without improving the system.
Maintaining calibration. A domain owner who has not rejected a change in two months either has extraordinary agents or has stopped governing. Healthy rejection rates (5–15% of agent-generated PRs) are a calibration signal, not a ceiling to minimize. Sustained rates below that range should be treated as a governance degradation signal, not as quality improvement, unless corroborated by other evidence.
These behaviors are not auditable by structure alone. They require the domain owner to treat governance as a craft that degrades without practice.
Governance Health Monitoring
Accountability frameworks can degrade silently. Control theater — humans nominally accountable but operationally blind — is the most common governance failure at scale and cannot be detected from the outside. Detect it from the inside by monitoring the signals that distinguish meaningful review from rubber-stamping. The Rubber-stamping detection table in the adoption metrics document provides a quantitative baseline: median review time, PR rejection rate, inline comment density, and rework rate within one week. These thresholds are operational heuristics, not empirically validated figures — treat them as starting points calibrated against your own team's baseline data. The intervention protocol when thresholds breach is to reduce autonomy scope for that domain, not to add more reviewers.
Incident Attribution
When incidents occur, accountability is assigned by policy failure mode: specification error, verification gap, enforcement failure, or operational override. This avoids circular blame on the final approver and drives targeted remediation. If trace volume exceeds meaningful human review, raise automation barriers or reduce autonomy until oversight signal quality is restored.
Maturity spectrum, boundary conditions, and operational definitions that apply across all twelve principles.
Read the Manifesto for the core values and minimum bars. See the Companion Guide for the full table of contents. See the Adoption Playbook for organizational change management, role transitions, and pilot design.
The Agentic Maturity Spectrum
Maturity is domain-specific, not organization-wide. A team can be Phase 5 in CI and Phase 2 in production operations. Assess each domain honestly.
Phase 1 — Guided Exploration ("vibe coding"). Single prompts, no structure, no memory. Creative but unreliable. Useful for discovering what agents can do; dangerous for anything that matters. Failure mode: extrapolating demo results to production expectations.
Phase 2 — Assisted Delivery. AI as autocomplete. AI code-completion tool suggestions where the human executes. Productivity gains are real but bounded by human throughput. Failure mode: optimizing human-in-the-loop speed instead of questioning whether the loop is necessary.
Phase 3 — Agentic Prototyping. Agents execute tasks autonomously within a single session. Limited memory, limited verification. The moment most teams realize prompting is not engineering. Failure mode: autonomy without verification — the agent said it worked.
At this phase, teams should begin contract-aware prompting: agents produce assertions and pre/postconditions with code, even before full proof pipelines are in place.
Phase 4 — Agentic Delivery. Agents operate with basic guardrails: autonomy tiers are defined, evaluations gate changes, and basic memory persists across sessions. But the system is still single-domain, single-swarm, and largely reactive. Failure mode: governance without feedback — constraints are enforced but never updated by what the system discovers.
Phase 4 should pilot formal contracts on a narrow critical path only when team capability and economics support it.
Phase 5 — Agentic Engineering. Structured autonomy at scale. Specifications steer behavior and evolve through evidence. Multi-agent swarms operate across domain boundaries, right-sized to each task. The full Agentic Loop operates as a continuous system. Failure mode: evaluation theater — evals pass but do not test what matters.
This is where contract-first development becomes systematic: code, contracts, and proofs co-evolve continuously rather than being bolted on late.
Phase 6 — Adaptive Systems. Self-improving infrastructure within governed boundaries. Systems that build, test, and fix themselves — then learn from the results. Continuous learning with active memory curation. Chaos-tested, runtime-verified, economically optimized. Specifications co-evolve with the system's understanding of the problem space. Phase 6 is not inevitable; it requires capabilities — formal verification, causal reasoning, provable containment — that are still maturing. Treat it as a frontier, not a destination. Failure mode: self-improvement without containment — optimizing the metric, not the goal.
At this phase, agents can propose contract refinements and invariant updates, but proof systems and governance gates must validate changes before adoption.
Every phase transition has distinct challenges. Phase 2→3 is where the supervision paradox first hits. Phase 3→4 is where governance overhead must justify itself. Phase 4→5 requires organizational change, not just tooling. See the Adoption Playbook for detailed transition guidance for each phase, role changes, and pilot design.
Empirical Phase Profiles: Evidence from SWE-CI
The SWE-CI benchmark (arXiv:2603.03823) provides the first empirical evidence for what each maturity phase looks like in measurable agent performance. SWE-CI evaluates agents across 100 tasks spanning an average of 233 days and 71 commits of real-world development history, using an Architect–Programmer dual-agent CI loop.
- Phase 1-2 performance: Agents at these phases fail SWE-CI entirely. They lack the iterative capability to sustain a CI loop and cannot integrate feedback across cycles.
- Phase 3 performance: Agents pass early iterations but accumulate regressions at a high rate. Most models achieve a zero-regression rate below 0.25 — matching Phase 3's canonical failure mode: "autonomy without verification." The agent produces plausible output but erodes the codebase iteration by iteration.
- Phase 4 performance: Agents show basic CI-loop competence with evidence per iteration but struggle with cross-iteration learning. Regression rates improve but do not plateau. Governance catches individual failures but does not address the structural regression pattern.
- Phase 5+ performance: Only top-performing models exhibit Phase 5 characteristics: specification convergence across iterations, declining regression rates over time, and improving EvoScore. This matches Phase 5's description: the full Agentic Loop operating as a continuous system.
These profiles are descriptive, not normative — SWE-CI tests a specific task type (long-term code maintenance), and phase maturity is domain-specific. But they provide a concrete, measurable calibration point for teams self-assessing their maturity.
Use this benchmark family as a calibration aid, not as a universal scorecard. Public agent benchmarks age quickly, can be contaminated, and tend to attract optimization pressure from the ecosystem. Treat them as one input into maturity assessment alongside private holdouts, incident rates, replay quality, and domain-specific evidence bundles.
Alternative Framing: The Five Levels of Agentic Development
A complementary practitioner framing describes agentic maturity by what the human does rather than what governance exists. These levels (attributed to Dominik Fretz's analysis of production agentic teams) map to the manifesto's phases but emphasize the human role transition:
- Level 0 — Spicy Autocomplete: AI as tab completion (Phase 1-2).
- Level 1 — Coding Intern: Discrete tasks delegated, everything reviewed (Phase 2-3).
- Level 2 — Junior Developer: Multi-file AI changes, human reads all code (Phase 3). Most teams claiming to be "AI-native" operate here.
- Level 3 — Manager: Human directs AI, reviews at PR level, no longer writes code (Phase 3-4 transition).
- Level 4 — Product Manager: Human writes specification, evaluates outcomes hours later, does not read code (Phase 4-5).
- Level 5 — Dark Factory: Specifications in, working software out, no human writes or reviews code (Phase 5-6).
Anecdotal practitioner reports suggest many teams overestimate their AI-native maturity — most operate closer to Level 2 than they believe. The gap between perceived and actual maturity is the most common failure mode in agentic adoption. A 2025 study reported that experienced developers using AI tools took 19% longer to complete tasks while believing AI made them 24% faster. The manifesto's phase-calibrated evidence requirements exist precisely to close this perception gap — your phase is determined by the evidence you can produce, not by the practices you believe you follow.
Use this as calibration, not as a universal scorecard.
Where the two frameworks diverge: Fretz's levels are descriptive of the human experience. The manifesto's phases are prescriptive about governance infrastructure. A team can be at Level 3 (human as manager) while lacking the Phase 4 governance infrastructure (evidence bundles, evaluation gates, defined autonomy tiers) that makes Level 3 safe. The manifesto's position: advancing levels without advancing phases is how you get Level 4 velocity with Phase 2 governance — which is how incidents happen.
Boundary Conditions
This manifesto assumes the environment can support governed autonomy, reliable evidence capture, and reversible operations. When these assumptions do not hold, agentic engineering should be constrained — but not abandoned entirely.
When to Cap Autonomy
Proceed cautiously or cap autonomy at Phase 2-3 when:
- Certification or regulatory regimes require deterministic assurance patterns that the current agent/tool chain cannot meet
- Safety-critical or real-time systems cannot tolerate probabilistic behavior at the current control boundary
- Classified or restricted environments cannot satisfy data-handling and tool isolation requirements
- Teams lack baseline CI/CD quality gates, incident response discipline, or domain ownership needed for safe autonomy
Hard Autonomy Caps by Regulated Use Case
Some use cases carry hard autonomy caps regardless of the organization's maturity phase. These caps are not recommendations — they are regulatory constraints. A Phase 5 team operating at full agentic maturity still cannot exceed these caps. The table below shows the strictest cap per domain; see each domain document for the complete use-case-specific cap table.
| Domain | Conservative Default Cap | Regulatory Basis | Domain Document |
|---|---|---|---|
| Aviation (airborne software DAL A/B) | Tier 1 (observe only) | DO-178C; DO-330 tool qualification | aviation.md |
| Medical Devices (IEC 62304 Class C; EU AI Act high-risk) | Tier 1 (observe only) | IEC 62304; EU MDR + AI Act (Class IIa+) | medical-devices.md |
| Pharma (GMP context; GxP record modification) | Tier 1 (observe only) | GAMP 5; 21 CFR Part 11; EU GMP Annex 11 | pharma.md |
| Financial Services (credit/insurance decisions; algorithmic trading) | Tier 1 (observe only) | EU AI Act Annex III §5; GDPR Art. 22; MiFID II | financial-services.md |
| Automotive (ASIL C/D safety functions) | Tier 1 (observe only) | ISO 26262; UN Regulation 157 | automotive.md |
| Defense / Government (classified or ITAR-controlled systems) | Tier 1 (observe only) | CMMC; ITAR 22 CFR 120-130; FedRAMP | defense-government.md |
The rows above show conservative defaults for the most restrictive category in each domain. Lower-risk workflows in the same domain may permit higher tiers if separately justified. Most workflows in these industries permit Tier 2 or Tier 3 for lower-risk activities. The domain documents contain full use-case-specific cap tables with the regulatory basis for each row.
Phase maturity and autonomy tiers interact. Beyond the hard caps above, phase maturity is a prerequisite for autonomy tier:
- Phase 3 or below → Tier 1 only, regardless of infrastructure
- Phase 4 → Tier 2 available (branch + human approval)
- Phase 5+ → Tier 3 available, subject to use-case caps above
A team cannot operate at a higher autonomy tier than their phase supports, even if the infrastructure is in place.
What Regulated Industries Can Still Use
Capping autonomy does not mean the manifesto is irrelevant. Teams in regulated environments (healthcare, finance, aerospace, defense, government) can still adopt the manifesto's principles selectively:
- Principles 1-3 (Outcomes, Specifications, Architecture) apply fully. Outcome orientation, machine-readable specifications, and enforced domain boundaries are valuable in any regulatory context — and may strengthen compliance posture.
- Principle 5 (Autonomy) applies at Tier 1 (Observe) and Tier 2 (Branch) in most regulated environments. Agents analyze, propose, and draft in isolated environments; humans execute. The manifesto's tier model maps naturally to regulatory approval workflows.
- Principle 8 (Evaluations) applies fully. Evaluation portfolios, regression gates, and evidence bundles are compatible with — and often required by — regulatory audit frameworks.
- Principle 9 (Observability) applies fully. Structured traces with provenance are often more rigorous than existing audit logs.
- Principle 12 (Accountability) applies fully. Domain-scoped human accountability with incident attribution aligns with regulatory responsibility frameworks.
The principles that require caution in regulated environments are primarily Principle 5 at Tier 3 (production-impacting agent actions), Principle 6 (memory governance in data-restricted environments — see P6 extended guidance), and Principle 10 (chaos testing in safety-critical systems — validate chaos experiments in isolated environments before running on production equivalents).
For viable starting points by domain, see: Aviation · Medical Devices · Pharma · Financial Services
What Would Need to Change
For regulated industries to move beyond Phase 3, the following capabilities must mature: deterministic or formally verifiable agent behavior for critical paths, certified evidence chains that satisfy audit requirements, and data-handling frameworks that meet jurisdictional restrictions for agent- accessible data. These are active areas of development. Teams in regulated environments should track progress and pilot cautiously rather than waiting for full maturity.
Cross-Domain Regulatory Insights
Three governance requirements appear independently across regulated domains. They are not domain-specific — they are structural properties of any high- stakes verification system.
Independent validation as a governance principle. Across regulated domains — SR 11-7 in financial services, IEC 62304 in medical devices, DO-178C in aviation — a common requirement emerges: the entity that validates a system must be independent from the entity that developed it. In agentic engineering, this applies at two levels: the agent system itself must be validated by parties independent of its development, and agent-generated outputs in regulated contexts must be verified through independent means. This maps to the manifesto's tier model: at Tier 1-2, human review provides independence; at Tier 3, independent evaluation infrastructure (separate models, holdout scenarios) provides the independence guarantee. See P8 extended guidance.
SOUP / agent-as-tool categorization. Multiple regulatory frameworks require classification of software components by provenance and qualification status: IEC 62304 (SOUP), DO-178C/DO-330 (COTS/PDS and tool qualification levels), ISO 26262 (SEooC), and GAMP 5 (software categories). In agentic engineering, three entities require classification: the AI model itself (non-deterministic, version-dependent, opaque), agent-selected dependencies (libraries and patterns chosen during execution), and agent-generated code (may incorporate training-data patterns as implicit unclassified software). The manifesto's defense-in-depth response: treat the agent as an unqualified tool and independently verify all output through qualified means. See P3 extended guidance.
Data classification as an agent constraint. Agents operating in regulated environments must respect data classification boundaries. Classification requirements constrain what data agents may access, where inference may execute, and what outputs may be retained. Data classification is not a prompt instruction — it must be enforced at the infrastructure level (Principle 5: autonomy tiers). Domain-specific constraints:
- Financial services: GDPR cross-border transfer rules (Chapter V) and banking secrecy laws (Switzerland, Luxembourg, Singapore) may prohibit certain data from reaching external inference APIs entirely.
- Life sciences (pharma / medical): GxP record integrity requires ALCOA+ compliance; patient-level clinical data carries HIPAA (US) and GDPR (EU) obligations; raw GxP data must never be modifiable by agents.
- Aviation / defense: ITAR (22 CFR 120-130) and EAR (15 CFR 730-774) restrict export-controlled technical data to compliant infrastructure; agents must operate within Technology Control Plans.
- Automotive / industrial: Safety-function configuration data may be restricted under product liability and type-approval obligations.
The manifesto's architecture principle (P3) applies across all domains: data classification boundaries must be machine-enforced, not documented and hoped for. See each domain document for the applicable classification matrix and enforcement mechanism.
IEC 61508 as the parent functional safety standard. IEC 61508 (2010) is the foundational functional safety standard for industrial electronic systems, from which several domain standards derive: IEC 62304 (medical device software), ISO 26262 (automotive), EN 50128 (railway), and IEC 62061 (machinery). Teams in domains not covered by a specific domain document should map IEC 61508 Safety Integrity Levels (SIL 1–4) to the manifesto's autonomy tiers using the same logic as the DAL and safety class mappings: SIL 3-4 functions → Tier 1 (observe only); SIL 2 → Tier 1-2; SIL 1 → full tier range with evidence controls.
Domain-Specific Regulatory Alignment
For detailed mappings between the manifesto and specific regulatory frameworks, see the Domain Regulatory Alignment documents:
- Aviation — DO-178C, DO-330, DO-333, ARP 4754A
- Medical Devices — IEC 62304, ISO 14971, ISO 13485, FDA SaMD
- Pharma / Life Sciences — GAMP 5, CSA, 21 CFR Part 11, ICH
- Financial Services — SR 11-7, DORA, EU AI Act, SOX
- Automotive — ISO 26262, ASPICE, UN Regulation 157
- Defense / Government — CMMC, FedRAMP, NIST SP 800-53, ITAR/EAR
For V-model organizations, see adoption-vmodel.md for a V-model-specific adoption path.
Operational Definitions
Blast radius: the maximum credible impact of a wrong action across users, data, services, or regulatory obligations.
Right-sized: the smallest agent topology and model tier that can meet the required quality and latency targets at acceptable total cost of correctness.
Evidence bundle: the minimum artifacts needed to justify a change at a given phase and risk tier.
Total cost of correctness: inference + verification + governance overhead + incident remediation + opportunity cost + context-switching cost. Optimize this
composite, not any single component. See
Principle 11 guidance.
Evolution-weighted correctness (EvoScore): a metric that measures functional correctness on future modifications, not just current tests. Agents whose early decisions facilitate subsequent evolution score higher; agents that accumulate structural technical debt see progressively declining performance. Introduced by the SWE-CI benchmark (arXiv:2603.03823). Use evolution-weighted metrics as a complement to total cost of correctness for long-running agent pipelines. See Structural Regression in the P8 extended guidance.
Structural regression: a change that passes all current tests but degrades the codebase's capacity for future change. Distinguished from behavioral regression (breaking existing functionality). See P8 guidance.
Phase-calibrated evidence examples:
- Phase 3: tests, diff, trace link, rollback note.
- Phase 4: Phase 3 bundle plus policy checks and incident tags.
- Phase 5+: Phase 4 bundle plus reproducible replay and, where justified, formal artifacts.
ALCOA+ Alignment
Organizations operating under GxP, FDA 21 CFR Part 11, or equivalent regulated data-integrity frameworks will recognise that the manifesto's evidence model satisfies ALCOA+ by construction:
| ALCOA+ Criterion | Manifesto Mechanism |
|---|---|
| Attributable | Agent identity in every trace; named human domain owner (P12) |
| Legible | Structured, queryable traces — not free-text logs (P9) |
| Contemporaneous | Traces captured at execution time, not reconstructed after the fact |
| Original | Signed provenance for shared state; immutable evidence bundles |
| Accurate | Evaluations as the contract between intent and behavior (P8) |
| Complete | Evidence bundles are phase-gated; incomplete bundles block merge |
| Consistent | Versioned specifications; regression gates enforce non-degradation |
| Enduring | Replayable tool logs; trace retention as infrastructure requirement |
| Available | Traces must be queryable and aggregatable for audit at any time (P9) |
This mapping is intentional, not coincidental. The manifesto was designed so that governed agentic delivery produces records that meet regulated-industry data-integrity standards without a separate compliance overlay.
Concrete scenarios showing how the manifesto's principles apply in practice, including both successful applications and governed failures.
Read the Manifesto for the core values and minimum bars. See the Companion Guide for the full table of contents. See the Companion Principles for extended guidance on each principle.
Worked Patterns
Pattern A — Single-Domain Reliability Fix
Specification: "Retry payment capture exactly once after timeout; never double-charge."
Agent decomposition: implement retry logic, add idempotency key handling, add tests, produce trace and rollback plan.
Evidence bundle (Phase 4): diff, regression tests, trace ID, policy check results, rollback command.
Outcome: shipped change, observed behavior, no duplicate charges in canary.
Pattern B — Multi-Agent, Cross-Domain Coordination
Specification: "Cancel order across orders, billing, and notifications
without double-refund, stale customer status, or orphaned events."
Swarm decomposition:
- Planner agent creates domain tasks with shared invariants.
- Domain agents implement bounded changes in parallel.
- Verification agent runs cross-service regression and contract checks.
- Coordinator agent resolves conflicting diffs through a single commit path.
- Operations agent gates rollout with canary and rollback criteria.
Evidence bundle (Phase 5): per-domain diffs, cross-domain trace graph, invariant check results, reconciliation decisions, canary metrics, rollback commands.
Outcome: one conflicting refund rule detected pre-merge, corrected via constraint update, release completed without refund duplication.
Pattern C — Memory Poisoning Recovery
Scenario: A retrieval shard serving the billing domain is corrupted by a
batch indexing error. Agents start generating code that references a deprecated
payment API. Three PRs are merged before the pattern is detected through
evaluation regression.
Detection: Evaluation metrics for billing-domain changes show a sudden increase in API-compatibility failures. Trace analysis reveals all three failing changes retrieved context from the same shard, and the retrieved context references the deprecated API.
Recovery:
- Isolate the corrupted shard — remove from retrieval rotation immediately.
- Identify all memory entries created or influenced by the bad shard using provenance metadata.
- Roll back billing-domain learned memory to the last known-good snapshot (pre-indexing error).
- Revert or flag the three merged PRs for re-review against corrected context.
- Re-index the shard from authoritative knowledge sources.
- Add a retrieval canary for the billing domain: a known-good query with an expected result, run on every retrieval cycle, alerting on drift.
- Update incident memory with the failure class, root cause, and recovery steps — so the system recognizes this pattern faster next time.
Evidence bundle: trace IDs of affected changes, memory diff (before/after rollback), retrieval canary configuration, re-indexed shard validation.
Pattern D — Economics Routing Decision
Scenario: A specification requires generating integration tests for a new API endpoint. The team has access to a high-capability model (expensive, strong reasoning) and a fast model (cheap, weaker on complex logic).
Routing decision:
- Route the initial test generation to the fast model — integration test boilerplate is well-covered in training data and doesn't require deep reasoning.
- Route the edge-case and adversarial test generation to the high-capability model — these require understanding failure modes and security boundaries.
- Route the test review and evaluation against the existing regression suite to deterministic tooling — no model needed.
Illustrative only, not benchmark data.
Cost comparison (illustrative):
- All tasks to high-capability model: $4.20 total, 45 seconds.
- Routed as above: $0.85 total, 32 seconds (fast model handles 70% of volume).
- All tasks to fast model: $0.40 total, 25 seconds — but edge-case tests miss two security boundaries caught in evaluation, requiring a retry on the high-capability model. Actual total: $1.60, 55 seconds.
The lesson: total cost of correctness, not token price, is the metric. The cheapest model is not always the most economical if its failure rate drives rework.
Pattern E — Autonomy Tier Escalation at Runtime
Scenario: An agent operating at Tier 2 (Branch) is implementing a database migration. Mid-task, it discovers that the migration requires modifying a production configuration value to update a connection string.
Escalation protocol:
- Agent pauses migration and emits a structured escalation request:
"Need to update
DB_CONNECTION_STRINGin production config. Reason: migration target requires new connection endpoint. Blast radius: all services using the billing database. Reversibility: config change is reversible via config rollback. Evidence: migration plan diff, test results on staging." - System routes the request to the domain owner (billing infrastructure). Because this is a Tier 2→3 escalation (production-impacting action), it requires human approval.
- Domain owner reviews the evidence, approves with a time-bound scope: "Approved for this specific config key, this deployment window only."
- Agent executes the config change, completes the migration, and the temporary Tier 3 elevation expires.
- Full trace captured: escalation request, approval, action, outcome, tier restoration.
Anti-pattern: The agent modifies the production config without escalation because its prompt says "complete the migration." Infrastructure enforcement — not prompt compliance — must block this.
Pattern F — Governance That Didn't Prevent the Incident
Scenario: A team at Phase 4 has evidence bundles, evaluation gates, and defined autonomy tiers. An agent generates a migration that renames a database column used by three downstream services. The evidence bundle is complete: diff, passing tests (all within the agent's domain), trace, rollback command. The domain owner reviews and approves. The change ships. Two hours later, three downstream services fail because they depend on the old column name.
What went right: evidence bundle was complete per Phase 4 requirements. Evaluation gates caught regressions within the domain. The trace made root cause analysis fast — the team identified the breaking change in minutes, not hours.
What went wrong: the evaluation suite only tested within the agent's domain boundary. The specification said "rename the column" but didn't include cross-domain impact as an acceptance criterion. The domain owner approved based on evidence that was correct but incomplete.
Why the manifesto didn't prevent this: at Phase 4, governance is single- domain. Cross-domain evaluation coverage is a Phase 5 capability (shared evaluation registry, cross-domain trace standards). The team was operating correctly for their phase — but their phase wasn't sufficient for the task's actual blast radius.
The lesson: governed failure is still failure. The manifesto reduces the frequency and blast radius of incidents, and it makes diagnosis and recovery faster. It does not eliminate incidents. When a governed change still causes an outage, the question is not "why didn't governance prevent this?" but "what evidence was missing, and at what phase does the manifesto add that evidence?" In this case, the answer is Phase 5's cross-domain evaluation coverage — which the team should now prioritize for domains with shared dependencies.
The anti-lesson: do not respond to this incident by adding more governance at Phase 4 (requiring cross-domain reviews for every change). That is over-governance — it would slow every change to protect against a failure class that only occurs when changes cross domain boundaries. Instead, promote the specific domains with shared dependencies to Phase 5 governance.
Pattern G — Exception-Based Governance at Scale
These sampling thresholds are example defaults; calibrate them to local risk, review complexity, and incident history. This is a governance pattern, not a universal policy.
Context: A team at Phase 4+ is generating agent-driven changes at a volume that exceeds meaningful human review of every change. Domain owners are showing rubber-stamping signals (review time < 2 minutes, rejection rate < 1%).
The supervision paradox: Human review does not scale to machine-speed output. Adding more reviewers at the same volume creates the same pattern faster. The solution is to reduce the volume of decisions requiring human review — not the quality of review.
The pattern:
Classify all changes by risk tier using an automated pre-screener built from domain rules and change impact analysis:
- High-risk: Changes touching pricing logic, customer-facing decisions, shared schemas with cross-domain consumers, security boundaries, or compliance-annotated code paths → mandatory human review before merge.
- Medium-risk: Changes within a single domain, touching non-critical paths, passing full evaluation suites → statistical sample (10-20%) reviewed by domain owner; remainder logged without review.
- Low-risk: Test updates, documentation, configuration in isolated environments, changes with complete evidence bundles and no cross-domain impact → logged and merged automatically; retrospective audit.
Gate high-risk changes explicitly. A PreToolUse hook on PR creation checks the risk classification and blocks merge until the named domain owner approves. Approval latency for high-risk changes is a tracked metric — rising latency indicates the high-risk classification is too broad.
Sample medium-risk changes. The domain owner reviews a random 15% sample each week. If the sample catch rate (issues found per reviewed PR) falls below 2%, the classification threshold may be too conservative — promote some medium-risk to low-risk. If the catch rate exceeds 15%, the threshold is too permissive — raise more to high-risk.
Log low-risk changes for retrospective audit. PostToolUse hooks produce full audit records. Internal audit or the 2nd line of defense conducts periodic retrospective reviews of the low-risk cohort (monthly, 5% random sample) to validate the classification is working.
Evidence bundle: Risk classification rationale per PR (which rules triggered which tier), domain owner approval record for high-risk changes, weekly sample review record, retrospective audit findings.
Classification criteria examples:
| Rule | Classification |
|---|---|
Touches src/pricing/** |
High-risk |
Touches src/claims/** |
High-risk |
| Modifies a database schema | High-risk (cross-domain impact) |
| New dependency added | High-risk (provenance review required) |
| Test file changes only, green regression suite | Low-risk |
| Documentation, README, comments | Low-risk |
| Single-domain logic change, passing evals | Medium-risk |
Anti-pattern: Treating all agent-generated changes as equally risky and requiring human review for all of them. This creates the rubber-stamping failure mode. The goal is not to review everything — it is to review the right things with enough attention to catch real problems.
Relationship to Three Lines of Defense: In regulated environments, the classification pre-screener is a 1st-line control. The 2nd-line independent validation function reviews the classification criteria periodically (not individual changes) and challenges whether the risk tiers are set appropriately. The 3rd line audits whether the process was followed.
Pattern H — The Persona Simulator
When to use: Before shipping a feature that involves complex user interactions, ambiguous intent, or high diversity of user populations. Use this pattern to validate that the specification itself is correct — not just that the implementation satisfies the specification as written.
The problem it solves: Specifications are written from the perspective of the team that wrote them. They encode assumptions about how users will interact with the feature, what they will ask, and what they consider a success. These assumptions are often wrong. Traditional testing verifies that the implementation matches the spec; it does not verify that the spec matches user reality.
Pattern:
Deploy a swarm of simulation agents, each instantiated with a distinct persona profile: domain expertise, communication style, prior experience with the system, edge-case goals, adversarial intent (where applicable). Each persona agent interacts with the feature under development using the specification as its behavioral target.
The simulation produces two outputs:
- Coverage gaps — interaction paths, question types, or intent categories that the specification does not address. These become specification amendments before implementation is finalized.
- Failure signals — interactions where the feature's response would be incorrect, ambiguous, or unsafe from the perspective of that persona. These become evaluation cases in the evaluation portfolio (P8).
Relationship to the Agentic Loop: The Persona Simulator belongs to the Validate phase, not the Verify phase. Verify confirms the implementation satisfies the specification. Validate asks whether the specification itself is worth satisfying. Running the simulator before implementation is complete (on a specification stub or prototype) catches the wrong-thing-built failure class before it is fully built.
Minimum viable version:
personas = [
{ role: "power user", style: "terse", goal: "efficiency" },
{ role: "first-time user", style: "exploratory", goal: "orientation" },
{ role: "adversarial user", style: "probing", goal: "boundary-finding" },
{ role: "domain expert", style: "precise", goal: "correctness validation" },
]
for persona in personas:
interactions = simulate(persona, feature_spec, n=20)
gaps = extract_coverage_gaps(interactions, feature_spec)
failures = extract_failure_signals(interactions, acceptance_criteria)
report.add(persona, gaps, failures)
Exit criterion: The simulation is complete when coverage gaps have been either addressed in the specification or explicitly accepted as out of scope, and all failure signals have been added to the evaluation portfolio. Shipping without addressing the failure signals is an explicit risk decision, not an oversight.
Not all failure signals can be deferred. Any failure signal involving safety, data integrity, irreversible user harm, or a regulatory requirement is non-deferrable: it must be addressed in the specification before the implementation proceeds. Logging it as "accepted out of scope" is not acceptable for these categories. If no fix is feasible, the feature scope must be reduced to exclude the interaction class that produces the failure.
What this pattern is not: It is not a replacement for user research. Real users surface failure modes that no persona model anticipates. The Persona Simulator is a pre-ship filter, not a substitute for post-ship observation. It raises the floor; it does not guarantee the ceiling.
Failure Patterns
Hallucination Loop
An agent misreads a timeout error as auth failure, applies credential retries, and increases incident volume. Each retry generates plausible-but-wrong output that drives increasingly wrong follow-on fixes.
The fix is not "retry the prompt." It is:
- Diagnose using traces — identify the misclassification point.
- Add a contract/invariant: "timeout retry must not mutate credentials."
- Update evaluations with the failure class as a regression test.
- Gate rollout until traces confirm the corrected behavior.
Never simply retry a failed prompt. Diagnose, update memory, strengthen contracts and constraints, and rerun verification before retrying.
Operational Recovery Cycle
- Diagnose using traces and failure classification.
- Add or tighten contract/invariant for the violated behavior.
- Add regression and adversarial tests for the failure class.
- Re-run verification and canary on constrained scope.
- Promote only after evidence shows the loop is broken.
Cross-Domain Incident Classification Framework
A common severity framework enables consistent incident classification, reporting, and recovery across regulated environments. Domain-specific calibrations are listed below.
| Severity | Definition | Recovery Expectation | Regulatory Trigger |
|---|---|---|---|
| Severity 1 | Agent takes unauthorized action with external impact (customer accounts, patient data, regulatory submissions, safety-critical systems) | Immediate containment; production rollback; root cause analysis with executive sign-off | Mandatory regulatory notification in most domains (DORA Art. 17-23; MDR Art. 87; ITAR incident reporting) |
| Severity 2 | Agent produces incorrect output detected before downstream impact; indicates a control failure (evaluation gate missed, tier enforcement bypassed) | Same-day diagnosis; evidence bundle with root cause; governance review of the failed control | Internal incident record; potential regulatory disclosure depending on data type affected |
| Severity 3 | Agent performance degradation (latency, accuracy drift, increasing evaluation failure rate) detected through monitoring; within tolerance thresholds | Diagnosis within 24h; specification or tier adjustment if root cause identified | Typically internal; may trigger DORA notification if threshold-breaching degradation continues |
| Severity 4 | Agent failure fully contained by circuit breakers or fallback mechanisms; no downstream impact | Post-incident review within 48h; update chaos test suite with the failure scenario | Internal only; document in resilience engineering log |
Domain-specific calibration:
- Aviation: Map to the failure condition category (Catastrophic, Hazardous, Major, Minor) of the software component affected. Any agent action affecting airborne software in a DAL A/B component is Severity 1 by default.
- Medical devices: Map to IEC 62304 safety class and ISO 14971 harm probability × severity. Any agent action affecting Class C critical-path software is Severity 1. Vigilance reporting timelines apply for Severity 1-2.
- Pharma: Map to GxP data integrity impact. Any agent action that modifies or corrupts GxP records without a valid audit trail is Severity 1. Deviation and CAPA procedures apply.
- Financial services: Use the DORA Severity 1-4 taxonomy defined in the financial-services.md domain document. DORA notification timelines are strict; track them as a first-class workflow trigger.
- Automotive: Any agent action affecting ASIL C/D safety function specifications, test cases, or verification records is Severity 1. ISO 26262 Part 8 change management requirements apply.
- Defense / government: ITAR/EAR violations are automatic Severity 1 regardless of downstream impact. Report to the cognizant security officer immediately; do not attempt self-remediation before reporting.
Companion document to the Agentic Engineering Manifesto. Extends Principle 2 (Specifications as living artifacts) with a structured Requirements Engineering framework adapted for probabilistic, agentic, and hybrid systems.
Primary reference: "Requirements Engineering in the Age of Agentic AI" (submitted framework). Key academic support: arXiv:2602.22302 (Agent Behavioral Contracts), AgentSpec ICSE 2026 (arXiv:2503.18666), NIST AI RMF GenAI Profile (NIST AI 600-1, July 2024), ISO/IEC 5338 (AI system life cycle), ISO/IEC 42001 (AI management systems), EU AI Act (Regulation (EU) 2024/1689).
1. The Paradigm Break
Traditional requirements engineering was designed for deterministic systems. A requirement specifies a condition the system must satisfy; a test confirms whether the system satisfies it. Pass or fail.
Agentic systems break this model in three ways:
Non-determinism. The same input may produce different outputs across runs. A requirement stating "the system shall return X given input Y" cannot be verified by a single test execution. It must be stated as a probabilistic assurance target: "the system shall return output consistent with class X in at least N% of runs across the evaluation distribution."
Emergent behavior. Agentic systems learn, adapt, and generate outputs outside any enumerated set. Requirements that enumerate permitted outputs will always be incomplete. Requirements must instead define a behavioral envelope — the boundary the system must stay within — and verify containment rather than specific outputs.
Dual consumers. Specifications in agentic pipelines are consumed by both humans (who interpret intent) and agents (who execute literally). A specification that relies on human context to be meaningful will fail when consumed by an agent.
These three breaks require a different RE vocabulary. This document provides it.
2. The Two-Axes Classification Matrix
Every requirements artifact in an agentic system can be placed on two axes:
Axis 1 — System type:
- Deterministic: Classical software. Outputs are fully determined by inputs and current state. Traditional RE applies without modification.
- Agentic: LLM-based, reinforcement-learning-based, or otherwise probabilistic. Outputs are non-deterministic. Traditional RE must be extended.
- Hybrid: Deterministic orchestration layer over agentic execution components. Most production agentic systems. Requires mixed RE strategies.
Axis 2 — Artifact consumer:
- Human: The requirement is written for a human reader. Natural language is appropriate. Intent can be communicated through context, examples, and commentary.
- Agent: The requirement is consumed directly by an agent as part of a specification, system prompt, or AGENTS.md file. Must be unambiguous to a machine. Contextual inference is unreliable.
- Hybrid: The requirement must serve both humans (for review, governance) and agents (for execution). This is the hardest case and requires explicit dual-format specifications.
The 3×3 Matrix
| Human consumer | Agent consumer | Hybrid consumer | |
|---|---|---|---|
| Deterministic system | Traditional RE. Prose + formal models. | AGENTS.md / skill files. Machine-readable constraints with no ambiguity. | Canonical prose spec + machine-readable encoding. Keep them in sync. |
| Agentic system | Behavioral envelope in prose. Probabilistic assurance targets as acceptance ranges. | Behavioral contracts (arXiv:2602.22302). AgentSpec format (arXiv:2503.18666). Enumerated constraints with explicit probability bounds. | Single source (behavioral envelope) + dual projections: prose for governance, structured format for agent consumption. |
| Hybrid system | Separate deterministic and agentic requirement sets. Document which components are which. | Orchestration spec (deterministic, machine-readable) + behavioral envelope (agentic components). | Full RE framework: single-source document → human projection → agent projection → governance projection. |
Key rule: Never write a requirement in the human-consumer format when the primary consumer is an agent. The specification will be consumed literally. What a human infers from context, an agent will miss or misapply.
Stack allocation note: Requirements must be allocated at the appropriate layer of the system stack: foundation model / provider, prompt and runtime policy, planner or controller, memory, tools and connectors, deterministic orchestration, human review interface, deployment and monitoring infrastructure. A single high-level requirement (e.g., "the system must not exfiltrate sensitive data") typically decomposes into separate requirements at multiple layers: model and provider constraints, retrieval scoping, connector authorization scopes, egress controls, logging, review gates, and incident response. Apply the two-axes classification at the layer where the requirement is enforced, not at the system level.
3. Hard Requirements vs. Probabilistic Assurance Targets
The requirement type must match the system type.
Hard requirements are absolute. The system either satisfies them or it does not. They apply to:
- Deterministic components of hybrid systems
- Safety boundaries (the system must never take action X regardless of context)
- Authorization and access control
- Structural invariants (data formats, API contracts, schema validation)
Hard requirements in agentic systems should be enforced by infrastructure policy wherever possible, not by the agent's own reasoning. An agent instructed not to do X via a prompt can be argued or manipulated out of that constraint. An agent that cannot do X because the tool call is disabled cannot. In practice, critical hard requirements may need layered enforcement — infrastructure policy as the primary control, supplemented by runtime monitoring, human review gates, and post-hoc audit detection. No single enforcement mechanism should be treated as sufficient in isolation for Tier 3 systems.
Probabilistic assurance targets define acceptable performance ranges across an evaluation distribution. They apply to:
- Output quality (accuracy, relevance, completeness)
- Behavioral consistency (the system should behave consistently within the behavioral envelope)
- Task success rates
Format: "The system shall achieve [metric] of [value] ± [tolerance] across [evaluation distribution] with [confidence level]."
Example: "The claim extraction agent shall achieve F1 score ≥ 0.85 across the held-out evaluation set of 500 documents, with 95% confidence interval upper bound ≥ 0.82."
Critical distinction: Probabilistic assurance targets are not lower-quality requirements. They are the correct specification format for non-deterministic behavior. Writing a hard requirement for probabilistic behavior is not more rigorous — it is a category error that will always fail at verification.
4. The Behavioral Envelope
A behavioral envelope defines the space within which agent behavior is acceptable, without enumerating acceptable behaviors. It consists of four layers:
Layer 1 — Hard boundaries (must never). Actions the agent is prohibited from taking regardless of context, instructions, or apparent justification. These are enforced structurally (tool removal, permission policy) not by prompt instruction.
Examples: writing to production databases without explicit approval, sending external communications without human review, executing irreversible actions in Tier 3 systems (see Section 6).
Layer 2 — Soft boundaries (should avoid). Behaviors that are undesirable but not prohibited. Enforced by evaluation, monitoring, and steering. Alert on violation; do not hard-block.
Examples: producing responses that exceed the approved length envelope, citing sources outside the approved knowledge base, introducing architectural patterns not aligned with the codebase style.
Layer 3 — Performance envelope. The acceptable range of quality, cost, latency, and resource consumption. Defines when degraded performance triggers escalation.
Layer 4 — Adaptation envelope. For systems that learn or accumulate state: defines what the system is permitted to learn from (allowed data types, sources, and feedback channels), what it must not update on (prohibited inputs to persistent memory or fine-tuning), and how learned behavior is governed and audited. Specify: what counts as adaptation (few-shot history, RAG knowledge base updates, fine-tuning, long-term memory writes); who can write to persistent memory and under what conditions; provenance requirements for stored knowledge; retention and expiry policy; how learned state can be rolled back; and what behavioral changes trigger a revalidation cycle. For systems using retrieval-augmented generation, specify knowledge base governance: source authority, freshness requirements, and access boundaries.
The behavioral envelope is the primary specification artifact for agentic components. It replaces enumerated-output requirements as the verification target. For the full system, the behavioral envelope coexists with hard requirements for deterministic components and interface contracts between system layers.
Multi-Agent Behavioral Contracts
When multiple agents interact, behavioral envelopes must be specified for each agent individually and for inter-agent boundaries. The Agent Behavioral Contracts framework (arXiv:2602.22302) addresses this explicitly: contracts define pre/postconditions and invariants at agent boundaries, and multi-agent contract composition yields computable probabilistic degradation bounds for the chain.
The practical implication: reliability does not improve in a multi-agent system simply by adding more agents. Correlated failure modes — shared base model, shared knowledge base, shared tool chain — mean the combined reliability of a chain can be worse than any single agent's reliability, because failures propagate in the same direction simultaneously. Requirements for multi-agent systems must therefore specify:
- Communication contracts at each inter-agent boundary: what one agent is permitted to send, what another is required to accept, what triggers rejection or escalation
- Chain-level reliability targets stated as probabilistic assurance targets, not derived from per-agent targets by multiplication (which assumes independence that rarely holds)
- Failure isolation boundaries: what happens when one agent in the chain fails — does it escalate, fall back, or propagate the error downstream?
- Shared resource governance: if agents share a knowledge base, memory store, or tool, specify which agent can write, which can read, and under what conditions
5. The Single-Source / Multiple-Projections Principle
Agentic pipelines require requirements artifacts in multiple formats for multiple consumers:
- Governance and audit: prose, human-readable, context-rich
- Agent execution: structured, machine-readable, unambiguous
- Testing and evaluation: measurable, with clear pass/fail or threshold criteria
- Regulatory compliance: aligned to ISO/IEC 5338, NIST AI RMF, or domain-specific standards
The failure mode is maintaining separate documents for each consumer. These diverge. The governance document says one thing; the agent execution spec says another; the tests verify a third thing.
The single-source principle (governance best practice, not a legal requirement): One canonical source document (the behavioral specification) is the source of truth. All other representations are generated or derived from it, not independently authored. When the source changes, all projections must be updated.
In practice:
- Write the behavioral specification in human-readable prose with explicit, structured sections
- Derive the agent-consumable encoding (AGENTS.md, AgentSpec format, or behavioral contract) from the prose by explicit, documented transformation
- Derive the evaluation suite from the probabilistic assurance targets, not independently
- Derive the compliance mapping from the behavioral envelope using the relevant standard's framework (NIST AI RMF risk categories, ISO/IEC 5338 life cycle requirements)
Every requirement in every projection must trace back to a named section in the canonical source. If it cannot, it either belongs in the source or should not exist.
Change Control
Because the behavioral specification is the source of truth for all projections, changes to it carry cascading implications. A change control process must specify:
- Who can propose changes to the behavioral specification, and who must approve them (minimum: the tier owner and a representative from each affected consumer group — governance, engineering, and security)
- What triggers mandatory re-evaluation: new capability deployment, behavioral drift detected by monitoring, incident post-mortem, regulatory change, or elapsed review interval
- How projections are updated: no projection may be updated independently. The canonical source is updated first; projections are re-derived. The commit history of the source document is the audit trail.
- How version mismatches are detected: agents consuming a behavioral specification should receive a versioned reference. If the version they were instantiated with no longer matches the current canonical version, the mismatch must be flagged before the next deployment cycle.
For Tier 3 systems, changes to Layer 1 hard boundaries require explicit re-authorization: re-specification, updated evidence bundle, and revalidated approval chain.
6. Tiered Lifecycle
Requirements governance applies differently at different autonomy tiers. Mismatching governance to autonomy tier is a common source of both under-governance (too little control at high autonomy) and over-governance (paralyzing low-risk operations with excessive process).
Tier 1 — Propose-only (analyze and recommend, no execution). Requirements emphasis: output quality, behavioral consistency, information boundary (what data the agent can access). Governance: standard review gates. The blast radius of wrong output is bounded by the human review step.
Minimum requirements artifacts: behavioral envelope (Layers 1 and 3), evaluation suite for output quality, data access specification.
Tier 2 — Isolated execution (writes to branches, sandboxes, or staging environments; changes require review before promotion). Requirements emphasis: all Tier 1 requirements, plus: scope boundary (what the agent can modify), promotion criteria (what review must confirm before promotion), rollback specification.
Minimum requirements artifacts: Tier 1 artifacts + scope boundary document + review gate criteria + rollback procedure.
Tier 3 — Production-impacting (writes to production state, sends external communications, takes irreversible actions). Requirements emphasis: all Tier 2 requirements, plus: explicit human approval requirements (who can authorize, under what conditions), audit trail specification, rollback plan (pre-approved, not improvised), incident escalation path.
Minimum requirements artifacts: Tier 2 artifacts + human approval policy + audit trail specification + pre-approved rollback plan + incident escalation procedure.
EU AI Act obligations apply to high-risk systems operating at Tier 3. Human oversight requirements under the Act are not governance checkboxes — they are system design requirements. Specify: what interface enables operators to monitor, detect anomalies, and override outputs; what training or competency is required to exercise oversight effectively; what stop-operation procedure exists and how quickly it can be invoked; and what post-market monitoring captures for ongoing review. "Human approval" is insufficient as a Tier 3 requirement unless the approval mechanism itself is specified as part of the system.
Tier assignment is a requirements decision, not a deployment parameter. It must be made explicitly at the specification stage and documented in the behavioral specification. Tier assignment determines the governance overhead; an agent assigned to Tier 1 cannot subsequently be granted Tier 3 authority without a full re-specification and review cycle.
7. Non-Functional Requirements for Agentic Systems
The NFR categories that require explicit treatment in agentic systems:
| NFR Category | Agentic-Specific Consideration | Specification Format |
|---|---|---|
| Reliability | Non-determinism means reliability must be stated as a distribution, not a point estimate | Probabilistic assurance target |
| Safety | Define the behavioral envelope Layer 1 hard boundaries explicitly | Hard requirement, infrastructure-enforced |
| Security | Agentic threat landscape includes: prompt injection (direct and indirect via tool outputs), context poisoning, memory poisoning, goal/behavior hijacking, over-permissioned connectors, privilege escalation, supply-chain risks in tool protocols, and identity abuse. Requires explicit threat model and defense-in-depth — prompt-level controls alone are insufficient. Specify: credential scope per connector, connector trust model and verification, memory integrity requirements, red-team cadence | Hard requirement per threat category + evaluation suite for injection resistance + review gate for connector authorization |
| Privacy | Data exposure via context window and memory requires explicit access boundaries | Hard requirement |
| Fairness / Bias | Output bias is a behavioral quality metric, not a binary | Probabilistic assurance target + evaluation distribution specification |
| Explainability | Traceability of agent reasoning to decision | Hard requirement (trace format) + probabilistic target (trace completeness) |
| Cost | Token consumption, compute cost per task | Probabilistic assurance target (p95 cost per run) |
| Latency | Time-to-completion distribution | Probabilistic assurance target (p50/p95/p99) |
| Regulatory compliance | EU AI Act (high-risk obligations apply on a staged timetable; verify the current application date and transitional rules for your use case): documented post-market monitoring, human oversight, logging of autonomous decisions, traceability | Hard requirements (documentation, logging, override capability) + process requirements (post-market monitoring plan) |
| Evolvability | Specifications must evolve without full re-derivation | Single-source principle compliance |
8. Per-Requirement Checklist
For each requirement in a Tier 2+ agentic system, verify:
- Type declared: Is this a hard requirement or a probabilistic assurance target? Is the type correct for the system type?
- Consumer declared: Is the consumer human, agent, or hybrid? Is the format appropriate?
- Axis classification: Is the system type (deterministic/agentic/hybrid) and consumer type documented?
- Traceable to source: Does this requirement trace to a named section in the canonical behavioral specification?
- Verifiable: Can this requirement be verified by an evaluation or test? Is the evaluation defined?
- Tier-appropriate: Is the governance overhead appropriate for the tier?
- Hard boundaries infrastructure-enforced: If this is a hard boundary, is it enforced by infrastructure policy, not prompt instruction?
- Probabilistic targets have distributions: If this is a probabilistic assurance target, is the evaluation distribution specified?
- Memory governance addressed: If the agent has persistent memory, is the adaptation envelope (Layer 4) specified?
- Rollback defined: If this is Tier 3, is the rollback procedure pre-approved and documented?
9. Connection to the Manifesto
This framework is an extension of Principle 2 (Specifications are living artifacts that evolve through steering). It provides the vocabulary and structure that Principle 2 requires but does not define.
The behavioral envelope (Section 4) operationalizes Principle 5 (tiered autonomy): every tier has a corresponding behavioral envelope scope.
The single-source principle (Section 5) operationalizes Principle 9 (observability as infrastructure): when requirements are single-source, the audit trail is coherent.
The tiered lifecycle (Section 6) maps directly to the autonomy tiers in Principle 5 and the blast radius management framework in Principle 10.
The probabilistic assurance targets (Section 3) operationalize Principle 8 (evaluations are the contract): the evaluation contract is the assurance target, not a binary test assertion.
10. Academic References
- arXiv:2602.22302 — Agent Behavioral Contracts. Formal specification of agent behavior using pre/postconditions adapted for probabilistic systems. Provides mathematical grounding for the behavioral envelope concept.
- arXiv:2503.18666 — AgentSpec (ICSE 2026). A domain-specific language for specifying and enforcing runtime constraints on LLM agents. Rules consist of triggers, predicates, and enforcement mechanisms that intercept agent actions before execution. Relevant for encoding Layer 1 and Layer 2 behavioral envelope constraints in machine-executable form. Note: AgentSpec is a runtime enforcement tool, not a requirements specification format; it operationalizes the agent-consumer column of the two-axes matrix rather than replacing the requirements specification itself.
- NIST AI 600-1 (July 2024) — NIST AI RMF Generative AI Profile. Risk category taxonomy for generative AI systems. Maps to Layer 1 and Layer 2 of the behavioral envelope.
- ISO/IEC 5338 — AI system life cycle processes. International standard for requirements engineering in AI systems. The tiered lifecycle (Section 6) aligns to ISO/IEC 5338's risk-based process tailoring.
- ISO/IEC 42001 — AI management systems. International standard for governance, performance evaluation, monitoring, and continual improvement of AI systems. The single-source principle (Section 5) and tiered lifecycle (Section 6) align to ISO/IEC 42001's documentation and change management requirements.
- EU AI Act (Regulation (EU) 2024/1689) — Obligations for high-risk AI systems enforceable from 2 August 2026. Relevant provisions: post-market monitoring systems, human oversight measures, logging of autonomous decisions, technical documentation. For agentic systems: wrapping a foundation model in an orchestration layer can constitute a substantial modification triggering full provider obligations. Regulatory compliance row in Section 7 maps to these obligations.
- arXiv:2603.03823 — SWE-CI benchmark (Sun Yat-sen University & Alibaba Group, 2026). Found that most evaluated models achieve zero-regression rates below 0.25, meaning regressions were introduced across the majority of long-horizon maintenance tasks — validating the need for probabilistic assurance targets and independent evaluation rather than point-in-time binary testing.
How this manifesto can fail, and the skills teams need to implement it.
Read the Manifesto for the core values and minimum bars. See the Companion Guide for the full table of contents. See the Adoption Playbook for organizational change management, role transitions, and pilot design.
Failure Modes of This Manifesto
These are failure modes of the manifesto's technical implementation. For failure modes of the organizational change process (adoption without support, incentive mismatch, skipping phases), see the Adoption Playbook.
Applied poorly, this manifesto can fail through:
Over-governance: Constraints so heavy that human coding becomes faster. The tell: lead time increases without corresponding quality improvement. The fix: reduce ceremony, widen Tier 1/Tier 2 boundaries, and measure whether governance overhead is justified by incident reduction.
Evidence theater: Large bundles with low signal. Teams produce voluminous evidence artifacts that nobody reads and that don't catch real failures. The tell: evidence bundle size grows while escaped defect rate stays flat. The fix: audit which evidence artifacts actually influenced a decision in the last quarter. Cut the rest.
Control theater: Humans nominally accountable but operationally blind. A named domain owner "approves" changes they cannot meaningfully review because volume exceeds capacity. The tell: approval latency drops to near zero (rubber-stamping). The fix: reduce autonomy scope until review is meaningful, or invest in automated pre-screening that surfaces only the exceptions worth human attention.
Security theater: Policies documented but not enforced at tool/runtime boundaries. The architecture describes constraints that no infrastructure actually blocks. The tell: agents violate documented policies with no system-level detection. The fix: enforce before you document — if the infrastructure can't block it, the policy is aspirational, not real.
Adoption theater: Teams adopt the manifesto's vocabulary without its discipline. Evidence bundles are renamed PR descriptions. Autonomy tiers are defined but not enforced. Maturity self-assessments are aspirational. The tell: the language changes but incident patterns don't. The fix: measure outcomes (escaped defect rate, incident severity, rollback frequency), not adoption checkboxes.
Maturity inflation: Teams self-assess at Phase 4 or 5 because the phase descriptions are aspirational enough to pattern-match to current practice. The tell: a team claims Phase 4 but cannot produce an evidence bundle for a recent change. The fix: use the phase-calibrated evidence examples (Operational Definitions) as a litmus test — the evidence you can actually produce determines your phase, not the practices you intend to adopt.
Verification without validation: Every gate passes, evidence bundles are complete, escaped defect rate is low — but the team ships the wrong thing. The specification was never worth implementing, and the manifesto's verification machinery confirmed the implementation was correct without anyone confirming it was valuable. The tell: system quality metrics improve while business outcome metrics (adoption, usage, revenue impact, customer satisfaction) stay flat or decline. The fix: treat the Agentic Loop's Observe → Learn phases as validation checkpoints — connect evaluation results to business outcomes, define stop criteria (not just acceptance criteria) for every specification, and make business assumptions explicit before the Loop begins. See the Validation vs. Verification section in P2 extended guidance.
Structural regression without detection: Every change passes current tests, regression suites are green, evidence bundles are complete — but the codebase is progressively harder to maintain. Each iteration's decisions (naming conventions, dependency structures, architectural choices) create friction that compounds across subsequent iterations. The code is locally correct but globally harmful. The tell: iteration-over-iteration regression frequency rises, time per change increases, and specification convergence slows — all while current test pass rates remain high. The fix: track evolution-weighted metrics (see EvoScore in Operational Definitions), monitor coupling and dependency trajectories across iterations, and include structural quality indicators in evaluation portfolios alongside behavioral regression tests. See the Structural Regression section in P8 extended guidance. The SWE-CI benchmark (arXiv:2603.03823) provides empirical evidence: most agents introduce regressions in over 75% of CI iterations, many of which are structural rather than behavioral.
The corrective action is always the same: reduce ceremony, increase signal, and measure cycle time, defect rate, and incident severity together.
Skill Requirements by Principle
Not all principles require the same skills. This table helps teams identify capability gaps before they become adoption blockers. See the Adoption Playbook for guidance on building these capabilities.
| Principle | Core Skill Required | Team Readiness | Notes |
|---|---|---|---|
| P1 — Outcomes | CI/CD, release engineering | Ready | Existing pipelines need extension, not replacement |
| P2 — Specifications | Formal requirements, contract design | Reorient | Requirements skills exist but need machine-readable precision. Agent Skills, AGENTS.md, and specification-driven development frameworks provide concrete formats and workflows |
| P3 — Architecture | Infrastructure engineering, policy-as-code | Reorient | Infra skills exist but policy-as-code enforcement is new |
| P4 — Swarm Topology | Distributed systems design | Acquire | Few teams have multi-agent coordination experience. A2A protocol provides emerging standards for agent discovery and task delegation |
| P5 — Autonomy | Security engineering, access control | Reorient | Access control exists but agent-specific tier enforcement is new. Infrastructure-level policy systems (YAML-based permissions, audit logs, guardrail constraints) offer reference implementations |
| P6 — Knowledge & Memory | Data engineering, information retrieval | Acquire | Memory governance (provenance, expiration, rollback) is a new discipline. Git-native agent memory systems provide early reference architectures |
| P7 — Context | ML/retrieval engineering, context engineering | Acquire | Retrieval engineering at agent scale requires specialized skills. Agent-to-tool protocols, capability definitions, and agent memory systems form an emerging tooling ecosystem |
| P8 — Evaluations & Proofs | Test engineering, formal methods | Split | Test engineering: ready. Formal methods: acquire (and defer until Phase 5) |
| P9 — Observability | SRE, distributed tracing | Reorient | SRE exists but agentic traces require new schema and tooling. Emerging interoperability standards under neutral governance (AAIF) provide the foundation |
| P10 — Emergence | Chaos engineering, security | Acquire | Chaos engineering for agentic systems has no established playbook. Early autonomous agent security incidents provide case studies |
| P11 — Economics | FinOps, cost optimization | Reorient | FinOps exists but total-cost-of-correctness models are new |
| P12 — Accountability | Incident management, compliance | Ready | Incident management extends naturally; compliance may need updates |
Reading the Readiness column:
- Ready: The skill exists and applies with minor extension.
- Reorient: The skill exists but must be redirected toward agentic concerns. Training and practice are sufficient; hiring is not required.
- Split: Part of the skill is ready; part must be acquired separately.
- Acquire: The skill is rare or nonexistent in most teams. Requires hiring, dedicated training, or partnering with specialists.
Principles marked "Acquire" are the adoption bottlenecks. Do not attempt these at full depth without investing in the skill. Start with "Ready" and "Reorient" principles (P1, P3, P5, P9, P12) and build toward the harder ones incrementally. The Adoption Playbook maps these skills to specific phase transitions.
Annotated Agent Configuration Template
Every project needs an agent configuration file (commonly named AGENTS.md,
CLAUDE.md, or similar depending on tooling). Neither the manifesto nor most
tooling documentation provides a starting point. Use this template — adapt it,
do not just copy it. Annotations explain what each section must contain and
whether it is mandatory per CoE policy.
# [Project Name] — Agent Instructions
## Scope and Version
<!-- RECOMMENDED. Establish ownership and applicability before the agent reads further. -->
Owner: [name or team]
Last updated: [date]
Applicable systems: [which services, repos, or pipelines this file governs]
## Project Overview
<!-- MANDATORY. 3-5 lines. What does this service do? What domain does it own?
What is its upstream/downstream position in the system? -->
[Service name] is responsible for [core function]. It owns [domain boundary].
Upstream: [what feeds into it]. Downstream: [what consumes its output].
Stack: [language, framework, runtime].
## Build, Test, Deploy Commands
<!-- MANDATORY. Agents must be able to run these without asking. -->
Build: [command]
Test: [command] # Must exit 0 before any PR
Lint: [command]
Deploy: [command or "see CI pipeline — do not deploy manually"]
## Domain Constraints
<!-- MANDATORY. What must this agent never do in this codebase? -->
- Never modify [schema/table/config] without a migration file and a rollback.
- Never call external APIs directly — use the adapter layer at [path].
- Never generate pricing, underwriting, or claims logic — flag for human review.
- [Any other non-negotiable domain boundary]
## Security
<!-- MANDATORY. Do not duplicate enterprise-wide policy here; link to the
governing file instead. Add only project-specific security constraints. -->
Follows enterprise security rules. Project-specific additions:
- All [entity type] inputs must be validated against [schema/contract] at [path].
- [Any project-specific credential or secret handling requirement]
## Testing Conventions
<!-- MANDATORY. Agents must know how tests are structured before writing them. -->
Test location: [path pattern]
Naming: [convention, e.g., describe/it or TestFunctionName_Scenario]
Mocking: [approved mock strategy — real DB / in-memory / stub]
Coverage threshold: [minimum %, matches hook threshold]
## Commit and PR Conventions
<!-- MANDATORY. -->
Commit format: [conventional commits / other]
PR title: [format]
Every agent-assisted commit must include: "Co-Authored-By: [agent-id]"
## Architecture Notes
<!-- RECOMMENDED. Key decisions agents must respect. Keep brief. -->
- [ADR reference or one-line constraint, e.g., "hexagonal architecture — no
framework code in domain layer"]
- [Data flow constraint, e.g., "all writes go through the command bus at [path]"]
## MCP Integrations in Use
<!-- RECOMMENDED. List approved MCPs available in this project. -->
- [MCP name]: [what it does, what data classification it can access]
## What NOT to Put Here
<!-- Advisory — for the human writing this file -->
Do not include: credentials, environment variable values, hostnames, IPs,
information that belongs in enterprise rules (already loaded), information
that should be in a path-scoped rule file.
Do not exceed 200 lines. Use @path/to/file imports for larger reference docs.
CoE review checklist for project agent configuration file:
- Project Overview: domain boundary clearly stated
- Build/test/deploy commands: all present and tested
- Domain Constraints: no overlap with enterprise-managed agent configuration
- Security section: references enterprise rules rather than duplicating them
- Testing Conventions: coverage threshold matches hook threshold
- No credentials, hostnames, or environment-specific values
- Under 200 lines
Cross-Domain Supplier and Vendor Qualification
Every regulated domain requires qualification of critical suppliers of software systems. In agentic engineering, "supplier" is an ambiguous category — LLM providers, open-source frameworks, agent runtimes, and tool integrations all fall into scope. This section provides a cross-domain synthesis; domain documents provide the regulatory specifics.
Who Is the Supplier?
| Component | Supplier Type | Qualification Obligation | Key Issue |
|---|---|---|---|
| Commercial LLM API (OpenAI, Anthropic, etc.) | Named vendor with terms of service | Vendor assessment: data handling, version notification, SLA, incident notification | No access to training data, model weights, or full anomaly documentation. Regulatory expectations were written for traditional software suppliers. |
| Open-source foundation model (Llama, Mistral, etc.) | No identified supplier entity | Deploying organization assumes full supplier responsibility: validation, maintenance, version control, anomaly tracking, incident response | No quality agreement possible. The QMS burden falls entirely on the deployer. |
| Agent framework / orchestration library | OSS or commercial | Same as above, based on licensing model | Framework updates may change agent behavior without semantic versioning signals |
| MCP tool integrations | Varies | Each tool integration is a system boundary requiring supplier qualification appropriate to the data classification it can access | External API access expands the effective supply chain |
| Agent memory infrastructure | Internal or vendor | Internal: first-party governance. Vendor: assess data residency, backup/recovery, retention controls | Memory stores may hold regulated data; the store's supplier must be qualified accordingly |
The Open-Source Supplier Problem
GAMP 5 (pharma), ISO 13485 (medical devices), and SR 11-7 (financial services) assume an identifiable supplier with a quality system. Open-source foundation models have no such entity. The deploying organization must formally document that it assumes supplier responsibilities. This is not optional — it is the regulatory consequence of the build decision.
Documentation required:
- Assumption of supplier responsibilities: A formal record stating that the organization assumes full validation, maintenance, monitoring, anomaly tracking, and incident response responsibilities for the open-source model.
- Version management plan: How model versions are tracked, tested before upgrade, and rolled back if needed.
- Anomaly tracking: How the organization monitors community-reported issues and assesses impact on its validated use cases.
- Exit strategy: How the organization would migrate to a different model if the open-source project is abandoned or compromised.
Cross-Domain Qualification Minimum Requirements
Regardless of domain, agent supplier qualification should address:
| Requirement | Why It Matters | Minimum Evidence |
|---|---|---|
| Data handling and residency | Regulated data must not leave compliant infrastructure | Data processing agreement or on-premises deployment confirmation |
| Version notification | Model updates change agent behavior | Version change notification procedure with minimum lead time |
| Availability SLA | Agent unavailability is an ICT operational risk | SLA documentation with incident notification commitments |
| Security posture | Agent infrastructure is an attack surface | Security assessment (SOC 2, ISO 27001, or equivalent) |
| Sub-processor visibility | Data may pass through additional third parties | Sub-processor list and flow-down requirements |
| Exit strategy | Concentration risk requires mitigation | Multi-model routing plan (P11) as DORA/third-party risk mitigation |
Multi-vendor routing as qualification simplification. P11's multi-model routing strategy (routing tasks to the cheapest capable model) also reduces supplier qualification burden by preventing dangerous concentration in a single provider — a regulatory requirement under DORA for financial services, and a prudent risk management practice in all regulated domains. Each provider still requires qualification, but no single provider's failure can take down the entire capability.
Ecosystem References
This guide references standards and tools that are evolving rapidly. Rather than duplicate descriptions that will age, we list the categories that matter and point to the authoritative sources.
Standards under AAIF governance: MCP (agent-to-tool), A2A (agent-to-agent), Agent Skills (capability definition), AGENTS.md (repository-level constraints). The Agentic AI Foundation, launched December 2025 under the Linux Foundation, provides neutral governance across these protocols.
Specification-driven development frameworks: Multiple open-source frameworks enforce the specification-first workflow described in P2: specify before implementing, treat specs as code artifacts, and consume them at agent runtime. See Sources refs 43–47 for specific projects.
Memory and coordination infrastructure: Git-native agent memory systems, autonomous agent runtimes with infrastructure-level policy enforcement, and continuous integration benchmarks for structural regression. See Sources refs 40–42 for specifics.
The manifesto does not endorse specific tools. Its contribution is the governance model that applies across them. The Sources file carries the dated references; this guide carries the principles.
How to adopt the Agentic Engineering Manifesto in your organization: incremental steps, role evolution, change management, and success metrics.
Read the Manifesto for the core principles. Read the Companion Guide for implementation depth and worked patterns. Use this playbook to plan and drive the organizational change.
Making the Business Case
Agentic engineering is a technical discipline. But the organizational decision to adopt it — and sustain it through the J-curve dip before returns materialize — is a business decision that requires a business case.
The Competitive Logic
The organizations leading on AI are not winning because they have access to better models. The same foundation models are broadly available. They are winning because they can apply those models faster, with less risk, and at greater scale than competitors who are still governing AI with processes designed for human developers.
That advantage compounds. A team that verifies faster ships faster. A team that ships faster learns faster. Better learning sharpens specifications, which improves agent output, which reduces rework, which frees capacity for higher-value work. The Agentic Loop, run well, is a compounding return on engineering investment — not a one-time productivity gain.
The teams that build this flywheel early widen the gap continuously. The question for decision-makers is not "should we invest in agentic engineering?" but "how long can we afford not to?"
Stage-Gated Investment Model
Agentic engineering adoption is stage-gated investment, not a single project. Each phase transition has a distinct investment profile and return horizon:
| Phase transition | Investment character | Return horizon | Key go/no-go signal |
|---|---|---|---|
| Phase 1→2 (exploration → assisted delivery) | Low: tooling licenses, standardization time | Immediate: measurable cycle time reduction on assisted tasks | AI suggestions accepted at a materially positive rate without increasing rework |
| Phase 2→3 (assisted → agentic prototyping) | Low-medium: specification discipline, review process | 1–2 months | Agent outputs consistently reviewable; rework rate tracked |
| Phase 3→4 (prototyping → governed delivery) | Medium: evidence pipeline, evaluation suite, domain boundary encoding | 2–4 months | Evidence completeness ≥95%; escaped defect rate ≤ human baseline |
| Phase 4→5 (governed → engineering scale) | Significant: platform ownership, memory governance, multi-domain expansion | 4–8 months | Total cost of correctness declining per outcome; oversight load stable |
Treat these as starting signals, not universal thresholds. Calibrate against your domain baseline and risk class.
Do not fund the next phase until the current phase has produced evidence that justifies it. Organizations that invest in Phase 4 governance infrastructure before they have Phase 3 evidence that agents produce reviewable output create bureaucracy, lose team confidence, and stall. The correct sequence: prove the model in one domain, then replicate. Replication is cheap once proven; the investment in proof is not recoverable if you skip it.
Business Outcome Metrics
Frame investment returns in business terms, not engineering activity:
Cycle time reduction. Time from specification to verified deployment. Target: halving cycle time for governed changes by Phase 4. This directly enables faster product iteration and competitive response.
Escaped defect rate. Post-release fixes cost 5–10× pre-release fixes. Every percentage-point reduction compounds into reduced incident cost, reduced remediation overhead, and reduced reputational risk.
Senior talent leverage. Risk-tiered verification routes low-risk changes through automated evidence pipelines, freeing senior engineers for architecture, evaluation design, and high-risk review. Track hours redirected from low-value review to high-leverage work.
Total cost of correctness. The full cycle cost: inference + verification + governance overhead + incident remediation. This replaces story points and velocity as the primary economic signal. Track it per domain, per phase, per quarter. If it is not declining, the phase transition has not delivered.
The New Way of Working
Humans express intent as specifications with constraints and acceptance criteria — then refine those specifications as evidence accumulates. They encode architecture as enforceable, monitored domain boundaries. They set autonomy tiers appropriate to risk. They own outcomes and remain accountable. They do not supervise every intermediate step — they define what success looks like, verify that the system achieved it, and inspect the reasoning when it matters.
Agents decompose specifications into executable tasks. They execute within domain boundaries, right-sized to complexity. They verify their own outputs against evaluations. They report evidence, not assertions. They learn from failure and encode that learning in memory — with provenance, so the system knows where every lesson came from.
Systems maintain persistent knowledge and curated learned memory. They route work to appropriate model tiers based on cost and quality requirements. They enforce architectural constraints at runtime and monitor for violations. They observe behavior, surface anomalies, and maintain the feedback loops that make everything else work. They forget what no longer serves them.
Converting Agile Ceremonies to Agentic Practice
Teams converting from Agile face a specific organizational challenge: the ceremonies are load-bearing. They are not decoration. They synchronize teams, surface blockers, and create accountability. Abolishing them without replacing the function they serve produces confusion and regression. The question is not whether to keep the ceremonies — it is what mechanism replaces each function.
The table below maps the core Agile ceremonies to their agentic equivalents. The intent of each ceremony is preserved; the mechanism changes to match machine-speed, evidence-based execution. These are starting points, not mandates. Adapt to the team's phase maturity and domain constraints.
| Agile Ceremony | Intent | Agentic Equivalent | Mechanism |
|---|---|---|---|
| Sprint Planning | Agree on scope and how to build it | Spec Refinement & Tier Assignment | Domain owner and leads convert backlog items into machine-readable specifications with autonomy tier assignments and blast-radius classifications. Ambiguous items are decomposed until unambiguous — not estimated. The plan artifact is a specification, not a story-point count. |
| Daily Standup | Synchronize status and surface blockers | Trace Audit & Anomaly Review | Daily review of structured traces from the prior period. Tasks with unexpected tool calls, evaluation failures, or cost spikes are flagged for root-cause. The traces are the status; there is no verbal report. The review surfaces behavioral drift before it compounds into a hallucination loop. |
| Sprint Review | Demonstrate completed work to stakeholders | Evidence Bundle Review | Completed work is presented via evidence bundles: diffs, trace IDs, evaluation results, policy check outputs. Stakeholders review outcomes and audit quality, not demos. "The agent said it worked" does not pass review. |
| Retrospective | Reflect on process and improve it | Memory Curation & Skill Promotion | Review the knowledge base and learned memory from the cycle: what heuristics held, what failed, what should be promoted to reusable skill artifacts. Stale memory is pruned. Recurring failure patterns become new evaluation cases. The retro artifact is a memory diff, not a list of action items. |
| Backlog Refinement | Clarify and prioritize upcoming work | Specification Sharpening | Upcoming specifications are reviewed for constraint completeness, risk assignment, and observable success criteria. Items without measurable success criteria are not pulled into the next Spec Refinement cycle. |
| Release Planning | Coordinate cross-team work for a release | Governance Checkpoint | Cross-domain review of autonomy tier assignments, blast-radius gates, and evidence bundle completeness for all release-bound changes. The domain owner (P12) confirms accountability assignment before deployment. |
The failure mode to avoid. Teams that attempt to run Agile ceremonies unchanged alongside agentic workflows typically end up with two parallel processes: the Agile process for humans and an ungoverned agentic process running in parallel. Both processes degrade. The table above collapses the two into one evidence-based, specification-driven workflow.
Phase calibration. At Phase 1–2, the Standup → Trace Audit conversion may be partial: teams are still building trace infrastructure. Start with a hybrid (brief verbal check plus whatever traces exist) and migrate fully once tracing is reliable. Do not adopt the full ceremony mapping before the infrastructure can support it.
Contents
Roles and the Human Side
How roles evolve (Developers, Tech Leads, QA Engineers, Operations Engineers) and the human dimension of the transition: naming the loss, the supervision paradox, the acceleration trap, sustainable pace, and protecting the junior pipeline.
Adoption Path and Phase Transitions
The six-step incremental adoption path (technical infrastructure for Phase 3+) and organizational change guidance for every phase transition from Phase 1→2 through Phase 5→6.
Resistance, Politics, and Your First Pilot
Navigating organizational friction (productivity dip, velocity metrics, cost conversation, incentive misalignment) and a concrete guide for running your first governed pilot.
Success Metrics and Failure Modes
Metrics by phase transition, team health indicators, quarterly review cadence, and common failure modes of the organizational change program.
How roles evolve and how to manage the human dimension of the transition.
Read the Manifesto for the core principles. See the Adoption Playbook for the full table of contents. See the Adoption Path for incremental steps and phase transitions.
How Roles Evolve
The transition from writing code to steering agents changes what each role owns day-to-day. This is not a minor adjustment. It is a fundamental shift in professional identity that must be named, supported, and managed — not imposed silently.
These role descriptions show one likely trajectory from current state toward Phase 5 (Agentic Engineering). The shift is progressive — no one wakes up in the end state. At Phase 2, developers still write most code; at Phase 3, they begin delegating and reviewing; at Phase 4, specifications and evidence become the primary work product. Read these as a direction of travel, not a before/after switch.
Developers
Before (Phase 1-2): Own code quality through implementation. Write features, fix bugs, review peers' code. At Phase 2, AI assists with suggestions and completions, but the developer remains the author. Professional identity is rooted in craftsmanship — the ability to think through a problem and express a solution precisely in code.
Transition (Phase 3-4): Begin delegating bounded tasks to agents. Write informal specifications with acceptance criteria. Review agent-generated output — initially every line, increasingly by evaluating evidence bundles as evaluation suites mature. Still write code directly for complex or ambiguous work where specification would cost more than implementation.
After (Phase 5): Own specification quality, constraint encoding, and outcome acceptance. Write machine-readable acceptance criteria and constraints. Review agent-generated diffs against specifications. Accept or reject outcomes based on evidence bundles. The core skill shifts from writing code to expressing intent precisely enough that agents can execute it, then refining intent based on evidence.
What this means in practice: The shift is gradual. At Phase 3, a developer might spend 70% of their time writing code and 30% reviewing agent output. By Phase 4, that ratio inverts for routine work. By Phase 5, the primary work product is the specification and the evaluation — implementation is delegated. But even at Phase 5, developers still read, understand, and occasionally write code. The skill doesn't disappear; it becomes the foundation for a harder skill.
The identity challenge: Many engineers became engineers because they love writing code. The shift to steering agents can feel like being told the skill they spent years mastering is suddenly less important. This is not imaginary — it is a real loss of craftsmanship that leaders must acknowledge. The new role is not lesser; it requires different and often harder skills (system-level reasoning, precise specification, critical evaluation of code you didn't write). But the transition needs support, not just announcement.
Tech Leads
Before: Own architectural decisions, code review standards, and technical direction. Mentor junior engineers through code review and design discussions.
After: Own domain boundaries, decision records, topology choices, and conflict-resolution rules. Design constraints that keep multi-agent collaboration reliable under load. The core skill shifts from reviewing individual code quality to designing system-level governance.
What this means in practice: Tech leads spend less time in code review and more time in constraint engineering: defining what agents may and must not do, choosing swarm topologies, and designing the evaluation portfolios that verify agent output at scale.
QA Engineers
Before: Own test plans, manual testing, and test automation. Verify that code behaves as specified through structured test execution.
After: Own evaluation portfolios, adversarial coverage, formal-invariant checks where needed, and evidence gates. The core skill shifts from executing tests to defining the contract between intent and behavior in machine-verifiable terms.
What this means in practice: QA engineers become the architects of the verification pyramid. They design evaluation suites that agents run autonomously, define adversarial test cases that probe agent behavior under stress, and set the evidence thresholds that gate promotion of changes from branch to production.
Operations Engineers
Before: Own deployment pipelines, monitoring, incident response, and infrastructure reliability.
After: Own behavioral observability, cost routing, memory governance, runtime safety, and chaos drills. The core skill shifts from keeping infrastructure running to keeping the feedback loop honest under real-world conditions.
What this means in practice: Operations engineers own a new category of infrastructure: agent runtime, memory stores, retrieval systems, and routing layers. They monitor not just uptime but behavioral drift, cost anomalies, and evaluation regression. Incident response expands to include agent-specific failure patterns (hallucination loops, memory poisoning, tier violations).
Talent Density and Organizational Design
Role evolution tells you what people do differently. Talent density tells you how many people of what kind you need to build an organization that can actually deliver this. These are separate questions, and confusing them is how organizations end up with the right job descriptions but the wrong structure.
The Build-vs-Buy Decision by Phase
The default assumption — outsource early, build in-house later — is correct in principle but often applied too late. The governance capabilities at the core of agentic engineering (evaluation design, memory governance, autonomy tier management, observability of reasoning) are not purchasable as a service. They must be built as organizational muscle, and that requires in-house practitioners who own the outcomes.
A practical guide by phase:
| Phase | In-house minimum | Where external help makes sense | What must not be outsourced |
|---|---|---|---|
| Phase 1–2 | Core engineering team using AI tools; no specialist role needed | AI tool vendor support; training | Judgment on which AI outputs are acceptable |
| Phase 3 | At least one engineer who owns specification quality | Tool configuration, infrastructure setup | Specification writing; failure pattern documentation |
| Phase 4 | Domain owners; QA lead owning evaluation suite; one ops engineer owning observability | Platform infrastructure, CI/CD pipeline build | Evidence gate design; autonomy tier policy; incident response |
| Phase 5 | Platform team (3–5 engineers): agent runtime, memory governance, routing; evaluation guild; security lead for agent threat model | Specialized formal methods expertise (targeted, time-bounded) | All governance roles; evaluation ownership; incident accountability |
The practical target by Phase 4: the majority of people doing agentic delivery work are in-house, the majority of those are practitioners who build and own outcomes (not coordinators or oversight layers), and the majority of those are operating at a competent-or-above level in their role. Organizations that invert this — heavy external dependency, high coordinator-to-practitioner ratio, or large numbers of engineers operating below the competency threshold for agentic work — will not reach Phase 5. The governance infrastructure requires practitioners who understand what they are governing.
Team Size and Composition by Phase
Agentic engineering does not scale the way traditional software teams scale. Adding headcount at Phase 3 before governance infrastructure exists creates coordination problems that compound with agent output volume. The right trajectory is:
Phase 1–3: Small, high-trust teams (3–8 people). The primary bottleneck is governance design, not delivery throughput. Adding people before governance patterns are established creates more output to govern, not more governance capacity.
Phase 4: Governance roles become explicit. Minimum viable structure: a domain owner per active agent domain, a QA lead owning the evaluation portfolio, and one platform/operations owner. Total team size for a single pilot domain: 5–10 people including these roles.
Phase 5+: Platform team separates from delivery teams. Shared infrastructure (evaluation registry, trace standards, routing layer, memory governance) is owned by a dedicated platform function, not embedded in each delivery team. Delivery teams remain small (5–8 people each) and multiply across domains, sharing platform infrastructure. Scale comes from replicating the governed delivery model across domains, not from growing individual teams.
The Skill Density Requirement
The transition to agentic engineering concentrates the value of high-skill practitioners. A senior engineer who can write precise machine-readable specifications, design adversarial evaluation cases, and reason about blast radius is more valuable in a Phase 4 team than in a traditional team — because their work governs an agent that produces the output of several engineers. A junior engineer who cannot yet write reviewable specifications creates bottlenecks, not throughput.
This creates a real organizational challenge: the skills most needed (evaluation design, specification engineering, memory governance, observability of reasoning) are not standard hiring criteria and are not covered by most engineering bootcamps or degree programs. Build an explicit skills development path covering specification engineering, memory governance, and observability of reasoning — from prompt engineering fundamentals through agentic system design. Do not assume the market supplies practitioners ready-made — it does not, at scale, yet.
The Human Side of the Transition
Adopting agentic engineering is not purely a technical change. It is an organizational transformation that directly affects people's professional identity, daily work, and career trajectory. Ignoring the human dimension is how organizations lose their best engineers during the transition.
Naming the Loss
The shift from writing code to steering agents involves a genuine loss of craftsmanship for many engineers. AI made producing code easier and made being an engineer harder — and both things are true simultaneously. Engineers who raised concerns about this shift have too often been told, explicitly or implicitly, to "just adapt faster."
That is not how you build a sustainable engineering culture. Leaders must acknowledge that the transition asks people to redefine what they do and who they are professionally. This acknowledgment is not a sign of weakness — it is a prerequisite for maintaining a team that trusts you enough to follow you through the change.
The Supervision Paradox
Reviewing AI-generated code is often harder than writing code yourself. When you write code, you carry the context of every decision. When AI writes code, you inherit output without reasoning. You see the code but not the decisions behind it. This is why the manifesto insists on traces that capture reasoning, not just events (Principle 9). But leaders must also recognize that the cognitive load of reviewing agent output at volume is a new kind of burden that doesn't appear in productivity metrics.
If your engineers spend their days as judges on an assembly line, stamping pull requests that never stop coming, production volume went up but the sense of craftsmanship went down. That is not a morale problem to be managed. It is a workflow design problem to be solved — through better specifications (reducing the need for review), better evaluations (automating the reviewable parts), and better traces (making the non-automatable review faster).
The Acceleration Trap
AI makes certain tasks faster. Faster tasks create the perception of more available capacity. More perceived capacity leads to more work being assigned. More work leads to more AI reliance. More AI reliance leads to more code that needs review, more context to maintain, more systems to understand, and more cognitive load on engineers already stretched thin.
This cycle — what researchers have called "workload creep" — is self- reinforcing. It looks like productivity from the outside (velocity charts go up, more PRs merged, more features shipped) while quality quietly erodes, technical debt accumulates, and the people doing the work run on fumes.
The perception gap makes the trap invisible from inside. A rigorous 2025 study found that experienced developers using AI tools took 19% longer to complete tasks than developers working without them — while believing AI made them 24% faster. They were wrong not just about the magnitude but about the direction of the change. This perception gap is where the acceleration trap becomes self-reinforcing: teams believe they have more capacity, take on more work, and never measure whether the capacity was real. When the J-curve adoption dip arrives — productivity declining before improving as new workflows mature — teams that have already overcommitted have no slack to absorb the dip.
The corrective: set explicit throughput limits per engineer that account for the full cycle (specification + agent execution + verification + review), not just the implementation phase. Measure outcomes (defect rate, incident severity, customer impact) alongside output volume. When output goes up and outcomes don't improve, the acceleration trap has closed.
Sustainable Pace
The manifesto optimizes for correctness, governance, and economics. But governance that burns out the humans governing it is self-defeating. Sustainable pace is not a nice-to-have — it is a precondition for the human accountability that the entire manifesto depends on.
Track team health alongside system health. Burnout indicators (review latency spikes, approval rubber-stamping, rising escaped defect rates) are system health signals — they indicate that the human layer of governance is degrading. When these signals appear, the correct response is to reduce autonomy scope or simplify governance, not to push harder.
Protecting the Junior Pipeline
If junior engineers traditionally learned by doing routine work — fixing small bugs, writing straightforward features, implementing well-defined tickets — and agents now handle that work, the training ground is disappearing.
This is not just a concern for individual careers. It is a systemic risk: if junior engineers never develop foundational skills through hands-on work, the industry will face a shortage of senior engineers who truly understand the systems they oversee. You cannot supervise what you never learned to build.
This is an organizational policy choice, not a universal staffing rule.
Concrete actions:
- Dedicate a portion of agent-suitable work to junior engineers as learning tasks, even when an agent could do it faster. The efficiency cost is an investment in the talent pipeline.
- Use agent output as teaching material: juniors review agent-generated code, identify weaknesses, and write the evaluations that catch those weaknesses. This builds judgment faster than writing boilerplate ever did.
- Pair junior engineers with agents rather than replacing their work with agents. The junior specifies, the agent implements, the junior evaluates. This builds specification and evaluation skills from day one.
- Create structured progression paths that move juniors from "evaluating agent output" to "designing specifications" to "architecting constraints" — making the skill development explicit rather than hoping it happens through osmosis.
The technical infrastructure for governed delivery and organizational change guidance for every phase transition.
Read the Manifesto for the core principles. See the Adoption Playbook for the full table of contents. See the Roles and the Human Side for how roles evolve during the transition.
Canonical sources. Normative principle definitions (P1–P12) and autonomy tier definitions are in manifesto-principles.md. Phase transition criteria and go/no-go thresholds in this document are heuristics — calibrate to local domain and baseline before applying. See glossary.md for canonical term definitions.
For V-model organizations: If your organization operates a traditional V-model SDLC (common in life sciences, medtech, aerospace, automotive, and regulated financial services), see adoption-vmodel.md for a V-model-specific variant of this adoption path that preserves your existing verification structure while transitioning to agentic execution.
Incremental Adoption Path
This section describes the technical infrastructure you build to support governed agentic delivery. It assumes your team is at or approaching Phase 3 (agents executing autonomously) and wants to reach Phase 4 and beyond. If you are at Phase 1 or 2, start with the Phase Transitions section below — it covers the organizational changes needed before this infrastructure makes sense.
The seven steps below roughly map to the Phase 3→4 transition (Steps 1–3), the Phase 4→5 transition (Steps 4–6), and ongoing expansion (Step 7). Each step is described at the level of what you actually need to do — not just what the target state looks like.
Step 1: Define Domain Boundaries and Autonomy Tiers
What to do: Map your codebase into domains with clear ownership. For each domain, define what agents may do (Tier 1: analyze and propose; Tier 2: write to branches; Tier 3: production actions) and what they must not do. Encode these as infrastructure-level permissions, not prompt instructions.
Who leads: Tech leads, with input from security and operations.
Minimum viable version: Start with one domain. Define Tier 1 only for the first pilot domain (agents can analyze and propose, zero blast radius). This is safe, reversible, and immediately useful as a learning exercise.
Timeline: 2-4 weeks for initial domain mapping. Ongoing refinement.
Success signal: You can answer "what is this agent allowed to do in this domain?" for every active agent, and the infrastructure enforces the answer.
Step 2: Require Evidence Bundles for Every Merged Change
What to do: Define the minimum evidence bundle for your current phase (Phase 3: tests, diff, trace link, rollback note). Integrate evidence collection into your CI/CD pipeline so it's automatic, not manual.
Who leads: QA engineers and CI/CD owners.
Minimum viable version: Require a diff, a test report, and a rollback command for every agent-generated PR. Block merge without these. This adds minutes per PR, not hours.
Timeline: 1-2 weeks to configure CI gates. 1-2 sprints to normalize.
Success signal: No agent-generated change merges without an evidence bundle. Engineers stop saying "the agent said it worked" and start pointing at evidence.
Step 3: Add Regression Gates Before Expanding Autonomy
What to do: Build a regression evaluation suite for each domain where agents operate. Every agent-generated change must preserve or improve evaluation performance. Failed evaluations block merge.
Who leads: QA engineers, with domain expertise from developers.
Minimum viable version: Start with existing tests. Add behavioral regression tests for the most common agent failure patterns in your domain. Ten well-chosen regression cases are more valuable than a hundred boilerplate tests.
Timeline: 2-4 weeks for initial suite. Continuous expansion.
Success signal: Escaped defect rate for agent-generated changes is equal to or lower than for human-generated changes.
Step 4: Add Adversarial and Security Evaluations on Exposed Surfaces
What to do: For any agent-generated code that touches external-facing surfaces (APIs, user interfaces, data pipelines), add adversarial test cases: injection attacks, malformed inputs, edge cases, authorization bypasses.
Who leads: Security engineers and QA.
Timeline: 2-4 weeks per exposed surface.
Success signal: No agent-generated change reaches an external surface without adversarial evaluation coverage.
Step 5: Establish Durable Coordination State
What to do: Before expanding to multi-agent topologies or long-running agent tasks, build the coordination substrate that prevents duplicate work, orphaned tasks, and post-restart divergence. The minimum infrastructure:
- Work ledgers: A single source of truth for what tasks are active, claimed, completed, or abandoned. Without this, concurrent agents duplicate effort or leave work silently unfinished.
- Lease-based task ownership: Agents claim tasks with time-bounded leases. If an agent crashes or stalls, the lease expires and the task becomes available for reassignment. Without leases, orphaned tasks accumulate silently.
- Restart-safe handoffs: Agent state must survive restarts. If an agent is interrupted mid-task, the next agent (or the same agent after restart) must be able to resume from a well-defined checkpoint rather than starting over. Design for replay safety: re-executing a handoff must produce the same result, not duplicate side effects.
Who leads: Platform/infrastructure engineers.
Minimum viable version: A shared task ledger with lease expiration for one multi-agent workflow. This can be as simple as a database table with claim timestamps and TTLs.
Timeline: 2–4 weeks for initial ledger. Ongoing refinement as topologies expand.
Success signal: No orphaned tasks after agent crashes. No duplicate work across concurrent agents. Restart produces resumption, not repetition.
Step 6: Pilot Formal Contracts on One High-Blast-Radius Path
What to do: Select one critical path (e.g., payment processing, data integrity constraint, authentication flow) and add machine-checkable contracts (preconditions, postconditions, invariants). This is not full formal verification — it is contract-first development on a narrow scope.
Who leads: Senior engineers with architecture responsibility. May require external expertise in formal methods — see the Skill Requirements table in the Companion Guide.
Timeline: 4-8 weeks for initial pilot.
Success signal: The contracted path has zero escaped defects from contract-violating changes over the pilot period.
Step 7: Expand Only When Incident Rate and Economics Improve
What to do: Before expanding agent autonomy (promoting from Tier 1 to Tier 2, or from one domain to multiple domains), verify that the current scope is working: incident rate is flat or declining, total cost of correctness is acceptable, and governance overhead is sustainable.
Who leads: Engineering leadership with input from operations and finance.
Expansion criteria: Incident rate stable or improving for two consecutive quarters. Total cost of correctness declining per outcome. Human oversight load (reviews per domain owner) is sustainable.
Organizational Change by Phase Transition
The manifesto defines six maturity phases. The Companion Guide provides full definitions and failure modes for each. Here is a summary for reference:
- Phase 1 — Guided Exploration. Single prompts, no structure, no memory.
- Phase 2 — Assisted Delivery. AI as autocomplete; humans execute.
- Phase 3 — Agentic Prototyping. Agents execute within a single session; limited verification.
- Phase 4 — Agentic Delivery. Basic guardrails: autonomy tiers, evaluation gates, persistent memory. Single-domain.
- Phase 5 — Agentic Engineering. Structured autonomy at scale. Multi-domain, evidence-driven, continuous Agentic Loop.
- Phase 6 — Adaptive Systems. Self-improving infrastructure within governed boundaries. Frontier capabilities required.
Each transition below describes what changes organizationally, what actions to take, and what makes the transition hard.
Investment and Organizational Sizing by Phase
Every phase transition has both a technical dimension (what you build) and an organizational dimension (what you fund, who you hire or develop, what you stop doing). This table gives decision-makers the investment framing alongside the technical steps:
Calibrate all phase-transition metrics to domain baseline and incident history.
| Phase transition | Typical investment | Team change | Primary cost driver | ROI signal |
|---|---|---|---|---|
| Phase 1→2 | Tooling licenses (low); process standardization (1–2 weeks engineering time) | No new roles; existing team adopts AI tools | Tool cost + standardization overhead | Cycle time reduction on AI-assisted tasks; measurable in weeks |
| Phase 2→3 | Specification training (1–3 weeks); review process redesign | No new roles; senior engineers develop specification discipline | Senior engineer time on specification + review patterns | Reviewable agent output without excessive rework |
| Phase 3→4 | CI/CD evidence pipeline (4–8 weeks engineering); evaluation suite build (4–8 weeks QA); domain boundary encoding (2–4 weeks tech lead) | Add: QA lead owning evaluations; explicit domain owners | Evaluation suite build is the primary investment | Escaped defect rate ≤ human baseline; evidence completeness ≥95% |
| Phase 4→5 | Platform team formation (3–5 engineers, ongoing); shared evaluation registry; trace standards; memory governance infrastructure | Add: platform team separates from delivery; multiply delivery teams across domains | Platform infrastructure and governance capability building | Total cost of correctness declining per outcome; oversight load stable while output scales |
| Phase 5→6 | Formal methods expertise (targeted, time-bounded); independent audit paths; self-improvement governance | Formal verification specialists (targeted hire or consultant); independent validation function | Specialized expertise and governance overhead for self-improving systems | Phase 6 is a frontier, not a universal target — assess only when Phase 5 is fully stable across all critical domains |
Decision discipline: Do not fund the next phase until the current phase has produced evidence that justifies it. If the go/no-go signals fail for two review cycles, freeze expansion and re-baseline before proceeding. This is not conservatism — it is the mechanism that prevents the most common failure: organizations that invest in Phase 4 governance infrastructure before Phase 3 evidence exists that agents produce reviewable output. The infrastructure becomes bureaucracy, teams lose confidence, and the initiative stalls.
Phase 1 → 2: From Exploration to Assisted Delivery
What changes: You move from unstructured experimentation ("let's see what ChatGPT can do") to repeatable AI-assisted workflows where humans remain in the loop for every action. Agents go from novelty to daily tool.
This transition matters to the manifesto because it builds the foundation for two things that every later phase depends on: the habit of evaluating AI output critically (the seed of Principle 8 — Evaluations), and the organizational muscle of defining what tools may and must not do (the seed of Principle 5 — Autonomy). Teams that skip Phase 2 arrive at Phase 3 with no discipline around either, and Phase 3 is where the consequences start compounding.
Organizational actions:
- Identify the tasks where AI assistance delivers consistent value (code completion, test generation, documentation drafting) and standardize tooling around them
- Establish basic usage guidelines: what models are approved, what data may be shared with them, what outputs require human review before use
- Begin measuring where AI assistance actually saves time versus where it creates rework — intuition is unreliable here; this is your first encounter with the economics principle (Principle 11) at the simplest possible scale
- Run a lightweight retrospective: which experiments from Phase 1 produced real value, and which were demos that impressed but didn't stick?
The hard part: The organizational challenge is not technical — it is cultural. Phase 1 generates enthusiasm and a sense of possibility. Phase 2 demands that you kill the experiments that felt exciting but don't produce repeatable value. Teams that skip this curation step carry forward a scattered toolset of one-off prompts and ad-hoc workflows that no one else can reproduce. Worse, they develop a false confidence that "we're already doing AI" which becomes a barrier to the deeper changes Phase 3 requires. This is primarily a curation and standardization exercise, not a technical build.
Phase 2 → 3: From Assisted Delivery to Agentic Prototyping
What changes: You move from AI-as-autocomplete (human executes, AI suggests) to agents that execute autonomously within a single session. The human stops typing every line and starts delegating whole tasks. This is the moment the team realizes prompting is not engineering.
Organizational actions:
- Select 2-3 bounded tasks where agents can execute end-to-end within a session (e.g., generate a module from a spec, write a test suite for an existing component, refactor a file according to a style guide)
- Require human review of every agent-generated output before merge — no exceptions. At this phase, the agent has no memory, no verification pipeline, and no guardrails beyond the prompt
- Begin documenting failure patterns: where agents hallucinate, where they miss edge cases, where they produce plausible-looking code that fails silently. This documentation becomes the seed for your evaluation suite in Phase 4
- Start writing specifications with explicit acceptance criteria, even if informally. The habit of defining "what does done look like" before the agent starts is the single most important skill for everything that follows
The hard part: The supervision paradox hits here for the first time. Reviewing agent-generated code is harder than writing it yourself — you inherit output without context. Teams that don't acknowledge this will either rubber-stamp agent output (creating quality risk) or reject the workflow entirely (losing the productivity gain). Neither is acceptable. The answer is better specifications and the beginning of structured evaluation, which is exactly what Phase 4 formalizes. Expect this transition to take longer than Phase 1→2 — your team is moving from "AI suggests, I decide" to "AI executes, I verify," and learning to verify well takes practice.
Phase 3 → 4: Governed Delivery Foundation
What changes: You move from "agents do things and we hope they work" to "agents do things within defined boundaries with evidence."
Organizational actions:
- Add CI/CD evidence and policy gates
- Assign domain owners and escalation rotations
- Standardize incident classification and rollback drills
- Begin tracking evidence bundle completeness and escaped defect rate
The hard part: Convincing teams that the evidence overhead is worth it when they're already shipping faster than ever. The acceleration trap makes governance feel like a brake. Frame it as insurance, not bureaucracy: the evidence bundle is what lets you expand autonomy later. Without it, you're stuck at Phase 3 forever. Start with a single domain; parallel rollout across domains is possible but increases coordination overhead.
Phase 4 → 5: Engineering-Scale Transition
What changes: You move from single-domain, reactive governance to multi-domain, evidence-driven engineering. This is the hardest transition because it requires organizational change, not just tooling.
Organizational actions:
- Establish shared evaluation registry and trace standards
- Create platform ownership for agent runtime, routing, and memory governance
- Formalize security reviews for tools, connectors, and shared state
- Invest in the "Rare" skills identified in the Companion Guide's Skill Requirements table: distributed systems design, memory governance, ML/retrieval engineering, chaos engineering
The hard part: This transition often requires new roles or responsibilities that don't exist in the current org chart. "Platform ownership for agent runtime" is not something most organizations have. You are creating infrastructure categories, not just adopting tools. This is not a sprint goal — it is an organizational redesign that unfolds over multiple quarters.
Phase 5 → 6: Adaptive Frontier
What changes: Systems begin improving themselves within governed boundaries. This is a frontier — not all organizations need to reach Phase 6, and the capabilities required (formal verification, causal reasoning, provable containment) are still maturing.
Organizational actions:
- Require governance for self-updating specifications and routing policies
- Maintain independent audit paths for high-impact domains
- Treat formal methods expertise as targeted specialization, not universal role
The hard part: Knowing when you're ready. Phase 6 without Phase 5's discipline is how you get self-improvement without containment — the system optimizes the metric, not the goal. Do not attempt Phase 6 until Phase 5 is stable across all critical domains.
For organizations transitioning from a traditional V-model SDLC to agentic engineering. This is a V-model-specific variant of adoption-path.md.
Read the Manifesto for the core principles. Read the Companion Guide for implementation depth. Read the Adoption Playbook for organizational change management.
For generic (non-V-model) adoption steps, see adoption-path.md.
Core Thesis
The transition should not throw away the V-model.
In life sciences, aerospace, automotive, and regulated financial services, the V-model survives because it solves real problems: it forces early definition of intended use and design inputs, creates explicit traceability between requirements and evidence, distinguishes verification from validation, and fits quality systems, change control, and audit expectations.
What changes in an agentic SDLC is not the need for rigor. What changes is the way rigor is expressed:
- specifications become more structured and machine-readable
- verification becomes more automated, layered, and continuously replayable
- validation remains human-owned, but is better instrumented
- traceability moves from manual spreadsheet labor to generated evidence graphs
- implementation shifts from direct human authorship to governed agent execution inside bounded harnesses
The right goal: keep the V-model's assurance logic, but retool its artifacts, gates, and execution model for agents.
This is a transition framework for agentic engineering inside a V-model organization, not a proposal to discard the V-model.
What Should Stay the Same
- Intended use, risk classification, and release responsibility remain human accountabilities.
- Verification and validation remain distinct disciplines.
- Change control, approval records, and traceability remain mandatory.
- Higher-risk functions retain stricter review, narrower autonomy, and stronger evidence requirements.
- Validation against clinical, operational, or business reality cannot be delegated fully to an agent.
What Should Change
Applicability varies by domain, qualification regime, and tool qualification constraints.
- Requirements become versioned, structured, and reusable by both humans and agents.
- Architecture is encoded as enforceable constraints, not just diagrams and prose.
- Verification plans become executable evaluation suites.
- Evidence bundles are assembled automatically from traces, tests, policies, and artifacts.
- Agents assist with decomposition, implementation, regression analysis, traceability, and document assembly under explicit autonomy tiers.
- Post-release monitoring and periodic revalidation become part of the same lifecycle, not a separate operational afterthought.
The Traditional V-Model
Stakeholder <-----------------> Acceptance
Requirements Testing
\ /
System <-----------------> System
Requirements Testing
\ /
Architecture <-----------------> Integration
Design Testing
\ /
Detailed <-----------------> Unit
Design Testing
\ /
---> Implementation <---
Each left-side phase produces a specification. Each right-side phase verifies that specification. The horizontal arrows are traceability links. The bottom of the V is human implementation.
The Agentic V-Model
Outcome <-----------------> Acceptance &
Specifications Accountability
(P1, P2) (P12, P8)
\ /
System <-----------------> System-Level
Specifications Evaluation
(P2, P3) (P10, P8)
\ /
Agent <-----------------> Cross-Agent
Architecture Verification
(P3, P4, P5) (P9, P10)
\ /
Context & <-----------------> Per-Agent
Domain Design Evaluation
(P6, P7, P11) (P8, P9)
\ /
---> Agent Execution <---
(Bounded Autonomy)
The structural symmetry is preserved: every specification level maps to a verification level. But every layer has changed in substance.
| Classical V-model stage | Agentic equivalent | What changes | Human accountability remains at |
|---|---|---|---|
| User needs / intended use | Structured intent package | Intended use, hazards, workflow assumptions, risk class, and success criteria become explicit machine-readable inputs | intended use, risk acceptance, go / no-go |
| System requirements | Versioned requirement contracts | Requirements include acceptance criteria, stop criteria, data constraints, and traceability IDs | requirement approval and scope decisions |
| High-level design | Enforced architectural policy | ADRs, bounded contexts, tool permissions, and data boundaries become executable constraints | boundary ownership and exception approval |
| Detailed design | Executable specifications | Interfaces, invariants, state models, and critical decision rules become machine-checkable | design review at risk-based depth |
| Implementation | Harnessed agent execution | Agents draft, implement, refactor, and document inside governed sandboxes | autonomy tier approval and exception handling |
| Unit / component verification | Deterministic verification-as-code | Tests, static analysis, contracts, replay, and proofs are run automatically | review of failures, waivers, and critical evidence |
| Integration verification | Tool, workflow, and protocol verification | Agents and tools are verified as a system, not component by component only | approval of integration evidence and unresolved deviations |
| System verification | Evaluation harnesses | End-to-end workflows, adversarial cases, reliability, and economics are evaluated continuously | decision on fitness for intended technical use |
| Validation | Human-led contextual validation | Workflow fit, clinical or user value, and real-world operating assumptions are assessed with stronger instrumentation | validation conclusion and release decision |
| Maintenance / changes | Continuous revalidation loop | Drift, regressions, memory updates, and agent policy changes re-enter change control | periodic revalidation and CAPA ownership |
Layer-by-Layer Transformation
Level 1: Stakeholder Requirements --> Outcome Specifications
Traditional: Business analysts translate stakeholder needs into a requirements document. Requirements are written in natural language, reviewed by humans, and baselined.
Agentic: Outcome specifications replace requirements documents. Specifications are machine-readable: they define what "done" means in terms agents can evaluate autonomously (Principle 1). They include acceptance criteria, boundary conditions, blast-radius constraints, and validation criteria — not just verification criteria.
Validation ("did we build the right thing?") becomes a first-class concern because agents can satisfy every verification check and still produce the wrong outcome.
What changes in practice:
- Requirements become executable specifications with machine-readable acceptance criteria
- Validation criteria are defined upfront alongside verification criteria
- Specifications are versioned artifacts that evolve through the Agentic Loop
- Traceability is automatic: the specification is the input to the agent, and the trace records the link between specification and execution
Level 2: System Requirements --> System Specifications
Traditional: System engineers decompose stakeholder requirements into system requirements. Each system requirement is testable and traceable.
Agentic: System specifications define domain boundaries, inter-domain contracts, and the constraints that agents must respect (Principle 3: defense-in-depth). Each domain has a clear owner, a defined autonomy tier, and machine-enforceable boundaries.
The key shift: system specifications are infrastructure-level constraints that the runtime enforces. An agent that violates a domain boundary is blocked by the system, not caught in review.
What changes in practice:
- System requirements become enforceable domain boundaries and typed contracts
- Decomposition is driven by blast radius and autonomy tiers, not just functional decomposition
- Each domain specifies its evidence bundle requirements by phase
Level 3: Architecture Design --> Agent Architecture
Traditional: Software architects define the system structure: components, interfaces, data flows, deployment topology.
Agentic: Agent architecture defines the topology of the agentic system: how many agents, what roles, what coordination pattern (Principle 4). It defines the autonomy tier for each agent (Principle 5) and the defense-in-depth layers that wrap probabilistic decisions in deterministic infrastructure (Principle 3).
Architecture also encompasses context architecture (Principle 7) and memory architecture (Principle 6).
What changes in practice:
- Component diagrams become agent topology diagrams with explicit authority relationships
- Data flow diagrams include context flow, memory flow, and cost flow
- Architecture decisions include model selection rationale, routing policies, and cost targets (Principle 11)
Level 4: Detailed Design --> Context and Domain Design
Traditional: Module-level design defines the internal structure of each component. This is the last specification level before coding begins.
Agentic: Context and domain design defines what each agent needs to execute correctly: its context budget, retrieval configuration, memory access, tool permissions, and evaluation criteria.
This layer is where most agentic projects fail. Teams jump from architecture to execution without specifying what context each agent should see, what tools it may use, what cost limits apply, or what evaluation criteria define success.
What changes in practice:
- Module designs become agent configuration specifications
- Algorithm selection becomes model selection with cost/quality tradeoffs
- Data structure design includes memory store design with the five governance properties (provenance, expiration, compression, rollback, domain scoping)
- Internal interfaces become tool contracts with typed inputs/outputs
Level 5 (Bottom): Implementation --> Agent Execution
Traditional: Developers write code. The bottom of the V — the only layer where artifacts are produced rather than specified or verified.
Agentic: Agents execute within bounded autonomy. They receive specifications, context, and tool access. They produce code, artifacts, decisions, or actions. They generate traces of their reasoning. They are evaluated against criteria they may not see (evaluation holdout).
The fundamental shift: implementation is delegated. The human's role moves from writing code to defining the conditions under which agents execute and verifying the results.
What changes in practice:
- Coding sessions become agent execution sessions with full trace capture
- Code review becomes evidence bundle review (diff + tests + trace + rollback)
- The implementation artifact includes not just the code but the trace that explains how and why it was produced
Level 6 (Right, ascending): Unit Testing --> Per-Agent Evaluation
Traditional: Unit tests verify that each module behaves as designed.
Agentic: Per-agent evaluation portfolios verify that each agent's output meets its specification (Principle 8). This includes happy-path validation, adversarial testing, regression coverage, and behavioral checks. Evaluation holdout prevents agents from overfitting to visible criteria.
Structured traces (Principle 9) make every agent decision inspectable.
What changes in practice:
- Unit tests become evaluation portfolios with four coverage categories
- Test execution becomes continuous evaluation on every change
- Test reports become structured traces queryable by any dimension
Level 7: Integration Testing --> Cross-Agent Verification
Traditional: Integration tests verify that subsystems work together.
Agentic: Cross-agent verification confirms that agents interacting across domain boundaries produce correct system-level behavior. This includes trace correlation across agent chains, drift detection, and cost anomaly monitoring.
This layer also includes the behavioral vs. structural regression distinction: an agent's output may pass all current evaluations but degrade the codebase's capacity for future change.
What changes in practice:
- Integration test suites become cross-domain evaluation portfolios
- Interface testing becomes trace correlation and provenance verification
- Structural regression monitoring is added alongside behavioral regression
Level 8: System Testing --> System-Level Evaluation
Traditional: System tests verify the complete system against system requirements including non-functional requirements.
Agentic: System-level evaluation includes chaos testing (Principle 10) and threat modeling for agentic systems. This tests what happens when tools fail, retrieval is noisy, memory is corrupted, or agents interact in unexpected ways.
What changes in practice:
- System test plans become chaos testing plans with safety models
- Security testing becomes agentic threat modeling (prompt injection, memory poisoning, agent impersonation, data exfiltration)
- Non-functional testing includes total cost of correctness measurement
Level 9 (Top right): Acceptance Testing --> Acceptance and Accountability
Traditional: Acceptance tests verify the system against stakeholder requirements. In regulated industries, this includes formal sign-off by a qualified person.
Agentic: Acceptance and accountability verification confirms that a named human can inspect the reasoning, review the evidence, and own the outcome of every production agent (Principle 12). This is tier-calibrated governance:
- Tier 1 (Observe): Human executes every action. Accountability is inherent.
- Tier 2 (Branch): Human owns constraint design and evaluation portfolio.
- Tier 3 (Commit): Human owns policy design, sampling strategy, incident response. Automated enforcement handles routine checks.
Evidence bundles are the acceptance artifact: diff, tests, trace, rollback command, policy checks, and cost accounting — all phase-gated and immutable.
What changes in practice:
- UAT becomes evidence bundle review with tier-appropriate depth
- Formal sign-off becomes accountability assignment with trace-backed evidence
- Release gates become phase-calibrated evidence thresholds
- Post-release monitoring becomes continuous behavioral observability
What the Agentic V-Model Adds
The agentic V-model is not simply "V-model with AI at the bottom." It adds structural elements that the traditional V-model does not address:
Continuous verification. The traditional V-model verifies at gates. The agentic V-model verifies continuously — evaluations run on every change, not at phase transitions.
Emergence testing. The traditional V-model assumes deterministic implementation. Agentic systems are probabilistic and exhibit emergent behavior. Chaos testing and containment engineering have no equivalent in the traditional V.
Behavioral observability. The traditional V-model verifies correctness at each level. The agentic V-model also monitors for drift, anomaly, and constraint violation in real-time between verification levels.
Accountability under non-determinism. When agents implement at scale, comprehensive inspection is impossible. The agentic V-model replaces direct inspection with tiered governance: humans own the constraints, evaluations, and evidence model — not every individual output.
Economic optimization. The traditional V-model does not address the cost of verification. The agentic V-model includes economics as a first-class concern (Principle 11).
ALCOA+ Traceability Through the Agentic V-Model
For GxP and regulated environments, the agentic V-model produces ALCOA+ compliant records at every layer by construction:
| V-Model Layer | Record Produced | ALCOA+ Properties Satisfied |
|---|---|---|
| Outcome Specifications | Versioned, machine-readable specs | Original, Legible, Enduring |
| System Specifications | Domain boundaries, autonomy tiers | Consistent, Complete |
| Agent Architecture | Topology decisions, routing policies | Attributable, Accurate |
| Context/Domain Design | Agent configurations, tool scopes | Complete, Consistent |
| Agent Execution | Structured traces with full reasoning | Contemporaneous, Attributable, Original |
| Per-Agent Evaluation | Evaluation results, evidence bundles | Accurate, Available |
| Cross-Agent Verification | Correlated traces, provenance | Complete, Attributable |
| System-Level Evaluation | Chaos test records, threat models | Enduring, Available |
| Acceptance & Accountability | Named owner sign-off, evidence | Attributable, Complete, Available |
The trace chain from outcome specification through agent execution to acceptance evidence is unbroken, machine-queryable, and immutable.
Transition Principles
1. Start with specification engineering, not coding agents
If requirements are vague, agents will merely produce ambiguity faster. Specification quality is the primary upstream control variable.
2. Modernize verification before expanding autonomy
Autonomy without a strong right side of the V creates faster defect production, not faster compliant delivery.
3. Keep validation explicitly human-led
Verification asks whether the system satisfies the specification. Validation asks whether the specification was worth building. Agents can support validation; they should not own it.
4. Treat architecture as policy
In a standard SDLC, architecture can be partly social. In an agentic SDLC, domain boundaries, tool permissions, and data handling rules must be enforced by the runtime.
5. Make traceability an output of the system
Do not scale manual trace matrices. Generate traceability from linked, versioned artifacts, execution traces, tests, approvals, and evidence bundles.
6. Expand autonomy by risk tier, never uniformly
Low-risk artifacts can move to agent assistance early. High-risk requirements, validation conclusions, and release approvals remain strongly human-governed.
Transition Roadmap
This roadmap assumes a serious regulated environment and a staged transition. The phases are sequential in emphasis, but some activities overlap.
| Phase | Focus | Typical duration | Primary outcome | Manifesto phase |
|---|---|---|---|---|
| 0 | Baseline and segmentation | 4-6 weeks | Current V-model mapped, risk classes segmented, pilot scope chosen | Pre-Phase 3 |
| 1 | Specification foundation | 6-10 weeks | Requirements become structured, versioned, and agent-usable | Phase 2-3 |
| 2 | Verification and validation backbone | 8-12 weeks | V and V evidence becomes executable, repeatable, and tiered | Phase 3 |
| 3 | Architecture and harness controls | 6-10 weeks | Agents operate inside enforceable boundaries | Phase 3-4 |
| 4 | Controlled agent-assisted build and test | 8-12 weeks | Agents contribute under supervision and evidence gates | Phase 3-4 |
| 5 | Integrated agentic V-model release loop | 8-12 weeks | Release, change control, and revalidation become evidence-driven | Phase 4-5 |
| 6 | Full agentic SDLC | ongoing | Governed autonomy across the lifecycle | Phase 5+ |
Phase 0 — Baseline and Segmentation
Objective: Understand the current V-model implementation before changing it.
Activities:
- Map current lifecycle artifacts: intended use, requirements, architecture, verification plans, validation protocols, trace matrices, release records
- Segment products and workflows by risk and regulatory consequence
- Identify where traceability is manual, weak, or routinely backfilled
- Define autonomy red lines: high-risk approvals remain human-owned
- Select one pilot value stream
Exit criteria:
- The organization can name which lifecycle decisions will remain human-only
- The first pilot scope is explicit and bounded
- Current evidence gaps are known
Phase 1 — Specification Foundation
Objective: Turn design inputs into structured artifacts that can steer agents safely.
Activities:
- Standardize templates for intended use, user needs, system requirements, and detailed specifications
- Require every requirement to include: rationale, acceptance criteria, trace ID, risk tag, source, and owner
- Add stop criteria to major work items
- Define interface contracts and prohibited behaviors in machine-usable form
- Introduce specification review focused on ambiguity and unverifiable language
Exit criteria:
- A reviewer can determine whether a requirement is specific enough for an agent to act on
- Trace IDs are stable and versioned
- Ambiguous prose is being reduced before implementation starts
Phase 2 — Verification and Validation Backbone
Objective: Rebuild the right side of the V as an executable evidence system.
Activities:
- Convert verification plans into executable suites: unit tests, integration tests, policy checks, static analysis, simulation, adversarial scenarios
- Define evidence bundles for each change: trace links, diffs, test outputs, review outcomes, policy checks
- Separate verification layers: deterministic, statistical, formal, human
- Define validation protocols that remain human-led but instrumented
- Establish failure handling: explicit deviations, root-cause tagging
Exit criteria:
- Verification evidence can be regenerated, not merely asserted
- Validation records distinguish technical correctness from contextual fitness
- Teams can show which requirements are insufficiently covered by evidence
Phase 3 — Architecture and Harness Controls
Objective: Ensure agents execute inside bounded constraints.
Activities:
- Convert architecture rules into enforceable controls: domain ownership, dependency rules, tool permissions, data-access policies
- Define the agent harness: prompts, tool registry, runtime permissions, checkpointing, trace capture, evidence collection
- Create autonomy tiers by risk class
- Introduce sandboxing for agent execution
Exit criteria:
- Agents cannot bypass architectural rules through prompt interpretation alone
- Every agent action in the pilot is attributable and auditable
Phase 4 — Controlled Agent-Assisted Build and Test
Objective: Use agents in implementation and verification without breaking the quality system.
Activities:
- Start with bounded tasks: draft low-risk code, generate tests, propose trace links, summarize impact, prepare evidence packs
- Require every agent-produced change to pass the verification backbone
- Route higher-risk changes through narrower autonomy and deeper review
- Capture review outcomes as structured signals: accepted, rejected, partially accepted, policy exception, unclear spec
- Measure where agent output fails: bad decomposition, hallucinated requirements, architectural drift, weak evidence
Exit criteria:
- Agent assistance reduces cycle time on low-to-medium-risk work without reducing assurance quality
- Human reviewers focus on risk and ambiguity, not rereading every low-level step
- The pilot produces reusable evidence and lessons
Phase 5 — Integrated Agentic V-Model Release Loop
Objective: Move from isolated pilot to a governed lifecycle that closes the loop from requirement change to monitored release and revalidation.
Activities:
- Integrate generated traceability into change control and release records
- Add periodic revalidation triggers: model change, tool change, policy change, workflow change, observed drift
- Define how memory or learned agent behaviors are versioned and approved
- Connect field data, deviations, and CAPA findings back into requirement and validation updates
- Establish evidence-based release readiness
Exit criteria:
- A post-release issue can be traced to the relevant requirement, implementation, evidence, and approval path
- Revalidation triggers are explicit rather than ad hoc
- The lifecycle is closed from design input to operational learning
Phase 6 — Full Agentic SDLC
Characteristics:
- Specifications are the primary work product
- Verification is largely automated and replayable
- Validation is instrumented and human-owned
- Traceability is generated continuously
- Architecture is enforced at runtime
- Agents operate under risk-tiered autonomy
- Deviations, incidents, and revalidation update the system continuously
This is not: unrestricted autonomous change in high-risk areas, agent-written documentation without evidence linkage, replacing QMS discipline with prompt craft, or treating validation as another test suite.
Recommended Transition Sequence by Artifact
- Intended use and user needs — Tighten purpose, scope, hazards, exclusions, and success criteria.
- System and software requirements — Make them versioned, structured, and traceable.
- Verification plans — Convert to executable evidence where possible.
- Validation plans — Clarify human-led contextual validation and decision ownership.
- Architecture and design constraints — Encode boundaries, permissions, and invariants.
- Implementation workflow — Introduce harnessed agents on bounded work.
- Traceability and evidence management — Generate, do not manually reconstruct.
- Release, change control, and revalidation — Close the loop operationally.
That order is deliberate: specification first, verification second, architecture third, autonomy fourth.
Role Evolution in a V-Model Context
Quality / Validation functions
Move from document checkers to evidence-system governors. Own validation integrity, deviations, and release confidence boundaries.
System / software architects
Move from describing design to encoding enforceable constraints and approved execution zones.
Developers and technical leads
Spend more time on specification quality, interface design, evaluation design, and exception handling. Spend less time on first-draft boilerplate implementation.
Regulatory / quality leadership
Focus on where agent participation changes the assurance case: electronic records, traceability, tool qualification, approval semantics, and revalidation triggers.
Engineering leadership
Fund the evidence backbone, not just coding tools. Prevent local optimization where teams adopt agents without verification, traceability, or validation discipline.
Metrics for the Transition
Do not measure success with raw output volume. Track:
- lead time from approved specification to verified evidence bundle
- first-pass acceptance rate of agent-generated changes
- percentage of requirements with executable verification coverage
- percentage of changes with complete traceability
- deviation rate introduced by agent-assisted work vs. human-only work
- reviewer time spent on low-risk vs. high-risk changes
- revalidation effort per major change class
- total cost of correctness: inference + verification + governance overhead + incident remediation
Failure Modes to Avoid
- Automating implementation before fixing requirement quality
- Treating validation as test automation
- Allowing agents to modify constraints that should be governance-controlled
- Keeping traceability manual while scaling change volume
- Granting identical autonomy to low-risk and high-risk work
- Letting model or tool changes bypass revalidation logic
- Measuring success by throughput while reviewer fatigue and deviation rates climb
Bottom Line
The V-model does not disappear in agentic engineering. It becomes more important.
But its artifacts can no longer remain passive documents. They must become active controls:
- specifications that steer machines
- architectures that constrain machines
- verification that proves what happened
- validation that confirms the work still matters
- traceability that is generated by the system itself
Organizations that already operate a mature V-model are better positioned for agentic engineering than organizations that skipped the V-model for agile. They already have the specification discipline, the verification culture, and the traceability infrastructure. What they need to add is: machine-readable specifications, evaluation portfolios that handle non-determinism, continuous observability, emergence containment, and tiered accountability.
The V-model does not become obsolete. It becomes the governance skeleton that makes autonomous execution safe.
Appendix: Mermaid Diagram
graph TD
subgraph "LEFT: Specification Cascade"
L1["1. Outcome Specifications<br/>(P1, P2)"]
L2["2. System Specifications<br/>(P2, P3)"]
L3["3. Agent Architecture<br/>(P3, P4, P5)"]
L4["4. Context & Domain Design<br/>(P6, P7, P11)"]
end
subgraph "BOTTOM: Execution"
B["Agent Execution<br/>(Bounded Autonomy)"]
end
subgraph "RIGHT: Verification Cascade"
R6["6. Per-Agent Evaluation<br/>(P8, P9)"]
R7["7. Cross-Agent Verification<br/>(P9, P10)"]
R8["8. System-Level Evaluation<br/>(P10, P8)"]
R9["9. Acceptance & Accountability<br/>(P12, P8)"]
end
L1 --> L2 --> L3 --> L4 --> B
B --> R6 --> R7 --> R8 --> R9
L1 -.->|"Traceability"| R9
L2 -.->|"Traceability"| R8
L3 -.->|"Traceability"| R7
L4 -.->|"Traceability"| R6
Navigating organizational friction and running your first governed pilot.
Read the Manifesto for the core principles. See the Adoption Playbook for the full table of contents. See the Roles and the Human Side for the human dimension of the transition.
Navigating Resistance and Politics
The Human Side of the Transition covers the emotional and cognitive challenges individuals face. This section covers the organizational and political friction points that leaders must navigate.
The Productivity Dip
Teams will be slower before they're faster. Writing specifications is slower than writing code — at first. Building evidence gates adds overhead — at first. Reviewing agent output is harder than reviewing human code — until traces and evaluations reduce the review burden.
What to do: Set expectations explicitly at the start of the transition. Budget for a 2-4 week productivity dip per domain. Measure the dip so you can show the recovery. Protect the team from "why is velocity down?" pressure by communicating the plan to leadership in advance. This is where the acceleration trap (described in The Human Side) is most dangerous: the temptation to skip governance and reclaim velocity is strongest when the dip is visible to leadership.
Management That Wants Velocity Metrics
The manifesto explicitly argues that velocity, story points, and lines of code are the wrong metrics for agentic engineering. But management may still demand them — especially if the AI investment was justified on productivity grounds.
What to do: Don't fight the productivity narrative. Redirect it. Show that the right productivity metrics (lead time from specification to verified deployment, escaped defect rate, total cost of correctness per outcome) capture actual business value, while velocity measures raw output that may or may not produce value. Frame it as "we're measuring the thing that matters to the customer, not the thing that looks good on a slide."
The Cost Conversation
Agentic infrastructure costs money: inference costs, tooling, memory infrastructure, evaluation pipelines. The investment must be justified before results are fully proven.
What to do: Start with a narrow pilot (Step 1 in the adoption path) where costs are containable and measurable. Track total cost of correctness from day one, so you can demonstrate economics improvement as the pilot matures. Frame the comparison against the true cost of the status quo: escaped defects, incident remediation, technical debt accruing at machine speed.
Incentive Misalignment
If developers are still measured on lines of code, PRs merged, or tickets closed, the manifesto's values will lose to the incentive structure every time. Incentives that reward output volume punish the careful specification, verification, and governance the manifesto requires.
What to do: Align incentives with outcomes, not output. Reward: defect- free deployments, specification quality (measured by agent first-pass success rate), evaluation coverage, and incident prevention. These are harder to measure than "PRs merged" but they measure what actually matters.
How to Run Your First Pilot
This pilot is designed to take your team from Phase 3 (agents executing autonomously without governance) to Phase 4 (governed delivery with evidence bundles and autonomy tiers). It maps to Steps 1-3 of the Incremental Adoption Path. Do not attempt this pilot until your team has worked through Phase 2→3: agents are executing whole tasks, your team has documented initial failure patterns, and engineers are writing specifications with acceptance criteria (even if informally).
Selecting the Pilot Domain
Choose a domain that is:
- Bounded: Clear inputs, outputs, and domain boundaries. You should be able to define what agents may and must not do without ambiguity.
- Low-to-medium risk: Not your most critical production path. A failure should be recoverable without customer impact.
- Well-tested: Existing test coverage provides a baseline for evaluating agent output quality.
- Owned by a willing team: The team should be curious, not coerced. Forced adoption produces compliance, not learning.
Good pilot domains: internal tools, test infrastructure, documentation generation, non-critical API endpoints, CI/CD pipeline improvements.
Bad pilot domains: payment processing, authentication, customer-facing decisions with legal or financial impact, and other high-blast-radius or controlled-data workflows — these are Step 5, not Step 1.
Pilot Structure
Duration: 6-8 weeks minimum. Shorter pilots don't generate enough evidence to distinguish signal from noise.
Team size: 3-5 engineers from the pilot domain, plus one operations engineer and one QA engineer. Small enough to iterate fast; large enough to test real workflows.
Scope: One domain, Tier 1 autonomy (agents analyze and propose), with evidence bundles required for every merged change.
Tooling investment: Minimal. Use existing CI/CD with added evidence gates. Do not invest in specialized agent platforms before validating the workflow.
Pilot Success Criteria
The pilot succeeds if:
- Escaped defect rate for agent-generated changes is equal to or lower than the domain's historical baseline
- Engineers can produce evidence bundles without unsustainable overhead (measure time per bundle)
- The team can articulate what worked, what didn't, and what they'd change for the next domain
- At least one specification was refined based on execution evidence (demonstrating the Agentic Loop in practice)
The pilot fails if:
- Governance overhead exceeds the value of agent output (teams spend more time on evidence than the agent saves on implementation)
- Escaped defect rate increases
- Team burnout indicators appear (review rubber-stamping, evidence bundle quality declining over time)
After the Pilot
Document findings as a case study: what worked, what broke, what you'd change. Use the case study to inform the next domain's adoption. Do not generalize from one pilot — each domain has different failure surfaces.
How to measure progress and the common ways the change program fails.
Read the Manifesto for the core principles. See the Adoption Playbook for the full table of contents. See the Adoption Path for incremental steps and phase transitions.
Canonical sources. Normative principle definitions (P1–P12) are in manifesto-principles.md. Metric thresholds and alert bands in this document are heuristics — example starting bands that must be calibrated to local baseline, domain, and risk class before use. See glossary.md for canonical term definitions.
Success Metrics
Treat this manifesto as a living specification. Run pilots, publish failure analyses, measure outcomes, and revise principles based on evidence from real workflows.
Treat every threshold below as a starting baseline that must be calibrated to local review size, risk class, and domain history.
Metrics by Phase Transition
Phase 1 → 2 (focus on standardization and repeatable value):
- Number of AI-assisted tasks with documented, repeatable workflows
- Rework rate on AI-assisted outputs (how often does the human redo the AI's suggestion entirely?)
- Team coverage: percentage of engineers using approved AI tooling regularly
- Data handling incidents: trending toward zero for sensitive data shared with unapproved models (track as a security metric, not an adoption gate)
Phase 2 → 3 (focus on autonomous execution quality):
- Agent task completion rate (tasks delegated vs. tasks that required human takeover mid-execution)
- Review rejection rate for agent-generated outputs
- Documented failure patterns (growing catalog indicates learning, not problems)
- Specification quality: percentage of tasks where acceptance criteria were defined before agent execution
Phase 3 → 4 (focus on governance foundation):
- Evidence bundle completeness rate (target: 100% of agent-generated changes)
- Escaped defect rate: agent-generated vs. human-generated changes
- Rollback frequency and mean time to recovery
- Time per evidence bundle (sustainability indicator)
Phase 4 → 5 (focus on scale and economics):
- Lead time from specification to verified deployment
- Total cost of correctness by domain
- Policy violation rate and resolution time
- Cross-domain evaluation coverage
Phase 5 → 6 (focus on self-improvement and containment):
- Specification convergence rate (iterations to stable acceptance criteria)
- Evaluation theater detection rate (evals that pass but miss real issues)
- Self-improvement cycle time and containment breach frequency
- Human oversight load (high-risk reviews per domain owner)
Team Health Metrics (All Phases)
- Review latency trends (rising latency may indicate review fatigue or cognitive overload)
- Approval depth (are reviewers engaging meaningfully or rubber-stamping?)
- Engineer satisfaction and burnout indicators (survey quarterly)
- Junior engineer progression rate (are juniors developing specification and evaluation skills?)
Track these alongside system health. If system metrics improve while team health metrics decline, the governance model is consuming its own foundation.
Rubber-stamping detection. Control theater — humans nominally accountable but operationally blind — is the most common governance failure at scale. Detect it quantitatively before it becomes an incident:
| Signal | Example healthy band | Example alert band | What it indicates |
|---|---|---|---|
| Median review time per agent-generated PR | 8–20 minutes | < 2 minutes | Reviewer not reading the diff |
| PR rejection rate (agent-generated) | 5–15% | < 1% | Approving without meaningful review |
| Inline comments per approved PR | 3–7 | Trending to 0 over 4 weeks | Review becoming mechanical |
| Rework rate within 1 week of merge | 1–3% | > 10% | Approved changes requiring hotfixes |
Collect these via your code review platform (GitHub, GitLab, Azure DevOps — all provide approval timestamps and comment counts via API).
These thresholds are operational heuristics calibrated from practitioner experience, not empirically validated across diverse organizations. Treat them as starting baselines and adjust based on your team's observed patterns. The alert thresholds are directional: any sustained trend toward them warrants investigation, even before a hard threshold is crossed.
Intervention protocol when thresholds breach: Do not add more reviewers. Reduce autonomy scope for that reviewer's domain until review is meaningful again. The problem is volume, not capacity. Additional reviewers at the same volume create the same rubber-stamping pattern faster.
Governance Overhead Metrics
Governance infrastructure has real cost. Without efficiency metrics, it is impossible to distinguish "governance is working" from "governance is overhead with no signal." Finance and leadership will ask; measure proactively.
| Metric | Target | Alert threshold | What to do |
|---|---|---|---|
| Governance overhead as % of engineering throughput | < 15% | > 25% for two consecutive quarters | Audit which governance artifacts are actually influencing decisions; remove what isn't |
| False-positive rate on hook blocks | < 5% | > 15% | Rules are over-restrictive; refine with domain input |
| Time-to-update-governance-policy | < 2 weeks for standard changes | > 6 weeks | Governance model is too rigid; simplify change management path for low-risk policy updates |
| Incident-prevention rate attributable to governance controls | At least 1 prevented incident per quarter per active hook | Zero incidents prevented in 2 consecutive quarters | Hook may not be testing what matters; audit coverage |
| Hook false-negative rate (incidents that governance should have caught) | < 2% of total incidents | > 10% | Governance gaps; add coverage for the failure class |
Calibrate after one quarter of baseline measurement.
If governance overhead exceeds 25% of throughput with no corresponding reduction in escaped defects, that is over-governance. Reduce ceremony, increase signal. The corrective action is always the same: audit what is actually influencing decisions and cut the rest.
Quarterly Review Cadence
Begin formal quarterly reviews once your team reaches Phase 4 (governed delivery). At Phases 1-3, use the phase-specific metrics above in lighter- weight retrospectives. Once at Phase 4, each quarter review:
- Lead time from specification to verified deployment
- Escaped defect rate and incident severity distribution
- Rollback frequency and mean time to recovery
- Policy violation rate and evidence bundle completeness
- Human oversight load (high-risk reviews per domain owner)
- Total cost of correctness by domain
- Team health indicators
If governance overhead rises while quality and resilience do not improve, reduce control complexity and re-baseline autonomy scope.
Common Failure Modes of the Change Program
The Companion Guide covers technical failure modes (over-governance, evidence theater, control theater, etc.). This section covers failures in the organizational change process itself.
- Adoption without transition support. Leadership announces "we're doing agentic engineering" without budgeting for training, experimentation time, or the productivity dip. Engineers are expected to learn on their own time. The fix: budget explicitly for the transition — training, protected experimentation time, and a communicated plan that accounts for the dip.
- Ignoring the human cost. System metrics improve while engineers burn out. Governance load exceeds human capacity but nobody measures it. The fix: track team health alongside system health. When burnout indicators appear, reduce scope before pushing harder. See Sustainable Pace.
- Unclear ownership between platform, product, and operations teams. Nobody knows who owns agent runtime, memory governance, or evaluation registries because these infrastructure categories didn't exist before. The fix: explicit domain-owner assignments with escalation rotations, created as part of the Phase 4→5 transition.
- Premature autonomy expansion. A successful pilot in one domain leads to immediate rollout across all domains, skipping the evidence that the governance model scales. The fix: gate expansion on two consecutive quarters of stable or improving metrics in the current scope.
- Incentive-adoption mismatch. The organization adopts the manifesto's vocabulary but continues rewarding output volume (PRs merged, velocity points). Engineers learn to game the new system by producing minimal evidence bundles that satisfy the letter of the process without the spirit. The fix: align incentives with outcomes before expanding adoption. See Incentive Misalignment.
- Skipping phases. A team jumps from Phase 2 to Phase 4 because they "don't need" Phase 3's learning period. They adopt governance infrastructure without having documented the failure patterns it's supposed to catch. The fix: each phase builds prerequisites for the next. The phases are not a checklist to accelerate through — they are a learning sequence.
These documents map the principles of the Agentic Engineering Manifesto to the regulatory frameworks that govern specific industries. They bridge the gap between the manifesto's domain-agnostic guidance and the concrete standards teams must satisfy in regulated environments.
These documents do not explain the regulations themselves, nor do they constitute compliance advice. They assume the reader already understands the applicable regulatory landscape and needs to see how agentic engineering practices align with it.
Disclaimer — These alignment mappings are provided for informational purposes only. They do not represent legal, regulatory, or compliance advice. Organizations must conduct their own compliance assessments with qualified professionals. Regulatory frameworks evolve; always verify against the current published standards.
Documents
| Document | Scope |
|---|---|
| Aviation | DO-178C, DO-330, DO-333, ARP 4754A, DO-326A — airborne software and systems assurance |
| Medical Devices | IEC 62304, ISO 14971, ISO 13485, FDA SaMD, EU MDR — medical device software lifecycle |
| Pharma / Life Sciences | GAMP 5, CSA, 21 CFR Part 11, EU Annex 11, ICH Q8-Q12 — pharmaceutical computerized systems |
| Financial Services | SR 11-7, DORA, EU AI Act, SOX, Three Lines of Defense — banking, insurance, capital markets |
| Automotive | ISO 26262, ASPICE, UN Regulation 157 — road vehicle functional safety and autonomous driving |
| Defense / Government | CMMC, FedRAMP, NIST SP 800-53, ITAR/EAR — government contracting and defense systems |
Cross-Cutting Themes
Several themes recur across all domains and are addressed at the manifesto level rather than in domain-specific documents:
- Independent validation as a governance principle — see Companion Principles P8
- SOUP / agent-as-tool categorization — see Companion Principles P3
- Data classification as an agent constraint — see Companion Frameworks
- ALCOA+ compliance — see Companion Frameworks
- Champion-challenger testing — see Companion Principles P8
- Fairness and bias testing — see Companion Principles P8
- Cross-domain incident classification — see Companion Patterns
- Supplier and vendor qualification — see Companion Reference
- Memory governance in regulated environments — see Companion Principles P6
- Open interoperability requirements — see Companion Principles P9
- Benchmark instability and private holdouts — see Companion Principles P8
Cross-Domain Open Regulatory Questions
The following questions are unresolved across multiple regulated domains. They represent the highest-priority areas where industry consensus, standards-body guidance, or regulatory precedent is needed. Each question links to the domain that has developed the most specific framing.
| # | Question | Domains Affected | Status |
|---|---|---|---|
| 1 | Agent-as-tool qualification: Is an AI agent SOUP (IEC 62304), an unqualified tool (DO-178C/DO-330), a GAMP Cat 3/4 system, or a new category requiring new classification frameworks? No domain has a settled answer. | All | Open — each domain uses the "treat as unqualified tool, independently verify output" pragmatic approach pending regulatory guidance |
| 2 | Model version change revalidation scope: When the underlying model is updated (e.g., model version bump by the provider), what revalidation is required? Does a minor version change trigger full re-IQ/OQ/PQ? Full independent model validation? Or only a behavioral regression test? | Medical, Pharma, Financial | Open — PCCP (FDA) partially addresses anticipated modifications but not infrastructure-level model changes |
| 3 | Memory accumulation as a change control event: At what point does accumulated learned memory constitute a change to a validated system? No domain has a threshold or methodology. | Pharma (most developed), Medical, Financial | Open — GAMP 5 open question; no regulatory body has published guidance |
| 4 | Open-source model supplier responsibility: When a deploying organization uses an open-source model with no identifiable supplier, how should GAMP 5 supplier qualification, ISO 13485 §7.4 purchasing controls, and SR 11-7 vendor model management apply? | Pharma, Medical, Financial | Open — conservative position is to assume full supplier responsibility; regulatory validation of this approach is untested |
| 5 | GDPR Art. 22 and agent-assisted decisions: When an agent produces a recommendation that a human rubber-stamps, does that constitute "solely automated decision-making" under GDPR Art. 22? The boundary between meaningful human review and rubber-stamping is undefined in regulatory guidance. | Financial, Medical, All customer-facing | Open — rubber-stamping detection metrics (see adoption-metrics.md) partially address the engineering side; the legal question is unresolved |
| 6 | Protocol and evidence portability: What level of interoperability should regulated teams require for tool invocation, agent delegation, trace export, and replay before an agent platform can be treated as operationally governable rather than vendor-bound? | All | Open — open protocols are emerging, but regulatory expectations for portability, replay, and audit export are not yet settled |
Document Structure Template
All domain documents in this directory should include the following sections. Sections may be omitted only where clearly not applicable to the domain — in which case add a brief "Not applicable for this domain: [reason]" note.
## [Criticality/Risk Level] to Manifesto Autonomy Mapping
Map the domain's primary risk/criticality classification (DAL, safety class,
ASIL, GxP context, etc.) to manifesto autonomy tiers. This is the primary
table readers need.
## [Framework]-by-[Framework] Mapping (repeat for each major standard)
Table mapping each regulatory standard's key requirements to manifesto
principles, with Alignment (Strong/Good/Partial/Gap) and Gap description.
## SOUP / Agent-as-Tool Treatment
How the domain's software component classification framework applies to AI
agents, model dependencies, and agent-selected libraries.
## Hard Autonomy Caps
Regulatory floor caps by use case. These are not recommendations — they are
constraints. Include the regulatory citation for each cap.
## Viable Starting Points
3-6 concrete, low-risk entry points for teams beginning agentic adoption
in this domain. Each should be realistically achievable without resolving
open regulatory questions.
## Tool Configuration Notes
How to configure agent tooling (hooks, RBAC, MCP allowlists, model pinning)
to satisfy the domain's audit trail and data classification requirements.
## ALCOA+ or Equivalent Data Integrity Cross-Reference
Cross-reference to companion-frameworks.md#alcoa-alignment with any
domain-specific additions.
## Open Regulatory Questions
Unresolved questions specific to this domain. Cross-reference to the
cross-domain questions in this README where applicable.
Recommended Reading Path
- companion-frameworks.md — boundary conditions for regulated-industry adoption
- adoption-vmodel.md — V-model-specific adoption path for verification-heavy organizations
- Your domain document (above) — map manifesto principles to your specific regulatory framework
Mapping the Agentic Engineering Manifesto principles to aviation certification frameworks.
See companion-frameworks.md for boundary conditions on regulated-industry adoption. See adoption-vmodel.md for the V-model adoption path applicable to verification-heavy lifecycles.
Canonical sources. Normative principle definitions (P1–P12) and autonomy tier definitions are in manifesto-principles.md. Autonomy tier assignment criteria are in companion-principles.md — P5. This document maps those definitions to aviation certification requirements; it does not redefine them.
Scope: DO-178C, DO-330, DO-333, ARP 4754A, ARP 4761/4761A, DO-326A, DO-356A, DO-278A.
Audience: DERs, ODA unit members, certification liaisons, software leads, and systems engineers evaluating where agentic engineering practices can operate within existing certification constraints.
Disclaimer — This document maps concepts from the Agentic Engineering Manifesto to aviation regulatory frameworks. It does not constitute compliance or certification advice. Consult your DER, ODA, or certification authority for compliance determinations.
Regulatory currency: This document reflects DO-178C, DO-330, DO-333, ARP 4754A, ARP 4761/4761A, DO-326A, DO-356A, and DO-278A as understood at the time of last review. These standards evolve; EASA, FAA, and TCCA guidance material is updated periodically. Verify currency against official sources before relying on this content. Last reviewed: April 2026. Proposed changes not yet enacted are flagged as such.
Design Assurance Level to Manifesto Autonomy Mapping
The manifesto defines three autonomy tiers (Principle 5): Tier 1 (Observe), Tier 2 (Branch), Tier 3 (Commit). The mapping below constrains the maximum permissible tier based on the failure condition severity tied to the software component's Design Assurance Level.
| DAL | Failure Condition | Max Agent Autonomy Tier | Verification Depth | Rationale |
|---|---|---|---|---|
| A | Catastrophic | Tier 1 -- Observe only | All agent output independently verified through qualified means (DO-178C Table A-1 through A-10 objectives, independence requirements) | No certification credit for unqualified tool output. Agent may analyze and propose; human authors and verifies. |
| B | Hazardous | Tier 1 -- Observe only | Independent verification required for all objectives with independence (Table A-4, A-5, A-7) | Same constraint as DAL A. Reduced objective count does not relax the independence requirement. |
| C | Major | Tier 1-2 -- Observe or Branch | Agent may draft artifacts to isolated branches; merge requires qualified human verification against applicable Table A objectives | Fewer objectives with independence. Agent-drafted code and tests are viable when independently reviewed before baseline. |
| D | Minor | Tier 1-3 -- Full tier range | Standard evidence bundles (P1) attached to each agent contribution; verification per Table A objectives | Reduced verification rigor. Agent contributions with evidence bundles can satisfy most objectives with standard review. |
| E | No Effect | Tier 1-3 -- Full tier range | Standard manifesto adoption path applies | No certification objectives apply. Normal manifesto governance is sufficient. |
Key constraint: DAL assignment is determined by the system safety assessment (ARP 4754A/4761A), not by the development team. The DAL dictates the ceiling; the team cannot raise it.
DO-178C Objectives to Manifesto Principle Mapping
DO-178C organizes airborne software lifecycle activities into five process categories. The table below maps each to the most applicable manifesto principles.
| DO-178C Process | Key Objectives | Manifesto Principle | Alignment | Notes |
|---|---|---|---|---|
| Planning Process (Section 4) | PSAC, SDP, SVP, SCMP, SQAP | P2 -- Specifications are living artifacts | Strong | Machine-readable specifications (P2) strengthen plan-to-artifact traceability. Plans remain human-approved documents. |
| Planning Process | Standards definition, transition criteria | P5 -- Autonomy is a tiered budget | Moderate | Autonomy tiers map to plan-defined transition criteria. Agent permissions can be encoded in SDP/SVP. |
| Development Process (Section 5) | Requirements, design, coding, integration | P1 -- Outcomes are the unit of work | Strong | Evidence bundles per outcome satisfy DO-178C's requirement for traceable development output. |
| Development Process | Architecture, detailed design | P3 -- Architecture is defense-in-depth | Strong | Manifesto boundary enforcement aligns with DO-178C architectural partitioning (Section 2.4.1). |
| Development Process | Source code, integration | P4 -- Right-size the swarm | Moderate | Multi-agent coordination must preserve single-threaded configuration baselines. |
| Verification Process (Section 6) | Reviews, analyses, test cases, test procedures, test results | P8 -- Evaluations are the contract | Strong | Evaluation portfolios map directly to verification cases/procedures. Evidence bundles map to test results. |
| Verification Process | Structural coverage, requirements-based testing | P9 -- Observability covers reasoning | Strong | Trace-level observability supports structural coverage analysis and requirements-based test traceability. |
| Verification Process | Independence of verification | P12 -- Accountability requires visibility | Strong | Manifesto's accountability model requires named human ownership; DO-178C requires verification independence. Both demand separation of authoring from verification. |
| CM Process (Section 7) | Configuration identification, baselines, change control, status accounting, archival | P6 -- Knowledge and memory are infrastructure | Strong | Knowledge as versioned ground truth (P6) maps to CM identification and baseline management. |
| CM Process | Problem reporting, change review | P9 -- Observability covers reasoning | Moderate | Agent action traces provide richer change history than traditional problem reports. |
| QA Process (Section 8) | Process assurance, compliance, transition criteria | P12 -- Accountability requires visibility | Strong | QA's role as independent process watchdog parallels manifesto's accountability requirements. |
| QA Process | Standards compliance | P8 -- Evaluations are the contract | Moderate | Evaluation gates can automate portions of conformity review, but QA independence remains human-owned. |
DO-330 Tool Qualification -- The Hard Constraint
DO-330 determines when a software development tool requires qualification and at what rigor. This is the single hardest regulatory constraint for agentic engineering in aviation.
Tool Qualification Level Determination
An agent used in the development of airborne software is a development tool under DO-330. Its Tool Qualification Level (TQL) is determined by the DAL of the software it produces and whether its output errors are detectable.
| TQL | Software DAL | Error Detectability | Required Tool Development Rigor | Agent Feasibility (Current State) |
|---|---|---|---|---|
| TQL-1 | DAL A | Undetectable | Equivalent to DO-178C DAL A | Not feasible. LLMs are non-deterministic, requirements are unknowable, and exhaustive testing is impossible. |
| TQL-2 | DAL A-B | Detectable | Equivalent to DO-178C DAL B | Not feasible. Same fundamental obstacles as TQL-1 with marginally reduced scope. |
| TQL-3 | DAL A-C | Detectable | Equivalent to DO-178C DAL C | Not feasible. Requires demonstrable tool requirements and verification. Current LLMs cannot satisfy these. |
| TQL-4 | DAL B-D | Detectable | Equivalent to DO-178C DAL D | Marginal. Possible only with extremely constrained agent scope and deterministic wrappers. |
| TQL-5 | DAL C-E | Detectable | Equivalent to DO-178C DAL E | Viable for narrow tool functions where all output is independently verified. |
The Realistic Path
Current LLMs cannot achieve TQL-1 through TQL-3 qualification. The fundamental obstacles are non-determinism, absence of specifiable tool requirements (in the DO-330 sense), and inability to demonstrate coverage or absence of anomalous behavior.
The viable approach: treat the agent as an unqualified development tool and independently verify all of its output.
DO-178C already accommodates unqualified tools -- their output simply receives no certification credit until independently verified. This is precisely the manifesto's model:
- Evidence bundles (P1) document what the agent produced and what evidence supports it.
- Evaluation portfolios (P8) provide the independent verification that replaces tool qualification credit.
- Observability traces (P9) provide the audit trail showing that verification was performed and by whom.
The agent accelerates development; verification provides the assurance credit. This is Tier 1 and Tier 2 operation by construction.
Note: This constraint may evolve. EASA and FAA have issued AI roadmaps (EASA AI Concept Paper 2.0, FAA AI Safety Assurance Framework). Certification authorities are actively developing guidance for ML-based tools. Industry groups (SAE G-34/EUROCAE WG-114) are drafting standards for ML in airborne systems. Monitor these developments.
DO-333 Formal Methods -- The Opportunity
DO-333 is the formal methods supplement to DO-178C. It provides certification credit for formal analyses that replace specific testing objectives -- making it the most natural intersection between agentic engineering and aviation certification.
Manifesto Principle 8 states: "proofs are a scale strategy." DO-333 is the certification framework that gives this statement regulatory teeth.
DO-333 Credit Categories Mapped to Manifesto
| DO-333 Credit | What It Replaces | Manifesto Formal Contracts Approach | Aviation Applicability |
|---|---|---|---|
| Formal proof of absence of runtime errors | Robustness testing objectives | Agent-generated code with formal proofs via tools like Astree, Polyspace, or Frama-C | Production precedent: Astree on Airbus A380/A350, A340 |
| Formal proof of requirements satisfaction | Requirements-based test cases (partial) | Formal contracts as machine-verifiable specifications (P2 + P8) | Applicable where requirements are formally expressible |
| Model checking of state machines | State machine testing | Agent-generated models with exhaustive state exploration | Applicable to control logic, mode management |
| Formal equivalence checking | Integration testing (partial) | Agent-generated code verified against formal reference model | Applicable to compiler/code generator qualification (CompCert precedent) |
Why This Matters for Agentic Engineering
Agent-generated code accompanied by machine-checked formal proofs can produce a stronger certification case than traditionally hand-written code with manual testing alone. The proof is the evidence, and it is independently verifiable by deterministic tools.
Production precedents exist:
- Astree -- abstract interpretation, deployed on Airbus A380/A350 flight control software, proving absence of runtime errors.
- CompCert -- formally verified C compiler, applicable to TQL arguments.
- SCADE -- qualified code generator with formal semantics, used across multiple Airbus and other airborne platforms.
The manifesto's position that "proofs are a scale strategy" is directly validated by the DO-333 credit model: formal methods scale certification evidence in ways that test-only approaches cannot.
ARP 4754A System-Level Mapping
ARP 4754A governs the system development process that produces the safety requirements and DAL assignments flowing down to DO-178C software development. Agents can assist at this level, but human accountability is absolute.
| ARP 4754A Process | Agent Role (Manifesto Alignment) | Human Accountability |
|---|---|---|
| Functional Hazard Assessment (FHA) | Agent assists with analysis: identifies failure modes from system architecture, cross-references historical FHA databases (P6 -- Knowledge). | Human owns hazard classification. FHA severity assignments require engineering judgment and regulatory agreement. |
| Preliminary System Safety Assessment (PSSA) | Agent drafts fault trees and dependency diagrams from architectural models; proposes failure rates from component databases (P1 -- Evidence bundles). | Human approves safety assessment. PSSA conclusions drive DAL allocation and must be defensible to the certification authority. |
| System Safety Assessment (SSA) | Agent generates bidirectional traceability matrices between safety requirements, design artifacts, and verification evidence (P9 -- Observability). | Human validates completeness and correctness. SSA is the final safety argument; it must be human-owned. |
| Common Cause Analysis (CCA) | Agent identifies common causes across subsystems: shared resources, environmental factors, cascading failures (P10 -- Containment). | Human approves analysis and determines acceptability of residual common-cause risk. |
| Requirements validation | Agent cross-checks system requirements against FHA/PSSA allocations for completeness and consistency (P2 -- Specifications). | Human confirms that derived requirements are correctly captured and allocated. |
| FDAL/IDAL allocation | Agent proposes allocation based on FHA severity and architectural independence arguments. | Human owns allocation decisions. FDAL/IDAL assignments are certification commitments. |
Configuration Management
DO-178C Section 7 requires configuration management with identification, baselines, traceability, change control, status accounting, and archival for all software lifecycle data.
Agent-generated artifacts are software lifecycle data and fall under the same CM requirements as human-generated artifacts. The manifesto's model supports this:
- Evidence bundles (P1) are CM items. Each bundle carries identification (trace ID, agent ID, timestamp), provenance, and linked problem reports.
- Manifesto trace model (P9) provides bidirectional traceability from specification through implementation to verification -- the same traceability DO-178C Section 7.2 requires.
- Knowledge as versioned ground truth (P6) maps to CM baseline management. Agent knowledge stores must be baselined and change-controlled alongside source code and requirements.
Agent memory (the heuristic/learned component per P6) is not a CM item unless it influences airborne software output. If it does, it must be baselined, and changes must go through problem reporting.
CM Mapping Summary
| DO-178C CM Objective (Section 7) | Manifesto Mechanism | Implementation Note |
|---|---|---|
| Configuration identification | Evidence bundle IDs (P1), trace IDs (P9) | Each agent-generated artifact carries a unique identifier linked to the agent session, model version, and prompt hash. |
| Baselines | Knowledge baseline (P6) | Agent knowledge stores and model versions are baselined alongside software baselines at each lifecycle milestone. |
| Traceability | Bidirectional trace model (P9) | Specification-to-code-to-test traceability generated by agents must be independently validated for completeness. |
| Problem reporting | Evaluation failures (P8) | Failed evaluations generate problem reports automatically. Agent-introduced defects trace back to the originating session. |
| Change control | Autonomy tier gates (P5) | Tier 2 branch-to-merge workflow enforces change control. No agent-generated change enters a baseline without human approval. |
| Release and archival | Evidence bundles (P1) | Bundles are archival-ready: self-contained, immutable, and reproducible. |
ARP 4761 / 4761A Safety Assessment
ARP 4761 (and its revision 4761A) defines the safety assessment methods that produce the failure condition classifications driving DAL assignment. Agent involvement in safety assessment activities requires particular care because errors propagate into DAL assignments and certification scope.
| Safety Assessment Method | Agent Contribution | Constraint |
|---|---|---|
| Fault Tree Analysis (FTA) | Agent drafts fault trees from system architecture models and failure mode libraries. | Human validates logical correctness, cut set analysis, and probability assignments. Automated generation must not mask missing failure modes. |
| Failure Modes and Effects Analysis (FMEA) | Agent populates FMEA worksheets from component databases, prior analyses, and architecture descriptions. | Human reviews severity classifications, detection methods, and recommended actions. Agent cannot assign severity. |
| Markov Analysis | Agent builds state transition models and computes reliability metrics. | Human validates state space completeness and transition rate assumptions. |
| Dependency Diagram Analysis | Agent generates dependency diagrams from system interconnection data. | Human validates that all relevant dependencies are captured, including latent and environmental dependencies. |
| Common Mode Analysis (CMA) | Agent cross-references design data to identify shared resources, manufacturing processes, and environmental exposures. | Human owns the determination of common mode acceptability and any required design changes. |
The manifesto's Principle 10 (containment) is directly relevant: safety assessment errors are emergent risks that compound through the certification chain. Independent human review is non-negotiable for all safety assessment outputs regardless of DAL.
Airworthiness Security (DO-326A / DO-356A)
DO-326A establishes the airworthiness security process; DO-356A provides the information security supplement. Agentic engineering introduces specific threat vectors that must be addressed in the Security Risk Assessment.
Manifesto Alignment
| Security Concern | Manifesto Mapping | Aviation-Specific Consideration |
|---|---|---|
| Agent data access scope | P10 -- Containment; P3 -- Defense-in-depth | Agents must not have access to airborne software beyond their authorized development scope. Network isolation and data classification enforcement apply. |
| Supply chain integrity of agent models | P3 -- Architecture boundaries | Model provenance, integrity verification, and version control. Untrusted model updates are a supply chain attack vector. |
| Prompt injection / adversarial input | P10 -- Containment | Adversarial inputs to development agents could introduce subtle vulnerabilities in airborne code. Independent verification (DO-330 unqualified tool path) is the mitigation. |
| Data exfiltration via agent context | P7 -- Context is engineered | Agent context windows may contain export-controlled technical data. |
Export Control (ITAR/EAR)
Airborne software, particularly defense-related avionics, is frequently subject to ITAR (22 CFR 120-130) or EAR (15 CFR 730-774) restrictions. Agents that process ITAR/EAR-controlled technical data must operate within compliant infrastructure: no data transmission to non-compliant cloud endpoints, no model training on controlled data without authorization, and access controls consistent with Technology Control Plans.
DO-278A -- Ground-Based Software
DO-278A governs software for ground-based CNS/ATM systems. It is structurally similar to DO-178C but uses Assurance Levels (AL 1-6) rather than DALs and applies to a lower-criticality domain overall.
DO-278A is a strong candidate for earlier agentic adoption:
| DO-278A Assurance Level | Equivalent Rigor | Agent Autonomy Ceiling |
|---|---|---|
| AL-1 | Comparable to DAL A | Tier 1 |
| AL-2 | Comparable to DAL B | Tier 1 |
| AL-3 | Comparable to DAL C | Tier 1-2 |
| AL-4 | Comparable to DAL D | Tier 1-3 |
| AL-5 | Below DAL D | Tier 1-3 |
| AL-6 | Below DAL E | Tier 1-3 |
The same DO-330 tool qualification constraints apply. The path is identical: unqualified tool with independent verification of all output.
ALCOA+ Compliance
Aviation configuration management (DO-178C Section 7) requires data integrity standards that parallel ALCOA+ requirements. The manifesto's evidence model satisfies these by construction. See Companion Frameworks — ALCOA+ Alignment for the complete mapping table.
For aviation-specific application:
- Configuration identification maps to ALCOA+ "Attributable" and "Original": every agent-generated artifact carries agent identity, model version, session ID, and prompt hash.
- Baselines map to "Contemporaneous" and "Enduring": evidence bundles are captured at execution time and retained as immutable CM items.
- Problem reporting maps to "Accurate" and "Complete": evaluation failures generate problem reports that are traceable and cannot be silently suppressed.
Practical constraint: for DO-178C programs, the trace infrastructure is a development tool and must be addressed in the PSAC. Conservative framing: describe it as an internal tooling component with documented version control, not as a tool requiring TQL qualification.
Market-Specific Autonomy Guidance
The table below maps aviation workflows to recommended autonomy tiers. The DAL-based ceiling in the first section of this document applies; this table adds workflow-level context.
These are conservative caps for safety-relevant software paths; lower-risk supporting tooling may have different constraints.
| Workflow | DAL / Assurance Level | Recommended Autonomy | Notes |
|---|---|---|---|
| Airborne software — critical paths (flight control, engine control) | DAL A/B | Tier 1 (observe only) | Agent may analyze, draft, and propose. All output independently verified by qualified personnel. TQL-1/2 tool qualification is not currently feasible under present evidence and qualification expectations; treat the agent as an unqualified tool pending authority review. |
| Airborne software — major functions | DAL C | Tier 1-2 | Agents draft to isolated branches. Merge requires qualified review against applicable Table A objectives. |
| Airborne software — minor / no-effect functions | DAL D/E | Tier 1-3 | Standard evidence bundles satisfy reduced verification objectives. Natural pilot domain. |
| Ground support equipment (GSE) software | Typically not DO-178C scope | Tier 1-3 | Normal manifesto adoption applies. Confirm applicability of DO-178C to specific GSE. |
| Ground-based CNS/ATM software (DO-278A) | AL-3 to AL-6 | Tier 1-3 (AL-3 ceiling: Tier 1-2) | Lower assurance levels; natural early adoption domain. Same DO-330 path applies. |
| Test generation and requirements analysis | Any DAL — tool output only | Tier 1 (observe) | Agent operating at Tier 1 generates candidate test cases, traceability matrices, and coverage analyses. Human qualified staff review and accept. No tool qualification required. |
| Safety assessment (FHA, FMEA, FTA) | N/A — feeds DAL assignment | Tier 1 (observe only) | Errors propagate into DAL and certification scope. Independent human review non-negotiable for all safety assessment outputs regardless of DAL. |
| Traceability and evidence package assembly | Any DAL | Tier 1-2 | High value, low risk. Agent assembles; human validates completeness. Strong ALCOA+ alignment. |
Tool Configuration Notes
How to configure agent tooling to satisfy DO-178C and DORA Article 9 traceability requirements. Read alongside your enterprise configuration guide.
Configuration Management Hook Mapping
DO-178C Section 7 requires that all software lifecycle data is identified, baselined, and change-controlled. Agent configuration contributes to this:
| DO-178C CM Objective | Hook Type | What It Produces |
|---|---|---|
| Configuration identification of agent artifacts | PostToolUse audit hook | Artifact ID, agent session ID, model version, timestamp |
| Change control — agent-modified files | PreToolUse gate hook | Review record, autonomy tier at time of change |
| Problem reporting — failed evaluations | PostToolUse evaluation hook | Evaluation failure record with trace ID |
| Archival and retention | SessionEnd archive hook | Immutable session record in the CM repository |
Export Control Enforcement (ITAR/EAR)
For programs with ITAR/EAR-controlled technical data, the MCP allowlist (Layer 6 in enterprise configuration) is the primary data residency control:
- Restrict MCP servers to on-premises or US-person-accessible endpoints only.
- No external API calls for sessions containing ITAR-controlled design data.
- Log all tool calls with data classification context for Technology Control Plan compliance.
Model Version Pinning for Certification Stability
Pin agent model versions during active certification programs:
- During DER/ODA review periods
- While PSAC or SCI is open
- After any verification baseline has been established
Model version changes affecting agent behavior should be documented as CM changes and assessed for impact on previously verified artifacts.
Viable Starting Points
Not all aviation software carries equal certification burden. The following are realistic entry points for agentic engineering practices today:
DAL D/E software development. Reduced verification objectives, fewer independence requirements. Evidence bundles and evaluation gates provide sufficient assurance credit with standard review.
Ground support equipment (GSE) software. Often not subject to DO-178C at all. Standard manifesto adoption applies.
Test generation and requirements analysis automation. Agents operating at Tier 1 (Observe) to generate candidate test cases, requirements traceability matrices, and coverage analyses. Output is reviewed and accepted by qualified personnel -- no tool qualification required.
Traceability automation and evidence bundle assembly. Agent-assembled traceability data and certification evidence packages. Human validates completeness. High-value, low-risk application.
Formal proof assistance (DO-333 credit). Agents generate proof obligations or proof scripts for formal verification tools. The tool (Astree, Frama-C, etc.) provides the deterministic verification. Agent output is checked by the prover, not by human review alone.
DO-278A AL-4 through AL-6 systems. Lower assurance levels with proportionally reduced verification burden. Natural pilot domain.
Open Regulatory Questions
The following questions do not have settled answers as of this writing. Organizations should track developments from FAA, EASA, SAE G-34, and EUROCAE WG-114.
Certification authority stance on agent-generated airborne software. No published policy exists specifically addressing LLM-generated code in DO-178C certification. Current guidance is interpreted through existing tool qualification (DO-330) frameworks.
Issue Paper likelihood. Novel technologies in certification programs typically trigger FAA Issue Papers or EASA Certification Review Items (CRIs). An agentic development approach in a DAL A-C program should anticipate this.
PSAC framing. How to describe agentic engineering practices in the Plan for Software Aspects of Certification without triggering unnecessary concern. Framing agents as unqualified development tools with independent verification is the current pragmatic approach.
Tool qualification evolution for AI-based tools. SAE G-34/EUROCAE WG-114 are developing ARP 6983 (ML in airborne systems) and related guidance. Future standards may provide a path to qualified AI-based development tools that does not exist today.
Multi-model supply chain. When multiple models (routing per P11) are used in a development workflow, the tool qualification and CM implications compound. No guidance exists for multi-model development tool chains.
Memory and learned behavior in development tools. If an agent's learned memory (P6) influences airborne software output, does that memory become lifecycle data under DO-178C Section 7? The conservative position is yes.
Mapping the Agentic Engineering Manifesto to medical device regulatory frameworks.
Disclaimer — This document maps concepts from the Agentic Engineering Manifesto to medical device regulatory frameworks. It does not constitute compliance or regulatory advice. Consult qualified regulatory and quality professionals for compliance determinations.
Regulatory currency: This document reflects IEC 62304, EU MDR 2017/745, FDA 21 CFR Part 820 (QMSR, effective February 2026, replacing the prior QSR), and EU AI Act requirements as understood at the time of last review. The EU AI Act implementation timeline is subject to ongoing guidance and proposed amendments; verify current status at eur-lex.europa.eu before relying on AI Act classifications in this document. Last reviewed: April 2026. Proposed changes not yet enacted are flagged as such.
See also: Companion Frameworks (boundary conditions, ALCOA+ mapping), Agentic V-Model (V-model lifecycle transition for regulated industries).
Canonical sources. Normative principle definitions (P1–P12) and autonomy tier definitions are in manifesto-principles.md. This document maps those definitions to medical device regulatory requirements; it does not redefine them.
IEC 62304 Safety Class to Manifesto Autonomy Mapping
IEC 62304 safety classification determines documentation depth, verification rigor, and -- in this mapping -- the permissible agent autonomy ceiling. If IEC 62304 is revised, re-evaluate the class mapping rather than assuming the current three-class structure is permanent; until any update is published, map conservatively to the three-class model.
| Safety Class | Risk Level | Max Agent Autonomy | Documentation Depth | Evidence Bundle Requirements |
|---|---|---|---|---|
| Class A (no injury) | Negligible | Tier 1-3 (P5) for non-safety-critical software items; full agentic loop remains subject to the device's risk controls and use-case constraints. | Minimal: requirements + release documentation. | Standard evidence bundles per manifesto phase. |
| Class B (non-serious injury) | Moderate | Tier 1-2 (P5). Agents propose; humans approve merges. | Moderate: architecture + integration testing required. | Enhanced bundles with SOUP risk analysis per item. |
| Class C (death / serious injury) | High | Tier 1 only (P5). Agents analyze and propose; humans implement. | Full: detailed design + unit-level verification required. | Complete bundles with SOUP verification, unit-level trace, formal risk linkage. |
Notes:
- Autonomy ceilings are conservative defaults. Organizations may justify narrower or wider bounds through documented risk-benefit analysis.
- Class C Tier 1 restriction means agents assist with analysis, traceability matrix generation, and test scaffolding -- not code generation for safety-critical paths.
- If the 2026 IEC 62304 update merges Class A and B into a single class, re-evaluate the Tier 1-2 boundary for the merged class based on the updated documentation requirements.
- Evidence bundle requirements scale with safety class. Class C bundles must include unit-level traceability from requirement through design, implementation, and verification -- satisfying IEC 62304 Clause 5.6 in full.
IEC 62304 Software Lifecycle to Manifesto Mapping
| IEC 62304 Activity | Clause | Manifesto Equivalent | Principle | Alignment | Gap |
|---|---|---|---|---|---|
| Software development planning | 5.1 | Specification + Plan phases of Agentic Loop | P2, P5 | Strong. Living specifications exceed static plans. | Plans must be frozen at submission; manifesto assumes evolution. Snapshot mechanism needed. |
| Software requirements analysis | 5.2 | Specify phase; machine-readable specs | P2 | Strong. Machine-readable specs satisfy traceability. | Requirements must include safety requirements traced to risk analysis (ISO 14971 linkage). |
| Software architectural design | 5.3 | Design phase; domain boundaries (P3) | P3 | Strong. Enforced boundaries map to software items. | Architecture must decompose to software items with assigned safety classes. |
| Software detailed design | 5.4 | Design phase (Class C depth) | P3 | Partial. Manifesto does not mandate unit-level design docs. | Class B/C require detailed design for each software unit. Agents can generate but humans must verify. |
| Unit implementation | 5.5 | Execute phase | P4, P5 | Partial. Agent execution replaces human coding. | Agent-as-tool qualification is unresolved (see Open Questions). |
| Unit verification | 5.6 | Verify phase; evaluation portfolio (P8) | P8 | Strong. Evaluation gates exceed minimal unit test requirements. | Must include static analysis, code review equivalent, and SOUP verification. |
| Integration and integration testing | 5.7 | Verify phase; integration evaluations | P8, P9 | Strong. Traces reconstruct cross-component interactions. | Integration must verify software item interfaces per architectural design. |
| System testing | 5.8 | Validate phase | P1, P8 | Strong. Outcome-based validation aligns directly. | System tests must trace to software requirements (5.2). |
| Software release | 5.9 | Govern phase; release evidence bundle | P12 | Strong. Evidence bundles with accountability satisfy release criteria. | Release must include version identification, known anomalies, and SOUP list. |
| Software maintenance | 5.10 | Learn + Govern phases; living specifications | P2, P6 | Strong. Continuous loop exceeds reactive maintenance. | Problem and modification analysis must follow change control procedures. |
ISO 14971 Risk Management to Manifesto Mapping
| ISO 14971 Element | Clause | Manifesto Mechanism | Alignment |
|---|---|---|---|
| Intended use / reasonably foreseeable misuse | 4.2-4.3 | Specification scope (P2); boundary enforcement (P3) | Strong. Machine-enforced boundaries prevent foreseeable misuse categories. |
| Hazard identification | 5.2 | Adversarial testing (P8); chaos testing (P10) | Moderate. Manifesto identifies runtime hazards; clinical hazards require domain expertise outside agent scope. |
| Risk estimation | 5.4 | Observability data (P9); incident attribution (P12) | Moderate. Runtime data informs probability estimation; severity requires clinical judgment. |
| Risk evaluation | 5.5 | Autonomy tiering (P5); blast-radius limits | Moderate. Risk-based autonomy is philosophically aligned; acceptability criteria require manufacturer determination. |
| Risk control | 6 | Defense-in-depth (P3); deterministic wrappers; evaluation gates (P8) | Strong. Layered controls (wrappers + evaluations + observability) map to inherent safety, protective measures, and information for safety. |
| Residual risk evaluation | 7 | Evidence bundles; evaluation portfolio completeness | Partial. Manifesto does not explicitly model residual risk acceptance. Requires human risk-benefit judgment. |
| Production and post-production information | 8 | Observe + Learn phases; telemetry (P9) | Strong. Continuous observability exceeds traditional post-market surveillance data collection. |
ISO/TS 24971-2 (ML-specific risk management): Extends ISO 14971 for ML-based medical devices. Key additions relevant to agentic systems:
- Data quality risk: training and inference data quality directly affects agent output quality. Manifesto's context engineering (P7) addresses data curation but does not prescribe medical-device-specific data quality metrics.
- Model drift monitoring: the manifesto's Observe phase and evaluation regression gates (P8) detect drift. ISO/TS 24971-2 requires drift to feed back into the risk management file.
- Performance degradation detection: continuous evaluation portfolios satisfy this requirement when evaluation thresholds are calibrated to clinically meaningful performance boundaries.
- Uncertainty quantification: ISO/TS 24971-2 expects ML systems to characterize output uncertainty. The manifesto does not mandate uncertainty quantification but its evaluation framework can incorporate it.
ISO 13485 QMS to Manifesto Mapping
| ISO 13485 Requirement | Clause | Manifesto Mechanism | Notes |
|---|---|---|---|
| Design input | 7.3.3 | Specifications (P2); machine-readable requirements | Specs must include applicable regulatory requirements, standards, and risk control outputs. |
| Design output | 7.3.4 | Evidence bundles; verified artifacts | Outputs must reference design input requirements and include acceptance criteria. |
| Design review | 7.3.5 | Govern phase; human accountability (P12) | Named domain owner reviews at each design stage. Agent-generated artifacts are inputs to review, not substitutes. |
| Design verification | 7.3.6 | Verify phase; evaluation portfolio (P8) | Evaluation results serve as verification records when traced to design inputs. |
| Design validation | 7.3.7 | Validate phase; outcome-based acceptance (P1) | Validation must occur under defined use conditions. Simulated environments require justification. |
| Design transfer | 7.3.8 | Release evidence bundle; deployment records | Transfer procedures must ensure design outputs are verified before manufacturing. |
| Document control | 4.2.4 | Versioned specifications (P2); immutable evidence bundles | Manifesto versioning satisfies document control if retention and approval workflows are formalized. |
| Traceability | 7.5.9 | Trace infrastructure (P9); specification-to-outcome links | Structured traces exceed typical traceability matrices. Must extend to UDI and device identification. |
| CAPA | 8.5.2-3 | Learn phase; incident-driven specification updates | Manifesto's "failures are data" philosophy aligns. CAPA records must follow prescribed timelines and formats. |
| Management review | 5.6 | Govern phase; accountability (P12) | Requires periodic QMS effectiveness review. Manifesto governance is continuous but must produce discrete review records. |
| Purchasing controls | 7.4 | SOUP management; agent-selected dependencies | Supplier qualification applies to SOUP items. Agent-selected dependencies must go through purchasing/supplier evaluation. |
SOUP / Agent-as-Tool in Medical Device Context
SOUP Requirements by Safety Class
| Requirement | Class A | Class B | Class C |
|---|---|---|---|
| SOUP identification | Required | Required | Required |
| SOUP risk analysis | -- | Required | Required |
| Published anomaly list review | -- | Required | Required |
| SOUP functional/performance requirements | -- | Required | Required |
| SOUP verification (detailed) | -- | -- | Required |
| SOUP qualified via testing | -- | Recommended | Required |
AI Model as SOUP
In agentic engineering, the AI model exhibits SOUP characteristics that exceed traditional SOUP assumptions:
- Non-deterministic: identical inputs may produce different outputs across invocations, violating the implicit SOUP assumption of repeatable behavior.
- Version-dependent: model updates change behavior without explicit changelogs, making published anomaly list review impractical.
- Opaque anomaly list: failure modes cannot be enumerated a priori; the "published anomaly list" for a foundation model is effectively unbounded.
Agent-Selected Dependencies as SOUP Decisions
When agents select libraries, frameworks, or code patterns during execution, each selection is a SOUP decision that must be captured and evaluated. The manifesto's trace infrastructure (P9) records these selections but does not automatically trigger SOUP evaluation workflows.
Training-Data Patterns as Implicit SOUP
Agent-generated code may incorporate patterns, algorithms, or architectural decisions derived from training data. These constitute implicit SOUP -- code of unknown provenance embedded without explicit dependency declaration.
Manifesto Response
Treat the agent as an unqualified tool. Independently verify all agent output through the evaluation portfolio (P8) and human review (P12). This is consistent with the manifesto's position that agent assertions are never evidence -- only verified outcomes count (P1).
Practical implication: for Class B and C devices, every agent execution that produces deliverable artifacts must include a SOUP impact assessment in the evidence bundle. This assessment identifies any new dependencies introduced, any training-data-derived patterns detected (where feasible), and confirms that independent verification was performed on the output.
See Companion Frameworks -- Boundary Conditions for SOUP treatment in the cross-cutting regulated-industry guidance.
FDA SaMD / GMLP / PCCP
Predetermined Change Control Plan (PCCP)
The FDA PCCP framework for AI/ML-based SaMD requires a pre-specified plan for anticipated modifications. The manifesto's living specifications (P2) and continuous revalidation triggers align structurally:
| PCCP Element | Manifesto Mechanism |
|---|---|
| Description of anticipated modifications | Living specifications with versioned change categories (P2) |
| Modification protocol (implementation, V&V) | Agentic Loop: Execute, Verify, Validate phases with evidence gates |
| Real-world performance monitoring plan | Observe + Learn phases; telemetry and drift detection (P9) |
| Revalidation triggers | Evaluation regression gates (P8); specification change triggers re-verification |
Gap: PCCP requires pre-submission of the change control plan. The manifesto's continuous evolution must be bounded by the approved PCCP scope for marketed SaMD.
GMLP Principles to Manifesto Mapping
| GMLP Principle | Manifesto Principle | Alignment |
|---|---|---|
| Multi-disciplinary expertise | Right-sized swarm (P4); human domain ownership (P12) | Strong |
| Good software engineering practices | Architecture (P3); evaluations (P8) | Strong |
| Clinical association and scientific validity | Outside manifesto scope | Gap -- requires clinical expertise |
| Data quality assurance | Context engineering (P7); knowledge governance (P6) | Moderate |
| Data management and relevance | Memory curation (P6); versioned data (P7) | Moderate |
| Computational and statistical rigor | Evaluation portfolio (P8); formal verification | Strong |
| Study design transparency | Observability (P9); evidence bundles | Strong |
| Performance assessment across subgroups | Adversarial evaluations (P8) | Moderate |
| Independent datasets for testing | Evaluation design practice | Moderate -- not explicitly mandated |
| Monitoring and retraining | Observe + Learn phases (P9, P6) | Strong |
Total Product Lifecycle (TPLC)
The FDA TPLC approach for AI/ML SaMD maps directly to the Agentic Loop. Both assume continuous monitoring, learning, and modification rather than a single pre-market snapshot.
| TPLC Stage | Agentic Loop Phase | Notes |
|---|---|---|
| Planning and development | Specify, Design, Plan | Manifesto specifications serve as the SaMD development plan. |
| Verification and validation | Execute, Verify, Validate | Evidence bundles document V&V activities per PCCP scope. |
| Deployment and monitoring | Observe, Learn | Real-world performance monitoring feeds back into specifications. |
| Modification and revalidation | Govern, Specify (repeat) | PCCP-scoped modifications trigger re-entry into the loop. |
The manifesto's loop (Specify-Design-Plan-Execute-Verify-Validate-Observe- Learn-Govern) is a superset of the TPLC cycle. The key constraint: TPLC modifications outside the approved PCCP scope require new regulatory submissions.
EU MDR + AI Act Dual Compliance
Many EU MDR IIa+ devices that incorporate AI will also trigger high-risk AI obligations under the EU AI Act, but the exact classification depends on intended purpose and the applicable AI Act annexes. This typically creates dual compliance obligations.
| Requirement Source | Requirement | Manifesto Principle | Notes |
|---|---|---|---|
| AI Act Art. 10 | Data governance | P6 (Knowledge/Memory), P7 (Context) | Training, validation, and testing datasets must meet quality criteria. Manifesto's data curation aligns but must be formalized per Annex IV. |
| AI Act Art. 13 | Transparency | P9 (Observability) | Traces and decision reconstruction satisfy transparency requirements. Must include user-facing documentation per AI Act format. |
| AI Act Art. 14 | Human oversight | P5 (Autonomy tiers), P12 (Accountability) | Tiered autonomy with named human owners directly satisfies human oversight requirements. |
| AI Act Art. 15 | Accuracy, robustness, cybersecurity | P8 (Evaluations), P10 (Containment) | Evaluation portfolios and chaos testing address accuracy/robustness. Cybersecurity requires supplementary assessment. |
| MDR Annex I, Ch. I | General safety and performance | P1 (Outcomes), P3 (Architecture) | Risk-based design with verified outcomes. Clinical performance outside manifesto scope. |
| MDR Annex II | Technical documentation | P2 (Specifications), P9 (Observability) | Versioned specs + structured traces produce technical documentation artifacts. Format must comply with MDCG guidance. |
| MDR Art. 83-86 | Post-market surveillance / vigilance | P9 (Observability), Learn + Govern phases | Continuous observability exceeds minimum PMS requirements. Vigilance reporting timelines are regulatory obligations outside manifesto scope. |
Notes:
- Class IIa+ devices with AI components = high-risk AI system automatically under AI Act Article 6(1) via Annex I, Section A. No separate risk classification is needed on the AI Act side.
- Notified bodies must assess both MDR and AI Act conformity. A single evidence bundle strategy that satisfies both regimes reduces audit burden. The manifesto's evidence model is designed for this consolidation.
- AI Act conformity assessment may be integrated into the MDR conformity assessment procedure. Manufacturers should plan for a single, unified technical file that addresses both sets of requirements.
- AI Act Article 9 (risk management) overlaps significantly with ISO 14971. A single risk management file can serve both regimes if it addresses AI-specific risks (bias, drift, opacity) alongside device-level hazards.
Clinical Evidence Boundary
Clinical evaluation (EU MDR Article 61), post-market clinical follow-up (PMCF), and benefit-risk determination are explicitly outside agent scope. These require clinical domain expertise, investigator judgment, and regulatory strategy that agents cannot provide.
Agents may assist with:
- Traceability matrix generation between requirements and clinical evidence
- Evidence assembly and formatting for clinical evaluation reports
- Statistical analysis of post-market surveillance data
- Literature search and screening for clinical evaluation
Agents must NOT:
- Make clinical judgments or risk-benefit determinations
- Generate clinical evidence claims or conclusions
- Determine clinical investigation endpoints or study design
- Assess clinical significance of post-market data
ALCOA+ Compliance
The manifesto's evidence model satisfies ALCOA+ data integrity requirements by construction. See Companion Frameworks -- ALCOA+ Alignment for the complete mapping table.
For medical device applications, this means evidence bundles produced through governed agentic delivery inherently meet the data integrity expectations of ISO 13485 record-keeping and FDA 21 CFR Part 820 quality system requirements, provided the underlying trace infrastructure is validated.
Key implementation note: the trace infrastructure itself is a computerized system subject to validation under 21 CFR Part 11 / Annex 11. Organizations must validate the evidence capture pipeline before relying on it for regulatory records. The manifesto's observability requirements (P9) provide the functional specification for this validation.
Market-Specific Autonomy Guidance
The IEC 62304 safety class mapping at the top of this document defines the regulatory ceiling. This table adds workflow-level context for common medical device development activities.
| Workflow | Safety Class / Risk | Recommended Autonomy | Key Constraint |
|---|---|---|---|
| SaMD — patient-facing clinical decision output | Class C (IEC 62304); High-risk (EU AI Act) | Tier 1 (observe only) | Agent assists analysis; human clinician or qualified reviewer owns every output affecting patient care. |
| SaMD — Class B device software | Class B | Tier 1-2 | Agents draft to isolated branches. Enhanced evidence bundles with SOUP risk analysis. |
| Class A device software and tooling | Class A | Tier 1-3 | Full agentic loop permissible. Standard evidence bundles. Natural pilot domain. |
| Test generation and requirements traceability | Any class — tool output | Tier 1 (observe) | Agent generates candidate tests and traceability matrices. Qualified personnel review before entry into the DHF/DMR. |
| Post-market surveillance data analysis | Post-market | Tier 1-2 | Agents analyze vigilance data, identify signals, draft initial assessments. Clinical significance determination remains human-owned. |
| Clinical evidence assembly and formatting | Pre-submission | Tier 1-2 | Agents compile CER evidence packages, literature search results, and statistical summaries. Clinical conclusions are human-authored. |
| IQ/OQ/PQ evidence assembly | Validation | Tier 1-2 | Agents assemble qualification evidence packages and format test results. Human qualified person reviews and approves. |
| CAPA root cause analysis assistance | Quality | Tier 1-2 | Agents draft root cause analyses from defect data and trend analysis. Human quality owner approves before closure. |
Tool Configuration Notes
How to configure agent tooling to satisfy IEC 62304 traceability and 21 CFR Part 11 / EU Annex 11 audit trail requirements.
Audit Trail Hook Mapping
21 CFR Part 11 §11.10(e) and EU Annex 11 §9 require audit trails for all GxP computerized system activity. Agent configuration should produce:
| Regulatory Requirement | Hook Type | What It Produces |
|---|---|---|
| Audit trail — every agent action | PostToolUse audit hook | Agent identity, action type, timestamp, trace ID, data accessed |
| Access controls — authorized agents only | PreToolUse gate hook | RBAC check record; denied requests logged |
| Electronic signature for GxP record entry | PreToolUse signature hook | Named human approval with timestamp before any record submission |
| System validation evidence | SessionStart + SessionEnd hooks | Complete session record for IQ/OQ/PQ evidence |
| Data backup verification | Scheduled hook | Periodic confirmation that trace archive is intact and queryable |
Data Classification Enforcement
For Class B/C devices and for GxP records:
- Restrict agents to approved MCP servers only. No external API calls for sessions containing patient data or device design data.
- Apply HIPAA (US) and GDPR (EU) data handling controls through the infrastructure-level MCP allowlist, not through agent prompts.
- The trace infrastructure is a computerized system subject to Part 11 validation. Validate before using as a regulatory record source.
SOUP Detection Integration
Integrate a dependency scanning hook (PreToolUse) that:
- Intercepts any new library or framework selection by the agent
- Queries the organization's SOUP registry for qualification status
- Blocks integration of unqualified SOUP for Class B/C development
- Logs all SOUP decisions in the evidence bundle for DHF inclusion
Viable Starting Points
Not all medical device software carries equal certification burden. The following are realistic entry points for agentic engineering practices today:
Class A device software. No injury risk. Full agentic loop permissible. Standard evidence bundles. Natural pilot domain with minimal regulatory overhead.
Test generation for any safety class (Tier 1 observe). Agents generate candidate test cases, traceability matrices, and IEC 62304 §5.6 unit verification scaffolding. Qualified personnel review and accept. No tool qualification required. Applicable to Class B and C.
Post-market surveillance analysis. Agents analyze complaint data, identify adverse event patterns, and draft initial signal assessments. Human clinical reviewer owns the determination. High-value use case with contained blast radius.
Clinical evidence assembly. Agents compile literature search results, summarize clinical data, and format CER draft sections. Clinical conclusions remain human-authored. Reduces evidence assembly cycle time without automating clinical judgment.
Traceability matrix generation. Agent assembles specification-to-test-to-verification matrices from the DHF. Human validates completeness. Strong ALCOA+ alignment; directly supports MDR Annex II technical documentation.
IQ/OQ/PQ evidence packaging. Agents format and assemble qualification evidence packages from evaluation results. Human qualified person reviews before sign-off. Reduces qualification cycle time significantly.
Open Regulatory Questions
The following questions are unresolved in current regulatory guidance and represent areas where industry consensus, standards body clarification, or regulatory precedent is needed:
Agent-as-tool qualification under IEC 62304: Is an AI agent a "software tool" requiring qualification per IEC 62304 Clause 8, or is it SOUP, or something that requires a new classification? Current guidance does not address non-deterministic, general-purpose generation tools.
SOUP classification for continuously-learning systems: IEC 62304 assumes SOUP is versioned and stable between versions. A continuously- learning agent violates this assumption. How should SOUP risk analysis apply when the SOUP item's behavior changes without a discrete version boundary?
Version change revalidation requirements: When the underlying model is updated (e.g., model v1 to v2), what revalidation scope is required? The PCCP framework addresses anticipated modifications but does not explicitly cover infrastructure-level model changes that alter agent behavior without software changes.
FDA / notified body stance on agent-generated SaMD components: No regulatory body has published guidance on whether code generated by AI agents requires different verification than human-written code. The manifesto's position -- that agent output is unverified until independently confirmed -- is conservative but has not been tested in a regulatory submission.
Maps the Agentic Engineering Manifesto principles to pharmaceutical and life sciences regulatory frameworks.
Disclaimer — This document maps concepts from the Agentic Engineering Manifesto to pharmaceutical and life sciences regulatory frameworks. It does not constitute compliance or regulatory advice. Consult qualified regulatory and quality professionals for compliance determinations.
Regulatory currency: This document reflects GAMP 5 (2nd ed. 2022), FDA 21 CFR Part 11, EU Annex 11, ICH Q10, and EMA guidance as understood at the time of last review. FDA and EMA guidance on AI/ML in regulated manufacturing is actively evolving. Last reviewed: April 2026. Proposed changes not yet enacted are flagged as such.
Related documents: Companion Frameworks (boundary conditions, ALCOA+ mapping) | V-Model Adoption Path | Manifesto Principles
Canonical sources. Normative principle definitions (P1–P12) and autonomy tier definitions are in manifesto-principles.md. This document maps those definitions to pharmaceutical regulatory requirements; it does not redefine them.
1. GAMP 5 Category Mapping
These mappings are pragmatic classifications, not formal GAMP rulings.
GAMP 5 (2nd edition, 2022) categorizes computerized systems for risk-based validation. The table below maps each category to its agentic engineering equivalent and the manifesto mechanism that governs it.
| GAMP Cat | Description | Agent Context | Validation Approach | Manifesto Mechanism |
|---|---|---|---|---|
| 1 -- Infrastructure | OS, databases, networking | Agent runtime infrastructure (container host, network layer, database engine) | Minimal -- qualify as part of platform | P3 architecture enforcement; infrastructure treated as deterministic wrapper |
| 3 -- Non-Configured | COTS used as-is | LLM API consumed without customization; off-the-shelf agent framework with default settings | Verification of output against intended use; supplier documentation leveraged | P8 evaluation portfolios verify outputs; supplier qualification per GAMP Appendix O3 |
| 4 -- Configured | Products configured for intended use | Agent system configured via prompts, skills, tool permissions, autonomy tiers | Configuration-focused validation; verify each configured parameter behaves as intended | P2 living specifications; P5 autonomy tiers as validated configuration; P7 context engineering |
| 5 -- Custom | Bespoke software | Agent-generated code; custom tool integrations; bespoke orchestration logic | Full risk-based validation per GAMP lifecycle | P1 outcome evidence; P8 evaluations as contract; P9 structured traces for traceability |
Open question: Is the agent system itself Category 3 or Category 4?
An LLM API used with default parameters is arguably Category 3. The same API used with system prompts, configured tools, and autonomy tier enforcement is Category 4. Most production agent deployments are Category 4 at minimum. The categorization determines validation burden and must be justified in the system's validation plan. Where agent-generated code is deployed, that code is Category 5 regardless of the system that produced it.
Agent-selected dependencies. When an agent pulls in a library or framework, it is implicitly making a GAMP categorization decision. The manifesto's P3 (architecture as defense-in-depth) provides the mechanism -- allowlists and tool permissions -- but the GAMP implications must be explicitly addressed: each agent-selected dependency inherits a category and validation obligation that the deploying organization owns.
GAMP 5 2nd edition "critical thinking" alignment. The 2022 revision emphasizes critical thinking over rote compliance -- a philosophy shared with the manifesto. GAMP 5's risk-based approach to validation effort maps to the manifesto's phase-calibrated evidence: higher risk demands more rigorous evaluation, not more documentation.
2. Computer Software Assurance (CSA) Alignment
The FDA's 2022 CSA guidance replaces traditional CSV with risk-based, critical-thinking-driven assurance. This is the strongest alignment point between the manifesto and pharma regulation.
| CSV (Traditional) | CSA (2022) | Manifesto Alignment |
|---|---|---|
| Document everything | Risk-based documentation | Evidence bundles scaled by risk tier (P1) |
| Scripted testing only | Unscripted + scripted testing | Evaluation portfolios with adversarial cases (P8) |
| Compliance theater | Critical thinking | Outcomes over assertions (P1); verified outcomes over fluent assertions |
| Test to the script | Test to the risk | Phase-calibrated evidence; chaos testing (P10) |
| Every IQ/OQ/PQ step documented | Assurance commensurate with risk | Autonomy tiers match risk (P5); evidence bundles gated by phase |
| Scripted execution as proof | Intended use drives assurance | Specification-first approach (P2); validation distinct from verification |
| Compliance as end-state | Continual assurance | Agentic Loop (Observe, Learn, Govern) as living assurance cycle |
Strategic context. The manifesto is an engineering framework that operationalizes CSA's philosophy. CSA calls for risk-based, critical-thinking-driven assurance but does not prescribe the engineering discipline to implement it. The manifesto provides that discipline: specifications as living artifacts (P2), evaluations as contracts (P8), structured traces for auditability (P9), and tiered autonomy calibrated to risk (P5). Organizations struggling to operationalize CSA can adopt the manifesto's engineering practices as a CSA implementation framework.
Most pharma companies understand CSA's intent but lack the engineering practices to execute it. The manifesto fills that gap -- not as a compliance framework, but as the engineering discipline that produces CSA-aligned evidence by construction.
CSA principle-to-manifesto detail.
| CSA Principle | Manifesto Implementation |
|---|---|
| "Assurance activities commensurate with risk" | Phase-calibrated evidence bundles (P1); autonomy tiers scaled to risk (P5) |
| "Use of unscripted testing" | Adversarial evaluation cases (P8); chaos testing (P10) |
| "Critical thinking over scripted compliance" | Outcomes as unit of work (P1); evaluations as contract, not checklist (P8) |
| "Intended use drives assurance" | Specification-first approach (P2); validation distinct from verification (Agentic Loop) |
| "Leverage supplier testing" | Agent-generated evidence bundles as supplier evidence (P1, P8) |
This alignment is structural, not retrofitted. The manifesto's evidence model produces CSA-compatible artifacts as a byproduct of its engineering discipline. Organizations adopting the manifesto for agentic delivery simultaneously produce documentation that satisfies CSA expectations -- without a separate compliance workstream.
3. 21 CFR Part 11 / EU Annex 11 Mapping
| Requirement | Regulation | Manifesto Mechanism | Alignment | Gap |
|---|---|---|---|---|
| Audit trails | Part 11 s 11.10(e); Annex 11 s 9 | P9 structured traces -- every agent action produces inspectable trace with decision chain | Good fit | Agent system configuration changes (prompt edits, tier adjustments, tool additions) require their own audit trail beyond action traces |
| Electronic signatures | Part 11 s 11.50-11.100; Annex 11 s 14 | P12 accountability -- humans own outcomes, approvals, risk acceptance | Partial | Agent-produced records entering GxP systems may require legally valid electronic signatures; manifesto does not address signature binding |
| System access controls | Part 11 s 11.10(d); Annex 11 s 12 | P5 autonomy tiers with granular permissions (read/write, deploy scope, data access) | Good fit | -- |
| Closed vs. open system | Part 11 s 11.30 | P3 architecture as defense-in-depth; deterministic wrappers around probabilistic AI | Partial | No classification guidance for whether agent systems with external API calls constitute open systems |
| Data backup and recovery | Annex 11 s 7.1 | P6 memory governance -- rollback, provenance, expiration | Partial | Memory governance covers learned memory; GxP backup requirements extend to all system data and configuration |
| Validation | Part 11 s 11.10(a); Annex 11 s 4 | P8 evaluations as contract; evidence bundles per P1 | Partial | No explicit IQ/OQ/PQ mapping (see section 7 below) |
| Operational checks | Part 11 s 11.10(f) | P10 containment engineering -- circuit breakers, rate limits, safe fallbacks | Good fit | -- |
| Authority checks | Part 11 s 11.10(g) | P5 tier enforcement -- actions gated by tier and permission scope | Good fit | -- |
| Record retention | Part 11 s 11.10(c); Annex 11 s 17 | P9 trace retention as infrastructure requirement; ALCOA+ "Enduring" criterion | Good fit | Retention periods and format migration for agent traces need specification per GxP context |
4. GxP Context Differentiation
| GxP Context | Key Regulations | Agent Use Cases | Risk Profile | Recommended Max Autonomy |
|---|---|---|---|---|
| GMP (Manufacturing) | 21 CFR 210/211, EU GMP Annex 11, PIC/S | Batch record review, deviation trending, CAPA root cause analysis, process analytical technology (PAT) | High -- errors affect product quality and patient safety; manufacturing records are legal quality documents | Tier 1 (Observe) -- agents analyze and propose; human executes all GMP record modifications |
| GLP (Laboratory) | 21 CFR Part 58, OECD GLP Principles | Protocol drafting, data analysis, literature review, study report compilation | Medium -- errors compromise study integrity and regulatory submission basis; raw data integrity is absolute | Tier 1-2 (Observe/Branch) -- agents draft in isolation; human reviews and approves; agents must never modify raw data |
| GCP (Clinical) | ICH E6(R3), 21 CFR 50/56/312, EU CTR | Protocol design assistance, site feasibility, patient matching, medical coding (MedDRA), safety signal detection | Medium-High -- errors affect patient safety or trial integrity; ICH E6(R3) "fit-for-purpose" quality management applies | Tier 1-2 (Observe/Branch) -- agents assist under human governance; causality assessment and patient-facing decisions remain human-owned |
Differentiating factor. The manifesto treats "regulated industries" as a category. Pharma practitioners operate in specific GxP contexts with distinct requirements. GMP imposes the heaviest constraints on agent autonomy because manufacturing records are legal quality documents subject to Part 11. GLP permits more agent involvement in analysis but enforces absolute raw data integrity. GCP benefits most from ICH E6(R3)'s "fit-for-purpose" alignment with the manifesto's risk-tiered approach.
Use-case risk graduation. Organizations can adopt agentic engineering incrementally across GxP contexts:
| Use-Case Domain | Regulatory Burden | Starting Autonomy | Expansion Path |
|---|---|---|---|
| Drug discovery / research | Low | Tier 2-3 | Manifesto applies directly; minimal regulatory overlay |
| Regulatory affairs | Medium (high value) | Tier 1-2 | Dossier assembly, consistency checking; submission content human-approved |
| Clinical operations (GCP) | Medium | Tier 1-2 | Agents assist; ICH E6(R3) fit-for-purpose quality management applies |
| Pharmacovigilance | Medium-High | Tier 1 | Signal detection, ICSR triage; causality assessment human-owned |
| Manufacturing (GMP) | High | Tier 1 | Batch record review, deviation analysis; agent modification of GMP records requires full Part 11 compliance |
5. ICH Guidelines Mapping
| ICH Guideline | Core Concept | Relevance to Agentic Engineering | Manifesto Alignment |
|---|---|---|---|
| Q8 (Pharmaceutical Development) | Design Space -- operating ranges within which changes do not require regulatory notification | Tier 2 autonomy within established boundaries; agents operate freely within a validated Design Space, escalate outside it | P5 autonomy tiers: Tier 2 (Branch) maps to operation within Design Space; boundary crossing triggers Tier 3 governance |
| Q9 (Quality Risk Management) | Risk-based approach to quality decisions; severity, probability, detectability | Risk assessment drives autonomy level, evidence requirements, and validation depth | P5 risk-tiered autonomy; P8 phase-calibrated evidence; P11 economics of intelligence (cost of correctness includes risk) |
| Q10 (Pharmaceutical Quality System) | Continual improvement; knowledge management; management review | Agentic Loop (Observe, Learn, Govern) as a continual improvement engine; P6 knowledge vs. learned memory distinction | P6 knowledge infrastructure; P9 observability for management review; Agentic Loop as PQS implementation mechanism |
| Q12 (Lifecycle Management) | Established conditions; post-approval changes; reporting categories | Revalidation triggers when agent behavior changes; model version updates as post-approval changes; change classification | P2 living specifications; change control for model versions, prompt modifications, and memory accumulation |
| E6(R3) (GCP) | "Fit-for-purpose" quality management; proportionate approaches; risk-based monitoring | Risk-tiered governance for clinical agent applications; assurance proportionate to decision impact | P5 autonomy tiers; P8 evaluations scaled to risk; manifesto's risk-based philosophy mirrors E6(R3)'s proportionality principle |
6. IQ/OQ/PQ Framework for Agent Systems
The pharma qualification framework maps to the manifesto's engineering practices as follows.
| Qualification Stage | Traditional Scope | Agent System Equivalent | Manifesto Mechanism |
|---|---|---|---|
| IQ (Installation Qualification) | System installed per specification; hardware and software verified | Agent runtime installed; model versions locked and documented; tool connections verified; infrastructure (Cat 1) validated; configuration baselines captured | P3 architecture enforcement; P2 versioned specifications; infrastructure as deterministic wrapper |
| OQ (Operational Qualification) | System operates as intended within specified ranges | Agent produces correct outputs for defined test cases; autonomy tiers enforce correctly; traces capture completely; error handling and escalation paths verified | P8 evaluation portfolios; P5 tier enforcement verification; P9 observability validation |
| PQ (Performance Qualification) | System performs consistently under production conditions over time | Agent performs reliably under production load and data volumes over an extended period; drift detection active; evidence bundles generated consistently | P9 observability and drift monitoring; P10 resilience under stress; P1 outcome evidence over sustained operation |
Mapping note. IQ/OQ/PQ is a sequential qualification framework. The manifesto's Agentic Loop is continuous. In practice, IQ/OQ/PQ establishes the initial validated state; the Agentic Loop (Observe, Learn, Govern) maintains that state through ongoing operation. Requalification is triggered by changes per the organization's change control procedure -- see section 9.
IQ/OQ/PQ evidence mapping.
| Stage | Required Evidence (Traditional) | Agent System Evidence (Manifesto) |
|---|---|---|
| IQ | Installation records, version logs, configuration screenshots | P2 versioned specification snapshot; P3 infrastructure-as-code manifests; model version hash; tool connection test results |
| OQ | Test protocols, test results, deviation reports | P8 evaluation portfolio results; P5 tier enforcement test logs; P9 trace completeness verification |
| PQ | Production run records, performance trending | P9 observability dashboards; P1 evidence bundle consistency over time; P10 resilience metrics under production load |
7. Data Integrity for Agent Systems
ALCOA+ is the foundational data integrity framework for pharma and GxP. The manifesto's ALCOA+ mapping in companion-frameworks.md covers software development records. Pharma operational records require additional consideration.
| Data Integrity Concern | Regulatory Basis | Agent-Specific Consideration |
|---|---|---|
| Agent-generated data as "original" data | 21 CFR 211.68; Annex 11 s 8 | When an agent generates a calculation, trend, or summary entering a batch record or clinical database, the source record must be defined. The agent's input data and logic trace constitute the source. |
| Agent-modified data | Annex 11 s 9; Part 11 s 11.10(e) | Audit trail must capture: original value, new value, reason for change, who authorized the change, timestamp. The manifesto's P9 traces cover agent actions; the authorization chain (P12) must link to a named human. |
| Metadata preservation | WHO Data Integrity Guidance; PIC/S PI 041 | Agents processing GxP data must preserve timestamps, user IDs, system IDs, and audit metadata. Transformation or reprocessing must not corrupt metadata. |
| Data access classification | P5 autonomy tiers | Which GxP data can agents access? Define per data classification: read-only for raw data (GLP), read-only for batch records (GMP), read-write for draft documents only, no access to restricted patient-level data without additional controls. |
Data access matrix by GxP context.
| Data Type | GMP Access | GLP Access | GCP Access | Rationale |
|---|---|---|---|---|
| Raw / source data | Read-only | Read-only (absolute) | Read-only | Raw data integrity is non-negotiable across all GxP contexts |
| Batch records | Read-only | N/A | N/A | Legal quality documents; modifications require human execution with Part 11 signatures |
| Draft documents | Read-write | Read-write | Read-write | Agents draft; humans review and approve before documents enter the quality system |
| Calculated / derived data | Read-write with trace | Read-write with trace | Read-write with trace | Agent must log input data, algorithm, and output; source traceability required |
| Patient-level data | N/A | N/A | Read-only with controls | Additional access controls, anonymization, and data protection requirements apply |
8. Supplier Qualification
Pharma requires supplier qualification for all critical suppliers of GxP computerized systems.
| Supplier Qualification Aspect | Agent-Specific Consideration |
|---|---|
| Vendor audit | LLM providers and agent framework vendors require assessment. Audit scope should include: data handling practices, model versioning, availability SLAs, security posture. |
| Quality agreement | Agreements with model providers must address: version notification, deprecation timelines, data confidentiality, uptime commitments, incident notification. |
| Ongoing performance monitoring | P9 observability provides richer monitoring data than traditional supplier review. Track: output quality drift, latency changes, availability, cost-per-query trends. |
| Open-source models | No traditional "supplier" exists. The deploying organization assumes full supplier responsibility: validation, maintenance, version control, incident response. Document this in the validation plan. |
| Multi-vendor routing | P11 economics-aware routing means multiple model providers. Each requires qualification. Routing logic itself is a validated configuration (GAMP Cat 4). |
Open regulatory issue: who is the "supplier" for open-source models?
GAMP 5 and EU GMP Chapter 7 assume an identifiable supplier with a quality system. Open-source foundation models have no such entity. The deploying organization must formally document that it assumes supplier responsibilities -- including validation, ongoing monitoring, version control, anomaly tracking, and incident response. This represents a significant resource commitment that must be factored into the build-vs-buy decision for GxP agent deployments.
9. Change Control Considerations
Treat model updates, prompt edits, tool changes, and memory growth as distinct change classes; they carry different validation scopes and requalification burdens.
| Change Type | Pharma Change Control Implication | Manifesto Mechanism | Open Question |
|---|---|---|---|
| Model version update | Change to a validated system; requires impact assessment and potential requalification (OQ minimum) | P2 living specifications; revalidation triggers in Agentic Loop | What is the minimum requalification scope for a minor model version change vs. a major version change? |
| Prompt / specification modification | Configuration change to a Cat 4 system; requires change control record | P2 versioned specifications; P9 traces capture specification version | Should prompt changes follow the same change control rigor as software configuration changes? |
| Tool addition or removal | System boundary change; may affect GAMP categorization and validation scope | P3 architecture enforcement; P4 swarm topology | Does adding a tool to an agent's toolkit constitute a change requiring full OQ? |
| Memory accumulation | Agent behavior changes as learned memory grows; this is a novel change type | P6 memory governance -- expiration, rollback, provenance | Is accumulated memory a change requiring change control? At what threshold? |
| Autonomy tier adjustment | Risk profile change; requires risk assessment and potential requalification | P5 tiered autonomy; P12 accountability | Tier escalation (1 to 2) requires documented risk acceptance. Does de-escalation? |
| Periodic review | Annual review obligation remains regardless of continuous monitoring | P9 continuous observability provides richer data than traditional periodic review | How does continuous observability supplement or replace the annual periodic review? |
10. Viable Starting Points
Not all pharma workflows carry equal GxP burden. The following are realistic entry points for agentic engineering practices today:
Drug discovery and early research (no GxP obligations). Manifesto applies directly with minimal regulatory overlay. Natural pilot domain. Use to build team competency and evidence practices before GxP contexts.
Regulatory dossier consistency checking. Agents cross-check submission sections for internal consistency, identify gaps against CTD format requirements, and flag cross-references. Regulatory affairs professional approves before submission. High-value use case; Tier 1-2 natural ceiling.
Deviation trending and CAPA root cause assistance. Agents analyze deviation databases, identify patterns, and draft initial root cause analyses for human review. Reduces investigation cycle time. No GMP record modification — observe only.
Pharmacovigilance signal detection. Agents analyze ICSR data and literature for emerging safety signals. Qualified pharmacovigilance professional reviews all findings before regulatory reporting. Contained blast radius; significant value at Tier 1 observe.
Protocol drafting assistance (GLP, GCP). Agents draft study protocol sections from templates and prior studies. Principal investigator or sponsor review and approval before finalization. Strong alignment with ICH E6(R3) "fit-for-purpose" quality management.
IQ/OQ/PQ evidence assembly. Agents format and compile qualification evidence packages from evaluation results. Qualified person signs off. Reduces validation cycle time while preserving human accountability for all quality decisions.
11. Hard Autonomy Caps
The following caps apply regardless of organizational maturity phase. They are derived from GxP data integrity requirements, not from risk preference.
| Use Case | Maximum Tier | Regulatory Basis | Key Constraint |
|---|---|---|---|
| GMP batch record modification | Tier 1 (observe only) | 21 CFR 211.68; EU GMP Annex 11 §9; Part 11 | Batch records are legal quality documents. Agents may analyze; humans execute all modifications with Part 11 electronic signatures. |
| GMP manufacturing instructions | Tier 1 (observe only) | EU GMP Chapter 4; 21 CFR 211 | Agent may draft; qualified person reviews and approves before issuance to production. |
| GLP raw data | Tier 1 (observe only) | 21 CFR Part 58; OECD GLP Principles | Raw data integrity is absolute. Agents may read; agents must never modify raw data. |
| GCP patient-facing decisions / causality | Tier 1 (observe only) | ICH E6(R3); 21 CFR 50/56 | Causality assessment and any patient safety decision requires qualified human judgment. |
| Regulatory submission content | Tier 2 max | FDA, EMA submission regulations | Agent drafts and consistency-checks; regulatory affairs professional approves before submission. |
| Drug discovery / early research | Tier 3 available | Minimal GxP overlay | Standard manifesto adoption applies. No GxP obligations for pre-IND research. |
| Pharmacovigilance (ICSR triage, signal detection) | Tier 1-2 | ICH E2A/E2B/E2C; EudraVigilance | Agent assists signal detection and ICSR assembly; qualified pharmacovigilance professional reviews every case before reporting. |
12. Formal Verification Opportunity
Manifesto Principle 8 states: "proofs are a scale strategy." For pharma, formal verification creates value in specific contexts:
Process Analytical Technology (PAT) and Control Strategy
PAT models (ICH Q8, Q10) governing real-time release testing and process control can benefit from formal verification of the control logic:
- Process model contracts: Formal preconditions and postconditions on analytical control algorithms can be machine-verified rather than validated through scripted testing alone.
- Agent-generated PAT logic with formal proofs: Agent-generated control logic accompanied by machine-checked proofs of correctness properties (no out-of-bounds, monotonicity of response) can produce a stronger validation case than test-only approaches.
- FDA CSA alignment: CSA's "use of unscripted testing" and "critical thinking over scripted compliance" principles support replacing exhaustive scripted test matrices with targeted formal verification on critical paths.
Quantitative Structure-Activity Relationship (QSAR) and Pharmacokinetic Models
QSAR models and PK/PD algorithms used in drug development can benefit from:
- Formal invariants: Constraints on output ranges, monotonicity of dose-response relationships, and absence of undefined behavior formally verified rather than tested across a finite sample.
- Contract-first specification (P2): Specify model constraints as formal contracts before implementation. Agent-generated model code verified against the formal contract by a model checker provides stronger evidence than equivalence testing alone.
Practical Entry Point
Formal methods do not require a full theorem-proving infrastructure. The practical entry is executable specification: write GxP acceptance criteria as machine-checkable assertions (postconditions on calculations, invariants on data ranges). These serve simultaneously as human-readable requirements and automated verification inputs — collapsing the gap between specification and test evidence. This is directly compatible with CSA's intent and eliminates the overhead of scripted test protocol generation.
13. Tool Configuration Notes
How to configure agent tooling to satisfy 21 CFR Part 11 / EU Annex 11 audit trail requirements and GxP data integrity obligations.
Audit Trail Hook Mapping
| GxP Requirement | Hook Type | What It Produces |
|---|---|---|
| Audit trail — agent actions on GxP data | PostToolUse audit hook | Timestamp, user/agent identity, action type, before/after values, reason |
| Electronic signatures for GxP records | PreToolUse signature gate | Named qualified person approval with binding electronic signature |
| System access controls | PreToolUse RBAC hook | Access check record; unauthorized access attempts logged |
| Configuration change audit trail | PostToolUse config hook | Specification version changes, prompt modifications, tier adjustments logged |
| Data backup and recovery verification | Scheduled PostToolUse | Periodic archive integrity check |
| Operational checks (circuit breakers) | PreToolUse system check | Agent health check; blocks execution if system state outside validated range |
GxP Data Classification Enforcement
The MCP allowlist (Layer 6 in enterprise configuration) is the primary data residency control for GxP systems:
| Data Classification | Agent Access | Routing Constraint |
|---|---|---|
| Raw / source data (GMP, GLP) | Read-only | On-premises or validated private cloud only; no external API |
| Batch records and GMP quality records | Read-only | On-premises only; any agent access logged as a Part 11 event |
| Draft documents | Read-write with audit trail | Approved models with signed DPA; agent writes to draft state only |
| Restricted patient-level data | Read-only with additional controls | Anonymization layer required; local inference preferred |
GAMP Validation of the Agent Infrastructure
The agent runtime itself is a GAMP Category 4 or 5 system:
- IQ evidence: Configuration-as-code (specifications, tool permissions, tier settings, model version pins) captured in the version-controlled configuration repository.
- OQ evidence: Evaluation portfolio results (P8); tier enforcement test logs (P5); trace completeness verification (P9).
- PQ evidence: Production performance metrics (P9); drift detection records; evidence bundle consistency over time (P1).
The configuration repository is the IQ record. Point auditors to it — it is the answer to "show me your validated system configuration."
14. Open Regulatory Questions
These questions are unresolved at the intersection of agentic engineering and pharma regulation. They are listed here to support regulatory strategy discussions, not to imply that answers exist.
| # | Question | Regulatory Context | Manifesto Reference |
|---|---|---|---|
| 1 | How should agent systems be categorized under GAMP 5 -- Category 3, 4, or a new category? | GAMP 5 (2nd ed.) | P3, P5 |
| 2 | Do agent-generated GxP records satisfy Part 11 requirements for electronic records? | 21 CFR Part 11 | P9, P12 |
| 3 | What validation approach applies to systems whose behavior changes through learning? | GAMP 5; CSA | P6, P8 |
| 4 | Is a model version change equivalent to a software version change for change control purposes? | EU GMP Annex 11 s 10; ICH Q10 | P2 |
| 5 | Does prompt modification constitute a configuration change requiring formal change control? | GAMP 5 Cat 4; Annex 11 s 10 | P2, P7 |
| 6 | At what point does memory accumulation constitute a change to a validated system? | GAMP 5; Annex 11 s 11 | P6 |
| 7 | Can agent-generated evidence bundles serve as supplier documentation under CSA's "leverage supplier testing" principle? | FDA CSA | P1, P8 |
| 8 | What constitutes an adequate quality agreement with an LLM provider for GxP use? | EU GMP Chapter 7; ICH Q10 | P11 |
| 9 | How should agent systems be classified as open or closed systems under Part 11? | 21 CFR Part 11 s 11.30 | P3 |
| 10 | Does continuous observability (P9) satisfy or supplement periodic review obligations? | EU GMP Annex 11 s 11 | P9 |
These questions reflect the current state of regulatory uncertainty. As regulatory bodies issue guidance on AI in GxP environments, this section should be revisited and questions resolved or refined. Organizations should track FDA, EMA, MHRA, and PIC/S publications for emerging positions.
Appendix A: Alignment Summary by Manifesto Principle
| Principle | GAMP 5 | CSA | Part 11 / Annex 11 | GxP (GMP/GLP/GCP) | ICH Q8-Q12 / E6(R3) |
|---|---|---|---|---|---|
| P1 Outcomes | Cat 5 validation evidence | Risk-based documentation | Record retention | Evidence across all GxP | Q10 continual improvement |
| P2 Specifications | Cat 4 configuration | Intended use drives assurance | -- | Protocol / specification management | Q12 established conditions |
| P3 Architecture | Category boundary enforcement | -- | Closed/open system classification | System boundary definition | -- |
| P5 Autonomy | Risk-based validation depth | Assurance commensurate with risk | Access controls; authority checks | Tier caps per GxP context | Q8 Design Space; Q9 risk management |
| P6 Memory | -- | -- | Data backup and recovery | Raw data integrity | Q10 knowledge management |
| P8 Evaluations | Validation testing | Unscripted + scripted testing | Validation of computerized systems | IQ/OQ/PQ framework | E6(R3) fit-for-purpose QM |
| P9 Observability | -- | -- | Audit trails; operational checks | Audit trail across GxP | Q10 management review |
| P10 Containment | -- | -- | Operational checks | -- | Q9 risk controls |
| P12 Accountability | -- | Critical thinking | Electronic signatures | Human ownership of GxP records | E6(R3) sponsor/investigator responsibility |
Appendix B: Principle Quick Reference
Manifesto principles referenced throughout this document.
| Ref | Principle | Core Concept |
|---|---|---|
| P1 | Outcomes are the unit of work | Evidence bundles; deployed, instrumented, evaluated |
| P2 | Specifications are living artifacts | Versioned, reviewable, machine-readable |
| P3 | Architecture is defense-in-depth | Deterministic wrappers; enforced boundaries |
| P4 | Right-size the swarm | Topology matched to complexity |
| P5 | Autonomy is a tiered budget | Tier 1 Observe / Tier 2 Branch / Tier 3 Commit |
| P6 | Knowledge and memory are distinct | Knowledge (ground truth) vs. learned memory (heuristic) |
| P7 | Context is engineered like code | Versioned, tested, performance-benchmarked |
| P8 | Evaluations are the contract | Evaluation portfolios; regression gates |
| P9 | Observability covers reasoning | Structured traces; audit trails; interoperability |
| P10 | Assume emergence; engineer containment | Circuit breakers; chaos testing; safe fallbacks |
| P11 | Optimize economics of intelligence | Cost of correctness; dynamic model routing |
| P12 | Accountability requires visibility | Human ownership; incident attribution |
Mapping the Agentic Engineering Manifesto principles to financial services regulatory frameworks.
Disclaimer -- This document maps concepts from the Agentic Engineering Manifesto to financial services regulatory frameworks. It does not constitute compliance or regulatory advice. Consult qualified risk, compliance, and regulatory professionals for compliance determinations.
Regulatory currency: This document reflects SR 11-7 / OCC 2011-12, DORA (EU 2022/2554), EU AI Act, GDPR, MiFID II, and SEC/FINRA model risk guidance as understood at the time of last review. Financial services regulation varies significantly by jurisdiction; this document uses conservative cross-jurisdictional defaults, not jurisdiction-specific advice. The EU AI Act implementation timeline and Annex III classifications are subject to ongoing guidance; verify current status before relying on AI Act references here. Last reviewed: April 2026. Proposed changes not yet enacted are flagged as such.
Preamble
This document is a companion to manifesto.md. It assumes familiarity with the boundary conditions and the Agentic V-Model transition framework. Financial services already operates the governance infrastructure the manifesto demands: model risk management, three lines of defense, change control, audit trails. The bridge to agentic engineering is extension of existing frameworks, not construction of new ones.
Canonical sources. Normative principle definitions (P1–P12) and autonomy tier definitions are in manifesto-principles.md. This document maps those definitions to financial services regulatory requirements; it does not redefine them.
SR 11-7 / OCC 2011-12 Model Risk Management
SR 11-7 defines model risk management expectations for banking organizations supervised by the Federal Reserve and OCC. Agent systems that influence financial decisions fall within scope when they meet the SR 11-7 definition of "model" -- a quantitative method that processes inputs to produce quantitative estimates used in decision-making.
| SR 11-7 Requirement | Manifesto Mechanism | Alignment | Gap |
|---|---|---|---|
| Model development documentation -- design, theory, data, assumptions | P1 evidence bundles, P2 specifications | Partial | Model development rationale (why this model, alternatives considered, limitations) not captured by default evidence bundles. SR 11-7 expects documentation of the conceptual soundness of the approach, not just that it was built and tested. |
| Independent model validation -- effective challenge by qualified staff | P8 evaluations | Significant gap | SR 11-7 requires organizational independence between developer and validator. The manifesto treats verification as part of the delivery pipeline, performed by the same team. Validation must include conceptual soundness review, not just test execution. |
| Ongoing monitoring -- backtesting, benchmarking, sensitivity analysis, outcomes analysis | P9 observability, structured traces | Good fit | Agent traces provide richer monitoring data than most current model monitoring infrastructure. Traces capture reasoning chains, not just input-output pairs, enabling deeper performance analysis. |
| Model inventory and classification -- tiering by materiality, use, and complexity | None | Missing | Every agent system used in financial decisions must be registered, classified by materiality, and tracked in the model inventory. Classification drives validation frequency, monitoring intensity, and governance oversight. |
| Model risk governance -- roles, escalation, board reporting, risk appetite | P5 autonomy tiers, P12 accountability | Partial | Three Lines of Defense roles and escalation paths not explicitly addressed. Board-level model risk reporting and model risk appetite statements have no manifesto equivalent. |
| Champion-challenger testing -- parallel execution against alternative approaches | None | Missing | Comparing agent outputs against alternative approaches or incumbent models is not part of the manifesto evaluation framework. Critical for demonstrating that the agent system performs at least as well as the approach it replaces. |
| Model limitations documentation -- known weaknesses, boundary conditions, compensating controls | None | Missing | Explicit documentation of what the agent system cannot do, known failure modes, conditions under which outputs should not be relied upon, and compensating controls for known limitations. |
| Vendor model management -- due diligence, ongoing monitoring of vendor models | P11 economics, multi-model routing | Partial | SR 11-7 requires due diligence on vendor models including access to methodology documentation. LLM providers rarely provide the level of transparency SR 11-7 expects for vendor model assessment. |
SS1/23 (PRA) addendum. The PRA model risk management principles extend SR 11-7 with several additions relevant to agentic systems:
- Model risk appetite defined and approved at board level, with explicit thresholds for model performance degradation and triggers for remediation.
- Explicit coverage of AI/ML models, removing ambiguity about whether agent systems are in scope.
- Proportionality requirements scaled to model materiality -- not every agent system requires the same validation intensity.
- Enhanced expectations for data quality in model inputs, strengthening the link to P7 (context quality as infrastructure).
These additions reinforce the case for P12 (accountability at governance level) and P7 (context quality as infrastructure).
Implementation note. Organizations should map each agent system to the SR 11-7 model tiering framework at the point of registration. Tier 1 (highest materiality) agent systems require annual independent validation, quarterly ongoing monitoring review, and board-level reporting. Lower-tier systems may follow a lighter cadence, but no agent system influencing financial decisions should be exempt from the inventory and governance framework entirely.
The manifesto's evidence bundles (P1) provide a strong foundation for SR 11-7 model documentation, but must be supplemented with:
- Conceptual soundness assessment -- why this agent architecture, what alternatives were considered, what are the theoretical limitations.
- Outcome analysis -- comparison of agent decisions against actual outcomes over time, with statistical rigor appropriate to the use case.
- Sensitivity analysis -- how agent outputs change under varying inputs, context quality, and model provider configurations.
Three Lines of Defense
The Three Lines model is the foundational governance structure in financial services. Any agentic engineering adoption must map to this structure or it will not pass internal governance review.
| Line | Traditional Role | Agentic Equivalent | Manifesto Principle | Key Requirement |
|---|---|---|---|---|
| 1st -- Business / Technology | Builds and operates models; owns risk within business domain | Develops agent systems, defines specifications, produces evidence bundles, operates monitoring, manages day-to-day agent performance | P1-P11 | Owns first-line risk for agent systems within its domain; responsible for evidence quality and ongoing monitoring; accountable for agent outputs |
| 2nd -- Risk / Compliance | Oversees, challenges, and independently validates; sets risk frameworks and policies | Independently validates agent systems; monitors ongoing performance against risk appetite; challenges autonomy tier assignments; sets agent governance policy | P8 independent validation, P5 autonomy tiers | Must be organizationally independent from 1st line; cannot develop what it validates; sets model risk appetite for agent systems |
| 3rd -- Internal Audit | Provides independent assurance over the governance framework itself | Audits the entire agent governance framework -- specifications, evidence quality, validation independence, trace completeness, policy adherence | P12 accountability, P9 observability | Evidence bundles and traces enable audit; structured data reduces audit cycle time; audit scope includes the governance process, not just the agent output |
Segregation of duties. The team that builds and operates the agent system cannot also validate it. This is non-negotiable under SR 11-7 and SS1/23. The manifesto's P8 evaluation framework must be extended to require organizational separation between the first line (development and operation) and second line (independent validation and challenge).
In practice, this means:
- First-line teams write specifications, build agent systems, and run evaluations as part of their development process.
- Second-line teams independently design validation test cases, execute them without first-line involvement, and issue findings that must be remediated before production deployment.
- Third-line teams audit the process: was the segregation real, were findings tracked to closure, did evidence bundles meet the standard.
DORA (Digital Operational Resilience Act)
DORA applies to financial entities operating in the EU and establishes requirements for ICT risk management, incident reporting, resilience testing, and third-party risk management. Agent systems are ICT assets and fall within scope.
| DORA Pillar | Articles | Requirement | Manifesto Principle | Alignment |
|---|---|---|---|---|
| ICT Risk Management | Art. 5-16 | Agent systems included in ICT risk framework; business impact analysis for agent failure scenarios; risk identification and classification | P3 defense-in-depth, P5 autonomy tiers | Good fit -- defense-in-depth architecture and tiered autonomy map directly to ICT risk management expectations. Agent failure scenarios should be included in business continuity planning. |
| Incident Reporting | Art. 17-23 | Agent failures classified as ICT incidents; classification by severity; notification to competent authorities within regulatory timelines; root cause analysis | P9 observability | Good fit -- structured traces enable incident classification and root cause analysis. Gap: incident reporting workflow, severity classification taxonomy for agent failures, and regulatory notification timelines are not addressed in the manifesto. |
| Resilience Testing | Art. 24-27 | Scenario testing for agent systems; advanced testing including TLPT for significant entities; testing of ICT tools, systems, and processes | P10 containment, chaos testing | Strong fit -- the manifesto's chaos testing (tool outages, noisy retrieval, adversarial inputs) aligns directly with DORA resilience testing expectations for agent systems. TLPT scenarios should include agent-specific attack vectors. |
| Third-Party Risk | Art. 28-44 | LLM providers as critical ICT third parties; concentration risk assessment; exit strategies; right to audit; sub-outsourcing controls; contractual requirements | P11 multi-model routing | Partial -- multi-model routing mitigates concentration risk by design. Gaps: contractual requirements for LLM providers (SLA, data handling, incident notification), exit planning and portability, sub-outsourcing visibility, right-to-audit clauses in provider agreements. |
| Information Sharing | Art. 45 | Agent-specific threat intelligence sharing with peers, regulators, and industry bodies | P10 containment | Supportive -- the manifesto's containment patterns generate threat intelligence (adversarial inputs, failure modes); no explicit mechanism for sharing this intelligence with the financial services community. |
Multi-model routing can be an effective mitigation for DORA concentration risk, but it is not a universal regulatory requirement. Under the third-party risk pillar, concentration risk in a single LLM provider creates regulatory exposure where a single provider outage would impair critical financial functions. P11 (economics of intelligence) therefore serves a dual purpose: cost optimization and DORA concentration risk mitigation. Organizations should document their multi-model routing strategy as a DORA third-party risk mitigation measure where relevant.
Exit planning. DORA requires exit strategies for critical ICT third-party providers. For agent systems, this means: the ability to switch LLM providers without loss of capability, portability of specifications and evaluation suites across providers, and documented fallback procedures when a provider becomes unavailable. P2 (specifications) and P8 (evaluations) support this if they are provider-agnostic by design.
Incident classification for agent failures. DORA requires classification of ICT-related incidents by materiality. Organizations should define agent-specific incident categories:
- Severity 1: Agent takes unauthorized action affecting customer accounts, market positions, or regulatory submissions.
- Severity 2: Agent produces incorrect output that is detected before downstream impact but indicates a control failure.
- Severity 3: Agent performance degradation (latency, accuracy drift) detected through monitoring but within tolerance thresholds.
- Severity 4: Agent failure contained by circuit breakers or fallback mechanisms with no downstream impact.
The manifesto's P9 (observability) provides the data needed for classification. The gap is the classification framework itself and the escalation workflow.
EU AI Act
Financial AI systems frequently fall into the high-risk category under Annex III. The mapping below focuses on high-risk system obligations, which apply to most financial use cases involving automated decision-making.
| AI Act Requirement | Article | Manifesto Principle | Notes |
|---|---|---|---|
| Risk classification | Art. 6, Annex III | -- | Financial AI systems are frequently high-risk: credit scoring, insurance pricing, fraud detection, AML screening. Classification triggers the full set of high-risk obligations. |
| Risk management system | Art. 9 | P3, P5, P10 | Defense-in-depth, autonomy tiers, and containment engineering collectively satisfy risk management system requirements. Must be documented as a continuous iterative process. |
| Data governance | Art. 10 | P7 context engineering | Data quality, relevance, representativeness, and freedom from errors. Context quality engineering directly maps. Training data governance for fine-tuned models adds scope beyond P7. |
| Technical documentation | Art. 11 | P1 evidence, P2 specifications | Evidence bundles and versioned specifications satisfy technical documentation. Must include intended purpose, foreseeable misuse, and interaction with other systems. |
| Record-keeping and logging | Art. 12 | P9 observability | Automatic logging of events during system operation. Structured traces exceed this requirement. Logs must enable post-market monitoring and incident investigation. |
| Transparency and information to deployers | Art. 13 | P9 observability | Structured traces satisfy transparency obligations. Traces and documentation must be accessible to deployers in a form they can understand and act upon. |
| Human oversight measures | Art. 14 | P12 accountability, P5 autonomy | Tier-calibrated governance provides graduated human oversight proportional to risk. System must allow human intervention, including ability to override or stop the system. |
| Accuracy, robustness, cybersecurity | Art. 15 | P8 evaluations, P10 containment | Evaluation portfolios address accuracy requirements. Chaos testing addresses robustness. Cybersecurity must cover adversarial attacks specific to agent systems. |
| Conformity assessment | Art. 43 | P1 evidence bundles | Evidence bundles structured to serve as conformity assessment documentation. Financial services AI may require third-party conformity assessment under sector-specific rules. |
| Post-market monitoring | Art. 72 | P9 observability | Ongoing monitoring through traces, evaluation regression tracking, and performance drift detection. Must feed back into the risk management system. |
High-risk classification in financial services. Under Annex III, Section 5, the following financial use cases are explicitly listed as high-risk:
- Creditworthiness assessment of natural persons.
- Risk assessment and pricing for life and health insurance.
- Evaluation of credit scoring or establishment of credit scores.
Additional financial use cases may qualify as high-risk under the general criteria in Art. 6(2) when they significantly affect decisions about natural persons. Organizations should conduct a risk classification assessment for each agent system and document the rationale, including cases where the system is determined to be non-high-risk.
SOX Controls for Agent Systems
SOX compliance applies to publicly traded companies and focuses on internal controls over financial reporting. Agent systems that touch financial data, reporting pipelines, or accounting processes fall within scope.
| SOX Requirement | Manifesto Mechanism | Alignment |
|---|---|---|
| IT General Controls (ITGC) | P3 architecture, P5 autonomy tiers | Good fit -- defense-in-depth and tiered permissions map to ITGC expectations for access management, change management, and operations |
| Change management -- authorization, testing, approval before deployment | P2 specifications, P1 evidence bundles | Good fit -- evidence bundles with evaluation results, diffs, and deployment IDs exceed most ITGC change management documentation requirements |
| Access controls -- logical access, authentication, authorization | P5 autonomy tiers, least privilege | Good fit -- tier enforcement and granular permissions (read but not write, deploy to canary but not full rollout) provide stronger access controls than typical role-based models |
| Audit trails -- who did what, when, and why | P9 structured traces | Strong fit -- traces reconstruct reasoning chains, not just event logs; traces include decision rationale, tool calls, and policy checks |
| Segregation of duties -- incompatible functions separated | -- | Gap -- not explicitly addressed in the manifesto; must be enforced through organizational controls external to the agent system (see Three Lines of Defense above) |
| Financial reporting integrity -- completeness, accuracy, validity | P8 evaluations | Partial -- evaluation portfolios verify correctness but do not specifically address financial statement assertion-level testing (completeness, existence, valuation, rights, presentation) |
Algorithmic Accountability and Explainability
These requirements span multiple regulatory frameworks and represent a cross-cutting concern for any agent system that influences decisions affecting individuals.
| Requirement | Source | Manifesto Mechanism | Gap |
|---|---|---|---|
| Right to explanation for automated decisions | GDPR Art. 22 | P9 structured traces | Traces provide system-level reasoning reconstruction. Gap: individual-level explainability (why this specific decision for this specific customer) requires purpose-built explanation generation, not raw trace data. |
| Fairness and non-discrimination testing | Fair Lending (ECOA, FHA), FCA Consumer Duty, EU AI Act Art. 10 | P8 evaluation portfolios | No explicit fairness testing, bias detection, or protected-class impact analysis in the manifesto evaluation framework. Evaluation portfolios must be extended with fairness-specific test cases. |
| Contestability of automated decisions | Consumer protection regulation, FCA Consumer Duty | P12 accountability | No defined process for customer challenge of agent-influenced decisions. Accountability exists but a contestation workflow -- how a customer disputes, how the decision is re-examined, how traces are reviewed -- does not. |
| Kill switches for algorithmic trading systems | MiFID II Art. 17 | P10 containment, circuit breakers | Good fit -- circuit breakers and containment engineering serve as kill switch infrastructure. Must operate in real-time with sub-second latency for trading systems. |
| Model explainability for supervisory review | SR 11-7, SS1/23 | P9 traces, P1 evidence | Partial -- traces explain system-level behavior. Gap: model-level interpretability (feature importance, sensitivity analysis, partial dependence) requires additional tooling beyond manifesto scope. |
GDPR Art. 22 in practice. The right not to be subject to solely automated decision-making with legal or similarly significant effects creates a hard constraint on agent autonomy tiers in customer-facing financial decisions. Any agent system that produces a credit decision, insurance pricing determination, or account action must either:
- Maintain meaningful human involvement in the decision (not rubber-stamping), which maps to manifesto Tier 1 (observe) or Tier 2 (branch with approval), or
- Obtain explicit consent and provide the right to contest, which requires a contestation workflow that the manifesto does not currently define.
The practical implication is that for customer-facing decisions with legal or similarly significant effects, Tier 1 or Tier 2 is the conservative default pending jurisdiction-specific review.
Hard Autonomy Caps
The following caps are regulatory floors — constraints derived from applicable law, not risk preference. A mature Phase 5 organization still cannot exceed these caps for the listed use cases.
| Use Case | Maximum Tier | Regulatory Basis | Key Constraints |
|---|---|---|---|
| Credit and insurance underwriting, pricing, limit-setting | Tier 1 (observe only, conservative default) | EU AI Act Annex III §5 (high-risk); GDPR Art. 22; Fair Lending (ECOA, FHA) | Agent may analyze and recommend. Human makes every decision. Full explainability required. Fairness testing mandatory. |
| Algorithmic trading, execution, market making | Tier 1 (observe only, conservative default) | MiFID II Art. 17; MAR; Reg SCI | Kill switches mandatory and must operate sub-second. Agent cannot execute trades autonomously. |
| AML/KYC screening, SAR filing | Tier 2 max | AMLD6; FinCEN BSA; Wolfsberg Principles | Human review on every SAR. Agent assists triage and evidence assembly; does not make filing determinations. |
| Customer credit decisions (lending, card limits) | Tier 1 (observe only, conservative default) | EU AI Act Annex III §5; Consumer Credit Directive | Right to human review of automated credit decisions cannot be waived. |
| Claims decisioning affecting payout | Tier 1 (observe only, conservative default) | EU AI Act high-risk; FCA Consumer Duty | Agent may triage and summarize. Human adjudicates every claim. |
| Fraud detection triggering account action | Tier 2 max | Consumer Duty; GDPR | Agent may score and flag. Human authorises account restriction or closure. |
| Regulatory reporting (drafting, consistency checks) | Tier 2 max | COREP/FINREP; various reporting regulations | Accuracy requirements are absolute. Agent drafts; human approves before submission. |
| Back-office automation (reconciliation, data entry) | Tier 3 available | SOX (with evidence controls) | Standard manifesto adoption path. Evidence bundles satisfy change management requirements. |
These are conservative defaults, not universal legal ceilings; legal review is required for each product and jurisdiction.
Market-Specific Autonomy Guidance
The table below maps common financial services workflows to recommended starting autonomy tiers. These are starting points; actual tier assignments must reflect the organization's risk appetite and regulatory obligations — and must not exceed the hard caps above.
| Use Case | Risk Profile | Recommended Starting Autonomy | Key Regulations | Notes |
|---|---|---|---|---|
| Back-office automation (document processing, reconciliation, data entry) | Low | Tier 1-3 | SOX | Standard manifesto adoption path. Evidence bundles satisfy change management. Low regulatory sensitivity allows higher autonomy tiers. |
| Model development support (quant code generation, research assistance, data exploration) | Medium | Tier 1-2 | SR 11-7 | Agent output independently validated by model validation team. The agent is a development tool, not the model itself. Output enters the model development lifecycle and is subject to full SR 11-7 validation. |
| Regulatory reporting (drafting, data aggregation, consistency checks) | Medium | Tier 1-2 | Various (COREP, FINREP, FR Y-9C, Call Reports) | High value use case. Agent drafts, human approves. Traces provide audit trail for regulatory examination. Accuracy requirements are absolute -- no tolerance for reporting errors. |
| AML/KYC (transaction monitoring, customer due diligence, screening) | High | Tier 1-2 | AML Directives (AMLD6), FinCEN BSA, Wolfsberg Principles | Human review on every SAR. Agent assists triage and evidence assembly but does not make filing determinations. False negative risk is regulatory and criminal. |
| Credit and insurance decisioning (underwriting, pricing, limit setting) | High | Tier 1 (observe only) | EU AI Act (high-risk), Fair Lending (ECOA, FHA), Consumer Duty | High-risk AI classification. Agent provides analysis and recommendations; human makes the decision. Full explainability required. Fairness testing mandatory. |
| Algorithmic trading (execution, market making, systematic strategies) | Highest | Tier 1 (observe only) | MiFID II Art. 17, MAR, Reg SCI | Kill switches mandatory. Agent cannot execute trades autonomously. Real-time monitoring required. Latency constraints may limit agent applicability. |
Data Residency and Classification
Customer PII processed through external LLM APIs triggers GDPR cross-border transfer obligations (Chapter V), including adequacy decisions, standard contractual clauses, or binding corporate rules. The Schrems II framework adds requirements for supplementary measures when transferring data to jurisdictions without adequate protection.
Banking secrecy laws in certain jurisdictions (Switzerland, Luxembourg, Singapore, the Cayman Islands) may prohibit sharing financial data with third-party inference providers entirely. These laws operate independently of GDPR and may impose stricter constraints.
Data classification must gate agent access, model routing, and memory retention at the infrastructure level:
- Public / Internal: Agent may use any model, including hosted APIs. Standard manifesto adoption applies.
- Confidential: Agent restricted to approved models with appropriate data processing agreements. Memory retention subject to data minimization.
- Restricted / Secret: Agent restricted to on-premises or private-cloud models only. No external API calls. Memory must not persist beyond session.
This is an infrastructure enforcement concern under P5 (autonomy tiers): data classification becomes an autonomy constraint enforced at the system level, not merely a policy document. The routing layer (P11) must respect classification boundaries -- a cost-optimal route that violates data residency rules is not a valid route.
Agent Tooling Configuration
This section maps the regulatory requirements above to the agent tooling configuration mechanisms described in the manifesto's companion documents. Read this alongside your tool's enterprise configuration guide — neither is sufficient alone.
DORA Article 9 Evidence Chain
DORA Article 9 requires that changes are recorded, tested, and approved before production. Each requirement maps to a specific hook type:
| DORA Article 9 requirement | Hook type | What it produces |
|---|---|---|
| Changes are recorded | PostToolUse audit logging hook | SIEM record: timestamp, developer, tool calls, trace ID, deployment ID |
| Changes are tested before deployment | PreToolUse test enforcement hook | Test pass/fail record, coverage threshold evidence |
| Changes are approved before production | PreToolUse PR gate hook + Layer 7 RBAC | Named approver, approval timestamp, scope of approval |
| Sensitive data not exposed | PreToolUse data residency enforcement hook | Classification check log, block record if violated |
| Session activity is auditable | SessionStart hook + transcript centralization | Session initiation record; note: transcripts are stored locally by default and must be centralized via scheduled hook or script |
The configuration repository (ai-governance-config or equivalent) is the
auditable record of how these controls are configured. Point auditors to the
repository — it is the answer to "how do you control what the AI tool can do."
Three Lines of Defense → RBAC Mapping
| Line | Role | RBAC role | Hook infrastructure access |
|---|---|---|---|
| 1st line — Development | Builds and operates agent systems | Developer role | Development hooks (secrets detection, test enforcement, security scanning) |
| 2nd line — Risk/Compliance | Independent validation; sets risk policy | Read-only access to validation hook configs, or separate validation workspace | Validation hooks only — runs on independent infrastructure, not shared with 1st line |
| 3rd line — Internal Audit | Audits the governance framework | Cost/usage visibility + read access to configuration repository | Audit log hooks output; configuration repository Git history |
Segregation note: 2nd-line validation infrastructure must be organizationally separated from development infrastructure — use a separate workspace or tenant for 2nd-line validation execution, not shared infrastructure with the development environment.
MCP Allowlisting as Data Residency Control
The managed MCP policy (Layer 6) is the primary infrastructure control for GDPR cross-border transfer compliance and banking secrecy law requirements:
- MCP servers calling external APIs for Confidential/Restricted data must be restricted to approved providers with signed Data Processing Agreements.
- MCP servers must be routed through the corporate proxy — not direct external access — so egress is logged and auditable.
- The MCP allowlist is the machine-enforced answer to "what third-party systems can the AI access?" Document every approved MCP server with its data classification scope and DPA reference.
Model Version Pinning for Regulatory Stability
During periods of regulatory sensitivity, pin the agent tooling to a specific model version in the managed settings file to prevent behavioral drift:
- Q1 regulatory reporting cycles (COREP, FINREP, annual reports)
- Year-end close periods
- During active supervisory examinations
- After any SR 11-7 independent validation that established a behavioral baseline
Behavioral drift between model versions can invalidate a validation baseline. Document pinning decisions with rationale in the configuration repository change log.
ALCOA+ Compliance
The manifesto's evidence model satisfies ALCOA+ data integrity requirements by construction. See Companion Frameworks — ALCOA+ Alignment for the complete mapping table.
For financial services, this means:
- SR 11-7 model documentation: Evidence bundles and structured traces provide the "Attributable," "Legible," and "Contemporaneous" criteria that model validation teams rely on for independent review.
- SOX audit trails: Traces meet the "Original," "Accurate," and "Complete" criteria required for IT General Controls over financial reporting pipelines.
- DORA record-keeping: The "Enduring" and "Available" criteria are satisfied by trace retention infrastructure and queryable audit stores.
Key constraint: the trace infrastructure itself is a production system subject to IT change management. Organizations must version-control their observability configuration and include it in SOX ITGC scope.
Viable Starting Points
Not all financial services workflows carry equal regulatory burden. The following are realistic entry points for agentic engineering practices today:
Back-office automation (SOX-scoped). Reconciliation, data entry, document processing. Lower regulatory sensitivity. Standard evidence bundles satisfy SOX change management. Natural Phase 3→4 pilot domain.
Model development support. Agents assist quant researchers with code generation, data exploration, and analysis. Agent output enters the SR 11-7 model lifecycle and receives full independent validation — the agent is a development accelerant, not a replacement for model governance.
Regulatory reporting — consistency checking. Agents cross-check report data against source systems, flag inconsistencies, and draft narrative sections. Human approves before submission. Traces provide the audit trail regulators expect.
Traceability and evidence assembly. Agents assemble SR 11-7 model documentation packages, DORA incident records, and SOX change management evidence. Reduces cycle time for audits and examinations without automating the decisions themselves.
AML/KYC triage assistance. Agents pre-screen transaction monitoring alerts, assemble supporting evidence, and draft initial assessments. Human reviews every case before disposition. False negative risk means Tier 1-2 is the permanent ceiling, but agent triage can significantly reduce analyst workload.
Open Regulatory Questions
The following questions do not have settled regulatory answers. Organizations adopting agentic engineering in financial services should track these areas and engage with supervisors proactively.
- SR 11-7 model inventory scope. Are agent systems "models" under SR 11-7? If an agent uses an LLM to generate risk assessments, is the agent the model, the LLM the model, or both? Inventory classification methodology for agent systems is unsettled. Conservative approach: register the agent system as a model and the underlying LLM as a vendor model.
- DORA third-party risk for LLM APIs. When agents call external LLM APIs, does the LLM provider constitute a critical ICT third-party service provider? Concentration risk thresholds and contractual requirements for LLM providers are undefined in current regulatory technical standards.
- Champion-challenger methodology. Traditional champion-challenger compares model outputs on identical inputs. Agent systems are non-deterministic and context-dependent. Methodology for meaningful comparison -- including statistical approaches to handle output variability -- is undeveloped.
- Regulatory examination expectations. Supervisory examination procedures for agent governance do not yet exist. Early adopters should prepare for ad hoc supervisory inquiries and document governance frameworks defensively. Evidence bundles (P1) and traces (P9) position organizations well for this.
- EU AI Act conformity assessment. The interaction between AI Act conformity assessment and existing financial services supervisory frameworks (CRD, MiFID, Solvency II) is not yet clarified by the European Commission. Dual compliance obligations may emerge.
Mapping the Agentic Engineering Manifesto principles to automotive functional safety and software process frameworks.
Disclaimer — This document maps concepts from the Agentic Engineering Manifesto to automotive regulatory frameworks. It does not constitute compliance or certification advice. Consult qualified functional safety engineers and type-approval specialists for compliance determinations.
Regulatory currency: This document reflects ISO 26262:2018, ASPICE 3.1, UN Regulation 157 (ALKS), UN Regulation 155 (cybersecurity), ISO/SAE 21434, and ISO PAS 8800 (draft) as understood at the time of last review. ISO PAS 8800 is under active development; its requirements may change materially before publication. Last reviewed: April 2026. Proposed changes not yet enacted are flagged as such.
See companion-frameworks.md for boundary conditions on regulated-industry adoption. See adoption-vmodel.md for the V-model adoption path applicable to verification-heavy lifecycles.
Canonical sources. Normative principle definitions (P1–P12) and autonomy tier definitions are in manifesto-principles.md. This document maps those definitions to automotive regulatory requirements; it does not redefine them.
Scope: ISO 26262, ASPICE (Automotive SPICE), UN Regulation 157 (ALKS), UN Regulation 155 (cybersecurity), ISO/SAE 21434 (cybersecurity), ISO PAS 8800 (AI in road vehicles — under development).
Audience: Functional safety engineers, ASPICE assessors, software leads, and systems engineers evaluating where agentic engineering practices can operate within existing type-approval and functional safety constraints.
Automotive Safety Integrity Level (ASIL) to Manifesto Autonomy Mapping
ISO 26262 assigns Automotive Safety Integrity Levels (ASIL A through D) to safety functions based on Severity × Exposure × Controllability. The mapping below constrains the maximum permissible agent autonomy tier based on the ASIL of the software element under development.
| ASIL | Failure Potential | Max Agent Autonomy Tier | Verification Depth | Rationale |
|---|---|---|---|---|
| ASIL D | Most severe | Tier 1 — Observe only | All agent output independently verified through qualified means; Part 6 (software) objectives at ASIL D rigor | No tool credit for unqualified tool output. Agent assists analysis and proposes; qualified engineer authors and verifies. |
| ASIL C | Severe | Tier 1 — Observe only | Independent verification required; Part 6 ASIL C objectives apply | Same constraint as ASIL D. Reduced objective count does not relax the independence requirement. |
| ASIL B | Significant | Tier 1-2 — Observe or Branch | Agent may draft artifacts to isolated branches; merge requires qualified human verification against Part 6 ASIL B objectives | Fewer independence requirements at ASIL B. Agent-drafted code and tests are viable when independently reviewed. |
| ASIL A | Low | Tier 1-3 — Full tier range | Standard evidence bundles (P1) attached to each agent contribution; verification per Part 6 ASIL A objectives | Reduced verification rigor. Agent contributions with evidence bundles can satisfy most objectives with standard review. |
| QM (Quality Management only) | Negligible safety relevance | Tier 1-3 — Full tier range | Standard manifesto governance; no functional safety objectives | No ASIL applies. Normal manifesto governance is sufficient. |
These are conservative defaults for safety-relevant software paths; lower-risk QM and supporting tooling may permit higher autonomy.
ASIL decomposition. ISO 26262 supports ASIL decomposition: an ASIL D requirement may be decomposed into two ASIL B requirements handled by independent channels. In agentic contexts, ASIL decomposition applies to the agent's contribution to each decomposed channel independently — the two-channel independence requirement must be preserved even when agents assist in developing both channels.
Key constraint: ASIL assignment is determined by the hazard analysis and risk assessment (HARA, ISO 26262 Part 3), not by the development team. The ASIL dictates the autonomy ceiling; the team cannot raise it.
ISO 26262 Software Process to Manifesto Mapping
ISO 26262 Part 4 (system level) and Part 6 (software level) govern the development process. The table below maps key activities to manifesto principles.
| ISO 26262 Activity | Part / Clause | Manifesto Equivalent | Principle | Alignment | Gap |
|---|---|---|---|---|---|
| Initiation of product development at software level | Part 6, §5 | Specification scope; autonomy tier assignment | P2, P5 | Strong. Machine-readable specifications map to software development plan inputs. | SW development plan must document tool qualification and agent usage as part of the SW development environment. |
| Specification of software safety requirements | Part 6, §6 | Specify phase; machine-readable specs with safety constraints | P2 | Strong. Living specifications support traceability to ASIL-allocated safety requirements. | Formal notation may be required for ASIL C/D; agent-drafted formal specs must be independently reviewed. |
| Software architectural design | Part 6, §7 | Design phase; domain boundaries (P3) | P3 | Strong. Enforced boundaries map to software component isolation. | ASIL C/D require freedom from interference between components; independent verification of architectural decisions required. |
| Software unit design and implementation | Part 6, §8 | Execute phase; agent generates code | P4, P5 | Partial. Agent execution replaces human coding. | Tool qualification (Part 8, §11) applies to tools that automate Part 6 activities. See Tool Qualification section below. |
| Software unit verification | Part 6, §9 | Verify phase; evaluation portfolio (P8) | P8 | Strong. Evaluation gates exceed minimum unit test requirements. | Must include static analysis (ASIL B/C/D), code coverage (MC/DC at ASIL D), and review by independent party (ASIL C/D). |
| Software integration and testing | Part 6, §10 | Verify phase; integration evaluations | P8, P9 | Strong. Traces reconstruct cross-component interactions. | Integration testing must verify software component interfaces per architectural design. |
| Verification of software safety requirements | Part 6, §11 | Validate phase; outcome-based evidence (P1) | P1, P8 | Strong. Outcome-based validation aligns with ASIL-calibrated verification. | Requirements-based testing must trace to every software safety requirement. |
| Configuration management | Part 8, §7 | Knowledge as versioned ground truth (P6) | P6 | Strong. Versioned specifications and evidence bundles map to CM objectives. | Agent-generated artifacts must be CM items; model versions must be baselined alongside source code baselines. |
| Change management | Part 8, §8 | Govern phase; autonomy tier gate on changes | P5, P12 | Strong. Tier 2 branch-to-merge workflow enforces change management. | Impact analysis for ASIL-relevant changes must be performed by a qualified safety engineer before merge. |
ISO 26262 Part 8, §11 — Tool Qualification
ISO 26262 Part 8, §11 determines whether a software development tool requires qualification. This is the primary constraint on agent use in ASIL-relevant development, analogous to DO-330 in aviation.
Tool Confidence Level (TCL) Determination
A tool's Confidence Level (TCL 1, 2, or 3) is determined by:
- Tool Impact (TI): Could tool errors remain undetected and cause or contribute to a violation of safety requirements?
- Tool Error Detection (TD): Could the error be detected before it could affect the safety of the item?
| TCL | Basis | Agent Feasibility |
|---|---|---|
| TCL 1 | Low tool impact or high detection | Viable. If agent output is always independently reviewed by qualified engineers, the detection probability is high, placing many agent functions at TCL 1. |
| TCL 2 | Moderate tool impact, moderate detection | Viable with constraints. Requires increased confidence measures: use case restrictions, validation of tool use environment, or tool monitoring. |
| TCL 3 | High tool impact, low detection | Challenging. Requires formal tool qualification or use of a pre-qualified tool. Current LLMs are not practical candidates for TCL 3 qualification under present evidence and qualification expectations. |
The viable path. Independent human verification of all agent output is the primary mechanism for achieving high TD (tool error detection), which reduces the TCL classification for most agent functions. An agent that generates code which is always reviewed by a qualified engineer before integration typically achieves TCL 1 or TCL 2 — making tool qualification unnecessary for those functions.
ASPICE (Automotive SPICE) Process Alignment
ASPICE is the software process framework used across the automotive supply chain. Most OEM development contracts require ASPICE assessment at Level 2 or 3. Agentic engineering does not conflict with ASPICE — it accelerates several process areas.
| ASPICE Process Area | Manifesto Alignment | Agent Contribution |
|---|---|---|
| SWE.1 — Software Requirements Analysis | P2 living specifications | Agents assist requirements traceability, consistency checking, and impact analysis |
| SWE.2 — Software Architectural Design | P3 defense-in-depth | Agents draft architectural views; qualified engineers verify against safety requirements |
| SWE.3 — Software Detailed Design and Unit Construction | P4/P5 execution with autonomy tiers | Agents generate code at ASIL-appropriate tier; independent review required for ASIL B+ |
| SWE.4 — Software Unit Verification | P8 evaluations as contract | Agent-generated test cases and coverage analysis; qualified engineer reviews before baseline |
| SWE.5 — Software Integration and Integration Testing | P8/P9 evaluation and observability | Agents generate integration test suites; traces support integration evidence |
| SWE.6 — Software Qualification Testing | P1 outcome evidence | Agent-assisted test execution and evidence bundle assembly; human qualified by domain approves |
| SUP.1 — Quality Assurance | P12 accountability | Named domain owner accountable for agent output quality; QA role is independent oversight |
| SUP.8 — Configuration Management | P6 knowledge as versioned ground truth | Agent artifacts are CM items; model versions tracked alongside software baselines |
| SUP.10 — Change Request Management | P5 tier enforcement | Tier 2 branch gate enforces change request workflow before integration |
UN Regulation 157 (ALKS) and Autonomous Driving
UN Regulation 157 governs Automated Lane Keeping Systems (ALKS) and represents the most developed regulatory framework for autonomous driving functions. It establishes performance requirements that interact directly with agent autonomy tiers.
The fundamental constraint: agents assisting in the development of ALKS software face the highest ASIL assignments (typically ASIL C/D for the safety-relevant functions). All development activity on these functions is subject to the ASIL-based autonomy caps in the first table above.
Agent use cases for ALKS development:
| Use Case | Recommended Tier | Notes |
|---|---|---|
| Scenario generation for safety validation | Tier 1-2 | Agents generate candidate scenarios from failure mode databases. Human safety engineer validates scenario coverage and acceptance criteria. |
| Simulation test infrastructure | Tier 1-3 (QM functions) | Simulation toolchain is typically QM; standard manifesto adoption applies. |
| Requirements traceability | Tier 1-2 | Agents assemble traceability matrices from system, software, and test requirements. Human validates completeness against ASIL allocation. |
| Safety case argumentation | Tier 1 | Agents may assist structuring the safety case (GSN/CAE format). All safety arguments require human authorship and qualified engineer sign-off. |
| Regression test suite maintenance | Tier 1-2 | Agents update test cases as specifications evolve. Qualified engineer approves changes to safety-relevant test cases. |
ISO/SAE 21434 — Cybersecurity Engineering
ISO/SAE 21434 governs cybersecurity engineering for road vehicles, complementing ISO 26262 for safety. Agents introduce specific cybersecurity risk vectors that must be addressed in the Threat Analysis and Risk Assessment (TARA).
| Cybersecurity Concern | Manifesto Mapping | Automotive-Specific Note |
|---|---|---|
| Agent model supply chain integrity | P3 architecture boundaries | Model provenance, integrity verification, and version pinning. An untrusted model update is a supply chain attack vector affecting the CAL (Cybersecurity Assurance Level) of the affected function. |
| Prompt injection in development agents | P10 containment | Adversarial inputs to development agents could introduce vulnerabilities in vehicle software. Independent verification (TCL 1 path) is the primary mitigation. |
| Data exfiltration via agent context | P7 context engineering | Agent context windows may contain CSMS-protected design data or cybersecurity-relevant technical information. |
| Model routing and multi-vendor supply chain | P11 economics | Each model provider in a multi-model routing setup expands the supply chain; each requires TARA assessment under CSMS obligations. |
Market-Specific Autonomy Guidance
| Workflow | ASIL / Risk Level | Recommended Autonomy | Notes |
|---|---|---|---|
| ASIL D/C safety-critical software | ASIL D/C | Tier 1 (observe only) | Agent assists analysis and proposes; qualified engineer authors and verifies all artifacts. TCL qualification typically not required due to high TD through independent review. |
| ASIL B software | ASIL B | Tier 1-2 | Agents draft to isolated branches; independent verification required before integration. |
| ASIL A and QM software | ASIL A / QM | Tier 1-3 | Standard evidence bundles sufficient. Natural pilot domain for early adoption. |
| Test generation (any ASIL) | Tool output only | Tier 1 (observe) | Agents generate candidate test cases, traceability matrices, and coverage analyses. Qualified engineer accepts before baseline. No TCL 3 qualification required at Tier 1. |
| Simulation and virtual validation | QM context | Tier 1-3 | Simulation infrastructure is typically QM. Standard manifesto adoption. High-value domain for accelerating validation campaigns. |
| Safety case and FMEA support | Safety-critical analysis | Tier 1 | Agents assist structuring FMEAs, FTAs, and safety cases. All safety determinations are human-authored and signed off by a qualified functional safety engineer. |
| ASPICE process documentation | Process improvement | Tier 1-3 | ASPICE artifacts (work products) are human-reviewed. Agent-assisted generation reduces cycle time for lower-risk work products. |
Viable Starting Points
QM software development. No ASIL obligations. Full agentic loop permissible. Standard evidence bundles. Use to build team competency and evidence practices before taking on ASIL-rated functions.
Test generation for any ASIL (Tier 1 observe). Agents generate candidate unit tests, integration scenarios, and regression cases. Qualified engineer accepts before baseline. High value, low regulatory risk regardless of ASIL level.
ASPICE process documentation. Agent-assisted generation of work products: software development plans, traceability matrices, review records. Human authors and signs off. Reduces ASPICE preparation cycle time significantly.
Simulation scenario generation. Agents generate candidate test scenarios for virtual validation campaigns from failure mode libraries and operational design domain specifications. Safety engineer validates coverage and acceptance criteria.
Requirements traceability automation. Agents assemble specification-to-test-to-verification matrices. Qualified engineer validates completeness. Directly supports ASPICE SWE.4/SWE.5 evidence.
Regression test suite maintenance. As specifications evolve, agents update test cases to reflect changes. Human reviews all changes to safety-relevant test cases before re-baseline.
Tool Configuration Notes
How to configure agent tooling to satisfy ISO 26262 CM obligations and ISO/SAE 21434 cybersecurity requirements.
Configuration Management Hook Mapping
ISO 26262 Part 8, §7 requires that all safety-relevant development artifacts are identified, baselined, and change-controlled. Agent configuration contributes to this:
| ISO 26262 CM Objective | Hook Type | What It Produces |
|---|---|---|
| Identification of agent-generated artifacts | PostToolUse audit hook | Artifact ID, agent session ID, model version, timestamp, ASIL context |
| Change control for ASIL-relevant artifacts | PreToolUse gate hook | ASIL classification check; blocks merge to safety-relevant branch without qualified reviewer approval |
| Problem reporting from evaluation failures | PostToolUse evaluation hook | Evaluation failure record with trace ID; automatic problem report creation |
| Model version baselining | SessionStart hook | Records model version in session metadata; must match approved baseline |
Data and Design Protection
- Restrict MCP servers to on-premises or approved endpoints for sessions containing CSMS-protected design data or ASIL-rated requirement documents.
- Model version pinning is a CM obligation for ASIL-relevant development: pin to the approved model version in the development environment configuration; any model change requires a change request and ASIL impact assessment.
- Apply ITAR/EAR controls (see defense-government.md) if the program involves defense-related content subject to export control.
Open Regulatory Questions
ISO PAS 8800 (AI in road vehicles). ISO PAS 8800 is under active development and will be the primary standard governing AI system development for road vehicles. Its release will clarify tool qualification requirements, autonomy constraints, and evidence requirements for AI-assisted development. Monitor ISO TC22/SC32.
Tool qualification path for AI-based development tools. ISO 26262 Part 8, §11 predates LLM-based development tools. The existing TCL framework can be applied (and the Tier 1 observe approach achieves high TD), but no guidance exists specifically for non-deterministic generation tools. Industry groups (ISO TC22, AUTOSAR) are developing clarifications.
ASIL decomposition and agent-generated dual-channel software. When ASIL decomposition is used to justify agent involvement in both channels, the independence requirement between channels must be preserved at the model, knowledge store, and evaluation infrastructure levels — not just at the code level. Methodology for demonstrating this independence is undeveloped.
UN Regulation 157 / ALKS edge case coverage. The regulation requires demonstration of performance across a defined operational design domain. Agent-generated scenario coverage methodologies for satisfying ODD completeness arguments are not yet standardized.
Memory and learned behavior in development tools. If agent learned memory influences ASIL-rated software output, does that memory become a CM item? The conservative position (consistent with aviation) is yes — but automotive standards do not address this explicitly.
Mapping the Agentic Engineering Manifesto principles to defense and government regulatory frameworks.
Disclaimer — This document maps concepts from the Agentic Engineering Manifesto to defense and government regulatory frameworks. It does not constitute compliance, legal, or security advice. Consult qualified security officers, program managers, and legal counsel for compliance determinations. Classification obligations vary significantly by program; this document addresses unclassified system development.
Regulatory currency: This document reflects CMMC 2.0, FedRAMP (current marketplace and authorization requirements), NIST SP 800-53 Rev 5, NIST SP 800-171 Rev 2, ITAR (22 CFR 120-130), EAR (15 CFR 730-774), and DoD Instruction 5000.02 as understood at the time of last review. CMMC scoping guidance for AI systems is not yet settled; DIBCAC has not issued definitive guidance on LLM API boundary classification. FedRAMP authorization status for frontier LLM providers is evolving rapidly; verify the FedRAMP marketplace before making infrastructure decisions. Last reviewed: April 2026. Proposed changes not yet enacted are flagged as such.
See companion-frameworks.md for boundary conditions on regulated-industry adoption.
Canonical sources. Normative principle definitions (P1–P12) and autonomy tier definitions are in manifesto-principles.md. This document maps those definitions to defense and government regulatory requirements; it does not redefine them.
Scope: CMMC 2.0 (DoD contractor cybersecurity), FedRAMP (federal cloud authorization), NIST SP 800-53 (federal security controls), NIST SP 800-171 (protecting CUI), ITAR (22 CFR 120-130) / EAR (15 CFR 730-774) export controls, DoD Instruction 5000.02 (acquisition).
Audience: Program managers, system security engineers, Authorizing Officials, ISSO/ISSMs, and technical leads evaluating agentic engineering in government and defense contexts.
Primary Constraint: Data Classification
In defense and government contexts, data classification is the primary autonomy constraint, preceding all other considerations. Unlike other regulated industries where data classification is one of several constraints, here it is the governing constraint that determines whether an agent system can be used at all, on what infrastructure, and with what controls.
| Data Level | Agent Permissibility | Infrastructure Requirement | Memory Retention |
|---|---|---|---|
| Unclassified / Public | Fully permissible | Standard cloud or on-premises | Standard manifesto TTL policies apply |
| CUI (Controlled Unclassified Information) | Permissible with controls | FIPS 140-2/3 validated, CUI-authorized environment; FedRAMP High or equivalent | No CUI in external API calls; memory retention subject to CUI handling requirements (32 CFR Part 2002) |
| Classified (SECRET / TS / TS/SCI) | Not permissible with non-accredited commercial AI systems | Air-gapped, accredited systems only; non-accredited commercial LLM APIs are categorically excluded | No persistence whatsoever outside the accredited system boundary |
| ITAR / EAR Controlled Technical Data | Permissible only on compliant infrastructure | US-person-only access; no transmission to non-compliant cloud endpoints; Technology Control Plan required | Retention only within ITAR-compliant boundary; model training on controlled data requires authorization |
The hard rule: No classified information may enter any commercial AI system, regardless of the system's other security controls. This is not a risk decision — it is a legal obligation under the National Industrial Security Program Operating Manual (NISPOM, 32 CFR Part 2004) and applicable security classification guides.
CMMC 2.0 to Manifesto Mapping
The Cybersecurity Maturity Model Certification (CMMC 2.0) is required for DoD contractors handling Federal Contract Information (FCI) or CUI. Agent systems that process FCI or CUI in the context of DoD work must be assessed as part of the CMMC boundary.
| CMMC Level | Applicable To | Manifesto Alignment | Key Requirement for Agent Systems |
|---|---|---|---|
| Level 1 — Foundational | Contractors handling FCI only | Partially aligns with P3 (architecture) and P5 (access control) | 17 basic safeguarding practices from FAR 52.204-21. Agents handling FCI must operate within an access-controlled boundary. |
| Level 2 — Advanced | Contractors handling CUI (most defense contractors) | Strong alignment with P3, P5, P8, P9, P12 | 110 NIST SP 800-171 practices. Agent systems are in-scope; all CUI flows through agents must be controlled, logged, and auditable. Third-party assessment required for critical programs. |
| Level 3 — Expert | Critical programs handling CUI | Alignment plus additional requirements | 24 additional NIST SP 800-172 practices. Government-led assessment. Agent autonomy tier must be documented in the system security plan. |
CMMC Practice Mapping (Level 2 / NIST SP 800-171)
| NIST SP 800-171 Practice Family | Manifesto Mechanism | Alignment | Gap |
|---|---|---|---|
| Access Control (3.1.x) | P5 autonomy tiers with granular permissions; MCP allowlist | Strong | Agent-to-agent communications must also be access-controlled; A2A protocols require authorization evidence |
| Audit and Accountability (3.3.x) | P9 structured traces; PostToolUse audit hooks | Strong | Traces must meet NIST log requirements: user, time, type of event, success/failure, system component. Retention: 3 years for CUI systems. |
| Configuration Management (3.4.x) | P2 versioned specifications; P6 knowledge baseline | Strong | Model versions, prompt configurations, and tool permission sets are configuration items requiring CM controls. Changes require CM approval. |
| Identification and Authentication (3.5.x) | P5 tier enforcement; RBAC | Partial | Multi-factor authentication required for CUI access; agent identity (as distinct from human identity) must be established and logged. |
| Incident Response (3.6.x) | P12 accountability; P9 traces for diagnosis | Strong | CMMC requires documented incident response plan, testing, and reporting to appropriate authorities. |
| Risk Assessment (3.11.x) | P3 defense-in-depth; P5 blast radius | Moderate | Formal risk assessment of agent systems as part of the CMMC boundary; agent-specific threat vectors must be included. |
| System and Communications Protection (3.13.x) | P3 architecture; data classification enforcement | Strong | Network segmentation between agent systems handling CUI and those handling unclassified data. MCP traffic requires encryption and access controls. |
| System and Information Integrity (3.14.x) | P8 evaluations; P10 containment; P3 allowlists | Strong | Agents must not introduce unauthorized software or dependencies; allowlists are the enforcement mechanism. Security alerts from agent anomalies must be monitored. |
FedRAMP Authorization for Agent Infrastructure
FedRAMP governs the use of cloud services by federal agencies. If agent infrastructure (model hosting, orchestration, memory storage) runs on a cloud service, that service must be FedRAMP authorized at the appropriate impact level.
| FedRAMP Impact Level | Data Sensitivity | Agent Use |
|---|---|---|
| Low | Public federal information | Standard manifesto adoption; commercial cloud FedRAMP Low services permissible |
| Moderate | Most CUI, low-sensitivity PII | Most federal agency agent deployments; FedRAMP Moderate authorization required for all cloud components in the agent boundary |
| High | Law enforcement, emergency services, financial, health | Strictest cloud requirements; FedRAMP High authorization required; subset of cloud providers qualify |
Key implication for agent systems: The LLM API, the orchestration layer, the memory store, and the observability pipeline are all in-scope for FedRAMP if they process federal information. Using a commercial LLM API not on the FedRAMP marketplace for federal agency use is a compliance violation. As of early 2026, a small number of LLM providers have obtained or are pursuing FedRAMP authorization; the landscape is evolving rapidly.
Multi-model routing (P11) and FedRAMP: Each model provider in a multi-model routing setup must be FedRAMP-authorized at the applicable impact level. Routing to a non-authorized provider for cost optimization is not permissible for in-scope federal workloads.
NIST SP 800-53 Security Control Mapping
NIST SP 800-53 is the security controls catalog for federal information systems (required under FISMA). Agent systems used in federal contexts are subject to these controls.
| Control Family | Key Controls | Manifesto Mapping | Agent-Specific Note |
|---|---|---|---|
| Access Control (AC) | AC-2 (account management), AC-3 (access enforcement), AC-6 (least privilege) | P5 autonomy tiers, granular permissions | Agent service accounts must be managed identities with least-privilege permissions; periodic access review required |
| Audit and Accountability (AU) | AU-2 (event logging), AU-9 (protection of audit information), AU-12 (audit record generation) | P9 structured traces | All agent actions are auditable events; trace infrastructure must be tamper-evident and backed up separately from the agent system |
| Configuration Management (CM) | CM-2 (baseline configuration), CM-6 (configuration settings), CM-8 (information system component inventory) | P2/P6 versioned specifications and baselines | Model versions, agent configurations, and MCP tool connections are CM items; deviations from baseline require CM board approval |
| Incident Response (IR) | IR-4 (incident handling), IR-6 (incident reporting) | P12 accountability, P9 traces | Agent-related incidents must follow the organizational IR plan; traces support rapid incident diagnosis |
| Risk Assessment (RA) | RA-3 (risk assessment), RA-5 (vulnerability scanning) | P3/P5/P10 defense-in-depth | Formal risk assessment of agent systems with explicit consideration of AI-specific threat vectors (prompt injection, model poisoning, data exfiltration) |
| System and Services Acquisition (SA) | SA-4 (acquisition process), SA-9 (external information system services) | P11 multi-model routing | LLM providers are external system services subject to SA-9; supply chain risk management (SR family) applies |
ITAR / EAR Export Control
ITAR (22 CFR 120-130) and EAR (15 CFR 730-774) are the primary export control frameworks for defense and dual-use technology. For agent systems in defense development contexts, these are not secondary considerations — they are fundamental constraints on infrastructure architecture.
What Constitutes an Export
Under ITAR/EAR, transmitting controlled technical data to a non-US-person or to a foreign country constitutes an "export" — even if the transmission is digital and within the same organization. Agent systems that process controlled technical data must be designed to prevent inadvertent export.
Agent-Specific Export Control Risks
| Risk | Scenario | Manifesto Mitigation |
|---|---|---|
| Data exfiltration via LLM API | Agent sends ITAR-controlled design data to a commercial LLM inference API | MCP allowlist restricts all external API calls for ITAR-classified sessions; no external calls allowed |
| Context window as export vehicle | Agent context containing controlled data is logged or transmitted outside the controlled boundary | Context classification enforcement: sessions with controlled data produce traces that stay within the controlled boundary only |
| Model training on controlled data | Controlled technical data enters model training pipeline | Explicit prohibition in agent system policy; infrastructure-level prevention via data classification gates |
| Foreign national access | Agent system accessible to non-US-persons in a multi-tenant cloud | Architecture requirement: ITAR-rated workloads run on US-person-only infrastructure; access controls verified by system security plan |
Technology Control Plan Requirements for Agent Systems
Organizations with ITAR/EAR controlled programs must maintain a Technology Control Plan (TCP). Agent systems handling controlled technical data must be explicitly included in the TCP with:
- Identification of controlled data flows through agent systems
- Access controls preventing foreign national access
- Monitoring and audit procedures for agent access to controlled data
- Incident reporting procedures for unauthorized disclosure
DoD Acquisition and Authority to Operate
Agent systems used in DoD programs must operate under an Authority to Operate (ATO) granted by the Authorizing Official (AO). The ATO process maps to the manifesto's governance model:
| ATO Stage | Manifesto Mechanism | Evidence Required |
|---|---|---|
| System categorization | P3 architecture; data classification matrix | FIPS 199 categorization; system boundary definition including all agent components |
| Security plan development | P5 autonomy tiers; P12 accountability | System security plan documenting agent configurations, autonomy tier assignments, and accountability structures |
| Security assessment | P8 evaluations; P9 traces | Security control assessment evidence; penetration testing for ASIL C/D equivalent programs |
| ATO decision | P12 named human accountability | AO decision with explicit residual risk acceptance; agent systems are in-scope for risk determination |
| Continuous monitoring | P9 observability; P10 containment | Ongoing monitoring plan; automated alerts for agent anomalies; periodic reauthorization |
Zero Trust Architecture (ZTA) alignment. DoD has mandated Zero Trust Architecture adoption (DoD Zero Trust Strategy, 2022). Agent systems must be consistent with ZTA principles: assume breach, verify explicitly, use least privilege. The manifesto's tiered autonomy (P5), enforcement at infrastructure level (P3), and comprehensive tracing (P9) are direct implementations of ZTA principles in agentic systems.
Market-Specific Autonomy Guidance
| Use Case | Classification Level | Recommended Autonomy | Notes |
|---|---|---|---|
| Unclassified development tooling (code generation, test automation) | Unclassified | Tier 1-3 | Standard manifesto adoption. CMMC Level 1 practices apply if FCI is involved. |
| CUI document processing and analysis | CUI | Tier 1-2 | Agents analyze and draft; human reviews before any CUI record is modified or transmitted. FedRAMP Moderate infrastructure required. |
| Requirements and traceability analysis | Unclassified / CUI | Tier 1-2 | High-value use case. Agent assembles traceability matrices; human qualified engineer validates. Evidence bundles support ATO documentation. |
| ITAR-controlled program development | ITAR technical data | Tier 1 (observe only) | ITAR compliance requires human control over all controlled technical data. Agent may analyze within the controlled boundary; no external API calls. |
| Classified program development | SECRET / TS | Not permissible | Non-accredited commercial AI systems are categorically excluded from classified programs. No exceptions without accredited system boundary and government authorization. Unclassified still requires access control, auditability, and change management. |
| Cybersecurity assessment and testing | Varies | Tier 1-2 | Agent assists vulnerability analysis and security assessment; ISSO/ISSM reviews and approves all findings before remediation actions. |
| Logistics and sustainment analytics | Unclassified | Tier 1-3 | Non-safety-critical domain; standard manifesto adoption. High-value opportunity for cost reduction. |
Viable Starting Points
Unclassified administrative and logistics software. No classified data, no ITAR. Standard manifesto adoption applies. Natural pilot domain for building competency before tackling CMMC/FedRAMP requirements.
CUI document analysis (Tier 1 observe). Agents analyze CUI documents, extract requirements, identify inconsistencies, draft summaries. Human reviews all outputs before any CUI record is modified. FedRAMP Moderate infrastructure required.
Requirements traceability for DoD programs. Agents assemble specification-to-test-to-verification traceability matrices from program documentation. Qualified engineer validates completeness. Directly supports DI-IPSC-81433B data items and ATO documentation.
CMMC evidence and documentation assembly. Agents compile CMMC assessment evidence packages, security plan sections, and POA&M tracking. Reduces CMMC preparation cycle time while keeping human reviewers accountable for all compliance determinations.
Software security analysis (Tier 1). Agents perform static analysis, dependency scanning, and security posture assessment at Tier 1. ISSO reviews all findings. Human authorizes any remediation actions. No agent access to classified or ITAR-controlled components.
Tool Configuration Notes
How to configure agent tooling to satisfy CMMC and FedRAMP audit trail requirements and export control data classification obligations.
Audit Trail Hook Mapping (NIST SP 800-53 AU family)
| NIST Control | Hook Type | What It Produces |
|---|---|---|
| AU-2 / AU-12 Event logging | PostToolUse audit hook | Agent identity, action type, success/failure, component accessed, timestamp — every event logged |
| AU-3 Content of audit records | PostToolUse with structured schema | Full structured trace including tool calls, data accessed, decision chain |
| AU-9 Protection of audit information | Separate audit log infrastructure | Traces written to tamper-evident, separately-backed-up store; no agent write access to its own audit log |
| AU-11 Audit record retention | Scheduled retention hook | Trace retention: minimum 3 years for CUI systems; format-migrated for long-term preservation |
CUI Classification Enforcement
The MCP allowlist (Layer 6 in enterprise configuration) is the primary infrastructure control for CUI data residency:
- CUI-handling sessions must be restricted to FedRAMP-authorized or on-premises infrastructure only. No external APIs without authorization.
- Session metadata must include a CUI indicator; traces for CUI sessions are handled and retained under CUI requirements.
- Data residency enforcement hook: PreToolUse hook checks the data classification of the session context; blocks external API calls for CUI-classified sessions.
Model Version Pinning for ATO Stability
Pin model versions during ATO assessment periods and continuous monitoring reviews:
- Behavioral changes from model updates may trigger a change request requiring AO review.
- Document model version in the system security plan as a CM item.
- Any model change affecting security-relevant behavior (output filtering, tool call behavior) must be assessed for ATO impact before deployment.
Open Regulatory Questions
FedRAMP authorization for frontier LLM providers. The pathway for frontier LLM providers to obtain FedRAMP High authorization is complex and slow. Most commercially capable models are available only at FedRAMP Moderate or below as of early 2026. Monitor FedRAMP marketplace for updates; plan architecture to accommodate the current authorization landscape.
CMMC scoping for agent systems. How far does the CMMC assessment boundary extend for agent systems? Is the LLM API a third-party service subject to SA-9 controls, or is it in-scope as a system component? DIBCAC (Defense Industrial Base Cybersecurity Assessment Center) has not issued definitive guidance on LLM API scoping.
AI in weapons systems and autonomous functions. DoD Directive 3000.09 governs autonomous weapons systems. This document does not address weapons systems development. Teams working on programs subject to 3000.09 should seek specific guidance from the program legal and policy advisors.
Zero Trust and agent identity. DoD's Zero Trust Architecture requires explicit identity verification for every access request. Agent identity (as distinct from human operator identity) must be established in a standards-based way. No published DoD standard addresses agent identity in the ZTA context; the most current guidance treats agents as service accounts, but this is likely insufficient for the autonomy levels the manifesto describes.
CUI in agent training and fine-tuning. Using CUI data to fine-tune models creates a complex set of obligations: the fine-tuned model may "memorize" CUI that could be extracted later. No regulatory guidance addresses this risk specifically. Conservative position: do not use CUI for model fine-tuning without explicit legal and security review.