You can build an AI agent in six weeks. But can you defend it in six months?
John Cosstick is one of my favorite AI experts to talk to. He's a technologist and researcher with a focus on making AI systems work in a safe and scalable way.
Today he digs into a question that is suddenly catching even the strongest engineering teams off guard:
The AI agent works, but could you defend it six months from now?
This is not another piece on "building faster." Instead, let's talk about production readiness, ordering the skills that actually matter into a sequence you can work through with your team. (There's actually a 5-layer process you can use to evaluate your own team's AI readiness, more on that below).
If you're running agents in production, or hope to, you're going to get a lot out of this one. Now here's John.
Hi everyone, John here!
Engineering teams can stand up an agentic AI system in weeks. They cannot defend what they have built six months later.
The build skill is no longer the bottleneck. The audit-ready skill is.
This newsletter walks through the five-layer skill stack that separates a working agent from a defensible one, why four patterns surface in every production review, and the one layer most curricula are still catching up to.
Before we get into this week's topic, one of our courses most relevant to engineering leads working with agentic systems —Agentic System Design — covers the orchestration, state management, retry logic, and failure isolation patterns that turn a working prototype into a production-ready system. It pairs naturally with Mastering MCP: Building Advanced Agentic Applications for the tool-call boundary layer underneath.
If you want to build real judgment around the production-readiness gap this newsletter walks through, those two are the fastest way in.
Engineering teams with mature delivery discipline — code review gates, CI/CD pipelines, observability for their traditional services — routinely ship agentic AI systems they cannot explain to a senior reviewer six months later.
That's not unusual. Most engineering teams building agentic AI today are significantly underestimating the skills required to get a system through an audit, an insurance renewal, or a procurement review. The reason: the skill model for agentic AI is fundamentally different from the traditional ML deployment many have been planning for.
The build skill is no longer the bottleneck
Building agentic AI in 2024 was hard. Building it in 2026 is not. Frameworks are mature. The Model Context Protocol has converged the tool-call layer into something approaching a standard. A senior engineer with six focused weeks can stand up a working agentic prototype against a real corpus and real tools.
What teams are discovering is that the build skill is no longer where the failure happens. The failure happens later — at the audit, at the insurance renewal, at the procurement review, at the board-level risk discussion that now precedes production sign-off at serious firms. And the skill that closes that gap was never on the learning roadmap.
Call it the agentic skill paradox. Build capability is up. Audit-ready capability is flat. The bills, the exposure, and the regulatory load are all growing into the gap.
The four patterns that surface at audit
Across professional services firms running agentic AI in production, four patterns surface repeatedly in governance and insurability reviews. They are the context that makes the skill stack legible.
Pattern 1 — Teams can show what the agent does, but not why
When a senior reviewer or auditor asks why a particular output was produced, the team often genuinely cannot answer. Not incompetence — nobody asked them to design for explainability. The literacy layer was never built into the work.
Pattern 2 — Provenance breaks at the first link
Where did the retrieved information come from? Was it authorised for that use? Most agentic systems cannot answer. The audit trail breaks before it starts.
Pattern 3 — Tool calls go beyond scope
Scope creep at runtime is now the most common production incident pattern. Agents invoke tools they were never meant to touch, or take actions without an approval gate that was supposed to exist but was never designed in. This is where liability surfaces fastest.
Pattern 4 — Telemetry was never built in
The most expensive pattern. The team ships a capable system and then discovers, at first production review, that there is no runtime observability layer. Retrofitting telemetry produces incomplete trails that satisfy no one — not auditors, not underwriters, not boards. The cost of this retrofit is now showing up in insurance pricing.
These are not theoretical patterns. They are the four failures I see surface across professional services AI deployments, and they map directly onto the five-layer skill stack that follows.
The five-layer skill stack, in sequence
This is the order. Not a menu. Each layer depends on the one before it, and skipping any of them creates predictable failures downstream.
Layer 1 — Foundations: LLM literacy and reasoning
What it covers: transformer architecture, failure modes, why hallucinations happen as an architectural property rather than a bug, what reasoning under uncertainty actually means at the model level.
Why it matters: if no one in the team can explain why the model produced an output, no downstream oversight rescues the system at audit. This is the first checkpoint, and it is the most fixable.
Layer 2 — Knowledge and retrieval: grounding the agent
What it covers: RAG architecture, vector databases and embeddings, retrieval pipeline failure modes, and when fine-tuning (LoRA/QLoRA) is the right answer instead.
Why it matters: this is the layer that determines whether your agent's outputs are traceable. Without it, "where did this information come from" has no satisfactory answer at a procurement review — and no defensible answer at an insurance renewal.
Layer 3 — Agent construction: building the action layer
What it covers: tool calling, action authorisation, multi-step reasoning chains, MCP as the emerging standard for connecting agents to external tools and data in a structured, auditable way.
Why it matters: this is where most production incidents now originate. Every tool the agent can invoke is an incident surface and a documented control point — and the only question is whether those controls were documented before the incident or after.
Layer 4 — System design: architecture for production scale
What it covers: orchestration across multiple agents and tools, state management, retry logic and failure isolation, and — critically — designing telemetry hooks into the architecture rather than bolting them on later.
Why it matters: this is the distance between a working prototype and a system that behaves predictably under load. The design decisions made at this layer determine whether Layer 5 is even possible.
Layer 5 — Telemetry and oversight: the layer the curriculum hasn't caught up to
This is the layer that separates capable teams from defensible ones, and it is the layer most learning paths — including ours — are still catching up to. Most AI education was built around building. The runtime oversight discipline — capturing what the AI did, what it was authorised to do, who reviewed the outputs — has lagged the build skill by roughly two years.
That;s the gap. It is not a knowledge problem; it is a sequencing problem. Teams that learn Layers 1 through 4 and skip Layer 5 ship capable systems they cannot defend. Teams that build Layer 5 thinking into their work from Layer 4 onward are the ones whose deployments survive procurement, insurance, and board review.
What Layer 5 actually covers:
• Runtime AI telemetry — the observability layer for agentic systems, equivalent to APM for traditional services but extended to capture what the AI did, not just whether the server responded. • Audit-trail design — structured logging, approval-gate documentation, and output review records. Built in, not bolted on. • Pre-deployment validation discipline — testing and staging treated as governance evidence, not just quality assurance. • The commercial case — organisations with documented oversight are getting better insurance pricing, faster procurement, and cleaner regulator engagement. The market is now pricing governance capability.
Layer 5 is also the layer worth building on the job, deliberately, because it is where the next generation of senior engineering roles is being defined. The promotion path through agentic AI in 2026 runs through oversight, not through faster builds.
Where to start your team this month
Pick where you are. If foundations are shaky, start at Layer 1 — the team cannot defend what they cannot explain. If you are already shipping, start at Layer 5; the audit gap will find you before you find it.
Either way, the stack is the map. Work through it deliberately, in order, and treat every layer as both a learning decision and a production-readiness decision.
The engineers getting hired into senior agentic AI roles in 2026 are not the ones who can build the fastest. They are the ones who can defend what they built. The skill stack is the difference.
Build boldly. Defend wisely.
John Cosstick
Freelance Journalist | Author | Founder – TechLifeFuture.com 🏆 BOLD Award Winner 2024 – Open Innovation, Digital Industries
Educative, 12280 NE District Way, Bellevue, WA, 98005 United States