← Learn to Build·Ai1 platform home
Field Guide · 2026 Edition

Agentic Development
Best Practices

A working blueprint for building software with AI agents - drawn from inside MyZone AI's own practice. Two paths: a simple track for non-developers building agents and automations, and an advanced track for complex custom software builds. Pick your track below.

Audience: Operators, founders, technical clients Stance: Opinionated, evolving Updated: May 2026
A glowing pipeline of connected nodes representing the stages of agentic software development.
I'm building…
Showing: Agents & automations · You can switch tracks anytime - your selection is remembered.
Table of contents
  1. Agentic dev vs. vibe coding
  2. The Software Development Pipeline recipe
  3. The mindset shift
  4. The 7-stage pipeline
  5. Requirements & PRDs
  6. Scoping & situational awareness
  7. Visual Blueprint (optional)
  8. Architecture document (custom software)
  9. Memory systems (the Memento problem)
  10. Avoiding drift & PM layers
  11. QA & testing
  12. Model selection: Opus, Sonnet, Haiku, GPT-5.5
  13. Right-sizing the pipeline
  14. Post-deploy: build your maintenance agent
  15. Your toolbox - what's already deployed
  16. Part II - Advanced patterns
  17. Verification loop
  18. AGENTS.md + skills tree
  19. Sub-agents inside a project (experimental)
  20. Evals as regression tests
  21. TL;DR - the rules
  22. Closing
First - get the framing right

Agentic development is not vibe coding

You'll hear both terms thrown around. They are not the same thing. Vibe coding is what most people mean when they say "I'm building with AI." Agentic development is what we do. The difference matters.

Split illustration contrasting chaotic vibe coding versus orderly agentic development.
Two very different things
One trusts the vibes. The other follows a discipline.
Vibe coding

"Just tell the AI what to do."

Tools like Replit, Lovable, Windsurf, Cursor, or even plain Claude Code where you describe what you want and hope the output is good. It looks great. It often is great. But there's no enforced process, no human-in-the-loop checkpoints, no architectural discipline. You're trusting the vibes.

Agentic development

AI agents following strict protocols.

You combine the skills of traditional software development with the power of AI agents. You don't touch the code. But you follow strict protocols: requirements, scoping, human-in-the-loop gateways, QA, deploy. Each stage has discipline. The agent is your team - not your shortcut.

Why this distinction matters

Vibe coding will build you something functional. Agentic development will build you something that's clean, secure, maintainable, recyclable, and won't blow up in production. The difference shows up at month three, when the vibe-coded thing breaks and nobody knows why.

The front door

The Software Development Pipeline recipe

Almost everything in this guide is already wired up as a single recipe in your AI1 system. If you remember one thing from this entire document, remember this: you don't have to memorize every step. You just have to know how to start.

A glowing holographic vertical pipeline with a single cursor at the top entering a command.
One command. The recipe walks you through everything that follows.

The one command that starts everything

Spin up a new session, grab a developer agent, and say:

# Your literal first message Kick off the software development pipeline recipe. I want to build a podcasting outreach agent.

That's it. The recipe takes over. It will introduce itself, summarize the stages, and prompt you for step 1 - requirements. When you're done, it asks if you're ready for scoping. After scoping, it asks if you want a human-in-the-loop expert to review. Then it moves to build, QA, deploy. It guides you through every step in the right order.

What the recipe does automatically

If the agent ever drifts off the recipe

Agents are probabilistic - occasionally one will get distracted and start improvising. If that happens, just say: "Reminder - where are we in the software development pipeline? Are we following all the right steps in the right order?" 90%+ of the time the agent will course-correct immediately.

For everyone - agents track and custom track alike

Whether you're building a 20-minute Friday automation or a multi-week custom dashboard, the recipe is the entry point. The depth of what it does scales with the complexity of what you're building. You don't have to choose a "lighter" version. Just start.

Chapter 01

The mindset shift comes before the tooling

AI development is moving incredibly fast. What worked a month ago is already wrong today, and what we're doing today will look different in another 30 days. The single most important trait of operators who succeed with agents isn't a stack - it's flexibility.

The people winning at this aren't the great coders. They're the flexible thinkers who learn and test fast. The core skill for being a great agentic developer is flexibility and a willingness to keep learning. It is not experience with coding. - Mike Schwarz, MyZone AI

What's stable

  • The shape of the pipeline. Requirements → scoping → tasks → build → QA → review → deploy. That hasn't changed.
  • Modular API-first wins. Smaller pieces beat monoliths whenever memory and context matter.
  • Plan more than you build. The 90/10 ratio is real.

What's moving

  • How agents are wired together. Sub-agents, skills, recipes, memory - the plumbing is in flux.
  • Which model is best for which step. Opus, Sonnet, GPT-5.5 - the leaderboard rotates monthly.
  • How many stages a project needs. Pretty soon it'll be "here's my idea - go build it."
Chapter 02

The seven stages - and when to skip them

Seven stages, the same shape since AI got serious about coding. Small projects skip stages. Big projects expand them with checkpoints and parallel reviews. Tap any step to expand.

Right-size first - don't over-engineer

The full seven-stage pipeline is for big, complex software projects. Simple agents and quick automations don't need all of this. For a 30-minute build of a small agent or a one-off automation, it's totally fine to go straight from a quick requirements chat to a developer agent - no scoping doc, no wireframes, no task decomposition, no formal QA. Skip steps proportional to the complexity of what you're building. The chapter on right-sizing is later in the guide.

01

Requirements

Capture the what. Use a requirements agent. Voice-dictate answers. Push for outstanding questions before moving on.
02

Scoping

Convert the what into a how. Situational awareness of existing skills, recipes, and architecture is critical here.
02.5

Visual Blueprintopt

Wireframes + full front-end design - for any project with a visual component. Lock the "what good looks like" before code is written.
03

Task creation

Decompose the scope into milestones and individual tasks. Each task carries its own thinking - chain-of-thought baked in.
04

Build

Hand off to the developer agent. Monitor. Never approve questions you don't understand - research them.
05

Review & QA

Code review, security audit, refactor pass. On big projects, run QA at every milestone, not just at the end.
06

Pull request & deploy

Human gate. The agent prepares the PR; you approve. CI/CD to staging, then production.
90%
Planning & scoping
10%
Building & iterating
2–10d
Spent on PRDs for big builds
8.5/10
Code quality with this approach
Chapter 03

Requirements - get the what right before anyone touches the how

The requirements stage is where most agentic builds quietly fail before they start. The agent's job here is to ask, not to plan. Your job is to load it up with everything you know.

PRD = Product Requirements Doc

Throughout this guide and in conversations with developers, you'll hear PRD a lot. It just means a requirements document. Same thing. Developers call it PRD; we just call it the requirements doc.

What to feed it

  • Existing SOPs and artifacts - anything that describes the current state.
  • Sample outputs - reports, screens, exports you want to mirror.
  • URLs to crawl - competitor sites, reference apps, anything you want the agent to study.
  • Developer docs for integrations - Stripe, Slack, Supabase, whatever the system touches.
  • Voice answers via Whisper Flow - fastest way to keep momentum.
  • External deep research - Claude, Perplexity, ChatGPT, Grok 4 for X intelligence. Reconcile.

The two failure modes

Failure mode 1 - cut off too early

The agent declares it's done, generates the PRD, and you scroll to the bottom to find 14 unanswered questions. Always read the "outstanding questions" section before you advance.

Failure mode 2 - cut off too late

The questions get progressively weaker and lower-relevance. When you hit three trivial ones in a row, the curve is exhausted. Just say: "I think you've got what you need. Is this critical?" The agent will usually agree and move on.

Your job vs. our job

At the requirements stage, your job is to give us the WHAT - as much information as possible. Don't worry about scope, complexity, breaking things into pieces, or how it'll be built. If the agent says "this will take 6 months," ignore the estimate - it's almost always wrong. Just keep telling it what you want.

Our job as the human-in-the-loop experts is to take that big requirement and break it into 3, 4, or 5 separate agents during scoping. Don't pre-decompose. Just describe the outcome.

Default the requirements doc to HTML

Requirements docs are generated as Markdown by default, because they're typically passed from agent to agent (cheaper on tokens). But if you are going to read and iterate on the doc, ask for HTML output - it's much easier to scan, edit, and review. Just tell the agent: "Generate this as HTML so I can review it."

Iterate - v1, v1.1, v1.2

The first version of the requirements doc is rarely the final version. Read it, give the agent feedback, and ask for v1.1. Then read that, add more, ask for v1.2. The 90% planning rule means you should expect to do this 2–3 times for any non-trivial build.

The push-back pass advanced

For mission-critical or high-complexity projects, you can tighten the PRD one more notch. Hand it back to the same agent with a different lens. Two prompts we like:

# Push-back prompt #1 - persona swap You've just finished v1 of this PRD. Now act as a senior product strategist. List 5 things you like and 5 things you'd change. What's missing? What would a developer ask you in 3 weeks that this doc doesn't answer?
# Push-back prompt #2 - deep research Go out on the web. Do deep research on best practices related to what we're building here. Then come back and suggest 5 ways we can improve this document.

Models do better work with more thinking time and more reflection passes. For simple agents, this is overkill - skip it. For complex custom software, do this 3–4 times before moving on.

Chapter 04

Scoping - turning the what into a how

The scoping agent's job is to translate the PRD into a how - the technical plan for building it. Where most teams lose: they let the scoping agent reinvent components that already exist in their stack. The fix is forced situational awareness.

What the scoping agent already knows

Scoping is moving to your server

Up until now, scoping has run on the MyZone AI side. The challenge: as your AI1 instance grows with custom skills built just for you, our scoping agent can't see what's on your server. So it might say "we need to build this" when you already have that piece.

We're transitioning soon so that requirements and scoping run on your server, with full situational awareness of every custom skill you've deployed. The scoped plan still comes to our team for human-in-the-loop sign-off before build - but the agent proposing the plan will know your full toolbox.

A modular architecture of glowing interconnected blocks versus a single monolithic cube.
Architecture decision
Modular API-first beats monoliths every time.

Modular > monolithic - little houses, not skyscrapers

AI thrives with smaller surface areas. Take a sales agent example: instead of building one mega "sales agent" that does lead enrichment, proposal generation, transcript analysis, and meeting prep all in one, we build each as a separate skill and compose them together at the agent level.

Chapter 04.5 · Optional

The Visual Blueprint - locking what good looks like before code

This step is only for projects with a visual component - dashboards, portals, web pages, anything with a user interface. For pure text automations or back-end agents, skip it. But for anything visual, the time you spend here pays back 10× during build and QA.

Design wireframes transforming into a finished interface mock-up.
Lock the goal first
Wireframes → designs → working mock-ups, before code.

What it is

A non-functional, front-end-only version of the thing you're building. Could be a rough wireframe, a polished design, or a fully clickable static mock with dummy data. The point: you can play with it, validate the layout, get stakeholder buy-in - all before a single line of working code is written.

Why it matters

How to run it

The Software Development Pipeline recipe will route you to the design agent for this stage. It picks up context from your scoping document and generates a visual layout. You go back and forth - "move this, change that, add this" - until you love it. Then the build agent picks up the blueprint and starts implementing.

# What the design agent has access to brand-identity-extractor # Pulls your brand from URL design-wireframing # Generates wireframes and full designs design-system # Your fonts, colors, components
When to skip

For text-only automations, back-end agents, scheduled jobs, data pipelines - anything without a UI - skip Visual Blueprint entirely. There's nothing to design. Go straight from scoping to build.

Chapter 05 · Custom software

The architecture document - the agent's compass for big builds

For agents on the Ai1 platform, the scoping agent already carries architectural awareness - you don't need a separate doc. For custom software development (a CRM, ERP, portal, anything standalone) you create a dedicated architecture document that every agent reads on boot.

A team of glowing AI agents arranged in a circle, each running a different stage of the pipeline.
A specialized agent for every stage - each booting with the right context.

What goes in a custom-software architecture doc

Two places to put it

The architecture doc can live at the GitHub repo level (an AGENTS.md at the top of the project) or in the agent's own boot-up instructions. Best practice: keep the GitHub copy as the source of truth, and reference it from every agent that touches the codebase. Synchronize, don't duplicate.

Chapter 06

The Memento problem - agents forget everything overnight

Every time an agent boots, it's a clean slate. Like the protagonist in Memento, it has to piece its life together from post-it notes, tattoos, and Polaroids. Those notes are your memory system. Without them, the agent will quietly forget who you are and what you're building.

A robotic head surrounded by floating memory cards and clue fragments.
Core insight
Agents wake up with no memory. You build the post-it notes.

Default vs. ideal - depends on complexity

Fine for simple projects

One AGENTS.md at the top of the project

Everything in one file: boot instructions, architecture, decisions, learnings, ideas. Perfectly workable for a small agent or simple automation. Just keep an eye on size - once that single file balloons past a few hundred lines, your tokens explode every boot.

For bigger, complex projects - Karpathy-style

A Wikipedia of small, interconnected MD files

Table of contents at the top. Each topic - architecture, ideas, learnings, considerations, features - lives in its own atomic file. Agents follow links to the chunks they need, like vector retrieval. Worth the setup cost once memory bloat becomes a real problem.

A Wikipedia-style interconnected knowledge graph of glowing hexagonal nodes.
Andrej Karpathy's wiki structure - small atomic files, big retrieval gains.

The three ways to revisit a build (worst to best)

  1. Reopen the original session where the build happened. Works, but the conversation has grown huge - every new message is expensive in tokens.
  2. Start a fresh developer session and ask it to research. It digs through your brain and files for 5–10 minutes to rebuild context. Token-heavy, and there's risk it misses something important.
  3. Have a dedicated agent for that specific automation, with all the boot instructions, architecture references, and prior learnings baked into its AGENTS.md. Boots at ~6,000 tokens, not 50,000. Always warm. This is the best practice - covered in Chapter 13.
Chapter 07 · Custom software

Drift - the Jenga tower of stacked layers

Every layer between you and the developer agent introduces a small percentage of drift. Stack enough layers - PM agent, then sub-PMs, then sub-developers - and the tower wobbles. Eventually it suggests something silly, like "I'll just connect directly to the production database."

A Jenga tower of glowing translucent blocks slowly drifting and tilting.
Failure mode
The more layers in the stack, the more drift accumulates.
It's better to have one agent and one developer working on a project for a longer period of time than it is to have a PM that spins up five developers and then reassembles the code. - Mike Schwarz, on measuring drift
Chapter 08

QA & testing

Every build gets a quality assurance pass. For simple agents, that's a single tester agent at the end. For complex custom software, QA gets woven through every milestone with multiple specialized agents.

Multi-device QA testing - phones, tablets, and a laptop fanned out in a glowing arc.
Multi-device QA
Visual testing across devices - automated, scheduled, repeated.

The generic tester agent - your default

The Software Development Pipeline recipe automatically invokes a generic tester agent at the end of every build. It runs a quality sweep, loops until errors are zero, and only then declares the build complete. For most simple agents and quick automations, this is all you need.

Specialized QA recipes for different needs

We also have dedicated recipes for specific QA workflows - quality assurance runs differently depending on what you're building:

Over time we'll customize QA recipes specifically for your environment and the kinds of things you build most often.

The full QA stack - for complex custom software

  • Code-review agent with sub-skills for cleanliness, performance, and refactoring.
  • Security review agent dedicated to vulnerability scanning, secrets handling, auth flows, and injection vectors.
  • Token Trimmer agent auditing for probabilistic→deterministic conversions and prompt bloat.
  • BrowserStack (~$270/month) for real iPhone/Android visual testing across actual devices.
  • Playwright for desktop end-to-end and visual diffs.
  • Visual goal anchor: the Visual Blueprint from Chapter 04.5. The QA agent compares output to goal.
  • Architectural-consistency agent scanning for drift against the architecture doc.

Token efficiency - probabilistic vs deterministic

A prompt worth memorizing

Ask the agent: "Is there anything in this codebase that's currently done with probabilistic logic (an LLM call) that could be moved to deterministic code? That would reduce per-run token cost and produce more consistent outputs." Repeated structured work is almost always cheaper and more reliable as deterministic code.

Our dedicated Token Trimmer agent does this analysis across software agents, skills, recipes, and scheduled jobs.

QA checkpoints for big projects

For multi-week builds, don't wait until the end for QA. Bake checkpoints into every milestone - if there are 5 milestones, the QA agent comes in 5 times, cleaning up as you go. This is how you keep the Jenga tower from leaning.

Chapter 09 · Custom software

Model selection - per agent, deployed per stage

Each agent is locked to a specific model. As you work through a recipe, it routes between different agents at different stages - and each agent already knows which model it should use.

Claude Opus 4.7

Best for

  • Complex, long-horizon coding sessions
  • Refactors that touch many files
  • Architecture decisions and PRDs
GPT-5.5

Best for

  • Code review & QA - accuracy-bound work
  • Precise, narrowly-scoped tasks
  • A/B-able second opinions on complex code
Claude Sonnet

Best for

  • Most general-purpose agent work - the everyday workhorse
  • Mid-complexity coding, scoping, and PM tasks
  • Cost-efficient long sessions where Opus is overkill
Claude Haiku

Best for

  • Fast, high-volume classification & routing
  • Light retrieval, summarization, and parsing
  • Background sub-agents inside larger pipelines
Chapter 10

Right-sizing - you don't need all seven stages

The framework above is the maximum. The minimum is "here's my idea - go build it" with a single developer agent. Most projects sit somewhere between. Use intuition.

Tiny project - under 30 minutes

Example: "Every Friday I have to go to HubSpot and download a file, then import it into Google Sheets, make a few changes, and write an email. I want to automate that."

Skip scoping. Skip QA recipe. Talk directly to a developer agent (sometimes you can even skip the requirements agent - just say what you want). State the idea, answer a few clarifying questions, approve, build. 99% of the time it's fine. Total: 20–30 minutes.

Light project - a few hours

Light requirements pass (single agent, voice answers). Skip scoping or do a 5-minute version. Build. Generic tester agent at the end.

Big project - weeks of work

Full pipeline. Multi-day PRD. Architecture doc (or built-in Ai1 platform awareness, depending on what you're building). Visual Blueprint. Task decomposition into milestones. Mid-milestone QA. BrowserStack + Playwright. Code review on every milestone. GitHub for code repository and backups - every commit traceable, every state recoverable.

Mission-critical - weeks, public-facing

Everything above, plus: deep research from three engines reconciled. Multiple persona reviews on the PRD. A dedicated agent per module. Memory consistency agent on cron. Cross-model QA comparison. GitHub with strict branch protection, code-owner reviews, and CI/CD gates - nothing reaches main without passing the full QA pipeline.

The mindset

Stick to requirements → scoping → build → QA as your default mental model. Skip scoping or QA for the smallest projects. Expand into wireframes, GitHub, milestone QA, and Part II patterns as complexity warrants. You'll develop the intuition fast - usually within your first 5–10 builds.

Chapter 11

Post-deploy - build your maintenance agent

Here's the step most people miss. After you've shipped a build that you'll come back to - to edit, debug, query, or extend - create a dedicated agent for it. This is the difference between future-you booting cold for 10 minutes versus warm in 10 seconds.

When to create a maintenance agent

What the maintenance agent gets

An AGENTS.md file at the agent level containing:

The payoff

When you come back next week with three new ideas, you don't have to dig through chat history or have a fresh agent spend 10 minutes re-discovering the codebase. You just open the maintenance agent and say "here are my three new ideas." It's already warm. It already knows everything. Boot cost: ~6,000 tokens instead of ~50,000.

For now - we'll do this for you

For complex builds, we (the MyZone team) will create the maintenance agent as part of the deploy process. Over time, as you get comfortable, you'll create them yourself. You don't need to worry about this step in your first few builds.

Chapter 12

Your toolbox - what's already deployed

You don't have to build from scratch. We've pre-built over 200 automations across the MyZone platform, and a healthy chunk of them are already deployed on your AI1 instance. Before you scope anything new, check what you already have.

A glowing toolbox of pre-built software components arranged on floating holographic shelves.
Recycle, don't rebuild
200+ pre-built automations. ~60–70% deployed by default.

What's already on your server

What's behind the curtain

Roughly 60–70% of our 200+ pre-built automations are deployed on your instance by default. The other 30–40% are either client-specific (built for someone else), still being polished, or waiting for a use case. Many are 10 minutes of work for us to clean up and push to your server.

Check before you build

When you're scoping something new, the scoping agent will surface existing skills it knows about. But it's always worth asking us: "Hey, before I build X - do you have any pre-built pieces for this?" Quite often the answer is yes, and we can deploy them in minutes. Recycling existing LEGO pieces is always faster than building from scratch.

Part II

Advanced patterns for complex custom software

Everything above gets you a clean, well-built project. The chapters that follow are advanced patterns we're applying to complex custom software development - multi-week builds, large codebases, production systems with real stakes. For simple agents and quick automations, these patterns add overhead without much payoff. Read them as the next layer of sophistication when your build complexity warrants it.

Chapter 13 · Custom software

The verification loop - the inner heartbeat of every stage

In the basic pipeline above, verification looks like a single stage near the end (QA). The 2026 best practice - Anthropic calls it "the single highest-leverage thing you can do" - is to make verification the inner loop of every stage, not a stage of its own.

The concept

Anthropic's official Claude Code agent loop is four words: gather context → take action → verify work → repeat. The mistake most teams make is treating verification as something that happens "at QA time." By then the drift has already accumulated. Instead, every stage gets its own verifier that runs before the agent says "done."

What it looks like at each stage

  • Requirements → persona-swap reviewer. Agent re-reads its own PRD as a senior PM.
  • Scoping → architecture-doc diff check. Does the scope conform?
  • Task creation → dependency-graph sanity. Does the task order compile?
  • Build → tests + linters + screenshots run by the agent before "done."
  • Code review → independent reviewer with fresh context (sub-agent).
  • UX/UI → Playwright + BrowserStack visual diff against the Visual Blueprint.
  • Security → dedicated security agent, runs after every meaningful change.
  • Memory writeback → consistency agent scans before commit.

The pattern

# Pseudo-prompt baked into every stage's agent 1. Produce the artifact (PRD / scope / code / etc). 2. Identify the most likely failure modes for this kind of artifact. 3. Build a verifier - a checklist, a test, a script, or a fresh-context sub-agent - that would catch those failure modes. 4. Run the verifier. 5. If the verifier reports issues: fix them. Loop. 6. Only when the verifier passes: report "done" and move on.
Why this matters

Without a verifier baked into the stage, the agent will claim work is done without actually testing it. Anthropic's published data is blunt: agents mark features complete without running them unless given explicit verification tools and prompted to use them.

Chapter 14 · Custom software

From single architecture doc to AGENTS.md + skills tree

The basic version of architecture is "one canonical doc that every agent reads on boot." The 2026 evolution is the same family of idea as Karpathy's wiki memory - applied specifically to codebases. A thin root file at the top points to small, on-demand pieces.

The pattern

AGENTS.md is now an open standard stewarded by the Linux Foundation, supported across 18+ tools (Claude Code, Cursor, Codex, Cline, Windsurf, Devin) and living in 60,000+ public repos. The premise: a single thin AGENTS.md at the root tells any agentic tool how to navigate your project. It doesn't contain the architecture - it links to a tree of small skills, each loaded only when needed.

Structure in practice

# Repo layout . ├── AGENTS.md # Thin root - boot instructions + skill index ├── .claude/skills/ │ ├── architecture/SKILL.md │ ├── design-system/SKILL.md │ ├── auth-flow/SKILL.md │ ├── deploy/SKILL.md │ └── ... └── src/ # AGENTS.md (root) - just a pointer, ~50 lines This is the FooBar CRM. Modular API-first. When working on: • the front end → load skills/design-system • auth or sessions → load skills/auth-flow • the database schema → load skills/architecture • deploying → load skills/deploy

Why this beats a single doc

Chapter 15 · Custom software · EXPERIMENTAL - UNDER ACTIVE TESTING

Sub-agents inside a project - parallelism without drift

This chapter covers patterns we're actively testing internally. The shape is settling but the details are still moving. Read as where the field is going, not locked best practice.

The concept

In Chapter 11 we covered one agent per complex iterative automation. The newer pattern goes one level deeper: inside that one agent's session, delegate heavy sub-tasks to sub-agents with their own isolated context windows. The sub-agent does the work, then returns a summary. The main agent's context stays lean.

Use-case 1

Read-heavy investigation

"Read this entire codebase and tell me where the session timeout is configured." The main agent shouldn't ingest 50 files - spawn a sub-agent that reads, finds the answer, returns two lines.

Use-case 2

Writer / Reviewer split

Main agent writes a feature. Fresh-context sub-agent reviews it. The reviewer has no bias toward code it just produced - clean second pair of eyes.

Parallel sub-agents via git worktrees

# Spin up 5 isolated workspaces for 5 sub-agents git worktree add ../wt-auth feature/auth git worktree add ../wt-billing feature/billing git worktree add ../wt-onboarding feature/onboarding git worktree add ../wt-reports feature/reports git worktree add ../wt-search feature/search # Each sub-agent runs in its own worktree with isolated context.

The main agent gets five independent results, reviewed in parallel, ready to merge. One human operator running the equivalent of a 5-developer team.

Chapter 16 · Custom software

Evals as regression tests - making agent drift falsifiable

Traditional QA catches code bugs. Evals catch agent-behavior bugs - the kind that show up when an agent that worked perfectly yesterday hallucinates today.

The concept

Every time an agent ships a bug, you capture the conversation trace. You convert it into a tiny test case: input → expected good output. You drop it into an eval suite. On every future PR, CI runs the suite. If the agent reproduces the old bug, CI blocks the merge.

A lightweight eval file

# evals/bugs.json - one entry per fixed bug [ { "id": "bug-2026-05-12", "input": "How do I verify a Stripe webhook signature?", "expected_contains": ["STRIPE_WEBHOOK_SECRET", "constructEvent"], "expected_not_contains": ["hardcoded", "skip verification"], "notes": "Agent previously suggested skipping verification." } ]

The discipline

Reference

The rules - in one page

A checklist to print, tape near the screen, and re-read when an agent loses its mind at 11pm.

For everyone

01

Use the recipe

"Kick off the software development pipeline" is your entry point. Always.

02

Agentic ≠ vibe coding

You don't touch code, but you follow a discipline. Big difference.

03

Be flexible

The stack you used last month is already wrong.

04

Plan 90, build 10

Days of scoping save weeks of debugging.

05

Modular & API-first

Little houses with pathways, not skyscrapers.

06

Your job: the what

Don't pre-decompose. Just describe the outcome.

07

HTML for humans, MD for agents

Ask for HTML when you're going to read the doc.

08

Visual Blueprint first

For anything visual: wireframes & mock-ups before code.

09

Right-size the pipeline

Skip stages for tiny builds. Expand for big ones.

10

Build a maintenance agent

For anything you'll come back to.

11

Recycle, don't rebuild

Check the toolbox before scoping new pieces.

12

Push the agent

Re-evaluate, swap personas, run another pass.

For complex custom software - Part II additions

13

Verification loop

Every stage produces an artifact and a verifier.

14

AGENTS.md + skills

Thin root file pointing to on-demand skill files.

15

Sub-agents in-project

Experimental. Delegate heavy reads. Worktrees for parallelism.

16

Evals as CI gates

Every shipped bug becomes a regression test.

17

Architecture doc

For standalone software (not Ai1 agents) - one canonical reference.

18

Model per agent

Opus, Sonnet, Haiku, GPT-5.5 - pick per agent. Re-test quarterly.

Closing

How to actually learn this

This guide gets you maybe 15–20% of the way there. The other 80% comes from picking up the fishing rod and using it.

We don't want to be dumb builders for you, throwing fish over a fence while you eat them. We want to give you the fishing rod. - Mike Schwarz, MyZone AI

Your homework

  • Start tiny. Find one Friday-morning task you do manually and automate it.
  • Spin up a developer agent. Say "kick off the software development pipeline."
  • Iterate on something we built. Take an existing agent or automation and improve it.
  • Ask lots of questions. Slack us, message your account manager, surface the weird stuff.

Group training sessions

We're running regular group training sessions for clients who want to go deeper. Different topics, live builds, Q&A, real examples from the community.

Ask your account manager to add you to the next one.

The learning ratio

15–20% of your learning will come from reading this guide. The other 80–85% will come from getting your hands dirty - trying things, breaking things, asking questions. The fastest path to being good at this is to start, fail a few times, and ask why.