The AI Engineer

I let AI agents run my entire GitHub workflow

Owain Lewis — Fri, 29 May 2026 08:34:57 GMT

Hey friend 👋,

For the past year I’ve used AI agents to run my entire GitHub workflow. If you write code with AI, getting this part right makes a real difference to the quality of your work.

The best engineers I’ve worked with have one thing in common. They follow a repeatable workflow. When you work with AI coding agents, you need the same thing to avoid inconsistent results.

Once your workflow is clear, there’s a simple way to hand the whole thing over: give the agent the GitHub CLI.

In this issue I’ll walk through how I use the GitHub CLI with agents like Codex and Claude Code to run a project end to end. To keep it concrete, I built a small Go command line tool along the way, and the agent did nearly all of it: the repo, the tickets, the code, the reviews, the merges, and a full release pipeline.

Here’s the video if you want to check it out.

Give the agent one tool

Agents can write code. On their own they can’t touch the place your work actually lives. GitHub is where most of us store code, review it, and work together, so that’s the gap to close.

You close it by installing the GitHub CLI. On a Mac it’s two commands:

brew install gh
gh auth login

Once it’s authenticated, the agent can run any GitHub command you can. I tested this by asking for a new repo, without telling it to use the CLI:

Create a new public GitHub repo, initialise it with a README and a Go .gitignore, and set the description to "a CLI tool for managing your agent workspace".

It checked whether the CLI was installed, then set the whole thing up. I didn’t touch a thing.

Keep the work on a board, not in markdown files

If you track agent tasks in markdown files on your machine, it gets chaotic fast. You lose sight of what’s done and what’s still in progress.

GitHub project boards are built for this. I asked the agent to create one and link it to the repo. Now every task is a ticket, and as the agents work they move the tickets themselves, so the board always shows what’s done and what’s in progress. It’s also what makes parallel work practical: you can run several agents at once, each on its own ticket, and the board tells you who’s doing what.

Plan first, then break it into tickets

Don’t ask an agent to build the whole thing in one go. Plan it, then split it into tasks you can review one at a time.

I asked for a spec first, and gave the agent a link to the existing project so it could read the API and understand what I was building:

Write a spec for a single-binary Go CLI that lets AI agents operate this project. Here's the codebase: [link].

Once the spec looked right, I had it break the work down and push it to the board:

Break this into tasks we can do one by one, then push them all to the project board.

That gave me ten tickets, each linked on the board. From there it’s just working through them.

Make the agent review its own work

The first time an agent writes code, there’s usually (always if we’re honest) something wrong with it. So before any code goes up for review, I have it check its own work first.

I use an implementation skill that reminds the agent to review what it just wrote. On the first task it caught a bug in its own code before opening the pull request. It’s a small habit, and it saves you finding the same thing later.

Let the agent run the review cycle

When a task is done, the agent opens a pull request.

I have automated review set up, so when a PR opens, Copilot and Codex review the code and comment on anything they find. Then I send the agent back in to deal with the comments. The prompt I use has one important detail:

Pull down the PR comments with the GitHub CLI and fix them. Only fix the issues you think need fixing. If you disagree with one, say so.

That last line matters. You don’t want the agent blindly accepting every suggestion. After it fixes the real issues, I have it reply on each comment to record what changed and where. You’d do the same for a human reviewer who spent time on your code. Then the agent merges the branch and cleans up the stale ones.

At no point did I open GitHub to do any of this by hand.

Automate the boring infrastructure too

This is where it goes past writing code (and get’s really interesting). The CLI tool needs releases people can download, so I had the agent build a release pipeline with GitHub Actions. Here’s the prompt:

Create a GitHub Actions release pipeline. When I push a version tag, check out the code, build binaries for each OS, name them clearly, generate release notes from the commits, and create a GitHub release with the binaries attached.

It came back with a plan and one design question worth answering yourself: attach the raw binaries, or bundle them as compressed archives? I went with archives. Then it wrote the whole release yaml: the build matrix across operating systems, the build steps and environment, the naming, and the release notes. Writing that by hand is tedious, and I’ve lost whole afternoons to it over the years.

To fire it, you just push a tag:

git tag v1.0.0 && git push --tags

The action ran, built the binaries, generated the notes from the commit history, and published a release anyone can download and install. Configuring this by hand can eat an afternoon. The agent did it in minutes.

More ways to put the agent to work

I’ve only scratched the surface here. Once an agent can drive the GitHub CLI, almost any repeatable chore around your repo is fair game. A few more worth trying:

Triage incoming issues: read the new ones, label them, flag duplicates, and ask for any detail that’s missing.
Watch CI on a pull request, read the failing logs, and fix the build without you copying anything across.
Summarise a long pull request before you review it, so you read a plain-English overview instead of wading through the diff yourself.
Keep the change log and release notes current as work merges, instead of scrambling to write them at release time.
Open fixes for dependency and security alerts as small, reviewable pull requests.
Write a Monday digest of what merged last week and what’s still open, pulled straight from the repo.
Tidy as you go: close stale issues, prune merged branches, keep labels consistent.

Where the leverage is

By the end I’d built a working project from scratch: the repo, the board, the code, the reviews, the merges, and an automated release. I’ve been a software engineer a long time, and I’m still struck by how much an agent can do once you give it the right tool.

Notice that I drove each step by hand here, one prompt at a time. But those steps don’t really change from project to project, which points to the next move: once the cycle is repeatable, you can stitch it into one pipeline the agents run end to end, with you stepping in only when something needs a real decision.

The value here isn’t writing code faster, it’s handing over the whole cycle: plan, implement, review, fix, merge, release. So work out your workflow, give your agents the right tools, and let them run it. Next time you catch yourself doing slow, mundane setup by hand, like a release or deploy pipeline, ask whether there’s a CLI an agent could drive instead.

Checkout the video if you want to go deeper.

Thanks for reading.

Have an awesome week :)

P.S. If you want to go deeper on building AI systems, I run a community where we build real projects hands-on: 👉 https://aienginer.co

The 7 skills I actually use every day with AI coding agents

Owain Lewis — Fri, 15 May 2026 11:24:18 GMT

If you spend any time in AI engineering circles online, you’ll quickly get the impression that being productive with coding agents requires hundreds of skills, dozens of MCP servers, an insane workflow, and a wall of plugins blinking at you whenever you open your terminal.

I don’t think that’s true. In fact, I think it’s actively unhelpful.

The longer I work with these tools, the more convinced I am that you just need a small number of high-quality skills to handle the vast majority of your day-to-day work.

It took me a while to actually internalise it, because the gravitational pull is always toward more. More tools. More skills. More configuration. More cleverness.

In this post I’ll share the skills I enjoy using most day to day building software.

I made a full video here:

Managing Skills

One of the biggest headaches I found with skills is having markdown files scattered all over my filesystem. Every coding agent seems to keep skills in a different place. The solution is to use this CLI which makes managing and updating skills easy.

https://github.com/vercel-labs/skills

1. `spec`. Get clear before you write code

If I had to pick one skill that’s done more for the quality of my output than any other, it would be this one.

The idea is straightforward. You come to the agent with a rough idea, usually a couple of sentences, sometimes a paragraph if I’m being generous, and the skill produces a proper spec. What we’re building. Why. The relevant existing code in the repo. The data model. The requirements. The constraints.

This is simple the “think before you code” principal for agents.

The first draft is almost never right, and that’s exactly the point. You sit with it. You read through carefully. You correct the things the agent misunderstood. You make the implicit assumptions explicit. By the time you’re done iterating, you and the agent are genuinely aligned on what’s being built, and every micro-decision you’ve nailed down is one fewer decision the agent has to guess at later.

The cost is maybe twenty minutes up front. The saving is hours of rework on the other side.

2. `plan`. Break the work into shippable chunks

Once the spec is solid, I run plan to break it into discrete tasks.

I never, and I really mean never, ask an agent to build an entire feature in a single shot. It rushes. It cuts corners. The diff becomes impossible to review, and you end up either rubber-stamping work you haven’t really understood or unpicking the whole thing.

Smaller chunks fix all of that. Each task is reviewable. Each task is shippable. Each task is something you can actually course-correct on before the next one starts.

I push the tasks straight to Linear or GitHub Issues. The tracker doesn’t really matter. Having one does. Trying to manage agent work through a folder of markdown files is a special kind of pain I don’t recommend.

3. `explain-visually`. Turn anything into a rich HTML explainer

There was a post on X recently that went pretty viral about the “unreasonable effectiveness of HTML” as an output format for agents. The argument is roughly: markdown is the lazy default, but HTML is a much richer medium. Better for diagrams, better for layout, better for the way humans actually consume information.

I took that idea and turned it into a skill. You point it at a repo, an architecture, a PR, a concept, anything really, and it produces a properly designed HTML page. Cheat sheets. Diagrams. Structured explanations.

I’m a visual learner, and this has quietly become one of the skills I use most often. When I’m onboarding to an unfamiliar codebase, I run this first. When I want to understand a concept properly, I run this. The output is the kind of thing that would take a person hours to put together.

4. `clarify`. For when you’re rambling

Most of the prompts I write now are dictated. I talk to my agents. It’s just faster.

The problem is that voice dictation produces messy prompts. Half-formed thoughts. Contradictions. Vague references (”the thing”, “that script”, “you know, that bit we did earlier”). An agent trying to act on that is going to do something, but probably not what you wanted.

The clarify skill is the fix. It interviews you one question at a time, walks down the decision tree, and turns your ramble into a tight, self-contained optimised prompt. The difference between what goes in and what comes out is genuinely night-and-day.

5. `address-pr-feedback`. My favourite of the seven

This one has changed how I work more than any of the others, and it’s almost embarrassingly simple.

I used to do most of my code review locally. It was slow, and honestly, I wasn’t great at it on my own work. So I changed the loop. Now the agent pushes the branch, and I rely on automated reviewers (Copilot, Codex, Claude) to comment on the PR. Then address-pr-feedback pulls every comment down, judges which are valid, fixes what needs fixing, pushes the changes, and replies to each reviewer noting the commit that resolved their comment.

The crucial part is that it doesn’t blindly accept feedback. It evaluates each comment on its merits. Bad suggestions get pushed back on. Good ones get implemented. And replying to reviewers (even AI ones, but especially humans) is just good practice.

6. `refactor`. Always do a second pass

The first version of code an agent writes is almost always fine. Workable. Functional. But almost never the best version.

The refactor skill is the second pass I run on anything significant. Simpler logic. Less duplication. Fewer lines saying the same thing twice. It’s a small skill in terms of what it does, but compounded across every change I ship, it’s one of the biggest quality multipliers in my whole workflow.

The lesson buried in here is broader than the skill itself. Never blindly accept the first output. There’s almost always something to improve. The agents themselves know this. They’re often perfectly happy to tighten their own work when you ask.

You just have to ask.

7. `design-doc`. Think clearly before you build

For anything with real architectural weight (a new service, a significant refactor, a decision that’s going to be expensive to undo), I write a design doc before the spec.

A spec is for the agent. A design doc is for humans. For you. For the teammate you’ll loop in. For the senior engineer reviewing the approach before code gets written. Context and scope. Goals and non-goals. Architecture diagrams. APIs. Storage. Trade-offs. The kind of structured thinking that companies like Google and Amazon have built into their engineering culture for good reason. The cost of changing code in production is enormous, and a few hours of thinking up front is dirt cheap by comparison.

Agents help us write code. They don’t help us think clearly. Design docs do.

Get The Skills Free

All seven skills are open and free. They’re linked from the video that goes with this post. Grab whichever ones look useful, ignore the rest, fork them and make them yours.

Thanks for reading.

Owain

Is Pi better than Claude Code?

Owain Lewis — Sat, 18 Apr 2026 17:00:15 GMT

I spent last week using Pi as my main coding agent, which was an interesting exercise because Pi is radically different coding agent compared to Claude Code or Codex.

If you have not come across it, Pi is an open source coding agent that has a very minimalist philosophy. It ships with only 4 tools (read, edit, write, bash), supports ALL AI models, has a system prompt under a thousand tokens, no MCP, no permissions. The one feature that makes it really different is the ability to extend and customise the agent through extensions.

The minimalism is a feature not a bug. As coding agents have more and more random (often unnecessary) features bolted on, they can start to feel like a big mess. You have no control of third party coding agents so if you want to change something you’re at the mercy of Anthropic or OpenAI.

A really interesting feature of Pi to me is that you can edit or completely replace the system prompt for the agent. This is useful if you want to use it for things that aren’t coding.

How extensions work

Extensions are just TypeScript files you drop in a folder, and on startup Pi picks them up. They can do almost anything. Change the UI. Register tools. Intercept calls. Gate permissions if you want that back. or anything else you can think of.

I built a workflow extension that takes a spec, write code, reviews the code with a fresh context window, fixes the issues, runs the tests, and verifies. It took me about twenty minutes to put together because Pi can read its own source and documentation and write the extension for you. The agent is, in a fairly literal sense, able to extend itself.

I have wanted this for a long time. The loop I run with any coding agent is roughly: write, review, fix, test, verify. Doing it by hand means a lot of prompts that are always the same prompts, and the fresh-context review step is the one I skip most often because it is annoying to set up. Putting it in code, deterministically, so the agent does not have to remember to do it, feels like a real step beyond relying on non deterministic prompts.

What the minimalism costs you

The honest other side is that minimalism has a cost, and whether you care about the cost depends on how you work. Pi does not ship with MCP. You can add it through an extension, but the default answer to “how do I connect this to my other tools” is to use a command-line tool from bash. There is no to-do tracker. No sub-agents. No hooks. These are not accidental omissions, they are the design, but if your current workflow leans on any of them, you will feel it.

The bigger issue for me personally is the provider situation. Pi lets you bring any model. OpenAI via ChatGPT, Google, GitHub Copilot, OpenRouter, local models through Ollama. But Anthropic recently stopped allowing third-party agents to use the Claude subscription, which means if you want to use Claude models through Pi, you pay API rates. I use Claude heavily, I am on the subscription, and moving to API pricing for my daily agent is not something I want to do. That is not Pi’s fault. Thankfully OpenAI let you use your Codex subscription with other tools.

Where I think this fits

If I stripped out the Anthropic-subscription problem, I think Pi would probably be my default coding agent. I like the minimalism. I like that I can read and understand the system prompt. I like that extensions are code, not configuration. I like that the agent can improve itself by writing its own tools.

As it stands, I am going to keep Pi installed and keep using it for specific workflows where the extension system earns its keep. The multi-stage review loop is one. Anything where I want deterministic control over what the agent does between prompts is another. For everything else, I am still inside Claude Code, because that is where my subscription works and where the models I trust most live.

The broader thing I will take away is this. The interesting question for coding agents is no longer what features they ship, because the features have converged. The interesting question is what they let you change. Pi bets harder on user-defined behaviour than any other agent I have used, and that bet is, I think, the right one.

If you want to see what this looks like in practice, I recorded a full walkthrough where I install Pi, configure providers, customise the system prompt, and build an extension from scratch on camera.

You can watch it here:

Thanks for reading. Feel free to comment on the video if I missed anything.

Six RAG strategies, explained simply (with code).

Owain Lewis — Fri, 03 Apr 2026 08:36:55 GMT

Most RAG tutorials jump straight to vector embeddings. Half the time, that’s the wrong tool.

RAG means retrieval augmented generation. Retrieve information, add it to the prompt, let the model answer using it as context. The retrieval part is where it gets interesting; there are a lot more options than most people realise.

As a side note, I use Postgres for everything. You don’t need complex database infrastructure. Postgres can handle all of these strategies, making it a pragmatic choice for most situations.

Here are six approaches I use on real client projects, roughly in order of complexity.

PS: If you want a full walkthrough, I made a video here.

1. Document loading

Almost everyone dismisses this one, because it’s simple and boring.

If you’re loading a step-by-step runbook, a checklist, or a recipe - you can’t just retrieve partial bits of information. You need the entire document or the answer won’t make sense. Partial retrieval of a setup guide produces partial answers.

Two ways to do it. The naive approach: read the file, stick it in the prompt.

with open(path, "r") as f:
    document = f.read()
prompt = f"Answer using this document:\n\n{document}\n\nQuestion: {question}"

The smarter approach: an index or lookup system that describes each document. Pass the index to the model, let it pick the right document first, then load it. Slower, but handles a larger document set.

The downside is tokens. The other downside is you need to know roughly where the information lives. If you have hundreds of documents and no idea which one is relevant, this won’t work. But for a focused document set, surprisingly reliable.

Don’t dismiss the simple option.

2. Full text search

This has been around forever. Search by keyword. Built into Postgres with tsvector and tsquery, no extra infrastructure needed.

SELECT content, ts_rank(search_vector, query) AS rank
FROM document_chunks, plainto_tsquery('english', %s) query
WHERE search_vector @@ query
ORDER BY rank DESC
LIMIT %s

Someone asks about “30-day returns” and your document says “30-day returns,” you’ll find it. Postgres also stems keywords, so a search for “running” matches “runner” and “runs” as well.

Where it breaks: meaning. “Comfortable shoes for long distance running” will only find product descriptions that contain the word comfortable. It’ll miss cushioned, supportive, anything like that. Use this when your users naturally reach for the same words your documents use.

3. Vector search

This is what most people mean by RAG. Take a document, break it into chunks, turn each chunk into a vector, store it. When a user asks a question, embed the query into the same space and find the closest matches.

PS: As well as the cliched chunk documents approach - you can also turn database fields (product descriptions e.t.c) into vectors and search them as well (no one talks about this!).

The power: it understands what you meant, not just what you said. “Comfortable shoes for long distance running” finds a product described as “plush cushioned sole designed for marathon training.” The description never used the word comfortable. Vector search found it anyway.

Pgvector adds this directly to Postgres. No separate database.

sql = """
    SELECT content, 1 - (embedding <=> %s::vector) AS similarity
    FROM document_chunks
    ORDER BY embedding <=> %s::vector
    LIMIT %s
"""

Where it breaks: exact filters. “Nike shoes under £100” is a structured query. The embedding of “under £100” does not reliably land near documents that contain £99. It might return £200 shoes because the description is semantically similar. Semantic similarity and numerical filtering are different problems.

4. Hybrid search

Combine keyword and vector search, merge the results. This is my default when I’m not sure which approach a dataset needs.

The merging uses Reciprocal Rank Fusion. Each document scores based on where it appeared in each ranked list.

def reciprocal_rank_fusion(keyword_results, vector_results, k=60):
    scores = {}
    for rank, result in enumerate(keyword_results):
        scores[result["id"]] = scores.get(result["id"], 0) + 1 / (k + rank + 1)
    for rank, result in enumerate(vector_results):
        scores[result["id"]] = scores.get(result["id"], 0) + 1 / (k + rank + 1)
    return sorted(scores, key=lambda x: scores[x], reverse=True)

“Nike running shoes, comfortable for long distance.” Keyword search finds anything with Nike, vector search finds anything semantically similar to comfortable and long distance. You get both.

A solid and pragmatic choice for many business applications.

5. SQL RAG (Database RAG)

This one is relatively underrated and under-discussed, and it’s one of my favourites.

Plot twist: Most business data isn’t in documents. Customer records, orders, inventory, product listings. None of that is in a PDF. It’s in a database. SQL RAG turns a natural language question into a database query and just goes and gets exactly what you need. Tends to be very reliable.

Two approaches with different risk profiles.

Parameterised queries (safer). Pre-write the SQL. The model extracts parameters from the question and slots them in. The model never writes SQL.

sql = """
    SELECT name, price, stock_quantity
    FROM products
    WHERE category = %(category)s
    AND price < %(max_price)s
    AND rating >= %(min_rating)s
"""

Dynamic query generation (more powerful, riskier). The model writes the actual SQL. LLMs are surprisingly good at this. The queries get complex fast, joining tables, applying multiple filters, and they’re usually correct.

I really like this approach for internal analytics tools or database querying tools where the cost of a bad query is low. I’d be very hesitant to use it on a customer-facing product.

Start with parameterised queries via regular tool calls.

6. Agentic RAG

Give the model access to all the retrieval tools above and let it decide which one to use.

tools = [
    {"name": "search_documents", "description": "Search docs and FAQs"},
    {"name": "query_products", "description": "Search products by price, category, rating"},
    {"name": "get_order_status", "description": "Look up orders for a customer"}
]

Where this shines is compound questions. “I want running shoes under £150, and what’s the return policy if they don’t fit?” That needs a product database query AND a document lookup. One retrieval strategy can’t answer it. The agent looks at the question, looks at the tools it has, figures out which to call, and synthesises the answer.

The downside is latency. The agent has to make decisions and sometimes makes bad ones. Picks the wrong search term, retries. It’s also less deterministic. But this is the kind of strategy tools like Claude Code use: search in one file, realise it’s wrong, correct, search somewhere else. Very powerful.

How to choose

A few well-defined or small documents, low query volume or cost sensitivity, users need full context: document loading.
Users search with specific keywords: full text search.
Semantic understanding matters: vector search.
Not sure which applies, or need both: hybrid search.
Data is in a database: SQL RAG, start with parameterised queries.
Compound questions or complex search requirements across multiple data sources: agentic RAG.

Most production systems combine at least two. Agentic RAG is really just a routing layer over the others.

All six strategies are implemented against the same database in the video. Run the same questions through each one and watch where they fail. Seeing the failure modes side by side is more useful than any explanation.

If you want a full walkthrough, I made a video here.

Thanks for reading.

Owain

How I Use AI To Review AI Code

Owain Lewis — Fri, 27 Mar 2026 17:33:31 GMT

We're offloading more and more of our coding to AI agents. But AI-generated code has more bugs, security issues, and logic errors than human-written code — and we're generating it faster than any team can review it.

The answer isn’t to skip review. It’s to automate parts of it so humans only spend time on the things that actually require human judgement.

Here’s the four-layer setup I use. Each layer filters out a category of problems so the next layer sees less noise.

Automated checks run your linter, tests, and security scanner before the agent can finish.
Local AI review gets a second agent to review the code before you push.
CI review runs AI code review automatically on every PR. The safety net for when you skip step two (it happens)
Human review handles what’s left: architecture, business logic, and “should we even build this?”

By the time a human looks at the code, the only things remaining are the things only a human can judge.

Here’s the setup.

Layer 1: Automate The Obvious

Claude Code has a feature called hooks. A hook is a shell script that runs automatically at certain points in the agent lifecycle (like when the agent finishes a task). If the script fails, the agent is blocked from completing and has to fix the issues first.

I use a Stop hook that runs my linter and scanner every time Claude finishes work.

The config goes in your Claude Code settings:

{
  "hooks": {
    "Stop": [
      {
        "hooks": [
          {
            "type": "command",
            "command": ".claude/hooks/stop-checks.sh"
          }
        ]
      }
    ]
  }
}

The script itself is just whatever checks you already run:

#!/bin/bash
set -e
rubocop .
brakeman -q
bundle exec rspec --fail-fast

Swap those for whatever your project uses. Ruff and pytest for Python. ESLint for JavaScript. The point is the same: the agent can’t say “done” until these pass.

This alone catches a surprising amount. Formatting issues, unused imports, type errors, broken tests. None of that makes it into a review.

Layer 2: Agent Review

After automated checks pass, review the code yourself and get an AI second opinion before you push.

Two things matter here. First, actually run the code. This sounds obvious but it catches the most embarrassing bugs in two minutes. Second, read the diff. You don’t need to understand every line; understand the shape of the change. What files were touched? Does the scope match what you asked for? Did the agent silently change something you didn’t ask it to?

For the AI review, the key is a fresh context window. Don’t ask the same agent that wrote the code to review it. It has sunk-cost bias and is less likely to challenge its own decisions.

There are a few ways to do this:

Custom Claude Code command. A review prompt in .claude/commands/review.md, paired with a REVIEW file at the project root that encodes your project-specific rules. Portable across tools, fully customisable. Claude Code also ships with some built in plugins.
Codex /review. Four presets covering every scenario (base branch, uncommitted changes, specific commit, custom instructions). Priority-ranked findings. The best local review UX I’ve seen. Bonus: writing with Claude and reviewing with Codex means cross-model review built into your workflow. Different models have different blind spots.
CodeRabbit. /coderabbit:review locally. 40+ linters and scanners running behind the scenes, purpose-built for code review. There are many other great code review tools like Greptile to explore also.

I use a custom review command that reads a REVIEW file at the project root. This file has project-specific rules, things I always want checked.

# REVIEW.md

## Project Patterns
- Repository pattern for data access. Direct DB queries in handlers are a flag.
- New API routes need an integration test. Flag if missing.

The general review catches general problems. The project-specific rules catch the things that are unique to your codebase.

Layer 3: External Review

Sometimes I forget to run the local review. Sometimes I’m in a rush. So I have an automated check on GitHub that reviews every PR before a human sees it.

There are a few options for this. Codex has a GitHub integration that reviews PRs automatically. CodeRabbit has a GitHub App that does the same thing. Anthropic has an open source GitHub Action for security-focused review.

I like having this as a separate layer because it catches things even when I skip the local step. Set it up once, runs on every PR for free.

Layer 4: Human Review

By the time a teammate opens the PR, the linter has passed, tests are green, and an AI has already flagged obvious issues. The human reviewer doesn’t need to catch formatting problems or unused variables.

What’s left is the stuff only a human can judge. Is this the right approach? Does it solve the actual business problem? Will this cause issues in three months? Five minutes of focused review on those questions is more valuable than thirty minutes of line-by-line reading.

TL;DR

I spent a long time trying to find the perfect code review setup. The experience was frustrating. There are hundreds of tools, plugins, and approaches, many of them doing the same thing in slightly different ways.

Don’t get lost looking for the perfect solution or perfect prompt. Start with Layer 1. Set up your linter and your hooks; that alone eliminates an entire category of review noise. Then find one way to get an AI review locally that you trust. Add CI when you’re ready.

Start simple and never accept the first output from an agent.

If you’re interested in this topic: I also made a video walking through the full setup with demos if you prefer to watch.

How I delegate work to a team of AI agents

Owain Lewis — Thu, 19 Mar 2026 14:42:29 GMT

Hey 👋,

Most of us are using AI coding agents the same way. You’re in the terminal, you’re very involved. You prompt, you review, you go back and forth. This is still the right way to work in many cases. We need to be in the loop to keep quality high and think through problems.

But if you’re working on smaller tasks like bug fixes or documentation updates, you generally don’t need to be in the loop. I think about this as delegating vs micro-managing. You just want to hand these off to an agent and trust that they’re going to do the work.

The problem is there aren’t many easy ways to do that right now.

So this week, I built a proof of concept solution to help with this problem.

Agent worker

I built a simple TypeScript worker script that polls a task management system for tickets. When it finds one, it picks it up, delegates the work to Claude Code (headless), runs a series of checks, and opens a pull request. The ticket moves to “In Review.” I go and check the output.

I call this an agent control plane. Your tasks manager becomes your interface for delegating work to one or more agents.

I’m using Linear - but this works with Jira, Monday.com, or anything with an API.

Task managers are the right way to delegate this kind of work because if you have many things going on at once, you don’t want to be chatting with agents. You want a way to actually track what they’re doing. You’d do the same thing if you’re working in a team. You wouldn’t delegate work via chat. You’d have some kind of system to track it, especially if you’re working on hundreds of tasks.

Pull vs push architecture

This is maybe the most interesting architectural decision. Push-based systems like OpenClaw use webhooks. You expose an endpoint, something hits it, the agent starts working. That means your agent runtime is reachable from the internet. Anyone can hit that endpoint.

A pull-based architecture is different. The agent worker makes outbound requests only. No need for open inbound ports. No exposed servers. If the worker goes down, tickets just sit in the queue until it comes back. The only trade-off is latency. If it takes 60 seconds to pick up a ticket, that’s fine. We don’t care about latency for this kind of work.

For a system where you’re giving an AI agent write access to your codebase and the ability to open PRs, we want the smallest possible attack surface. Polling gives you that.

Deterministic guardrails around non-deterministic agents

One of the challenges when you’re delegating to agents this way is you can’t really do an iterative process. You need it to work in one shot and you need full permissions. This is challenging because more often than not agents make mistakes on their first attempt.

So we wrap the non-deterministic part (the agent writing code) with deterministic checks on both sides.

Pre-hooks run before the agent starts. Check out a worktree, git pull, make sure the environment is clean. If any of that fails, the agent doesn’t start.

Post-hooks run after the agent finishes. Run tests, run linting, push the code. These are just shell commands.

The workflow inside the agent is also structured to compensate for the lack of back-and-forth. Write the code, run the tests, then run a code review with CodeRabbit to look for errors, fix anything it finds, and run the tests again. This reduces the number of iterations you need.

When solving a ticket:

1. Write the code to solve the ticket
2. Run `bun test` and fix any failures
3. Review your changes for code quality. Use CodeRabbit if available
4. Fix any issues found in the review
5. Run `bun test` again to confirm fixes didn't break anything

Automated code review

When you’re not sitting in the terminal reviewing the code, you need some kind of automated system to do that for you. I use CodeRabbit. It’s an AI code review tool that integrates with Claude Code and also runs on GitHub when a PR is opened. So every PR the agents open gets reviewed automatically before I even look at it.

It doesn’t catch everything. But the PRs I end up reviewing have already been through linting, tests, and an AI code review. The obvious stuff is already handled.

Scaling agents

What I like about this simple approach is that it scales well.

You can start with one worker running on your laptop. But you can also run multiple workers on different machines, on a VPS, wherever. Same delegation process, more throughput.

I showed this in this video with two workers picking up two tickets at the same time and completing them in parallel.

The code is mostly a proof of concept to demonstrate the idea. What’s interesting here is the architecture, not necessarily the code. Pull-based delegation with deterministic guardrails around non-deterministic workers. That pattern holds regardless of which tools you use.

Full walkthrough with the demo and all the code is in the video:

Thanks for reading. Have an awesome week : )

P.S. If you want to go deeper on building AI systems, I run a community for people interested in these topics: https://skool.com/aiengineer

The 7 stages of building software with AI (with prompts you can steal)

Owain Lewis — Fri, 06 Mar 2026 13:15:47 GMT

Every week there’s a new AI coding framework that promises to revolutionise how you build software. A new agent. A new spec driven agent workflow. A new way to structure your prompts that will supposedly change everything.

Most of them are packaging the same ideas with different names, and if you’re feeling overwhelmed by all of it, I think stepping back and looking at the big picture is more useful than chasing the next tool.

Here’s what I mean. Every piece of software ever built, at Google, at a two-person startup, on a weekend project, went through some version of the same lifecycle. Requirements. Design. Task breakdown. Build. Review. Deploy. Monitor. The tools change constantly. The lifecycle doesn’t. It hasn’t changed in decades, and a new AI framework isn’t going to change it now.

What has changed is that AI now accelerates every single stage of that lifecycle, not just the coding step. And most people are only using it for one part, code generation, and leaving enormous value on the table everywhere else.

I want to walk through all seven stages, share how I actually use AI at each one, and give you specific prompts and examples that have worked well for me. Some of this might seem obvious to experienced engineers, but I’ve been building software for over twenty years and I still find it useful to step back and look at the full picture. Especially now that the tools have changed so dramatically.

If you want a full video: get it here.

Why planning still matters, even when building is fast

On a recent client project, I spent three full days on research and planning before I wrote a single line of code. That probably sounds like a long time when you could just open a terminal and start prompting an agent.

But here’s what happened: because I had a clear plan, requirements, technical design, key decisions all documented, I could constantly go back to it as I was building. When I hit a fork in the road, the plan had already made the decision for me. When the agent drifted in a direction I didn’t want, I could point it back to the spec. Over the course of the project, those three days of planning saved me far more time than they cost. The project went smoothly in a way that felt almost unusual.

One benefit I didn’t expect: because the requirements were so clearly defined, when it came time to write evals, the agent was able to generate large numbers of them almost automatically. It knew exactly what the software was supposed to do, so it could test against that. Without those clear requirements, the agent wouldn’t have had enough context to generate useful evals at all. That’s a downstream benefit of planning that you don’t really see until you’ve experienced it. The clarity compounds through every later stage.

If I hadn’t done that planning, I know exactly what would have happened, because I’ve seen it play out dozens of times over my career. You rush into building, you make a decision about your database schema or your auth strategy that feels fine in the moment, and then three weeks later you realise it was wrong. But by then your software is in production, customers are using it, and the cost of reversing that decision is so high that most teams just live with it. I’ve watched teams carry bad architectural decisions for years because someone rushed the planning phase. That’s not a hypothetical. It’s one of the most common patterns in software engineering.

So the first two stages of the lifecycle are requirements (what are we building and why) and technical design (how are we building it). AI is useful at both. You can have a real conversation with Claude about your architecture, ask it to challenge your assumptions, even prototype multiple approaches quickly to see which one feels right. But the thinking still needs to happen. You need to own these decisions.

Here’s a simple requirements template that works well:

What: User authentication system
Why: Users need accounts to save preferences
Who: End users of the web app
In scope: Email/password login, signup page
Out of scope: OAuth, password reset (v1), admin roles

That “out of scope” section is quietly one of the most useful things you can write. It stops scope creep before it starts, and it gives the agent a clear boundary for what not to build.

Breaking work down is where most people go wrong

The third stage is task breakdown, and this is the one that makes the biggest difference to the quality of what you get from AI coding agents.

The instinct is to hand an agent your entire project and say “build this.” Don’t do that. You’ll get a mess of code that’s hard to review, hard to test, and hard to understand. What you want instead is a series of small, clear, bounded tasks. Each one with enough context that the agent can do it well without needing to hold your entire application in its head.

I use a prompt like this to break a spec down into tasks:

Read the spec in .ai/specs/auth.md.

Break it down into independent work items that can be completed
one at a time. Each work item should have a clear title, a short
description of what needs to be done, and any dependencies on
other work items.

Once you have the list, push each work item to Linear as a new task.

And then when I hand a specific task to Claude Code, I give it real context:

Task: Create the login API endpoint
Context: We're using FastAPI with SQLAlchemy async.
Auth is JWT tokens in httpOnly cookies.
User model is already defined in app/models/user.py.
Follow the existing pattern in app/routers/health.py.

The difference between this and a vague “add login” prompt is night and day. An LLM is making hundreds of small decisions as it writes your code. Naming conventions, error handling patterns, where to put things. If you give it context, those decisions are well-informed. If you don’t, it guesses, and it guesses in ways that feel plausible but don’t fit your application.

Review is the step that changed my workflow

I keep Claude Code open all day. I run multiple terminal sessions. And I have a slash command specifically for reviewing code. This is the part of the workflow that I think most people skip, and it’s the part that has made the single biggest difference to the quality of what I ship.

After the agent finishes a task, I ask it to review its own work:

Look at the code you just wrote. Find any bugs, edge cases,
security issues, or potential problems.

I’m consistently surprised by how much this catches. Not dramatic, application-breaking bugs. Usually small things. A missing edge case. Input validation that isn’t there. An error handling path that doesn’t quite work. But these small things compound. If every change you make introduces one minor issue, over time your codebase degrades in ways that are hard to track down later.

The reason this works is that generation and review are fundamentally different cognitive tasks (for humans and agents). When the agent is writing code, it’s focused on making things work. When it’s reviewing, it’s looking for problems. These aren’t the same mode of thinking, and almost every time I run a review pass, it finds something meaningful that it missed the first time around.

Once I’ve built out a complete feature, I’ll also do a secondary review of the whole thing end-to-end. You catch a different class of issues at that level. Things that look fine in isolation but don’t quite fit together, or patterns that are inconsistent across files. I’ve started thinking of this as just part of the work now, not an extra step.

Deploy and monitor

The last two stages are deployment and monitoring. Neither is as glamorous as the build step, but both are areas where AI has saved me more time than I expected.

For deployment, I’ve used prompts as simple as:

Commit and save these changes with a clear commit message.
Then push the latest version to GCP Cloud Run.

If you’re not deeply familiar with CI/CD pipelines or infrastructure configuration, this is one of those areas where AI genuinely shines. You can describe what you want and it will walk you through the setup or just do it for you. Things that used to take an afternoon of reading (truly awful) cloud provider docs now take minutes.

Monitoring is the stage that most people skip entirely, and then they find out their application is broken because a customer emails them about it. I’ve seen this happen more times than I’d like to admit, including on my own projects. The fix is simple: set up error tracking with something like Sentry, add uptime monitoring, configure alerts. You can ask Claude Code to integrate all of this into your application, and the whole thing takes less time than you’d spend debugging one production incident without it.

What I’ve learned after a year with Claude Code

I’ve been using Claude Code since it first launched, and at this point I use it for essentially all of my development work. But that experience has also taught me something important: it’s only a powerful tool if you know how to guide it.

Claude Code still makes a significant number of mistakes. It still makes decisions that don’t align with what you want. It still needs clear direction, careful planning, and thorough review to produce software you’d actually be proud of. The agents are getting better all the time, but we’re not at a point where you can skip the thinking and get good results. I’m not sure we ever will be, honestly. The thinking is the valuable part.

The people who are getting the most out of these tools aren’t the ones with the cleverest prompts or the most elaborate frameworks. They’re the ones who understand the fundamentals of building software (requirements, design, task breakdown, review) and use AI to accelerate each of those stages rather than trying to skip them entirely.

Every new framework that comes along is ultimately just a different way of sending text to a language model. The framework doesn’t change the quality of the output. Your thinking before you write the prompt does.

Vibe coding isn’t the enemy. Skipping the thinking is. A senior engineer who has done the design work, made the architectural decisions, and broken the work down can move fast within that structure and produce something great. Someone who skips all of that and just prompts their way through will produce a mess, no matter how good the tools are.

Do the thinking. Then you’ve earned the right to move fast.

I put together a companion repo on GitHub with all the prompts from this piece, plus the presentation slides I used in the video. Clone it, steal the prompts, adapt them to your own workflow.

How I'm using OpenAI Codex automations to improve my code

Owain Lewis — Sat, 21 Feb 2026 15:17:09 GMT

OpenAI just added a feature to their coding agent, Codex, that most people missed. It’s called Automations.

If you haven’t used Codex - it’s an AI coding assistant, similar to Claude Code. You give it a task in plain English (”fix this bug”, “review this file”) and it writes the code for you. Think of it as a developer on your team that you hand tasks to.

Automations let you take any task you’d give Codex and run it on a schedule. You write a prompt, pick a frequency (every morning, every 3 hours, whatever), and the agent runs that task automatically in the background on repeat. No manual prompting. You’re not at the keyboard.

I’ve been running two of these for a few weeks and they’ve caught bugs I almost certainly would have missed. One scans for issues and creates Linear tickets. The other picks up those tickets, fixes the code, and opens PRs.

This video walks through the full demo. This edition breaks down the setup and how I wired the two automations together.

The Setup

Each automation lives as a .toml file (a simple config format) inside a .codex/automations/ folder in your project. It has three things: a prompt (what the agent should do), a schedule (when it runs), and a memory.md file that persists between runs so the agent remembers what happened last time.

Here’s what the bug scanner looks like stripped down:

[automation]
name = "Bug Scanner"
cwd = "/workspace/myproject"
schedule = "0 9 * * *"  # Every day at 9am

[prompt]
content = """
You are performing a daily code review. Your job is to find critical bugs,
security issues, and unhandled edge cases in the codebase.

For each issue you find:
1. Check whether a Linear ticket already exists for this issue. If it does, skip it.
2. If it's new, create a Linear ticket with the following:
   - Title: concise description of the bug
   - Label: autofix
   - Body: summary, affected files with full paths, customer impact,
     reproduction steps, suggested fix

Use full absolute paths when referencing files. This automation runs inside
a Git work tree and relative paths will not resolve correctly.

At the end, report: X bugs found, Y skipped as duplicates.
"""

The bug fixer runs on the same schedule:

[automation]
name = "Bug Fixer"
cwd = "/workspace/myproject"
schedule = "0 9 * * *"  # Every day at 9am

[prompt]
content = """
Scan your Linear board for open issues with the label: autofix.

For each issue:
1. Read the bug description in full
2. Check out a new git branch: fix/-
3. Implement the fix
4. Verify the build passes
5. Open a pull request using the GitHub CLI
6. Move the ticket to In Review status

If anything fails, stop and report the error. Do not silently work around failures.
"""

The memory file sits alongside the automation config and gets updated after each run. It keeps a record of what the agent found and did previously. On the next run, the agent reads it before scanning - so it knows which issues it already reported and won’t duplicate them.

Quality

This is what surprised me most.

The agents write better bug tickets than most developers do.

The tickets generated by these automations had incredibly detailed descriptions, a suggested fix, and detailed steps to reproduce the issue.

Memory

Traditional automation tools like Zapier and n8n run fixed flows. They do the same steps every time. These Codex automations are different because they persist memory across runs.

After each run, the agent writes what it found into the memory file. Here’s a simplified version of what that looks like after a few runs:

# Automation Memory

## 2026-02-18
- Found 3 new issues. Created tickets: LIN-47, LIN-48, LIN-49.
- Skipped 1 issue (duplicate of LIN-44).

## 2026-02-19
- Found 1 new issue. Created ticket: LIN-52.
- Note: The webhook timeout issue (LIN-47) was fixed and merged.
  Removed from watch list.

## 2026-02-20
- No new issues found in webhook or retry modules.
- Flagged auth module for closer review tomorrow - noticed some patterns
  that could lead to session fixation under specific conditions.

On the next run, the agent reads this before starting. It knows what it already reported. It knows what was fixed. It can notice when it flagged something yesterday and follow up on it today.

What to Automate

The pattern generalises. This works in any agent environment - it’s just a prompt, a recurring schedule, and a memory file.

What makes a good candidate:

It’s something you’d do on a regular schedule anyway
A vague instruction is enough to produce useful output (you don’t need pixel-perfect determinism)
It’s safe for an agent to try and fail - the output goes somewhere reviewable before anything irrevocable happens

Bug scanning fits all three. So does dependency review, documentation checks, stale ticket cleanup, security audits, release notes, and test coverage monitoring. Anything that currently gets skipped because you’re busy is a candidate.

Where it doesn’t work well: anything where the agent needs to make a decision you’d want to make yourself, or where a wrong answer is hard to detect in review.

Summary

If you have a codebase with more than a few thousand lines, set up one scanner this week.

Write a prompt that asks the agent to do a daily code review and create a ticket for anything new. Run it manually a few times to see what it finds.

I was really impressed by this feature. The idea is simple but the impact is significant. A bug that never reaches your customers, a task you’re too busy to do that can now be done automatically, a security issue that is detected early on.

I made a video covering this in more depth here.

Claude Code agent teams explained

Owain Lewis — Thu, 12 Feb 2026 17:25:21 GMT

Hey 👋,

Claude Code just shipped a really interesting feature called Agent Teams. Instead of one agent doing everything, you can now run multiple Claude Code instances that work together as a team. Each agent has its own context window, and they can talk to each other directly (which is crazy).

Using AI agents to write code is standard now. Using one agent in a terminal or IDE feels natural - describe a task, it builds, you review. Straightforward loop.

Multiple agents talking to each other feels completely different. And, despite this being an early feature, it feels like looking into the future.

I made a video showing how to set this up with tmux so you can watch all the agents working together.

Subagents vs Agent Teams

You’re probably familiar with subagents in Claude Code. With subagents, each one has its own context window and results return to the parent. Communication is one direction. The parent manages everything.

With agent teams, each agent also has its own context window but they’re fully independent Claude Code sessions. They can message each other directly. There’s a shared task list for coordination. It’s best for long running work that needs discussion and iteration between agents.

The quick test is: if your agents would benefit from talking to each other, use a team. If they just need to return results, subagents are simpler and cheaper.

When To Use This

The pattern I keep coming back to is agents reviewing other agents’ work.

One agent writes code. A second agent reads the output and sends specific feedback. The first agent fixes the issue. The reviewer checks again.

With subagents, this feedback loop runs through you. The reviewer reports back, you read it, you paste it into a new prompt for the builder. You’re the middleman.

With agent teams, the reviewer sends notes directly to the builder. The builder fixes it. The reviewer checks again. Multiple rounds without you relaying anything.

That’s a small thing on paper. In practice, it changes what kind of work you can hand to agents. Single-pass generation - write this function, generate these tests - works fine with one agent. Multi-pass long running work - build something, review it, revise, review again — requires agents that can talk to each other.

C Compiler

To see where this goes, look at what Anthropic’s engineering team did. Nicholas Carlini ran sixteen Claude instances to build a C compiler from scratch in Rust. Two weeks, about two thousand sessions, just under twenty thousand dollars in tokens. The result: a hundred thousand lines of Rust that compiles the Linux kernel. Ninety-nine percent GCC torture test pass rate.

The interesting part isn’t the scale. It’s the change in the type of work agents can do.

Carlini’s key observation: “Claude will work autonomously to solve whatever problem I give it. So it’s important that the task verifier is nearly perfect.” The agents weren’t the bottleneck. The quality of the feedback loop was.

That’s the same pattern at a different scale. Builder agents produce code. Reviewer agents check it and send feedback. The builders act on it. The loop runs without a human relaying messages.

My Experience

I used agent teams to build an app with three Claude instances: a backend agent, a frontend agent, and a code reviewer. The reviewer watches both, checks that the API contract lines up, and sends issues to the agent that owns the code.

During the build, the reviewer caught bugs in the initial implementation and delegated back to the front end and back end agents to fix the issues.

That’s a closed-loop correction that happened without me. With a single agent, I’d have caught it during code review. With teams, the feedback loop ran on its own.

Tradeoffs

Three agents in parallel costs roughly three times as much. I burned through my rate limits making a video about this (and rarely have issues). That’s the honest trade-off.

The C compiler project used about twenty thousand dollars in tokens over two weeks. That’s sixteen agents running in loops. For most of us, the question isn’t “can I afford sixteen agents” — it’s “is the closed-loop feedback worth 3x the cost for this particular task?”

Where Is This Heading?

Right now, agent teams are experimental.

But the pattern - agents reviewing agents in a loop, self-correcting, picking up the next task when they’re done - that’s clearly the direction we’re going in. Systems of specialised agents working together on more complex tasks. The C compiler project showed it actually works at scale. Agent teams in Claude Code bring a version of it to your terminal.

I walked through the full build in my latest video here.

Your agent workflow doesn't scale (here's the fix)

Owain Lewis — Sat, 07 Feb 2026 16:36:17 GMT

Hey,

I’ve been using a project board to manage my AI agents and it’s working really well. Instead of prompting back and forth in the terminal, I put tasks on a Linear board. The agents pick up tickets, follow a workflow I’ve defined, and open PRs. I just review the PRs.

Here’s the setup.

The Setup

Two things make this work: an MCP connection to your task board, and a CLAUDE.md file that defines how the agent should work.

Connect Claude Code to Linear

One command:

claude mcp add --transport http linear-server https://mcp.linear.app/mcp

Open a Claude Code session and authenticate through Linear’s OAuth flow. Claude can now read tickets, create tickets, update status, and close issues.

While I’m using Linear, you could follow this flow in any task management system.

I break all this setup down in a video here.

Encode the Workflow in CLAUDE.md

Your project’s CLAUDE.md file tells the agent how to behave. Here’s a stripped-down version of what I actually use:

## Linear Integration

- Fetch issues using the Linear MCP tool.
- Always read the parent issue (if one exists) for full context.
- If a description references a spec file, read it before implementing.
- Set issue status to **In Progress** when starting,
  **In Review** after PR creation.

## Branching

Branch format: `/-`
- `feature/` for features
- `fix/` for bugs
- `cleanup/` for tech debt

Example: `feature/gra-12-add-supabase-sync`

## Commits

- Format: ` ()` e.g. `Add Supabase sync (GRA-12)`
- Never commit code that doesn't build. Run `bun run build` first.

## Pull Requests

Create with `gh pr create`. PR body must include:
- Summary of changes
- Verification: `bun run build` result, files changed
- Link to the Linear issue

## Self-Review (required before pushing)

After implementation, launch a sub-agent to review the diff:
- Check for bugs, dead code, security issues, over-engineering

That’s the whole system. The agent reads this file, follows the workflow, and produces PRs that are structured, verified, and linked to tickets. You write it once and every task follows the same process.

The agent fetches the ticket, reads context, checks out a branch, implements, runs the build, reviews its own code with a sub-agent, then opens a PR. All defined in a file that lives in your repo.

Write Good Tickets

The workflow only works if your tickets are clear. Here’s what a good one looks like:

Add authentication to the dashboard
Users should be able to log in with email/password. The login form should validate input, create a session, and redirect to the dashboard on success.
Files to update: auth.ts
Acceptance criteria:
Login form renders at /login
Invalid email/password shows an error message
Successful login creates a session and redirects to /dashboard
Unauthenticated users are redirected to /login
Reference: See specs/auth.md for expected behaviour

More detailed than most developer tickets in the real world. That’s the point. Clear tickets are what let you step back. Vague tickets pull you back into the terminal.

Start Assigning

Fetch the open tickets on my Linear board and show me what's in the backlog.

Pick a ticket. Tell Claude to work on it. Watch it follow the workflow.

Before this, I was writing Markdown specs and handing them to agents. It worked for a while. But once I had more than a few tasks going, I couldn’t keep track of what was done, what was stuck, or what depended on what. Markdown files don’t have status. A board does.

It’s the same thing that happens when you grow as an engineer. Early on you just write code and push it. Then you add tests, CI, code review. Not because you want more process, but because you’ve been burned enough times to know that a bit of structure saves you from a lot of pain.

Same thing with agents. Prompting in the terminal works fine for small stuff. But once you’re juggling multiple tasks or building something real, having tickets with clear acceptance criteria and a build step that runs before every PR just makes everything more reliable.

How I Decide What to Hand Off

Not everything needs the full workflow. Small stuff I still just prompt directly. But for anything that takes more than a few minutes, I put it on the board.

I started by staying pretty hands-on. Watching the agent work through tickets, seeing where it needed better instructions. Over time I got a feel for what it handles well on its own (bug fixes, refactoring, straightforward features) and where I need to stay involved (architectural decisions, anything where I need to see the result before committing to an approach).

Summary

Pick one feature you’re working on right now. Break it into three or four tickets on a board. Start assigning tickets to your agents like you would assign to an engineer.

Watch it pick up the ticket, implement the work, and open a PR.

Then see what else you can hand off.

Thanks for reading. I walked through the full setup in a video here.

How I code with AI agents (spec-driven development)

Owain Lewis — Fri, 30 Jan 2026 12:24:42 GMT

Hey there 👋,

Here’s a pattern you’ll recognise: You tell an AI agent to “add authentication to the app.” It starts coding immediately. An hour later, you’re undoing decisions you never asked for.

Complex code you didn’t ask for. Password reset flows you didn’t need. New dependencies you explicitly avoid. The agent was trying to help. It just had no idea what you actually wanted.

The fix isn’t better prompting. It’s a different workflow entirely.

This post breaks down spec-driven development - the practice of defining a specification before letting an agent execute. I’ll show you exactly what goes in a spec, how it differs from other documents you might write, and a step-by-step workflow I use every day.

What Is Spec-Driven Development?

Spec-driven development is simple: instead of prompting first and figuring things out as you go, you define a specification up front. A short markdown document that describes what you’re building, constraints, relevant context, and a list of tasks to complete.

When you tell an AI agent to “add authentication,” there are dozens of decisions to make. Token expiration. Storage approach. Error handling. Library choices. If you don’t specify, the agent guesses. And guesses compound.

Even though agents like Claude Code can write plans, ask clarifying questions, and resume sessions, you’ll still want to own the spec yourself. A plan Claude generates lives in a conversation. A spec lives in your repo - a markdown file you can review, edit, version control, and hand to any other agent or teammate. And writing the spec yourself forces you to make decisions rather than just review the agent’s choices.

The Three Documents (and Why They’re Different)

I see people conflating PRDs, design docs, and specs constantly. They serve different purposes.

Product Requirements Document (PRD): For humans; product managers, stakeholders. Covers what we’re building and why. Business value, user stories, success metrics. This is a debate document.

Technical Design Document: For engineers. Covers how we’re building it. Architecture decisions, scalability considerations, security implications. Also debated and reviewed.

AI Spec: For agents. This is an execution document - not a debate, a plan. It translates decisions from the PRD and design doc into something an agent can act on.

In practice, you don’t always write all three. For a small feature, you might skip straight to a spec. For a large initiative, you’d have a PRD that spawns multiple design docs, each spawning multiple specs. The spec is always the final translation layer before code.

Anatomy of a Good Spec

A spec has four parts:

1. Why (Brief Context)

Keep this short. One or two sentences about the problem you’re solving. This helps the agent make intelligent decisions if it encounters ambiguity.

2. What (Scope)

Define the boundaries. What features are you building? Be specific about implementation details the agent would otherwise guess about.

Example: “JWT-based auth with one-hour access tokens and seven-day refresh tokens. Users can register, login, and refresh tokens.”

3. Constraints (Boundaries)

This is where you prevent the agent from being too eager. What libraries to use. What patterns to follow. What’s explicitly out of scope.

Example: “Use bcrypt for password hashing. Store user data in Postgres via Prisma. Must not add new dependencies. Must not store tokens in the database. Out of scope: password reset, OAuth, email verification.”

4. Tasks (Discrete Work Units)

Break the work into small, verifiable chunks. Each task should specify what to build, which files to touch, and how to verify completion.

Example:

Task 1: Add user model to Prisma schema. Verify: npx prisma generate succeeds.
Task 2: Create registration endpoint. Verify: Test with curl, user appears in database.
Task 3: Create login endpoint. Verify: Returns valid JWT on correct credentials.

When Specs Go Wrong

Specs fail in two directions.

Over-specified: You’ve constrained the agent so tightly it can’t solve the problem. Signs: the agent keeps asking for permission, or produces convoluted code to satisfy contradictory constraints. Fix: loosen constraints, focus on outcomes rather than implementation details.

Under-specified: The agent still has to guess. Signs: you review the code and find unexpected decisions - new files, different patterns, surprise dependencies. Fix: add the missing constraints. Each surprise is a constraint you forgot to write down.

The goal is a spec tight enough that the agent can’t make decisions you’d disagree with, but loose enough that it can solve problems you didn’t anticipate.

My Workflow

It’s important to point out that this level of planning isn’t always needed. If you’re fixing a simple bug, you likely don’t need extensive planning. Just do it. If you’re working on something large that might split into many tasks or run over multiple sessions - write a spec.

Here’s how I actually use specs day to day:

Step 1: Generate. I describe what I want to build to the agent and ask it to write a spec—not implement the feature. I use a /spec command for this.

Step 2: Iterate. I review the spec carefully. The agent will make assumptions. I correct them, add constraints I forgot, remove scope creep. This is where I catch problems before they become code.

Step 3: Execute. I open a fresh session. I ask the agent to read the spec and implement Task 1. Review the code. Commit. Move to Task 2.

“Read and implement T1”

Step 4: Adapt. Review the code. Could it be improved? Maybe Task 3 reveals a flaw in the spec. I go back and update it. This isn’t waterfall, it’s iterative. The spec is a living document.

The key insight: don’t ask the same agent to plan the work and do the work. Planning and execution are different modes. An agent that’s planning will think through edge cases. An agent that’s executing will rush to ship.

Skip the Frameworks?

There are a lot of spec-driven development frameworks out there. OpenSpec. Kiro. GitHub Spec Kit. I’ve tried them.

To me, they felt like overkill.

They generate tons of files. They want you to define user stories in markdown. They add ceremony that slows you down without adding value.

Here’s what you actually need: one slash command that acts as a meta-prompt to generate a spec.

> “/spec implement rate limiting in the API.

The power of spec-driven development isn’t in the tooling. It’s in the practice of thinking before prompting. A fancy framework won’t fix sloppy thinking. A simple markdown file that forces you to articulate constraints is probably enough.

Why This Creates Leverage

Spec-driven development isn’t new. Software teams have always worked this way. PRD > Design doc > Task breakdown > Implementation. The only difference is we’re handing tasks to agents instead of other developers.

But here’s what changes: an agent that executes well-defined specs can move faster than any human. The bottleneck shifts from implementation to specification. Your job becomes defining work clearly enough that an agent can execute it autonomously.

Final Thoughts

If you’re building anything non-trivial with AI agents, write a spec first. It prevents agents from guessing. It gives you control over implementation decisions. It produces higher-quality code.

Working incrementally - one task at a time, reviewed and committed - beats letting an agent generate 10,000 lines you have to untangle later. Match the spec’s detail to the task’s complexity. One liner? Just do it. Small feature? Short spec. Large feature? Detailed spec with many tasks.

Your job is to architect the work. The agent’s job is to build it.

PS: Watch the video here. It contains a link to all the templates I use:

Thanks for reading. Have an awesome week : )

P.S. If you want to go deeper on building AI systems, I run a community where we build agents hands-on: https://skool.com/aiengineer

The simplest way to build AI agents in 2026

Owain Lewis — Fri, 09 Jan 2026 17:43:04 GMT

Hey there 👋,

You can build a working AI agent with just a folder, a markdown file, and a python script.

No N8N. No LangGraph. No FastAPI. No infrastructure at all.

The AI agent space has convinced people that building agents requires either expensive no-code platforms or serious engineering overhead. I think that’s backwards - at least for personal use.

If you’re a solo builder who wants AI agents that handle your research, automate your workflows, or manage repetitive tasks, you don’t need production infrastructure. You need something you can build in an afternoon and modify in minutes.

I call this the Micro-Agent Architecture. It’s the pattern I use for my own agents, and it’s embarrassingly simple.

The Structure

Here’s everything you need:

my-agent/
├── AGENTS.md
├── tools/
├── context/
└── workspace/

Four folders. Let me show you what each one does.

AGENTS.md (The Instructions)

This is where you tell the agent who it is and what it can do. Think of it as a system prompt you can version control.

Most modern coding agents read this file on startup. For Claude Code just add this to your Claude.md and it will read that file.

@AGENTS.md

Here’s a real example: a research agent I use for YouTube content analysis which can fetch videos, research topics, get video transcripts, and even uploads my videos for me (writing all the metadata, tags, and descriptions):

# YouTube Research Agent

You are a research agent specialising in YouTube content analysis.

## Tools

Use the following tools.

### get_channel_videos

Fetch videos for a YouTube channel. Returns view counts, titles, outlier scores.

uv run tools/youtube.py get_channel_videos @mkbhd --days 30

### get_transcript

Pulls the transcript for a specific video.

uv run tools/youtube.py get_transcript VIDEO_ID

The agent knows its role, knows what tools it has, and knows the workflow for common tasks.

Tools (The Scripts)

Simple scripts that do specific things. Python, Bash, Node. If you don’t know how to code, Claude can just write the scripts for you (“write a python script to fetch youtube videos for a channel. Tell me how to use it”). The LLM reads AGENTS.md, sees the command, runs it. No SDK. No framework. Just scripts.

Context (The Knowledge)

Reference material the agent reads before working. Style guides, templates, examples, SOPs. This is how you make agents consistent-by giving them documentation, the same way you’d onboard a person.

Workspace (The Output)

Where the agent saves its work. Research, drafts, data. Files that persist between sessions. Everything it creates goes here, so you can review it, edit it, and build on it.

As an example, when using my YouTube agent, I store complete video transcripts as files and sometimes refer back to them during conversations.

How It Works

You already have the agent runtime. It’s Claude Code, Codex, Amp - whatever agentic coding tool you’re already using. ANY of them. These tools can read files, follow instructions, and most importantly run commands. That’s all an agent needs.

We treat the agent harness itself (Claude Code, Codex, Goose, Amp) as a building block and largely interchangeable.

Point your tool at the folder and give it a task:

The agent reads AGENTS.md, understands its role, runs the tools, and saves everything to workspace. Real research, done automatically, saved locally.

The folder IS the agent. Instructions are markdown. Knowledge is markdown. Tools are scripts. Storage is files. The agentic coding tool you already have is the runtime.

No deployment. No hosting. No complexity.

The Insight That Makes This Work

Software engineers have been building CLIs and scripts for decades. We write utilities that automate our work. It’s one of the oldest traditions in the craft.

Here’s what I’ve realised: agents are exceptionally good at using CLIs. Better than humans, actually.

Think about it. An agent can read documentation perfectly, remember every flag, and invoke your scripts hundreds of times without getting tired or making typos. Give it a conversational interface and suddenly your little Python script becomes something you can talk to.

Any CLI becomes 100x more powerful when you add an intelligence layer to it.

That YouTube research tool I showed earlier? It’s just a script. But when an agent uses it, it can analyse fifty channels in parallel, cross-reference the results, and synthesise insights I’d never have time to find manually.

And here’s the thing-anything can become a tool. A Python script. A bash one-liner. A Docker container. If it runs from a terminal, an agent can use it.

You’re not learning a new skill. You’re amplifying one you already have.

Why I Use This Instead of Frameworks

For personal agents, frameworks are overhead.

N8N, LangGraph, and similar tools solve real problems - for teams shipping production systems to users. If you’re building agents other people will use, you need observability, APIs, error handling, deployment pipelines, all of it.

But if you’re building agents for yourself? You don’t need any of that. You need something you can modify in two minutes when your requirements change. You need something you can understand completely. You need something that doesn’t break when a framework updates.

A folder of markdown and scripts gives you that. It’s not sophisticated. That’s the point.

The Leverage Angle

A tool helps you once. A system helps you a thousand times.

The Micro-Agent Architecture isn’t about building impressive AI systems. It’s about building personal agents that help 1 person do the work of 10.

Thanks for reading. Have an awesome week : )

P.S: If you want to build agents like this hands-on with other engineers, find more in depth content here: https://skool.com/aiengineer

AI frameworks worth learning in 2026

Owain Lewis — Wed, 07 Jan 2026 17:56:40 GMT

Hey there 👋,

AI framework hell is real.

Every week, a new AI framework arrives. And every week, developers ask the same question: which one should I learn?

LangChain, CrewAI, LangGraph, AutoGen, Spring AI, LlamaIndex, Vercel AI SDK, OpenAI Agents SDK, Google ADK. Everyone’s telling you to learn the latest one or you’ll fall behind.

Here’s what nobody tells you: you don’t need any of these. The provider SDKs (OpenAI, Anthropic, Google) are powerful enough on their own. You can build agents and complex multi-step workflows without a framework.

Frameworks are genuinely useful once you understand what they’re abstracting away. But, the problem is most developers start with frameworks before they understand the fundamentals. When something breaks, they’re stuck.

This week, I’ll show you the frameworks I think are worth learning right now, and what each one is best for.

The Trade-Off

Every framework is a trade-off. You’re adding a layer between you and the model. That layer gives you convenience, patterns, and abstractions. It also means bugs you didn’t write, upgrade paths that break your code, and opinions about how AI apps should work.

I call this framework tax. Not because frameworks are bad, but because they’re not free. You’re trading flexibility for convenience. Sometimes that’s the right trade. Sometimes it isn’t.

The provider SDKs (OpenAI, Anthropic, Google) are well-documented, stable, and give you direct control. The Gemini SDK in particular has seamless tool calling out of the box.

Here’s the approach I recommend: SDK-first development. Start with the raw SDK. Understand how tool calling works. Build a simple agent loop. Feel the edges.

Then, when you hit a wall, when you need something the SDK doesn’t give you easily, reach for a framework. Now you understand what it’s doing for you. You’re not cargo-culting. You’re making an informed choice.

When Frameworks Genuinely Help

Here are the five situations where I’d reach for one:

First: a production-ready agent loop. The core agent loop is simple. Maybe 70 lines of code. But the production details add up: max iterations, timeouts, graceful error recovery, retry logic. Frameworks have battle-tested these patterns. You can write this yourself. The question is whether you want to discover all the edge cases on your own. Plus, you probably don’t want to write this by hand on every project.

Second: provider flexibility. This is the big one. Moving from OpenAI to Azure OpenAI. Switching from Anthropic to Bedrock. Testing a new model from a different provider. These changes touch a lot of code if you’re using raw SDKs. An abstraction layer makes swapping providers a config change. If you think you might switch, or want the option, this alone justifies a framework.

Third: team standardisation. On larger teams, frameworks enforce consistent patterns. Same structure, same debugging approach, same conventions. Everyone speaks the same language. This benefit scales with team size.

Fourth: complex workflows. Retries, human-in-the-loop approvals, branching logic, parallel execution. If you’re building a workflow engine, frameworks have already solved the hard parts. You could build it yourself, but you’d be reinventing solutions that have been refined over years.

Fifth: multi-agent orchestration. Handoffs between agents, shared state, delegation patterns. Most apps don’t need this. But if yours does, frameworks make it easier than rolling your own.

Provider-Specific vs Provider-Agnostic

One more distinction before the list.

Provider-specific frameworks (Google ADK, Claude Agents SDK, OpenAI Agents SDK) are built by the model providers. They’re optimised for one model family, encode that provider’s best practices, and typically offer the smoothest experience. The trade-off is commitment. You’re betting on that provider.

Provider-agnostic frameworks (Pydantic AI, LangGraph, Vercel AI SDK) abstract across multiple providers. You trade some optimisation for flexibility. When a new model leapfrogs the competition, you swap a config value instead of rewriting code.

Neither is universally better. If you’re all-in on one provider, go provider-specific. If you want options, go agnostic. Know which game you’re playing.

The Five Frameworks Worth Learning

Here are the five I’d actually invest time in - organised by language and use case.

1. Pydantic AI (Python)

If you’ve used FastAPI or Pydantic, this feels instantly familiar. Same developer experience (types, validation, contracts) applied to LLM applications.

It handles response validation, retries on malformed outputs, and structured error handling. Works across all major providers.

Best for: Python developers who think in types and want reliable, testable agents.

2. LangGraph (Python)

LangGraph models AI applications as graphs. Nodes and edges with explicit state control at every point.

This shines when your workflow has branches, retries, parallel execution, or human-in-the-loop steps. More complex than simpler frameworks, but that complexity pays off for sophisticated systems.

Best for: Production systems with complex workflow logic.

3. Vercel AI SDK (TypeScript)

The default choice for TypeScript developers. Clean DX, provider-agnostic, swap models without touching frontend code.

Best for: TypeScript developers building AI applications. If you’re in this ecosystem, start here.

4. Google Agent Development Kit (ADK)

The first framework I’ve seen that takes polyglot teams seriously. Same mental model across four languages (Java, Go, TypeScript, Python).

The standout feature is observability. A built-in web UI (super useful) for traces, tool calls, and agent decisions. You can see exactly why your agent did what it did.

Best for: Teams on Google Cloud, or polyglot teams that want consistency across languages.

5. Spring AI (Java)

For Java shops, this is the natural choice. LLM services exposed through familiar Spring patterns.

The value isn’t innovation. It’s integration. If your team thinks in Spring terms, you add AI capabilities without learning a new paradigm.

Best for: Enterprise teams with Spring Boot microservices adding AI incrementally.

Final Thoughts

Frameworks aren’t the enemy. Neither are they required.

Start with the SDK. Understand how the primitives work: tool calling, message formats, streaming. Build something simple. Feel where it gets painful.

Then, when you have a real problem (provider switching, team coordination, complex workflows) pick the framework that solves that specific problem.

The goal isn’t to avoid frameworks. It’s to use them deliberately. Know what you’re trading away and what you’re getting back.

SDK-first. Then frameworks when they earn it.

That’s how you stay out of framework hell.

Thanks for reading. Have an awesome week : )

P.S. If you want to go deeper on building AI systems without the hype, I run a community where we build agents hands-on: https://skool.com/aiengineer

From software engineer to AI engineer (the 2026 roadmap)

Owain Lewis — Wed, 31 Dec 2025 14:55:59 GMT

Hey friend 👋,

Breaking into AI Engineering can feel overwhelming.

New tools launch weekly. Tutorials assume you already know everything. Half the advice contradicts the other half. It’s hard to know where to start, or what actually matters versus what’s just noise.

Here’s the good news: it’s simpler than it looks.

AI Engineering is software engineering with LLMs. You’re not training models from scratch or doing research. You’re building products that use language models as one component among many.

AI Engineers spend most of their time on the same things great software engineers focus on: designing reliable systems, writing code, testing properly, and making sure things work in production. The LLM part is maybe 20% of the job. The other 80% is engineering.

This roadmap is the practical path through the noise.

Let’s dive in.

Stage 1: Programming and Architecture

You’re a software engineer first. Everything else builds on this.

Most AI problems are design and architecture problems - the same kind of problems good engineers have always solved. How do the pieces communicate? Where can things fail? Why is latency so high? How do we know if it’s working? How does data flow from input to output? These questions matter more in AI applications than most people realise.

Know when to use different types of databases. Have a mental model for how web services work. If terms like “stateless,” “caching,” or “message queue” are unfamiliar, spend time here before moving on.

One myth to bust early: you don’t need Python. It’s popular and has great ecosystem support, but AI systems are language agnostic. Java developers can use Spring AI and Google’s ADK. TypeScript developers have the Vercel AI SDK and great provider support, Use what you know. What matters is understanding fundamentals, not picking the “right” language.

Don’t skip this stage. People who jump straight to complex AI frameworks end up with impressive demos that fall apart when real users touch them.

Stage 2: Working With LLMs

Now you’re ready to add LLMs to your toolkit.

Start by deeply learning one provider’s API: OpenAI, Anthropic, or Google. Understand how authentication works, how to handle streaming responses, what happens when you hit rate limits, and how to implement proper retry logic. Know what tokens are and why they matter for both context limits and cost.

Think about production inference early. The API you prototype with isn’t always what you’ll use in production. OpenAI in production means Azure OpenAI. Gemini means Vertex AI. Claude is available on AWS Bedrock, Azure (via Microsoft Foundry), and GCP. Enterprise platforms offer better SLAs, compliance guarantees, and network control. The SDKs are similar but not identical: authentication differs, and some features lag behind. Know where your application will run before you’ve built too much to easily switch.

Master the core capabilities that modern LLMs offer:

Text generation: master prompting: writing instructions that get LLMs to do what you want. The simplest and most overlooked strategy for learning is to master meta-prompting (ask AI to refine and improve your prompts).

Structured outputs matter because software systems need predictable structure, not free-form text. Learn how to constrain model outputs to schemas your code can reliably parse.

Tool calling is how LLMs take actions in the world. The model doesn’t just generate text: it decides which function to call and with what arguments. Understand this at the API level before reaching for abstractions.

MCP (Model Context Protocol) is becoming the standard for connecting LLMs to external data and tools. Instead of writing custom integrations for every database or API, MCP provides a common protocol that works across providers. It’s the connective tissue between your model and everything it needs to access. Worth understanding even if you don’t adopt it immediately.

Multi-modal inputs are increasingly essential. Models like Gemini process images, audio, and documents natively. A customer support system can accept photos of broken products. Voice agents are getting traction. A research tool can process PDFs with charts and diagrams. If you’re only thinking text-in-text-out, you’re missing half of what’s possible.

Frameworks can help, but they’re not where you should start. The SDKs from OpenAI, Anthropic, and Google handle all of this directly. Learn what’s happening at the API level first. Build a simple loop that reasons and acts using tool calling. Once you genuinely understand what’s underneath, then evaluate whether a framework adds value for your specific use case.

Stage 3: RAG

Retrieval-Augmented Generation is how you give LLMs your specific knowledge: your documents, your data, your domain expertise (stuff the LLM wasn’t trained on).

The core idea is simple: instead of relying only on what the model learned during training, you retrieve relevant information and include it in the prompt. But RAG isn’t one specific technique. Vector databases are popular, but they’re just one strategy. Keyword search, hybrid approaches, and even simple file lookups all count. Pick what fits your use case.

Don’t overcomplicate RAG at the start. Most improvements come down to better retrieval: finding the right information - not from adding sophisticated components.

The difference between a demo and a production system comes down to three things: measuring retrieval quality (are you actually finding relevant content?), handling failure cases gracefully (what happens when nothing relevant exists?), and keeping your index fresh when source documents change.

Stage 4: System Design

There’s a spectrum of approaches for building AI applications. On one end: deterministic workflows with fixed sequences of steps where you control exactly what happens. In the middle: agentic workflows that add flexibility within boundaries you define. On the other end: autonomous agents that plan and execute with significant independence.

As a general rule: agents are less reliable/predictable but can handle open ended problems (where you don’t know the steps in advance).

Here’s an insight that’s important: LLM calls are slow. A single call can take seconds. A multi-step agent flow could take minutes. If you run these inside web requests, you’ll get timeouts and hanging UIs.

The fix is async patterns you’ve probably used before. Your API receives the task and puts a message on a queue. The user gets an immediate response. A worker picks up the job, does the LLM calls, stores the result when it’s done.

Celery, SQS, Temporal workflows - these are battle-tested tools with automatic retries and easy observability.

Stage 5: Observability and Testing

You can’t improve what you can’t see.

Every LLM call in your application should be traced. Capture the inputs, the outputs, latency, token usage, and cost. This isn’t optional for production systems. You need this data to debug issues, understand costs, and improve quality over time.

Tools like Langfuse and Braintrust make this easier. The important thing is visibility.

Testing LLM applications is different from traditional software. Outputs are non-deterministic: same prompt, different responses. A “good output” is often subjective. Your regular code should have normal unit tests with mocked LLM responses. For LLM behaviour itself, you need evaluation.

Build a dataset of test cases with inputs and some notion of good outputs (“can you define what good output means?”). Run your application against these regularly. Check that outputs contain required fields, or use another LLM to judge quality. Run evals on every significant change to catch regressions.

Evals are the only way to confidently modify prompts and logic without breaking things.

Stage 6: Deployment

Your application isn’t done until it’s running reliably for real users.

For hosting, you have options. Docker + VPS. Platform-as-a-Service providers like Render let you deploy without thinking about infrastructure: push your code and it runs. For more control, managed container services like AWS ECS or Google Cloud Run give you automatic scaling and health checks without managing servers directly.

Cloud infrastructure is no fun, but it’s often necessary in the real world. Keep things simple if you can. A PaaS that handles the boring stuff lets you focus on your actual application. Only move to more complex setups when you’ve genuinely outgrown the simple option.

Set up proper CI/CD so deployments are automated. Make it easy to roll back when something goes wrong. Have logs you can search, traces to see what agents are actually doing, dashboards that show what’s happening, and alerts for the important stuff.

Stage 7: Security and Compliance

Security is no longer an afterthought: it’s often the reason AI projects get killed.

Data privacy comes first. Don’t send customer data to third-party providers like OpenAI unless you know what you’re doing. Understand your data processing agreements. Know where your data is stored and who can access it. Many promising demos become liabilities the moment real customer information flows through them.

LLM applications have unique attack surfaces:

Prompt injection is where malicious inputs that try to override your system instructions and make the model do something unintended.

Data leakage happens when your model accidentally reveals sensitive information from its context.

Practical tools to protect your applications:

LLM Guard - scans inputs and outputs for prompt injections, toxic content, and data leakage.
Garak - command-line vulnerability scanner for LLMs from NVIDIA. Red-team your own application before someone else does.

Human-in-the-loop (HITL) isn’t just a UX pattern: it’s a security architecture. For high-stakes actions, require human approval before the system executes. This catches both model errors and successful attacks.

A note on local inference: Tools like Ollama let you run models locally, keeping data off third-party servers entirely. This sounds appealing for privacy, but running your own inference is a full-time job done properly. You’re now responsible for hardware, scaling, model updates, and performance optimisation. Use this as a last resort when compliance requirements leave no alternative: not as a default choice.

The pattern is straightforward: validate inputs before they reach your model, monitor outputs for sensitive data, require human approval for high-stakes actions, and log everything. This isn’t different from securing any other application: it’s just that the attack vectors are newer.

The Path Forward

Here’s what to remember:

You’re a software engineer first. The fundamentals matter more than the frameworks.
Learn the APIs directly. Understand what’s underneath before reaching for abstractions.
Start simple. Boring patterns win. Frameworks aren’t necessary.
Think about production early. Where you deploy matters. So does security.
Build things. A GitHub full of working projects beats any credential.

The path to AI Engineering is simpler than it looks. You don’t need every framework. You don’t need to chase every new tool. You need solid engineering fundamentals, direct experience with LLM APIs, and the judgment to keep things as simple as possible.

The demand for people who can build reliable AI applications far exceeds the supply. If you put in the work to develop genuine skills, you’ll find plenty of opportunities.

Now go build something.

Thanks for reading. Have an awesome week : )

P.S. If you want to go deeper on building AI systems, I run a community where we build agents hands-on: https://skool.com/aiengineer

The 10x skill for AI engineers in 2026: agent feedback loops

Owain Lewis — Sun, 28 Dec 2025 15:09:05 GMT

Hey there 👋,

Here’s a truth: no one can build software without feedback.

Engineers don’t one-shot code. We make mistakes on the first try. Syntax errors, wrong variable names, off-by-one bugs. But we run the code and use that feedback to self-correct. Red squiggles in the IDE. Stack traces in the terminal. Failing tests. We fix, run again, iterate until it works.

This is so fundamental we take it for granted. The feedback loop is the process.

Agents are the same way.

Right now, most engineers treat agents like they should do something we’ve never done: write working code on the first try without running it. When the agent fails (code doesn’t work), we call it “hallucination.” But imagine trying to write code without having the ability to run tests or run the app to verify it’s working.

It’s not a reasoning problem. It’s a visibility problem.

The Problem: You’ve Become the Loop

Watch what happens in most agent workflows:

The Manual Loop (Slow): Agent → You → Terminal → You → Copy/Paste → Agent

You’re the feedback loop. The agent generates code in seconds, but you take minutes to close each iteration.

You’ve become the slowest part of the system.

The Closed Loop (Fast): Agent ↔ Terminal

Agents Are Brilliant But Blind

Here’s the mental model that changed how I work with agents:

Before every task, ask: what can my agent actually see?

Each session starts fresh. No memory of your codebase. No context from yesterday. The agent only knows what’s in its context window right now.

If it can’t see the error, it can’t fix the error. If it can’t see the test output, it doesn’t know something is broken. If it can’t see the logs, it can’t debug the integration.

There are three types of feedback agents need:

Execution output. Stack traces with line numbers. The agent needs to run the code and see what happens.

Test results. Specific assertion failures. “Expected 200, got 401” is actionable. “Tests failed” is noise.

System logs. API responses, container logs. For anything with dependencies, the bug is often in the integration.

If you can’t debug it with the information available, neither can the agent.

The Fix: CLAUDE.md

Claude Code reads a file called CLAUDE.md from your project root at the start of every conversation. This is where you tell the agent to verify its own work.

The same concept can be found in other coding agents like OpenCode.

# Development Process

## After code changes:
1. Run `uv run pytest` - all tests must pass
2. Run `ruff check . --fix` - fix linting issues

## Rules:
- Do NOT ask me to run tests. Run them yourself.
- If tests fail, read the output, fix, re-run.
- Provide a summary including the files you've changed and test results

This is the first step towards building a closed loop workflow. Agent writes code and then it verifies it’s own work.

Commands for Heavy Workflows

Tests run fast. But spinning up servers, running E2E suites, checking logs? Make those on-demand commands.

Create .claude/commands/e2e.md:

# End To End Test

Test the endpoint: $ARGUMENTS

1. Run `./scripts/e2e.sh $ARGUMENTS`
2. If it fails, read the logs, fix, test again.

Keep heavy verification logic in shell scripts. The command just tells the agent what to run and what to do when it fails.

The Shift

Every time you run tests and copy output back, start a server and paste the error, check logs to see what went wrong, you’re doing work the agent should do.

The agent can run commands. The agent can read output. The agent can iterate. You just have to tell it what “working” looks like.

Build feedback loops for your coding agents.

Thanks for reading.

Have an awesome week : )

P.S. If you want to go deeper on building professional AI systems, I run a community where we do this hands-on: https://skool.com/aiengineer

The AI design pattern playbook

Owain Lewis — Sun, 21 Dec 2025 13:12:55 GMT

Hey there 👋,

When it comes to designing AI systems, it helps to have a high-level view of what patterns are available. You don’t want to reinvent the wheel every time you start a new project.

There are two typical ways to structure LLM applications: workflows and agents.

These aren’t mutually exclusive. A workflow is a graph of steps you define upfront (a → b → c). You control what happens and in what order. An agent is where the LLM controls the flow. It decides what steps to take based on results.

Agents are powerful when you don’t know ahead of time what work needs to be done. Deep research is an example - you can’t predict what searches you’ll need or what rabbit holes matter until you start exploring.

Before we dive in, one principle worth internalising: LLM calls are expensive. Not just in cost, but in latency. Every call you add is another round trip, another few seconds of wait time. Everything in this guide is a trade-off between capability and speed. Always ask: do I really need another LLM call here, or can code handle it?

This is a reference for the patterns I find most useful.

Start Here: The Single LLM Call

This sounds obvious, but it’s worth stating: you can go a long way with a single, well-crafted LLM call.

CV parsing. Email drafting. Structured data extraction. Classification. Summarisation. A single call with a good prompt handles all of these.

Before reaching for frameworks or agents, ask yourself: can one prompt do the job? Often the answer is yes. And when it is, you get the fastest possible response time and the lowest possible cost.

The rest of this guide is for when a single call isn’t enough.

Workflow Patterns

You define the graph. The LLM is one step among many - composed with code, API calls, database operations.

1. Chain

Sequential steps where each builds on the previous.

This is the most common pattern. Do a, then b, then c.

Not every step needs to be an LLM call. You can mix LLM calls with deterministic code. Every LLM call you can replace with deterministic code is latency saved.

2. Parallel

Run independent operations simultaneously.

Image generation is the classic example because it’s notoriously slow and each image generation can be parallelised.

If the operations can be done in parallel, it’s a no brainer. Use a semaphore to avoid hitting rate limits.

sem = asyncio.Semaphore(5)

async def generate_image(prompt: str) -> bytes:
    async with sem:
        return await image_model(prompt)

images = await asyncio.gather(
    generate_image("generate an image of ..."),
    generate_image("generate an image of ..."),
    generate_image("generate an image of ..."),
    ...
)

3. Route

This is my favourite pattern. Classify first, then dispatch to specialised handlers.

Different inputs need different treatment. Billing questions need account context. Technical questions need documentation RAG. Sales inquiries need a human.

One prompt can’t handle all of this well. Classify first, then route to the appropriate subsystem. Each branch can have its own prompts, tools, even models.

Use a fast, cheap model (or code) for classification. Save the expensive model for the actual work.

4. Map-Reduce

Process many items, then synthesise.

Due diligence across 50 contracts. Research synthesis across 20 papers. Log analysis across gigabytes of files. No way any of this fits in one context window.

The map phase splits out the work. The reduce phase is where you lose information, so be deliberate about it. For critical details, keep structured data rather than summarising to prose too early.

5. Orchestrator-Workers

When you can’t predict the subtasks upfront, but you still want workflow-level control. This is similar to map reduce but with an LLM dynamically making a plan.

Example: “Add authentication to the app.” Which files need to change? You don’t know until you analyse the codebase.

The orchestrator examines the task and spawns workers (sub agents) dynamically. This is like parallel, but the LLM decides at runtime what workers are needed and what each should do.

6. Evaluate-Refine

Generate, check, improve. Loop until good enough. Most of us do this manually when working with LLMs. We ask a question. Ask for improvements. Keep iterating until done.

Example: generate a blog post, get feedback from an LLM, improve it based on that feedback.

The evaluator doesn’t have to be an LLM. Code checks are often better. Run the tests. Validate the schema. Lint the output. Deterministic evaluation is faster, cheaper, and more reliable.

7. Fallback

Try cheap first. Escalate when needed.

Most requests are straightforward. “What are your opening hours?” doesn’t need a frontier model.

Route everything through a fast model first. When confidence is low, escalate to something more powerful. This can cut costs dramatically while maintaining quality where it matters.

The trick is reliable confidence detection.

Agent Patterns

Unlike a workflow, an agent makes decision about control flow. The LLM decides what to do next. You provide instructions, tools and set boundaries. The agent decides what tools to call.

8. Tool Loop

The core agent pattern. Call tools until the task is done.

Claude Code is a good example. Read file, make change, run tests, see error, fix error, run tests again. The model decides what to do based on what it observes.

The critical mechanism is feedback. Model writes buggy code - sees the stack trace - fixes it. API returns an error - model adjusts parameters. Without feeding errors back to the model, agents can’t easily self correct.

The obvious downside of agents is loss of control and less predictability in exchange for more power.

9. Plan-Execute

Separate thinking from doing.

Deep research is the canonical use case. You can’t know upfront what searches you’ll need, what sources matter, what rabbit holes are worth exploring. The agent plans an investigation, executes steps, and replans as it learns.

10. Human-in-the-Loop

Pause for approval on high-stakes actions. For a simple terminal agent, this is a trivial if statement:

for tool_call in response.tool_calls:
    if requires_approval(tool_call):
        approved = input(f"Execute {tool_call.name}? (y/n): ")
        if approved != "y":
            continue
    
    result = execute(tool_call)

That’s really all it is. Before executing sensitive operations - sending emails, making payments, deleting data - ask first.

Auto-approve small refunds. Human review for large ones. Auto-send routine confirmations. Human review for anything sensitive.

This is how you build trust in a new system. Start with humans approving everything. Track what gets approved versus rejected. Identify patterns. Automate the safe categories. Keep humans on the edge cases.

Over time, the system earns more autonomy.

Choosing Between Them

Start with the simplest option: a single LLM call. Only add complexity when you have a clear reason.

Default to workflows over agents. Workflows are predictable, debuggable, and easier to reason about. You know exactly what will happen because you defined the graph.

Reach for agents when you don’t know the steps upfront. Research, coding, customer questions. The flexibility is worth the unpredictability.

Every pattern is a trade-off. The best AI systems aren’t the most sophisticated - they’re the simplest thing that solves the problem.

Thanks for reading. Have an awesome week : )

P.S. If you want to go deeper on building AI systems, I run a community where we build these patterns hands-on: https://skool.com/aiengineer

My 8 principles for agentic coding

Owain Lewis — Sun, 14 Dec 2025 16:34:05 GMT

Hey friend,

I’ve been coding for 20 years. I’ve written code in Scala, OCaml, Lisp, Ruby, Python, Java; I genuinely love the craft. But lately, I’ve had to rethink what that craft actually is.

AI coding was phase one - AI helps you write code faster, but you’re still driving every decision. Useful, but limited.

We’re in phase two now. Agents (like Claude Code) that plan, execute, test, deliver. You review what comes back. The job is moving up the stack: building systems that write better code than you could write yourself.

The shift from AI coding to agentic coding is a shift in identity.

You’re not a developer who uses AI tools. You’re an engineer who builds systems that build software.

Here are the 8 principles that finally made it click for me.

1. Your Agent Is Capable But Contextless

Agents can read codebases, edit files, run terminal commands. They’re genuinely capable.

But every session starts empty. No memory of your architecture. No understanding of your conventions. No awareness of what you tried yesterday.

Before getting frustrated that “AI sucks”, ask yourself: does the agent have everything it needs to succeed without me? If yes, let it run. If no, either provide the missing context or plan to stay in the loop.

2. Plan First, Execute Second

This is probably the biggest unlock: don’t ask the same agent to plan the work and do the work.

When you tell an agent to “build feature X,” it tends to rush. It makes assumptions. It starts coding immediately and wanders.

Instead, use a two-step process:

Plan: Ask an agent to analyze the problem and write a plan. Review it. Iterate until it meets your standards.
Execute: Start a fresh session. The new agent reads the approved plan and executes with focus.

The pattern: Plan → Review → Execute (new terminal session) → Ship.

Here’s an example of Claude Code command to generate a detailed plan from a high level instruction.

> /plan create a new python project for a rag agent using postgresql and hybrid search.

Iterate on the plan

> Update the plan to use uv for package management. Use Gemini as the model. Skip frameworks and use the SDK directly.

Match spec weight to task weight. Most tasks are small. The spec should be too.

3. Match Your Involvement to the Task

Not every task needs the same level of attention.

Low Ambiguity: Writing a unit test, adding a standard endpoint, fixing a lint error. Hand these off completely and go get coffee.
High Ambiguity: Designing an auth system, refactoring core abstractions, debugging race conditions. Keep these in the loop; treat it like pair programming.

The goal isn’t maximum autonomy everywhere. It’s spending your attention where it matters.

4. The SDLC Still Applies

The software development lifecycle didn’t disappear. It just runs differently with agents. Plan, code, test, review, document: agents can handle all five phases if you set them up for it.

The mistake I see: skipping straight to code. When agents plan first, they execute better. When they test their own work, they self-correct.

5. Stack Your Leverage Points

A few things multiply your agent’s effectiveness:

Context files. A README, a conventions doc, an AGENTS/CLAUDE.md. If it exists in a file, you don’t have to explain it in prompts.

Runnable tests. This is the highest leverage thing you can provide. If an agent can run tests and see green or red, it validates its own work without waiting for you.

Concrete plans. A good plan includes verification steps. When “done” is defined by a passing test, the agent knows when to stop.

Reusable workflows. Solve a process once, save it. Planning, testing, shipping-each becomes something you invoke rather than explain.

6. Write for Agents, Not Humans

Documentation for humans assumes shared context. Documentation for agents needs to be blunt and assume nothing. Provide commands to run. Don’t be vague. Define success criteria concretely.

7. Encode Your Workflows

If you’re typing a long, complex prompt more than twice, save it.

Here’s an example: a workflow I use for creating pull requests:

Check branch name, recent commits, and changed files
Think hard about what this change accomplishes and why
Write a PR title and concise body with summary, changes, and testing notes
Push the branch, run gh pr create, return the URL

I type one command, the agent handles everything. Five minutes to write, hundreds of uses afterward.

Pair a planning workflow with an execution workflow and you have a system: Plan → Build → Ship.

8. Measure Your Progress

How do you know you’re getting better at this?

Longer autonomous runs: The agent works for 10 minutes without asking a clarifying question.
Fewer iteration cycles: Tasks complete in 1-2 rounds instead of 5-6.
Higher first-try success: Tests pass without manual fixes more often.

When the agent gets stuck, don’t just fix the code-fix the workflow so it won’t get stuck there next time. That’s how you improve a system.

Thanks for reading.

AI code has no taste

Owain Lewis — Thu, 04 Dec 2025 17:05:19 GMT

Hey friend 👋,

AI code has no taste.

It doesn’t know why you avoided that dependency. Why you chose that abstraction. Why the naming convention matters. It produces functioning code with no opinion - code that passes review but that nobody wants to maintain.

Most people see this as AI’s limitation. I see it as an invitation.

Because AI code has no taste of its own. It’s waiting for yours.

The rules you’ve built up over years of engineering - the heuristics that live in your head during code review, the patterns you enforce without thinking - they can become explicit. They can become instructions or SOPs.

And once they do, something interesting happens.

Your taste scales.

The Shift

There’s a transition happening that most engineers haven’t named yet.

Most of us are still “in the loop.” Prompting back and forth. Reviewing every output. Tweaking, fixing, regenerating. The AI codes faster, but you’re still the bottleneck.

The shift is stepping out of the loop entirely.

You stop doing the work. You start building systems that do the work - better than you could do it yourself, at a scale you could never sustain.

Not “AI as assistant.” Not autocomplete on steroids. Something more fundamental: you encode your taste, your standards, your judgment into a system. Then you let it run.

The craft doesn’t disappear. It moves. Out of the code, into the rules that shape the code. Out of the output, into the system that produces the output.

This is a different job. And it requires an uncomfortable question.

The Question Most People Aren’t Asking

When I was managing software teams, I saw the same pattern constantly.

A team gets stuck on some tedious process. They brainstorm improvements. They shave 10% off the time. Everyone feels productive.

But that’s not strategy. That’s minor optimisation.

The real question was always simpler: “Why are we doing this work at all?”

I see the same thing with AI now. Most engineers ask: “How can AI help me do my old work faster?”

Fine question. But you’re still in the loop. Thinking small. The constraint isn’t the model. It’s not the tools. It’s you - still reviewing everything, thinking “only I can write this code well”, still the ceiling on what gets shipped.

The better question: How do I use AI to do previously impossible work - at a quality level that reflects or exceeds my standards?

Not “code faster.” Build something that couldn’t exist without the system.

Let me make this concrete.

What This Looks Like

A few weeks ago I built a system that generates ambient electronic mixes - focus music for coding. It produces about 2 hours of original music daily, runs a 24/7 YouTube livestream, and keeps going without me.

I could not do this manually. Not “it would take a long time” - I literally could not sustain this output.

But I didn’t just prompt “generate lo-fi beats” and walk away. That would produce garbage.

I built filter chains to mix audio to my taste - EQ, compression, analog warmth. I listened to hundreds of outputs to tweak parameters. I encoded my standards into every part of the pipeline.

The craft didn’t go into each track. The craft went into the system.

Same principle applies to code. You don’t review every line the agent writes. You encode your standards into the rules it follows - architectural preferences, naming conventions, the stuff you’d flag in PR review. The agent becomes the system. Your taste becomes the instructions.

You build the system that builds the system.

So how do you know if you’ve actually done this, or just built another automation?

Two Tests

The Impossible Test. Could a motivated human do this sustainably - without burning out or cutting corners?

Scheduling a cron job? Automation. An agent that monitors your on-call queue, root-causes incidents, and pushes a fix before you’ve opened your laptop? No human can do that.

Linting code? Automation. An agent that reviews every PR against your team’s architectural principles, catches subtle violations, and explains why it flagged them - across 50 PRs a day, without getting tired or sloppy? Impossible.

The Craft Test. Does the output reflect the taste of whoever built it?

If a different person built this system, would the results be different? If the answer is no, there’s no craft encoded. Just a generic pipeline anyone could spin up.

Scale without craft produces slop. Craft without scale means you’re still in the loop, doing everything yourself.

Scale plus craft is the unlock.

But passing both tests once is easy. Keeping quality high over time - that’s the hard part.

Closed Loops

Every system drifts toward garbage. Entropy always wins unless you fight it.

When you’re coding manually, you are the feedback loop. You notice when something’s off. In a system that runs without you, that loop has to be built in.

Three components:

The Generator produces output - the LLM, the agent, the pipeline.

The Sensor measures quality (evals). Did tests pass? Does it follow conventions? Did latency spike?

The Controller enforces your standards. It rejects bad outputs, adjusts parameters, decides what ships and what gets thrown away.

The craft lives in the Controller. That’s where you encode “good enough” vs “not good enough.”

For code: the Sensor runs your test suite, reviews for coherence across the codebase, checks for simplifications, flags complexity. The Controller decides whether to retry, refactor, or escalate to a human.

Without this loop, quality drifts. Always. This is where most “AI slop” comes from - not bad models, but missing feedback loops.

Build This

Want to try it? Start small.

Pick something repetitive - code review, documentation, test generation. Something you do weekly and see if you can encode it as a system that does the work without you. Before you build the pipeline:

Write down your standards. What makes output “good” vs “acceptable” vs “garbage”?
Build a sensor. How will you measure whether output meets those standards automatically?
Build a controller. What happens when it fails? Retry? Adjust? Flag for review?

You’ll learn more about your own taste by trying to encode it than you ever did by just doing the work yourself.

The New Job

Here’s the thing most engineers haven’t internalised yet:

This isn’t a productivity hack. It’s a different job.

You stop doing the work. You start building systems that do the work - better than you could do it yourself, at a scale you could never sustain.

The engineers who figure this out first will build things the rest of us can’t compete with. Not because they’re smarter. Because they stepped out of the loop.

Different question. Different results.

Have an awesome week :)

Want to build these systems alongside other engineers doing the same? I run a community where we work on real projects and share what’s actually working.

👉 Join the AI Engineering Community

4 context engineering strategies every AI engineer needs to know

Owain Lewis — Thu, 27 Nov 2025 17:01:45 GMT

Hey friend 👋,

A few months ago, I was building an AI agent to help engineers debug production issues. The idea was simple: pull logs from multiple sources, find patterns, and explain what went wrong.

“Search the logs and tell me why this alert fired.”

The agent would come back with something like:

“At 14:32 UTC, the checkout service started returning 503 errors. The root cause was the Redis cache hitting memory limits. The issue self-resolved at 14:47.”

Incredible, right?

Except it didn’t work.

The log data was massive and noisy. Within a few conversational turns, I’d maxed out the context window. The agent couldn’t keep all those log outputs in memory. It would start strong, then eventually fail or hallucinate.

The solution wasn’t to switch models or add more data. It was to rethink the context management strategy.

What Is Context Engineering?

When you talk to an AI model, it sees more than just your prompts. Your instructions, the conversation so far, tool call results, documents-all of it sits in this window together.

Andrej Karpathy has a useful mental model for this: the LLM is the CPU, and the context window is the RAM. It’s the model’s working memory. Everything has to fit there.

But it’s not just about overflow. Even before you hit the limit, models suffer from “context rot”-performance degrades as more tokens are added, even within the window size.

Think about finding one important note on a desk. Easy with 10 papers. Hard with 1,000. The note is still there-but good luck finding it.

Drew Breunig outlined four ways bad context breaks your agent:

Context Poisoning: A hallucination enters context and corrupts all future reasoning
Context Distraction: Too much context overwhelms the model
Context Confusion: Irrelevant information influences responses
Context Clash: Different parts of the context contradict each other

If your agent works at first then drifts later, one of these is usually why.

Why This Matters For Agents

Here’s the thing that makes this click: LLMs are stateless.

They don’t “remember” anything between calls. Every time you call the model, you pass in the entire conversation history via an API call.

→ User asks a question (20 tokens)
→ Assistant decides to call a tool (50 tokens)
→ Tool returns results (2,000 tokens)
→ Assistant reasons about the results (100 tokens)
→ ...repeat 50 times...

Eventually, you’re passing hundreds of thousands of tokens just to generate the next sentence.

This is the context engineering problem.

The Four Strategies

So how do you actually manage context? There are four main strategies. We’ll use Claude Code (a terminal AI agent) as a reference because it uses all of these.

1. Write (External Memory)

Don’t keep everything in context. Have your agent write important stuff somewhere external.

Claude Code writes its plans to disk. It also uses a TodoWrite tool to persist task state. When debugging a complex issue across 15 files, instead of holding “fixed auth.ts, need to check db.ts, then run tests” in context, it writes each step to a structured todo list. The todos live outside the window-the agent references them when needed, not constantly.

Cursor and Windsurf use rules files. ChatGPT saves memories across sessions. Same idea: give your agent a write_to_scratch tool that writes findings and plans to a file. Those notes don’t cost attention until the agent pulls them back in.

2. Select (Just-in-Time Retrieval)

Some people dump all docs and tools into context upfront. Don’t do this.

Claude Code never reads an entire codebase upfront. It uses Glob to find file paths matching a pattern (e.g., **/*.ts), Grep to locate specific code references, then Read to pull in only the relevant file. A question like “where is authentication handled?” triggers a targeted search-not a 50-file dump into context.

Keep references instead (file paths, database queries). When the agent needs the content, it loads it then. And if you have 50 tools, the model parses 50 descriptions every turn-keep your toolset minimal or dynamically load definitions based on the task.

Claude Skills was a recent feature that used this approach - it read’s a description of the tools - not the “entire” tool definition.

Here’s the algorithm:

Give agent a compressed summary of the tools (“Use this tool if the user asks about LinkedIn posts”)
Read the full tool description dynamically if the agent thinks it’s needed

3. Compress and Prune

Even with a 200k token window, a messy context leads to bad answers.

Summarization: If you’ve used Claude Code, you’ve seen this. When the window fills, it summarizes the conversation-preserving architectural decisions but dropping the exploration that led there.

Context editing (pruning): Sometimes you don’t need a summary. You just need to delete. Anthropic found that simply removing stale tool outputs reduced token usage by 84% on long-running tasks.

Did the agent run a ls -la command 10 turns ago? Delete the output. The model already used that info.
Did a tool return 5,000 lines of logs? Summarize it to “Found 847 errors, 92% were Redis timeouts: org.redis.client.RedisTimeoutException: Redis server response timeout (3000 ms) occured for command: (GET),” then delete the raw data.

4. Isolate (Multi-Agent Systems)

This is my favourite technique for complex tasks. Instead of one agent drowning in context, split the work.

Claude Code spawns specialized agents by type: Explore for codebase navigation, Plan for architecture decisions, claude-code-guide for documentation lookup. Each operates in its own context window. If the user asks “how does billing work?” and “what’s in the docs about webhooks?”-two agents run in parallel, each with fresh context, returning focused summaries to the main conversation.

When delegating to a sub-agent, the prompt is compressed: “Find all API endpoints that modify user data” rather than passing the full conversation history. The sub-agent explores freely, then returns a summary. The orchestrator never sees the 30 files the sub-agent read-just the 500-token answer.

Uses more total tokens. Gets better results.

TL;DR

Managing context is important when building long running AI agents. The window quickly fills up.

The counterintuitive thing about context windows is that bigger doesn’t always mean better. A 200k window full of noise performs worse than a 20k window with exactly what matters. Context engineering isn’t about cramming more in. It’s about curating what the model sees.

How to solve this:

Write: Save state to external files.
Select: Load data only when needed.
Compress: Summarise history and delete stale tool outputs.
Isolate: Use sub-agents to encapsulate high-token tasks.

Back to my logs agent: the fix was combining a few of these strategies together. What felt like a model limitation was actually a context engineering problem.

Thanks for reading.

Have an awesome week : )

P.S. If you want to go deeper on building AI systems, I run a community where we build agents hands-on: https://skool.com/aiengineer

How to build AI RAG agents with the new gemini file tool.

Owain Lewis — Tue, 18 Nov 2025 17:17:09 GMT

Hey there 👋,

Infrastructure complexity sucks.

As someone who’s spent years building public cloud services, I still find it frustrating how much time we waste on undifferentiated infrastructure setup.

You’ve probably been here before:

You want to build a simple RAG agent over your company docs. Suddenly you’re drowning in vector database setup (which one should I use?), writing complex embedding pipelines, spending weeks on undifferentiated infrastructure and wondering: why is this so hard?

For production systems with complex needs, this overhead makes sense. But for freelance projects, prototypes, and MVPs? It’s often a huge time sink.

Google recently shipped a feature that eliminates most of the headaches around basic RAG: the File Search Tool in the Gemini API.

I just built a customer support agent using it. The whole thing was a few lines of Python. No vector database. No embedding pipeline. No chunking logic.

Let me show you exactly how it works.

What Is Gemini File Search Tool?

Gemini File Search Tool handles the entire RAG pipeline through a simple API:

Create a store → Upload documents → Start querying. That’s it.

Behind the scenes, Google handles document parsing, automatic chunking with configurable overlap, semantic search using gemini-embedding-001, and citation extraction with grounding metadata.

Everything that used to take weeks of infrastructure work now takes one API call.

Building a Customer Support Agent (Step by Step)

Let me walk you through building a real FAQ agent using actual code from the Google documentation. This kind of setup has a lot of value to businesses (for example allowing employees to get instant answers to their questions rather than waiting for a human to respond).

Step 1: Create A File Store

The actual API is remarkably simple

Step 2: Upload Your Docs

We list all files in our docs directory and upload them to the file store. That’s it. Your documents are now chunked, embedded, indexed, and ready to query.

Optionally, you can control the chunking if you need to.

Step 3: Create Your Agent

You can simply use the file store as a tool when working with Gemini. This makes it incredibly easy. If you’re using another model or framework, you could wrap the API and provide it as a regular tool. Here’s the most minimal example of a RAG agent I could come up with.

This is basically all you need to build RAG agent in Gemini.

Final Thoughts

Google priced this aggressively. You only pay for one off costs when uploading your docs.

File Search removes the infrastructure tax from basic RAG. It lets you validate your idea in hours instead of weeks.

If you want the complete code you get get it all here for free.

Thanks for reading.

Have an awesome week : )

P.S. If you’re tired of learning this stuff alone, I run a community where ambitious software engineers master production AI by building real working projects together. The fastest path to master AI engineering: https://skool.com/aiengineer

The AI Engineer

I let AI agents run my entire GitHub workflow

Give the agent one tool

Keep the work on a board, not in markdown files

Plan first, then break it into tickets

Make the agent review its own work

Let the agent run the review cycle

Automate the boring infrastructure too

More ways to put the agent to work

Where the leverage is

The 7 skills I actually use every day with AI coding agents

Managing Skills

1. spec. Get clear before you write code

2. plan. Break the work into shippable chunks

3. explain-visually. Turn anything into a rich HTML explainer

4. clarify. For when you’re rambling

5. address-pr-feedback. My favourite of the seven

6. refactor. Always do a second pass

7. design-doc. Think clearly before you build

Get The Skills Free

Is Pi better than Claude Code?

How extensions work

What the minimalism costs you

Where I think this fits

Six RAG strategies, explained simply (with code).

1. Document loading

2. Full text search

3. Vector search

4. Hybrid search

5. SQL RAG (Database RAG)

6. Agentic RAG

How to choose

How I Use AI To Review AI Code

Layer 1: Automate The Obvious

Layer 2: Agent Review

Layer 3: External Review

Layer 4: Human Review

TL;DR

How I delegate work to a team of AI agents

Agent worker

Pull vs push architecture

Deterministic guardrails around non-deterministic agents

Automated code review

Scaling agents

The 7 stages of building software with AI (with prompts you can steal)

Why planning still matters, even when building is fast

Breaking work down is where most people go wrong

Review is the step that changed my workflow

Deploy and monitor

What I’ve learned after a year with Claude Code

How I'm using OpenAI Codex automations to improve my code

The Setup

Quality

Memory

What to Automate

Summary

Claude Code agent teams explained

Subagents vs Agent Teams

When To Use This

C Compiler

My Experience

Tradeoffs

Where Is This Heading?

Your agent workflow doesn't scale (here's the fix)

The Setup

Connect Claude Code to Linear

Encode the Workflow in CLAUDE.md

Write Good Tickets

Start Assigning

How I Decide What to Hand Off

Summary

How I code with AI agents (spec-driven development)

What Is Spec-Driven Development?

The Three Documents (and Why They’re Different)

Anatomy of a Good Spec

When Specs Go Wrong

My Workflow

Skip the Frameworks?

Why This Creates Leverage

Final Thoughts

1. `spec`. Get clear before you write code

2. `plan`. Break the work into shippable chunks

3. `explain-visually`. Turn anything into a rich HTML explainer

4. `clarify`. For when you’re rambling

5. `address-pr-feedback`. My favourite of the seven

6. `refactor`. Always do a second pass

7. `design-doc`. Think clearly before you build