The 3-Month Gap: Building AI Agents That Actually Work

The 3-month gap between building an AI agent and it being useful nearly broke me.

That’s the stretch nobody posts about. I’ve seen a hundred Twitter threads about building agents. Maybe three honest ones about what happens after you ship v1 and start using the thing for real. The 3-month gap building AI agent systems from “it demos” to “it actually works without me babysitting it” - that’s the real project. And it’s almost nothing like the first 30 minutes.

I run a multi-agent system called borb on OpenClaw. It handles my daily workflow - reminders, research, content scheduling, code review, memory management. Fifteen-plus specialized agents running different models depending on the task. It’s stable now. It does real work every day without me touching it.

It took three months to get there. Here’s what that actually looked like.

The 3-month gap between building an AI agent and it being useful

Week one, everything works great. The demo is clean. The happy path is flawless. You show it to someone and they’re impressed. You’re impressed. You start thinking about what else you can build.

Then you try to use it for something real.

The model hallucinates a tool call that doesn’t exist in your registry. The agent enters a loop - calling the same function twelve times, each time getting an error, each time deciding to try again. A cron job fires at 3am on a Wednesday and the API it needs returns a response format you’ve never seen before. The agent handles it confidently and incorrectly. You find out four hours later.

This isn’t bad luck. This is what agents do. A system that needs to operate autonomously across real APIs, real data, real time zones, real rate limits - it’s going to hit edge cases constantly. The demo doesn’t hit edge cases because demos are choreographed. Production use is chaos.

The first two weeks, I was convinced I had a fundamentally broken architecture. I didn’t. I just had a system that hadn’t been stress-tested against reality yet. There’s a difference.

Week 1-4: You don’t know what you don’t know

The early failures are the visible ones. The agent tries to call a tool with the wrong parameter format. Easy fix. The system prompt is ambiguous about which memory file to write to. Easy fix. Auth token expires overnight and everything fails silently. Annoying fix, but simple.

You add error handling. You feel productive. You think you’re almost there.

You’re not almost there. You’ve patched the surface layer. The deep problems haven’t shown up yet because the deep problems only appear when conditions stack in ways you didn’t anticipate.

What I didn’t understand in week one: error handling for an AI agent isn’t like error handling for deterministic code. With regular code, you can enumerate the failure modes and write a case for each. With an agent, the failure mode is sometimes “the model made a creative decision.” There’s no exception class for that. The agent didn’t crash. It just did something you didn’t want, then moved on.

This maps to what Anthropic calls “reward hacking” in their model specification documentation - agents optimizing for what looks like task completion without actually completing the task. It shows up constantly in production.

I had an agent that was supposed to add a task to my task file when I asked it to remember something. For about ten days, it was adding the task correctly but also occasionally appending a brief philosophical reflection on the importance of the task. Not always. Maybe 15% of the time. The task file worked fine. It was just… weird. I only noticed when I went back and read through it.

That’s the thing about AI agents failing quietly. The output is often plausible enough that you don’t immediately catch it.

Week 5-8: The edge cases get stranger

By week five, the obvious stuff is handled. The agent runs reliably on the happy path. You start feeling good about it.

Week 5-8: The edge cases get stranger

Then the 3am edge cases start appearing.

Specific API returns a field as null instead of an empty string. Rate limit hits mid-task and the agent, instead of waiting, decides to rephrase the request and retry with a different tool. A time zone calculation goes wrong because of daylight saving time and a scheduled task fires an hour early. The agent interprets “do this daily” as “do this every time you’re invoked” and runs a daily summary five times in a row on a Tuesday.

None of these are predictable from first principles. You can’t design your way out of them upfront. You discover them by running the system and watching what happens.

My logging setup saved me here. I log everything - every tool call, every model response, every function output, timestamps on all of it. At 2am when something breaks weird, the logs are the only reason I can figure out what happened. If you’re building agents and you’re not logging obsessively, you will spend hours debugging by vibes instead of evidence. The OpenTelemetry docs on structured logging are worth a read if you want a sane approach to this - I adapted their structured format for borb’s log output and it made parsing way easier.

The most expensive edge case I hit: I didn’t have retry limits on one of my agents. It hit an API error that was never going to resolve - the endpoint was just down. The agent kept retrying. Each retry cost tokens. I woke up to $40 in API charges and an agent that had been stuck in a loop for six hours.

After that: every operation in borb has a max of five retries, then it stops and reports the failure. Non-negotiable. The agent’s job is not to solve unsolvable problems. The agent’s job is to do the task or tell me it can’t.

Week 9-12: The architecture is wrong and you have to accept it

This is the hardest part of the 3-month gap building AI agent systems into something reliable.

Around week nine, I started noticing a pattern. Individual fixes weren’t sticking. I’d patch one thing and something adjacent would break. The sub-agents were failing silently in ways that only became obvious when I traced back through the memory files. The memory itself was getting stale - agents referencing context from three weeks ago that was no longer relevant, treating outdated information as current.

The problem wasn’t any specific bug. The problem was architectural.

My original memory system was a single MEMORY.md file. Everything the agents wrote went into it. This worked fine for the first few weeks when the file was small. By week nine, it was 8,000 words of mixed context - tasks, decisions, completed work, notes, research summaries, all in one place. The agents were pulling from it for context but the retrieval wasn’t smart enough to distinguish “recent and relevant” from “old and stale.” The whole thing was polluted.

I rebuilt the memory layer. Separate files for different context types - active tasks, completed work, long-term facts, agent-specific state. Added timestamps. Added a lightweight cleanup job that runs daily and archives anything older than two weeks unless it’s flagged as permanent.

That’s the thing about the debugging phase nobody talks about: some of what you’re debugging isn’t bugs. It’s the architecture not scaling the way you assumed it would. Patches don’t fix that. You have to rebuild parts of the system.

I wrote more debugging and refactoring code in month three than I wrote feature code in month one. That ratio felt wrong when I was in it. Looking back, it’s exactly right. The feature code gets you to “it demos.” The debugging gets you to “it works.”

What model selection actually looks like in production

One concrete thing that took me too long to figure out: you can’t use one model for everything.

My original borb setup used Claude Sonnet for everything. Consistent, predictable, good at reasoning. Also overkill and expensive for tasks that don’t need it.

Now the system is tiered by task complexity:

Orchestration layer (deciding what to do, assigning to sub-agents): Opus. It needs to make judgment calls, handle ambiguity, and route correctly. Skimping here is the most expensive kind of cheap.
Complex reasoning tasks (code review, research synthesis, writing drafts): Sonnet. Good balance of quality and cost.
Simple lookups and fast operations (memory writes, formatting, scheduling checks): Flash. Fast, cheap, and the task doesn’t need a frontier model anyway.

The cost difference is significant. Running Flash for the simple stuff cut my daily token spend by about 30% without any change in output quality on those tasks. Because a task like “write this event to the calendar file” doesn’t need a 200-billion-parameter model. It just doesn’t.

If you’re building an agent system and everything is running on your best model, you’re leaving both money and performance on the table.

The kill switch is a feature, not an afterthought

I added a kill switch to borb after week two. It’s the best thing in the entire system.

The kill switch is a feature, not an afterthought

One command stops all agents, freezes all cron jobs, and logs the current state of every active task. No partial writes. No half-completed operations. Just a clean stop.

I’ve used it maybe six times. Twice when something was clearly going wrong and I needed to stop the bleeding. Once when I pushed a bad config change. Three times during architecture refactors when I needed to be sure nothing was running while I was editing core files.

The kill switch means I can experiment aggressively because I know I can stop everything immediately if I need to. Without it, I’d be more conservative about changes - and slower progress because of it.

Building agent systems without a kill switch is like deploying to production without rollback. Technically possible. Categorically reckless.

Why nobody posts about the 3-month gap building AI agent systems that actually ship

Twitter is full of “I built an AI agent in 30 minutes” content. I’ve read it. Some of it’s even technically impressive. But a 30-minute build that you’re demoing to a camera isn’t the same thing as a system that runs your workflow for 90 days without you touching it.

The demo is marketing. The three months after the demo is engineering.

Building in public usually means sharing wins - launches, milestones, metrics going up. The debugging phase has almost no wins. It’s just slightly fewer failures each week than the week before. That doesn’t make for satisfying content. It doesn’t fit the narrative arc. Nobody wants to read “week six: fixed four edge cases, discovered three more.”

But that’s the actual work. That’s the difference between a project and a product. A project works when you’re watching it. A product works when you’re not.

I’m not saying this to gatekeep. I’m saying it because if you’re three weeks into running your agent and you’re losing your mind debugging things you didn’t expect - that’s normal. You’re not in the failure state. You’re in the middle of the process. Keep going.

What you actually need to get through the 3-month gap

Based on borb, based on the $40 loop-retry incident, based on the stale memory architecture rebuild - here’s what I’d tell myself at week one:

Log everything. Every tool call, every model response, every write. You will need these logs at an inconvenient time and you will be grateful they exist.

Hard limits on retries. Five max, then stop and report. The agent’s job is not to solve problems that can’t be solved. It’s to do the work or surface the failure.

Memory needs maintenance, not just construction. Building the memory system is the easy part. Keeping it clean and relevant over months is the hard part. Budget time for it.

Model selection is architecture. Using the same model for everything is a shortcut that becomes a tax. Right model for right task from the start.

Build the kill switch before you need it. You’ll need it.

And accept that months two and three look like a lot of debugging with very little to show for it. That’s not failure. That’s what it looks like to build something that actually works.

If you’re thinking about building an agent system or you’re already in the weeds, check out what I write about over on the blog. More of this in the ADHD and AI workflows. And if you want to understand the broader context of who’s building this stuff and why, the the MCP server has that.

The demo is the beginning. The 3 months after it are the product.