Prompt engineering is dead

Prompt engineering as a skill is dead. I know that’s a spicy take. I also know it’s correct.

I’ve been building AI agent systems since March 2023. Went deep on prompt engineering early - custom GPTs with specialized knowledge bases, engineered prompt systems for UX research, meal tracking, viral content creation, planning assistants. I did the whole thing. Followed the guides, read the papers, obsessed over system prompt structure.

And somewhere in the last 12 months I realized: almost none of that matters anymore. The skill I spent real time developing got commoditized. What replaced it is harder, more valuable, and almost nobody’s talking about it correctly.

Why prompt engineering as a skill is dead

Prompt engineering had about 18 months as a legitimate technical differentiator. Call it mid-2022 to late-2023. During that window, knowing how to coax useful output from a model was genuinely non-obvious. Chain-of-thought prompting improved outputs meaningfully. Few-shot examples helped a lot. Persona framing, careful instruction ordering, output format control - all of it made a real difference because the base models needed the help.

Then the models got better. Fast.

GPT-4 Turbo, Claude 3, Gemini 1.5 - these models understand intent. You don’t need to hand-hold them through a task with elaborate prompting rituals anymore. Chain-of-thought? Current models do it automatically without being told to “think step by step.” Few-shot examples? Models infer from conversational context. Carefully structured personas? You can write “you’re a helpful assistant that specializes in X” and it works fine.

The elaborate stuff people were selling as advanced prompt engineering? It was always just “learning to give clear instructions to something that needed clearer instructions.” That’s not a skill. That’s communication.

Here’s the tell: if prompt engineering were a real technical skill, you’d need to understand something about how the system works to get better at it. But most prompt engineering advice is just… good writing advice with extra steps. Be specific. Give examples. State your constraints. These aren’t insights about AI - they’re basic communication principles.

The YouTube courses, the “certified prompt engineer” bootcamps, the LinkedIn posts about prompt frameworks - that whole industry built up around a skill that had a two-year shelf life. I’m not dunking on the people who built that content. It was real value at the time. It’s just not the bottleneck anymore.

What actually killed it

Three things landed simultaneously and the combination was lethal for prompt engineering as a career path.

Better base models that understand intent. I can give Claude Sonnet a roughly-worded, slightly ambiguous instruction and it’ll figure out what I mean. I don’t need to craft it like a legal contract. The model’s interpretive ability improved faster than the complexity of what people wanted to do with it.

Tool use and function calling. This is the big one. A huge chunk of what prompt engineering was trying to do was get models to simulate capabilities they didn’t actually have. “Imagine you have access to a search engine…” Now you just give it a search tool. The prompt gymnastics were a workaround for the absence of real tool integration. Now that tool calling is standard, you don’t need the workaround.

The realization that it’s managerial, not technical. The best prompt engineers were good writers and good managers - people who could specify what they wanted clearly. That’s a useful skill. It’s not a technical skill. An engineer who spent years learning systems programming has depth that transfers. A “prompt engineer” who spent years optimizing instruction phrasing has a skill that’s been automated by model improvements.

What replaced it: agent architecture and tool design

I run a multi-agent system called OpenClaw. 15+ specialized agents, each running a different model, handling different parts of my daily workflow - morning digests, research, code tasks, memory management, content scheduling, crypto position monitoring. The system processes around 40K tokens a day. Monthly API cost is about $40 because I’m aggressive about model selection.

What replaced it: agent architecture and tool design

The prompts for each individual agent? Maybe 20 lines. Sometimes less. Nothing fancy.

The orchestration code is thousands of lines. The tool integrations are where most of the work lives. The hard thinking went into: which agent handles what, how agents hand off context to each other, what happens when an agent fails, how memory persists across sessions, which model is cheap enough for a given task while still being capable enough to not screw it up.

That’s the new skill. And it’s actually a skill - one that requires understanding systems, trade-offs, failure modes, and cost structures.

Here’s what the real work looks like now:

Model selection by task. I don’t use one model for everything. Opus handles orchestration decisions that require judgment calls. Sonnet handles fast tasks where I need speed over depth. Gemini Flash handles research synthesis because it’s cheap and good at skimming large amounts of text. Codex handles code because it’s purpose-built. Picking the wrong model for a task either burns money or degrades output quality. That selection logic matters far more than prompt phrasing.

Memory architecture. A single model call is stateless. Real agent systems aren’t. How you persist context across sessions, what you include in each agent’s working memory, when to summarize versus retain full conversation history - these decisions affect whether your system stays coherent over time or slowly degrades into confusion. My system uses markdown files (MEMORY.md, SOUL.md, AGENTS.md) that agents read from on each run. It’s simple and it works. Getting there took iteration.

Error recovery. Model calls fail. APIs go down. An agent returns a response that’s correctly formatted but semantically wrong. What does your system do? Does it retry with the same prompt? Escalate to a more capable model? Log the failure and skip? Notify you? Building robust error handling into agent pipelines is where I’ve spent more time than on any individual prompt.

Tool design. When you build tools for your agents to use, you’re essentially designing an API for a semi-autonomous system. The tool needs to do one thing clearly. The input schema needs to be simple enough that a model will call it correctly. The output needs to be structured in a way the model can reason about. Bad tool design leads to agents that can’t actually use their tools effectively - and you’ll spend time debugging model behavior when the real problem is your tool interface.

If you want to go deeper on how I think about agent systems and the principles behind building them, check out the universal MCP server for more context on where I’m coming from.

The MCP shift and why tool use is the real frontier

Model Context Protocol (MCP) is the clearest signal of where this is all heading. Anthropic published the spec, it’s being adopted fast, and it formalizes something I’ve believed for a while: the future of AI capability is real tool access, not better instructions.

The mental shift is this: instead of engineering a prompt that makes a model pretend to have access to data, you give it actual tools to access real data. Instead of “imagine you’re an expert at X with access to Y,” you connect it to Y’s API and let it query directly.

I built a universal MCP server with 56 API integrations. Notion, GitHub, Slack, calendar, weather, crypto data, Spotify, health APIs, web search, memory storage. My agents don’t simulate access to these systems. They actually query them. The outputs are real, current, and accurate in ways that no prompt engineering trick could achieve.

That server took weeks to build. The prompts I wrote for the agents that use it took hours. If you’re calibrating where to invest your time, that ratio is your answer.

The MCP GitHub repo has a solid list of existing server implementations if you want to see what’s available before building your own. Don’t rebuild what’s already there.

Function calling and tool use aren’t features bolted onto models - they’re the architecture shift that makes models actually useful in production. A model that can query a live database, run code, check current prices, read a file, and send a message is categorically different from a model that only outputs text. The difference isn’t prompt quality. It’s what the model has access to.

What this means if you’re building

If you’re still spending serious time on prompt templates and prompt libraries, I’d push back on that investment. Not because prompts don’t matter - they do, a little - but because the return on that investment has dropped sharply while the return on tool-building and orchestration knowledge has gone up.

What this means if you're building

The skills that matter right now:

Systems thinking. How do components fail? What are the dependencies? Where does state live? These are the questions that determine whether a multi-agent system works reliably or collapses randomly.

API integration. The ability to connect to external systems, understand their data models, handle their rate limits and errors, and build clean interfaces over them. This is the “prompt engineering” of 2025 - not glamorous, high leverage.

Evaluation. How do you know your agent system is working? Not just “it didn’t crash” but “it did the right thing.” Building evals for AI systems is hard, undervalued, and increasingly important as these systems touch real workflows. Hamel Husain has written well on this.

Cost optimization. Running agents at scale costs money. Knowing how to profile token usage, pick the right model tier for each task, and cache aggressively is a real engineering skill. I got my 15-agent system to $40/month through iteration, not luck.

For context on the 3 months of debugging agents, there’s more there - I write about this stuff as I build.

The honest version of where prompts still matter

I want to be fair here. Prompts aren’t zero-value. There are still cases where prompt quality meaningfully affects output quality.

Highly specialized domains where the model needs tight constraints. Anything involving consistent output formatting that downstream code parses. System prompts for agents that need specific behavioral guardrails. Evaluation prompts where you’re asking a model to grade other model outputs.

But notice: these are narrow, specific cases. And even here, the prompt is usually less than 10% of the total engineering work. The rest is the surrounding system.

The people who’ll tell you prompt engineering is still a high-value career skill are usually selling prompt engineering courses. The people who are building real systems with AI have mostly moved on.

If you’re newer to this and trying to calibrate what to learn: spend maybe two weeks understanding how prompts work and what affects model behavior. Then spend the next six months learning to build systems. That’s the right ratio.

The model will follow good instructions just fine. The question is what system you build around it.

Start with one tool. Connect your agent to one real API. See how differently the model behaves when it has actual access to something instead of simulated knowledge. That single experience will do more to reframe your thinking than any prompt engineering course.