When we started TwoPlus in late 2024, our best agents lived inside 4,000-word system prompts. We had three senior engineers whose job was, essentially, to write and rewrite those prompts. A new customer segment? Fork the prompt. A new edge case? Append three paragraphs and pray nothing else regresses. We thought this was normal because, in 2024, it was.
It wasn't sustainable, and the reason it wasn't sustainable is worth thinking about even if you've never worked on an AI product.
A brief history of the mega-prompt
In 2023, the cutting edge of LLM work was prompt engineering. We wrote prompts like we wrote SQL: one big statement that, if crafted well, returned the right answer. It worked because the problems were small enough to fit in one statement.
By early 2025 the prompts were getting long enough that we started to notice: adding a new rule would silently change unrelated behaviors. We'd fix a bug in outbound-email-tone and our triage accuracy would drop 4%. There was no stack trace. There was just 4,000 words of English.
Why it stopped scaling
Three compounding problems:
- No scoping. Every rule in the prompt applied to every task. A rule meant for refunds also ran during onboarding. The model had no way to know which rules were relevant.
- No ownership. Who owns paragraph 14? Nobody. Everyone. It was added by Alice in February, edited by Bob in March, partially reverted by a bot in April.
- No observability.When the agent did something wrong, we couldn't point at the line of the prompt that caused it. We'd rewrite whole sections on vibes.
The mega-prompt worked when we had one agent doing one thing. It broke the moment we had fifteen agents each doing twelve things.
Three layers that replaced it
Over six months in 2025 we rebuilt the architecture. The mega-prompt became three layers with clear responsibilities:
Guardrails (enforced at the tool layer)
Hard rules the agent literally cannot violate. Not “please don't email more than 500 people” — a code path that refuses to send the 501st email. Guardrails are deterministic, testable, and owned by whoever owns the policy.
// Guardrails live in code, not prose.
export const sendEmail = guardrail({
maxPerHour: 50,
mustHave: ['unsubscribe_link', 'from_verified_domain'],
forbid: regex(/discount|off|sale/i), // pricing policy
})(actuallySendEmail);Coaching notes (scoped memory)
Corrections and preferences that apply to a specific agent, not the whole org. When you reject an email and write “too formal for LinkedIn,” that sentence gets indexed, embedded, and retrieved only when the agent is drafting LinkedIn posts. It doesn't leak into Slack replies or sales emails.
The practical upshot: coaching notes compose. Prompt edits collide.
Role specs (the small remaining prompt)
A 200-word document that describes what this agent does. No edge cases, no rules, no style guide — those live in the other two layers. Just: role, goal, success conditions, scope. Readable by the human who hired the agent.
What to do Monday morning
If you're still operating a mega-prompt, here's the sequence that worked for us:
- Grep for “NEVER” and “ALWAYS” in your prompt. Those are guardrails pretending to be prose. Move them to code.
- Find every “in the past, you have said…” clause. Those are coaching notes pretending to be prompt. Move them to scoped memory.
- What's left should fit on a page.That's your spec.
The first time we did this for our best internal agent, the prompt went from 3,800 words to 180. Quality went up because the model could actually focus on the role instead of drowning in caveats.
We're not going to pretend the mega-prompt is dead for everyone — plenty of great work is still happening in one big string. But if you're managing more than a handful of agents, or more than a handful of edge cases per agent, you'll hit the same wall we did. Save yourself six months.


