Bluecore Studios — Engineering for Ambitious Products

Everyone is shipping agents right now.

Most of them work beautifully in the demo. Then they hit production and start hallucinating tool calls, asking users to choose between Option A and Option B on decisions that should never require a human, fabricating explanations when a tool isn't available, and generally behaving like a very confident junior employee who has no idea what the approval process is.

The problem isn't the model. The model is fine.

The problem is that someone handed it a prompt written like a creative brief instead of an operations manual, then blamed the AI when it improvised.

Prose Prompts Are Technical Debt Dressed in Markdown

The instinct when building agents is to describe intent — explain what you want the agent to do, what it should care about, how it should behave. That feels right. It's also how you get a system that technically understands its mission and has absolutely no idea what to do when the user's message doesn't match the scenario you described.

Prose prompts have no state model. They have no explicit transition logic. They have no guards against duplicate execution. They don't define what a success response looks like versus a failure response. They're optimistic about everything and defensive about nothing.

At small scale, inside a single conversation, this is tolerable. You can patch it with more prose. Add another paragraph explaining the edge case. Throw in a caveat. Hope the model infers the right behavior.

At production scale, across concurrent sessions, in a multi-agent architecture where downstream systems depend on consistent outputs — the whole thing becomes a liability. Not sometimes. Reliably.

The Fix Isn't Better Writing. It's Better Architecture.

Think of your prompt the way you'd think of a production runbook.

A runbook doesn't describe intent. It describes states, triggers, and transitions. It specifies exactly what condition must be true before an action is taken. It defines what the output looks like when an action succeeds, and what it looks like when it fails. It tells you what to do if a precondition isn't met — not as a suggestion, but as a hard rule.

A prompt built like a runbook gives an agent the same properties a runbook gives a human operator: deterministic behavior under load, predictable handling of edge cases, and an audit trail that doesn't require someone to go spelunking through chat logs to figure out what happened.

The structure that makes this work has five components:

States — an explicit lifecycle your agent understands and enforces. Not "the agent knows what phase it's in" — an actual declared state machine. UNINITIALIZED → PLANNED → EXECUTABLE → EXECUTED. The agent should know what state it's in, refuse operations that aren't valid in that state, and transition cleanly when the preconditions are met.

Triggers — plain language phrases mapped deterministically to tools. Not fuzzy intent matching. If the user says "execute," the agent calls the execution tool. Full stop. No branching. No asking for clarification on things that aren't ambiguous.

Guards — explicit preconditions that block critical actions before they happen. Never create a second batch if a valid one exists. Never execute without authorization. Never skip approval state because the user sounds confident.

Output contracts — structured, machine-readable, consistent. Every success response includes a full identifier and a link. Every failure response includes the actual error, not a paraphrase. No truncated IDs. No "the transaction was processed successfully" without the transaction hash.

One-click decisioning — after every significant action, the agent sends the correct next step. Not a menu. Not Option A/B/C. The agent knows what state you're in and what comes next. It sends that button. You click it or you don't.

Why Agents Break Exactly When You Need Them Most

There's a specific failure mode we see repeatedly in multi-agent systems: the agent performs correctly until something slightly outside its described scenarios happens, at which point it improvises.

Improvisation in a system that touches money, records, or state is not a feature. It's the thing you don't find out about until it matters.

The most common improvisations:

The agent fabricates an explanation when it can't call the right tool. It's trying to be helpful. The output sounds plausible. The output is wrong. And because it sounded plausible, nobody checked.

The agent creates a duplicate record because it wasn't told to check whether one already existed before creating a new one. Idempotency is never assumed by default — it has to be enforced by a guard.

The agent asks the user to make a technical decision they shouldn't have to make. "Would you like me to use the batch processor or execute individually?" is a question that should never reach a user. That's a routing decision. Route it in the prompt.

The agent truncates identifiers in its output. Now you have an audit trail that's almost complete, which is worse than not having one, because it creates the impression of traceability without the substance.

All of these are prompt architecture failures. All of them are preventable.

We Built This Into REFI2's Agent Infrastructure

When we designed the agent layer for REFI2 — our protocol for tokenizing Canadian residential mortgages into onchain yield instruments — we couldn't afford any of those failure modes.

Every agent in the system operates against an explicit state machine. The NAV calculation agent doesn't estimate — it calls the oracle, quotes the result verbatim, and attaches the full on-chain reference. The compliance gating agent doesn't interpret eligibility — it checks the regulatory gate at the Token-2022 level and returns a deterministic pass/fail with the exact condition that was evaluated. The deposit verification agent doesn't summarize — it returns the Merkle proof and the explorer link. Every time. Without improvisation.

What we found in building this way: agents operating against runbook-style prompts are dramatically faster to debug, dramatically easier to extend, and produce outputs that humans and downstream systems can both consume reliably. The audit trail isn't a feature you add at the end. It's a natural output of a system that was designed to explain itself at every step.

That architecture pattern — states, triggers, guards, output contracts, one-click decisioning — isn't specific to tokenized mortgages. It applies anywhere agents are doing consequential work.

What Production-Ready Actually Means for Agents

Production-ready isn't a vibe. It's a checklist:

Every intent resolves to a deterministic tool call, not a branching dialogue. Every critical action has a guard that verifies preconditions before execution. Every output includes a full identifier and a machine-readable success or failure contract. Every failure returns the actual error, not a sanitized summary. No action ever creates a duplicate because the system verified existence before creating.

If your current agent architecture can't pass that checklist, you don't have a production system. You have a prototype that's pretending.

The good news is that the fix is mostly architectural, not technical. The models are capable. The infrastructure exists. What's missing, in almost every case, is a prompt that treats itself as software with guarantees rather than guidance with good intentions.

Write the runbook. Declare the states. Enforce the guards. Define the contracts.

Then your agent stops guessing and starts behaving like something you'd actually trust with real work.

Back to Blog

Your AI Agent Isn't Broken. Your Prompt Is an Instruction Manual Written in Crayon.

Prose Prompts Are Technical Debt Dressed in Markdown

The Fix Isn't Better Writing. It's Better Architecture.

Why Agents Break Exactly When You Need Them Most

We Built This Into REFI2's Agent Infrastructure

What Production-Ready Actually Means for Agents