The Opus Everytwhere Anti-Pattern
Use the Smallest Model That Does the Job
Agent Systems Need Boundaries · Part 1: The Capability Boundary
Why reliable agent systems push intelligence away from execution.
Ask an LLM how many r’s are in “strawberry” and it may get the answer wrong. Counting characters is not what token prediction is built for.
Give that same model a code execution tool and the problem changes completely. The model writes "strawberry".count("r"), hands the task to a deterministic system, and gets back 3. Correct. Verifiable. Repeatable.
The model should stop doing work that belongs to a tool.
That principle scales far beyond toy examples.
Most agent stacks put intelligence in the wrong places
A lot of current agent design assumes the safest move is to use the most capable model available everywhere.
Frontier model for routing. Frontier model for classification. Frontier model for extraction. Frontier model for execution planning. The reasoning seems obvious: if the model is more capable, the system should be better.
In practice, that often makes systems more expensive, less predictable, and broader in attack surface. Every model-mediated decision is a place an attacker can push on. Deterministic code can still carry bugs, but it does not reinterpret untrusted text as instructions the way an LLM can [OWASP 2025].
What matters on a bounded step is the kind of failure each component introduces. A deterministic component that misclassifies an input misclassifies it the same way on every run under the same version and environment, which makes the failure reproducible and easy to pin down with a regression test. A model on the same task fails nondeterministically: it can return the right answer nine times and the wrong one on the tenth with no change to the input [Atil et al. 2024], which means it can pass a test suite and still fail in production. That asymmetry is the reason to keep the model off bounded steps.
If the answer space is bounded, you don’t want a model free to reinterpret the problem. You want the least capable component that can reliably choose from the available options.
That might be a small model. It might be a rules engine. It might be a lookup table. It might be an if statement.
The right question to ask is: what is the simplest component that can do this job reliably?
The suitability gap
This is the move I keep watching teams miss. A frontier model is capable of doing your classification, your routing, your extraction. That doesn’t mean it’s suited to doing those things in a production system.
Capability and suitability look the same in a demo. They diverge under load, under cost pressure, under adversarial input, and under the kind of long-tail behavior that only shows up when you have ten thousand runs and not ten.
The strongest model in the system is rarely the right component for the steps that demand precision and consistency. Frontier models are strongest where broad language understanding, synthesis, and open-ended reasoning matter, which does not make them the best fit for narrow production steps where consistency, cost, and bounded outputs dominate [Belcak et al. 2025]. Precision and consistency are different objectives.
Put differently: capability is what a model can do at its best. Suitability is how a component behaves across every input and operating condition the system will face: cost at volume, latency under load, adversarial input, and long-tail cases. Consistency is one part of suitability, not the whole of it.
Concretely: an email triage node with five possible labels does not need a model capable of debating the architecture of the system. It needs stable label selection, calibrated abstention when the input is genuinely ambiguous, and a test set that proves both.
Keep generative reasoning off the consequential path
Decisions that touch real money, production data, customer impact, or privileged actions should run on deterministic components or tightly constrained models. Treat this as a boundary. Models can propose, draft, and summarize on the advisory side of it; consequential execution on the other side passes through deterministic validation, policy enforcement, or human approval.
That sounds backwards until you look at the failure modes.
A language model is nondeterministic. It can overthink. It can hallucinate. It can latch onto the wrong detail. It can respond differently to the same problem after a small change in prompt wording or context [Sclar et al. 2024].
Those properties are tolerable in exploration. They are dangerous in execution.
A healthy agent pipeline usually looks something like this:
Execution: no model. Deterministic code, fixed logic, explicit math.
Routing and classification: the smallest model or simplest deterministic component that can do the task reliably.
Structured extraction: a constrained model with a schema and strict validation.
Exploration and strategy: this is where larger models earn their keep. Creativity matters here. Surprise can be useful here.
The pattern is simple: keep generative reasoning in the sandbox and the critical path deterministic wherever it can be.
A routing table
If you want a working rule of thumb, here is the one I use:
| Task | Right component |
|---|---|
| Ambiguous architecture / design | Strongest reasoning model |
| Resolved implementation | Cheaper direct model |
| Mechanical transform / refactor | Script, codemod, AST tool |
| Classification with bounded set | Small model or rules engine |
| Structured extraction | Constrained model with strict schema |
| Validation | Tests, typecheck, deterministic checks |
| Routing between known paths | Explicit rules |
| Security decision | Policy engine, not model judgment |
| Execution | Code |
The table is more useful as a question than as a prescription. The question is: for this specific node in my pipeline, what is the least capable component that handles it reliably? Start there. Escalate only when evaluation shows the cheaper component failing on a class of inputs you actually care about.
The table works as a containment strategy. Intelligence belongs where uncertainty is useful, concentrated there rather than smeared across every node in the system.
The same architecture wins on cost, reliability, and security
These wins are operational.
Cost
The least capable component is usually the cheapest one.
If a small model or a deterministic component can do classification, there is no reason to pay frontier-model prices for that step. On narrow, repetitive agentic tasks, small or specialized models can sometimes reach acceptable performance while costing far less to serve. The NVIDIA position paper estimates a 7B model is 10 to 30 times cheaper to serve than a 70 to 175B model across latency, energy, and compute [Belcak et al. 2025]. Those costs compound quickly in systems that run continuously or generate work at volume. The bill for “frontier model on every node” looks reasonable when you have one user. It looks ruinous when you have a thousand.
Reliability
Every LLM you put in the execution path adds a nondeterministic failure mode [Atil et al. 2024].
If the task can be handled by code, use code. If it can be handled by a rules engine, use a rules engine. If it can be handled by a small constrained model instead of a frontier model, do that.
The goal is to reserve models for the parts of the system where their strengths actually matter.
Security
There is also a security version of this argument. Every model-mediated decision is a surface an attacker can push on, and that surface widens as system complexity grows [OWASP 2025]. I cover that directly in Part 3. For now the practical point is simpler. If a step can be handled by deterministic code, a rules engine, a schema validator, or a smaller constrained model, do that before reaching for a frontier model. Deterministic components remove the surface rather than resize it.
A practical design rule
When you are building an agent system, audit every node in the pipeline and ask:
What is the simplest component that can do this job reliably?
Use that component. Evaluate it against the inputs you actually care about. Keep it until evaluation shows it failing on a class of inputs that matters, and escalate to a larger model only then.
Use the simplest component that handles the task with acceptable performance and predictable behavior. Skip the leaderboard winner.
That usually leads to better systems because it forces architectural discipline.
It separates exploration from execution.
It forces you to define where uncertainty is acceptable and where it is not.
It makes cost and security visible as architecture decisions instead of afterthoughts.
Where bigger models actually belong
None of this means frontier models are useless. It means they are most valuable in places where creativity and broad search are real advantages.
Use them to generate candidate strategies.
Use them to propose designs.
Use them to explore alternatives.
Use them to summarize ambiguous information for a human decision-maker.
Avoid them where the system needs precision and consistency.
That is the split many teams still get wrong. They put the smartest model nearest the point of consequence, when that is exactly where the architecture should become more constrained.
The right default
The right default for agent design is this:
Use the smallest model that does the job.
And where you can avoid a model entirely, do that.
That one design rule improves cost, reliability, and security at the same time. It forces a cleaner separation between creativity and execution, which is where most durable agent systems will be built.
The smartest architecture knows when not to use the smartest model.
References
Atil, B., et al. (2024). Non-Determinism of “Deterministic” LLM Settings. arXiv:2408.04667 (v1 2024, revised 2025). Measures output variance across identical runs for high-performing LLMs even under settings assumed to be deterministic, including temperature 0.
Belcak, P., Heinrich, G., Diao, S., Fu, Y., Dong, X., Muralidharan, S., Lin, Y. C., & Molchanov, P. (2025). Small Language Models are the Future of Agentic AI. NVIDIA Research position paper. arXiv:2506.02153. Argues that small language models are sufficiently capable, more suitable, and more economical for the narrow, repetitive tasks that make up most agentic invocations. As a position paper, it sets out an argued thesis rather than a settled empirical benchmark.
OWASP (2025). LLM01:2025 Prompt Injection. OWASP Top 10 for LLM Applications. genai.owasp.org/llmrisk/llm01-prompt-injection/. Documents prompt injection as the top LLM application risk and notes that attack surface expands with system complexity and tool integration.
Sclar, M., Choi, Y., Tsvetkov, Y., & Suhr, A. (2024). Quantifying Language Models’ Sensitivity to Spurious Features in Prompt Design. ICLR 2024. arXiv:2310.11324. Finds accuracy swings of up to 76 points between semantically equivalent prompt formats.
Further reading: a 2026 review in Information (17:1, 54), Prompt Injection Attacks in Large Language Models and AI Agent Systems, surveys how agent autonomy and tool use expand the attack surface beyond conversational LLMs.
Steve Ciraolo is the founder of Rank One Labs, where he builds agent operations infrastructure for AI-native teams and provides integration assurance for systems that touch real consequences, on-chain or off.