Agentic AI Is Growing Up: Engineering Skills Matter More Than Ever

Sep 8

The Era of Engineering in Agentic AI

The development of Agentic AI systems has been barreling ahead at a dizzying pace for the past couple of years. Beyond the frontier models hosted by OpenAI, Anthropic, Google, and others, there's now a growing ecosystem of high-quality open-weight models that anyone can download to customize and run on their own hardware. While model capabilities have continued to improve, the hardest problems in building production-ready systems today have persisted:

Unreliable tool calls with brittle JSON/API handling
Quality degradation from subtle, hard-to-detect drift
Concurrency challenges from workload explosions at scale
Compliance gaps with missing guardrails and auditability

It's tempting to think that smarter models will make these issues go away. But even if LLMs reached human-level competence tomorrow, we'd still face the same challenges: resilience, auditability, and scale. These aren't "AI problems" — they're classic engineering problems all too familiar to seasoned developers.

We're already seeing the ecosystem respond. New frameworks are cropping up that bring classic engineering patterns into the agent world: Embabel applies planning and type-safety to JVM-based development, Guardrails AI constrains model outputs with grammars, and LangSmith focuses on monitoring and continuous evaluation. Taken together, these efforts signal a broader trend: agentic AI is moving out of the demo phase and into an engineering era.

Graduating to Real Workflows

The need for engineering rigor is already clear. Up to now, many agent implementations have been fairly ad-hoc, stitched together for low-stakes tasks like note-taking and calendaring. They've also shown promise in highly supervised settings such as code generation. The next phase, though, is agents embedded in core workflows in industries like healthcare, finance, and enterprise systems, where accuracy, reliability, and compliance aren't optional.

And that shift is already happening. In healthcare, agents drafting notes or checking drug interactions must be deterministic and compliant; a misplaced field or hallucinated contraindication could impact patient safety. In finance, agents parsing SEC filings and compliance manuals can't afford reliability lapses — a missed disclosure or malformed query could lead to fines or losses. Even in enterprise productivity, the challenge is scale: a CRM demo is straightforward, but supporting thousands of concurrent users without performance bottlenecks or runaway cost is not.

This is why engineering discipline matters. Determinism, reliability, compliance, and scale are exactly the kinds of problems traditional engineering has solved before. Now they're the gating factors for making agentic AI useful in the real world. Smarter models alone won't cut it; the path forward is applying proven engineering disciplines to move agents from fragile, ad-hoc beginnings to reliable, production-ready systems.

Garbage In, Garbage Out

Agents don't become valuable just by calling an LLM. They become valuable when they're shaped by the right data, using knowledge injection techniques or continuous feedback loops that teach them how to operate in a specific domain. Without that step, you don't have an "agent", you have a generic assistant.

The catch is that most of the data we'd like to use is messy:

In CRM systems, sales notes are riddled with typos, shorthand, and inconsistent fields. Fine-tuning on that directly just bakes in noise.
In healthcare, clinical notes often mix structured codes with free-form doctor comments. Without normalization and anonymization, you risk both privacy violations and model confusion.
In finance, SEC filings are packed with boilerplate, duplicate disclosures, and formatting quirks. Without deduplication and alignment, your agent will learn the wrong patterns.

This is where data engineering comes in. Extract, Transform, Load (ETL) is the backbone of modern data lakes and analytics systems, and agents need that same rigor. Clean pipelines separate signal from noise and ensure models learn from the right examples. Without them, agents won't just be undertrained, they'll be brittle and unpredictable in the moments you need them most.

Scaling Up Without Melting Down

Scale changes everything. An agent that looks simple in isolation becomes a distributed systems challenge when hundreds or thousands run at once, each branching into dozens of model calls, retrieval steps, and API requests. What feels straightforward in a prototype quickly turns into problems of concurrency, synchronization, and resource allocation in production.

GPUs are a necessary and expensive resource for running agents, which makes efficient scheduling critical. Without it, GPU resources end up underutilized or misallocated, wasting money or slowing response times. Agent platforms need to make constant trade-offs: which jobs run now, which can wait, which can be routed to a smaller or cheaper model, and how to keep critical tasks from starving under load. In that sense, orchestrating agents looks less like handling simple request/response traffic and more like query planning in databases: the high-level intent may be clear, but translating it into an efficient execution plan requires careful optimization.

And efficiency is only half the battle; reliability matters just as much. A single flaky tool call shouldn't bring down an entire workflow. Agents need checkpointing, retries, and execution plans that can adapt dynamically—rerouting around failures, swapping in backup models, or rebalancing workloads on the fly.

Close Doesn't Count

One of the most fragile seams in agentic systems is the handoff from unstructured model output to structured systems. APIs, databases, and enterprise apps don't want prose; they want valid JSON, SQL, or parameters. Left unconstrained, LLMs are notorious for getting this almost right: an extra comma in JSON, a missing field in an API call, a hallucinated column in a query. In healthcare, for example, an invalid field in an HL7/FHIR record can prevent a note from being entered into the EHR. In all these cases, "almost right" is still broken.

The fix isn't a clever prompt, though; it's engineering. We've solved this problem before in compilers and programming languages. The way to make code safe and predictable was to enforce a grammar. The same principle applies here: structured inference constrains model outputs to follow a context-free grammar (CFG) or schema so that every response is guaranteed to be valid.

This is where classic skills come back into play: language design, parser generators, schema validation, type safety. Agent builders are rediscovering that structure isn't optional; it's the guardrail that makes automation dependable. Tools like Outlines or Guardrails AI repackage those lessons for the LLM era by enforcing grammar adherence at inference time.

Learning to Fail Gracefully

One of the clearest lessons from decades of running production-grade distributed systems is that failures are inevitable. APIs go down, network connectivity becomes flaky, databases return something unexpected. The job isn't to prevent every failure — that's impossible — but to design systems that recover gracefully when things break.

The same mindset applies to agents. Early implementations often pass initial tests but falter the moment a tool times out or an API misbehaves. In real use cases, that brittleness won't cut it.

Reliability engineering gives us the playbook:

Retries with backoff so transient errors don't spiral.
Circuit breakers to keep repeated failures from cascading.
Checkpointing and rollback so an agent can pick up where it left off.
Graceful degradation — falling back to a simpler path when necessary.

These are the tools that help agents recover in the moment. When the world throws an error, resilience keeps the workflow moving.

The lesson for agent builders is that resilience matters as much as intelligence. An agent that can handle bumps in the road without derailing is one people will actually trust. For example, a healthcare assistant that times out on a drug formulary lookup should still be able to complete the patient note, flagging the gap rather than abandoning the workflow entirely; in e-commerce, a checkout agent should retry a payment gateway or escalate gracefully to a human rather than leaving the customer stranded.

Staying on Track

But recovery and resilience only covers failures you can identify right away. Another challenge is catching the more slow-moving kind of degradation that sneaks in over weeks or months. That’s where monitoring and continuous evaluation come in.

Even if an agent performs flawlessly today, it won't stay that way forever. Context changes. APIs update, data drifts, and user expectations evolve. Left alone, an agent that once felt sharp can become brittle or (perhaps worse) confidently wrong. The risk is even higher for long-running agents that stay active across days or weeks. As they accumulate context and interact with shifting systems, subtle misalignments compound, and without oversight they can drift far from their intended behavior.

That's why continuous evaluation is critical. Updating agents with new data or fine-tuning gives them new skills, but it doesn't guarantee stability. Without monitoring, you won't know when performance has slipped, and without testing, regressions reach production. Evaluation is the only way to tell whether new data helped or hurt.

The parallels to classic software are clear. For decades we've had unit tests, canary deployments, and observability stacks. Agentic AI needs the same discipline, but adapted for systems that can change their behavior daily. That means:

Testing and regression checks for prompts, toolchains, and workflows.
Canary agents to safely trial updates in production.
Telemetry and evaluation loops to log decisions, monitor quality, and measure accuracy, safety, and compliance.

Consider finance: an agent that analyzes SEC filings might perform well in Q1, but by Q2, reporting formats shift, disclosure rules change, and the model starts missing critical signals. Without monitoring and evaluation, those failures often surface only after the damage is done.

Continuous learning makes this discipline even more important. With the right monitoring in place, agents adapt responsibly to change while staying within guardrails you can measure and trust.

Engineering the Next Leap

If there's one theme that cuts across all of this, it's that the hardest problems in agentic AI aren't really "AI problems". They're engineering problems. And while models will keep improving, the real breakthroughs in making agents production-ready will come from the teams that build around them.

Making this work draws on a broad mix of engineering instincts. It starts with data engineering, cleaning up messy notes and logs so agents have something reliable to learn from. It extends into distributed systems, where workloads must be scaled across costly GPU clusters without waste. Compiler and language expertise ensures that outputs follow the right structure, while reliability engineering builds systems that assume failure and recover gracefully. Monitoring and evaluation provide the feedback loop that keeps agents sharp over time. And tying it all together is orchestration, which looks a lot like query planning in databases — translating high-level intent into efficient execution strategies.

You don't need to staff every specialty overnight. But if you're serious about agentic AI, you do need to start thinking this way. The next leap won't come from bigger models alone. Agentic AI is growing up and engineering is how it comes of age.

David Noblet