How Jonathan Chávez Built ZeroEval to Make AI Agents More Reliable

By Laure Guilbault

AI agents can do impressive things in a controlled demo. They can answer questions, take actions, call tools, and move through long workflows without much friction. The hard part starts when those same agents meet real users, messy data, edge cases, and production traffic.

That is where reliability becomes the real test.

Jonathan Chávez built ZeroEval around that exact problem. Instead of treating AI quality as something teams check once before launch, ZeroEval is built around the idea that agent performance has to be measured, reviewed, and improved continuously. In simple terms, the company helps teams understand what their AI agents are doing, where they fail, and how to make them better over time.

That makes ZeroEval more than another AI tooling startup. It sits in an important part of the stack that more teams are now discovering they need. As AI products move from novelty to real business workflows, the difference between an interesting agent and a dependable one becomes huge. Jonathan Chávez saw that gap early and turned it into a company with strong momentum.

Who Is Jonathan Chávez

Jonathan Chávez is the co-founder of ZeroEval, a startup focused on helping companies build more reliable AI agents. Before starting the company, he worked as an early employee on Datadog’s LLM Observability team. That background matters because observability gave him a close look at what happens once AI systems leave the lab and start running in production.

He also brought a technical foundation that goes deeper than startup buzz. His earlier work included research on vision transformers for particle physics and reinforcement learning for robotics, along with engineering roles at fast-moving startups. That mix of research thinking and practical engineering shows up clearly in ZeroEval’s product direction.

Jonathan did not approach AI reliability as a vague industry trend. He came into it from the side of real systems, real debugging, and real performance issues. That is a big reason ZeroEval feels tied to an actual need rather than a fashionable label.

What ZeroEval Is Building

ZeroEval is building what you could call a self-improving layer for AI agents. The platform is designed to help teams trace agent behavior, evaluate outputs, collect feedback, and optimize prompts and model configurations based on what is happening in real usage.

That sounds technical, but the underlying problem is easy to understand.

Most AI teams can get an agent working well enough for internal testing. What becomes much harder is keeping quality high once the agent handles long conversations, multiple tool calls, changing prompts, production users, and a growing list of edge cases. Teams quickly realize that manual reviews are slow, static evaluations age badly, and one-off prompt tweaks do not solve the deeper issue.

ZeroEval is built to close that gap.

Its mission centers on self-improving AI agents, which means the system is not only watching outputs but helping teams create a feedback loop around them. Instead of guessing why agent performance dropped, developers can inspect traces, attach human feedback, run calibrated judges, and improve the prompt or model setup from a much more informed position.

Why AI Agent Reliability Became Such a Big Problem

The rise of AI agents has changed the nature of AI quality control. Traditional chatbot testing was already imperfect, but agents create a bigger challenge because they often take multiple steps before producing a final result. They may search, summarize, reason, generate structured output, and call external tools in the same workflow.

That creates more room for failure.

Sometimes the model gives a weak answer. Sometimes it chooses the wrong tool. Sometimes it completes the task in a way that looks correct on the surface but misses what the user actually needed. In other cases, the output is technically valid yet still poor in tone, usefulness, or accuracy. These failures are not always obvious from a simple pass or fail test.

This is why the conversation around AI observability, evaluation frameworks, production traces, and human-in-the-loop review has become much more important. Teams are no longer asking only whether the model works. They are asking whether the system is reliable enough for real workflows.

That shift created the opening for companies like ZeroEval.

How Jonathan Chávez Turned Observability Experience Into a Startup Idea

Jonathan Chávez’s background at Datadog helps explain why ZeroEval took the shape it did. Observability is incredibly useful because it tells teams what happened inside a system. It gives visibility into behavior, latency, errors, and other signals that become critical once software runs in production.

But visibility alone is not enough for AI agents.

A team can see that an agent failed, but that does not automatically tell them how to measure quality, compare outputs, calibrate evaluation criteria, or improve the next version. That is where ZeroEval moves beyond plain monitoring.

The company brings observability together with evaluation and optimization. It gives teams a way to capture what the agent is doing, score how well it is doing it, and then feed those learnings back into the product. That is a much stronger loop than simply logging model calls and hoping engineers figure the rest out later.

In that sense, ZeroEval feels like a natural next step from LLM observability. Jonathan Chávez seems to have recognized that the market did not only need more dashboards. It needed a reliability layer built specifically for AI behavior.

The Core Features That Make ZeroEval Stand Out

One of the reasons ZeroEval has attracted attention is that it tackles AI agent reliability from several angles at once. Instead of narrowing the product to a single feature, it connects tracing, judges, feedback, and prompt optimization into one workflow.

Tracing and Monitoring

At the foundation is monitoring and tracing. ZeroEval helps teams track costs, latency, errors, sessions, and traces so they can see what happened during an agent interaction. That matters because AI failures are rarely isolated to one final answer. The issue might have started several steps earlier through a bad tool call, a weak prompt version, or an unexpected handoff.

Tracing gives teams operational visibility into agent behavior. It helps with debugging workflows, error analysis, and performance review. Instead of treating the model like a black box, teams can inspect the path it took.

Calibrated Judges

ZeroEval also focuses heavily on judges, which are AI-based evaluators used to score outputs automatically. Many teams already understand that manual review does not scale, but static evaluation setups often become brittle and inconsistent. ZeroEval’s approach is built around calibrated judges that can get better over time.

That idea matters because evaluation quality is everything in this category. If the judge is weak, the feedback loop becomes noisy. If the judge tracks human preferences more closely, teams get a far more useful signal. This is one of ZeroEval’s strongest positioning points in the AI evaluation space.

Human and AI Feedback Loops

Another major part of the platform is feedback collection. Human reviewers and end users can attach signals like thumbs up, thumbs down, ratings, or corrections. Judges can also add automated feedback based on custom criteria.

This turns vague quality opinions into measurable quality signals.

Once feedback is attached to traces, spans, or completions, teams can start finding patterns instead of relying on instinct. They can see where output quality drops, which workflows fail more often, and what kinds of prompts lead to stronger results. That makes continuous improvement feel a lot less random.

Prompt Optimization and Version Control

Prompt changes are one of the easiest ways to influence agent behavior, but they can also create confusion fast. Teams often test multiple prompts without a clean system for prompt versioning, comparison, and deployment.

ZeroEval addresses that by tracking prompt versions and linking completions back to the exact prompt that produced them. From there, teams can use production feedback to optimize prompts and model configurations more intelligently.

This is important because prompt tuning on its own is often treated like trial and error. ZeroEval pushes it closer to a repeatable optimization process.

Why ZeroEval Fits the Market Right Now

The timing behind ZeroEval makes a lot of sense. The AI market has moved past the stage where companies only want a flashy chatbot on a landing page. More teams are now building production AI systems that have to support internal operations, customer support, research workflows, coding tasks, and structured decision-making.

As soon as those systems matter to the business, reliability becomes non-negotiable.

That is why topics like AI agent monitoring, trace-based evaluation, human feedback, model behavior analysis, and production readiness for AI are becoming central rather than optional. The more complex these agents get, the more teams need infrastructure that helps them understand and improve performance in a disciplined way.

ZeroEval is well positioned for that shift because it speaks directly to the pain teams feel after launch. It is not selling a futuristic promise disconnected from daily work. It is solving the very practical question of how to make AI agents dependable enough to trust.

Early Momentum and External Validation

ZeroEval’s early traction also adds weight to the story. The company was part of Y Combinator’s Summer 2025 batch, which gave it an important credibility signal early on. That kind of backing does not prove a business on its own, but it does show that experienced investors saw something strong in the founders and the problem they were tackling.

Jonathan Chávez and co-founder Sebastian Crossa also came into ZeroEval with real startup and engineering experience. Before launching the company, they had built together before and were already familiar with fast product cycles. Their earlier work on llm-stats.com, which reached a meaningful audience in a short period, suggested they understood both the market and the developer side of the AI ecosystem.

That matters because AI infrastructure companies often win by reading the problem clearly before the broader market fully catches up. ZeroEval appears to be doing exactly that.

What Jonathan Chávez’s Success Really Looks Like

Jonathan Chávez’s success with ZeroEval is not just about founding another AI startup at the right time. It is about identifying a painful, unglamorous, and increasingly necessary layer in the AI stack.

A lot of attention in AI still goes to model launches, big consumer apps, and flashy demos. Much less attention goes to the systems that make those products reliable enough to survive real use. That is where ZeroEval sits. It helps transform raw AI capability into something more measurable, manageable, and useful in production.

That is also why the company stands out.

Jonathan Chávez did not build ZeroEval around a broad claim that AI will change everything. He built it around a narrower and more important truth: AI agents are only valuable when they work consistently enough for people to rely on them. The companies that solve that problem well will shape a big part of the next AI infrastructure layer.

In that sense, ZeroEval’s progress says a lot about Jonathan Chávez as a founder. He saw where AI products were breaking, connected that problem to his experience in LLM observability, and built a company around making agent reliability a practical, ongoing system instead of an afterthought.

How Jonathan Chávez Built ZeroEval to Make AI Agents More Reliable

Who Is Jonathan Chávez

What ZeroEval Is Building

Why AI Agent Reliability Became Such a Big Problem

How Jonathan Chávez Turned Observability Experience Into a Startup Idea

The Core Features That Make ZeroEval Stand Out

Tracing and Monitoring

Calibrated Judges

Human and AI Feedback Loops

Prompt Optimization and Version Control

Why ZeroEval Fits the Market Right Now

Early Momentum and External Validation

What Jonathan Chávez’s Success Really Looks Like

RELATED ARTICLES

How Jonathan Chávez Built ZeroEval to Make AI Agents More Reliable

How Sukrut Oak Built LogosGuard to Stop Sensitive Data From Leaking Into AI Tools

How Linus Talacko Used His Lyrebird Health Experience to Build Den

How Chenhang Li Took Lumius From Duke Research to Y Combinator Backing

How Zhisen An Helped Allus AI Bring Vision Foundation Models to Manufacturing

How Abhijit Pranav Pamarty Took Godela From Deep Research to a Y Combinator Backed Startup

How Neil Nie Helped Verne Robotics Build Robot Arms That Learn New Skills in Hours

How Jason Cornelius Built Perseus Defense Around the Need for Affordable Counter Drone Defense