Logo Raphael Pereira Raphael Pereira
PT EN

Technology

How to Evaluate if Your AI in Production Is Actually Working

Most companies ship AI without knowing how to measure if it's working. This guide turns technical monitoring into strategic decision-making.

5 min

Listen to article

0:00 / —:—

You shipped an AI feature. The team celebrated. The board loved the narrative. Three months later, someone asks: “Is it actually working?”

And you realize you have no idea how to answer.

This is the most common scenario in Brazilian companies adopting AI right now. Fast implementation, zero monitoring. The problem isn’t technical. It’s product: nobody defined what “working” means in this context.

Why AI monitoring is different

Traditional software has a comforting characteristic: if the code doesn’t change, the behavior doesn’t change. A bug that existed yesterday exists today. A feature that worked keeps working.

AI doesn’t operate that way.

Language models and machine learning have probabilistic outputs. Same input can generate different outputs. More importantly: the context around the model changes constantly. Users ask different questions. Input data evolves. The world shifts.

Anthropic, in their published playbook on AI in production, makes a point that seems obvious but is rarely treated with the seriousness it deserves: AI systems need continuous evaluation, not just validation at launch.

What actually matters to measure

The temptation is to measure everything. Latency, tokens, cost per request, error rate. That data is useful, but it doesn’t answer the central question: is the AI delivering value to the user and the business?

For a PM, the metrics that matter are the ones connecting model behavior to product outcome.

Output quality metrics

Before thinking about volume or cost, you need to know if what the AI delivers is good. This means defining “good” concretely for your use case.

Some common criteria:

  • Relevance: Does the response address what the user asked?
  • Completeness: Is important information missing?
  • Factual accuracy: When applicable, is the information correct?
  • Appropriate tone: Is the communication style aligned with the brand?

None of these criteria are measured automatically by a system metric. They require human evaluation or custom evaluation systems (LLM-as-judge, for example).

Engagement metrics

If the AI is in a user-facing interface, you can measure what happens after the interaction:

  • Suggestion acceptance rate
  • Rate of edits to generated responses
  • Time until action after receiving the response
  • Abandonment rate mid-flow

An AI generating responses that get ignored isn’t working. Even if technically it’s producing output.

Business impact metrics

This is where investment justification lives:

  • Reduction in time spent on tasks the AI automates
  • Decrease in support tickets or calls
  • Conversion increase in AI-assisted flows
  • Cost avoided through process automation

If you can’t connect the AI feature to at least one business metric, it’s an experiment, not a product.

Drift: when the model degrades on its own

Drift is the technical term for when a model starts performing worse over time, even without code changes.

Two main types exist:

Data drift: The profile of input data changes. If you trained a chatbot on 2023 questions and now users ask about 2025 topics, the model may not handle the new context well.

Concept drift: The relationship between input and expected output shifts. What was a good answer six months ago might not be anymore.

Without drift detection

  • Problems surface through complaints
  • Reactive, slow investigation
  • Decisions based on intuition

With drift detection

  • Alerts before user impact
  • Diagnosis with structured data
  • Decisions based on trends

To detect drift, you need two things: a performance baseline defined at launch, and regular measurements compared against that baseline.

Practical monitoring framework

This framework doesn’t require sophisticated tools. It requires clarity on what to observe and discipline to do it regularly.

Level 1: Operational health

Measured automatically, reviewed weekly:

  • Is average latency and 95th percentile within acceptable range?
  • Is error rate below the defined threshold?
  • Is cost per request within budget?
  • Is usage volume at expected levels?

Level 2: Output quality

Measured by sampling, reviewed bi-weekly:

  • Has a sample of N responses been manually evaluated?
  • Is the average quality score stable or improving?
  • Are the most common error types mapped?
  • Are there degradation patterns in specific use cases?

Level 3: Business impact

Measured monthly, reported to stakeholders:

  • Is the primary business metric being positively impacted?
  • Is the estimated ROI of the AI system positive?
  • Is qualitative user feedback being collected?
  • Is the pre-AI baseline comparison up to date?

Automated evaluation with LLM-as-judge

A technique Anthropic details in their playbook is using one language model to evaluate outputs from another model. This doesn’t replace human evaluation, but it scales your ability to catch problems.

The concept is straightforward: you define quality criteria, create prompts instructing an evaluator model, and run automated evaluations on production samples.

Works well for objective criteria (does the response contain information X?) and reasonably for subjective criteria (is the response useful?). Works poorly for brand nuance and tone that require human context.

Practical recommendation: use LLM-as-judge for scaled screening, and human evaluation for calibration and edge cases.

The most common mistake: measuring only when there’s a problem

If you’re launching AI now, define metrics before production. If you already shipped without metrics, start collecting now. You won’t be able to prove future improvement without a documented starting point.

How to justify monitoring investment

The cost of monitoring AI is a fraction of the cost of operating AI. But that argument doesn’t always convince.

The argument that works better: monitoring transforms AI from recurring cost into measurable asset.

Without monitoring, you have two options: believe it’s working, or wait until it visibly breaks. With monitoring, you can:

  • Demonstrate ROI with data
  • Anticipate problems before impact
  • Justify additional investments with evidence
  • Discontinue features that aren’t delivering

Starting with what you have

You don’t need an MLOps platform to begin. You need discipline.

Start with:

  1. A spreadsheet with weekly operational health metrics
  2. A bi-weekly routine of manual sample evaluation
  3. A monthly report connecting AI usage to business metrics

That already puts you ahead of 80% of companies running AI in Brazil today.

The question you need to answer isn’t “is the AI running?”. It’s “is the AI worth what it costs?”. Only structured metrics answer that. And only PMs have the business vision needed to define which metrics matter.

Retrato de Raphael Pereira

Author

Raphael Pereira

Designer & strategist focused on performance-led digital experiences.

Related posts