A lot of companies shipped AI into their products over the last two years. Chatbots, recommendations, automated classification, content generation. The deploy happened. The model is running. The feature exists.
The problem is what comes after. Or more precisely: what doesn’t.
Most teams treat AI monitoring as pure engineering responsibility. Infrastructure metrics, API latency, error rates. Things the product team never looks at. Meanwhile, the real quality of what the AI delivers to users might be degrading silently.
The model in production isn’t the same model you tested
When you test a model before deploy, you’re validating performance under controlled conditions. Clean data, predicted scenarios, stable environment.
Production is different. Data changes. User behavior changes. Context changes. And the model, which was static, starts getting outdated relative to the reality it’s supposed to represent.
This has a technical name: drift. It can be data drift (the distribution of inputs shifts), concept drift (the relationship between input and expected output changes), or both. In practice, it means your recommendation model that worked great in January might be suggesting irrelevant stuff by July, with no visible technical errors.
The point that matters for product: you won’t find this by checking API latency. You’ll find it when conversion rate drops, when NPS falls, when support starts getting complaints about “weird suggestions.”
Why AI monitoring is a product problem, not just engineering
In traditional software, if the feature works technically, it works for the user. A button either opens a modal or it doesn’t. There’s no gradual quality degradation.
AI doesn’t work that way. The feature can be technically operational while delivering progressively worse results. The chatbot responds, but responses are getting less useful. The classifier categorizes, but accuracy dropped 15% because the fraud pattern shifted.
Traditional software
- Works or doesn't work
- Bugs are visible immediately
- Quality is binary
- Uptime monitoring is enough
AI in production
- Works in varying degrees
- Degradation is silent
- Quality is a continuous spectrum
- Needs output metrics
This changes who needs to be watching. If AI quality affects user experience, and user experience is product’s responsibility, then AI monitoring is partly product’s responsibility.
I’m not saying the PM needs to configure observability dashboards. I’m saying the PM needs to know which metrics matter and have access to them. The difference between a product that uses AI well and one that uses it amateurishly lives right there—in that visibility.
What to monitor: three layers of metrics
Layer 1: Technical metrics (engineering responsibility)
Latency, throughput, error rate, availability. Important, but they say nothing about output quality. You need them to ensure the model is accessible. You don’t need them to know if it’s useful.
Layer 2: Model metrics (shared responsibility)
This is where most blind spots live. Input distribution, output distribution, average prediction confidence, fallback rate.
Concrete example: if your support chatbot had an 8% fallback rate (responses like “I didn’t understand, can you rephrase?”) and it’s now 22%, something shifted. Could be the model, could be user behavior, could be the type of question arriving. Either way, product needs to know.
Another example: if the confidence distribution of responses is declining, the model is “less sure” than before. Even if final accuracy hasn’t dropped yet, that’s an alarm signal.
Layer 3: Business metrics (product responsibility)
Feature conversion rate, engagement with recommendations, chatbot ticket resolution, NPS specific to AI interaction.
These are the final thermometer. But they’re lagging indicators—by the time they move, the problem already happened. That’s why you need layer 2 metrics as leading indicators.
How to start: a minimal framework for product teams
If you’re reading this and your team doesn’t monitor AI in a structured way, the question is: where do you start without turning it into a six-month project?
Step 1: Identify the business metric the AI should affect
Before talking about technical monitoring, answer this: what business outcome does this AI exist to move? If it’s a support chatbot, it’s probably resolution rate without escalation. If it’s product recommendations, probably click-through or conversion rate on those recommendations.
If you can’t answer that, the problem isn’t monitoring. It’s clarity of purpose for the feature.
Step 2: Define a measurable proxy for output quality
You’re not going to manually evaluate every model response. You need an automated proxy that correlates with perceived quality.
Examples:
- For chatbot: fallback rate, rate of interactions ending with user requesting human support
- For recommendations: CTR of recommendations, average position of clicked item in the list
- For classification: rate of manual reclassification by operators
Step 3: Establish a baseline and monitor for variation
Take the last 4 weeks as your baseline. Set a variance threshold that triggers an alert. Doesn’t need to be sophisticated. “If fallback rate goes up more than 30% from average, someone investigates” is already better than zero monitoring.
Step 4: Create a review cadence
A dashboard doesn’t help if nobody looks at it. Include AI metrics in your weekly product review. Not as a technical item—as an experience quality item.
- Do you know which business metric your AI should affect?
- Is there an automated proxy for output quality?
- Do you have baseline data from the past weeks?
- Is there a defined threshold to trigger investigation?
- Do AI metrics make it into your weekly product review?
The most common mistake: treating evaluation as an event, not a process
A lot of companies evaluate the model before shipping and call it done. Pre-deploy evaluation is necessary, but it’s not sufficient.
Production reality needs continuous evaluation. I’m not talking about retraining the model every week. I’m talking about having visibility into whether performance is stable, improving, or degrading.
This shifts the mindset from “we shipped AI” to “we operate AI.” Same difference as between launching a product and maintaining a product. Launch is a moment. Operation is ongoing.
For small teams, this doesn’t need sophistication. A simple dashboard showing layer 2 and 3 metrics over time, with a baseline line, solves 80% of the problem. The cost of not having it is discovering degradation too late.
When to scale: signals you need more structure
The minimal framework works up to a point. Some signals you need more structure:
- AI is critical to your value proposition, not just an auxiliary feature
- You have multiple models in production that interact with each other
- Volume is high enough that small drifts have large absolute impact
- You’re in a regulated market requiring auditability
In those cases, it’s worth investing in specialized ML observability tools, formal periodic evaluation processes, and possibly someone in a dedicated MLOps or AI Quality role.
But most Brazilian teams working with AI in product aren’t at that point yet. They’re at the previous one: no visibility whatsoever into what happens after deploy. And to move from that point, the minimal framework is already a significant leap.
Connecting to business results
AI monitoring, at the end of the day, is risk management and result optimization.
The risk is silent degradation affecting business metrics without you knowing why. Conversion drops and you test a thousand things before discovering the problem was your stale recommendation model.
Optimization is the flip side: you see a model metric improving and can correlate it to results. That tells you where to invest more, where to adjust, where to experiment.
The AI you shipped to production isn’t a static feature. It’s a system that needs continuous attention. The question isn’t whether you’ll monitor it. It’s whether you’ll monitor it before or after you discover something broke.
Author
Raphael Pereira
Designer & strategist focused on performance-led digital experiences.
Related posts
How to Evaluate if Your AI in Production Is Actually Working
Most companies ship AI without knowing how to measure if it's working. This guide turns technical monitoring into strategic decision-making.
Continue reading
Agile Engineering with AI: When 'Vibe Coding' Becomes Production
AI-generated code stopped being a curiosity. Now it's an architecture decision.
Continue reading