If you’re building a digital product and still think AI in production means paying $20 per million tokens to OpenAI, update your spreadsheet. DeepSeek V4 and Qwen 3.6-27B dropped in the last few days with a proposition that breaks the math: performance comparable to the best models on the market, running on infrastructure you control, at a cost that doesn’t break your unit economics.
I’m not talking about toy models. I’m talking about models that compete with GPT-4o and Claude 3.5 on benchmarks, available for self-hosting or via APIs that cost cents where you used to pay dollars.
The real problem nobody talks about publicly
Most Brazilian digital products that claim to “use AI” actually have AI features turned off or capped. The reason is simple: the per-request cost makes it unviable at scale.
Do the math. A SaaS product with 10K active users, each making 5 AI interactions per day, at $0.015 per 1K input tokens and $0.06 per 1K output tokens (GPT-4o pricing). With an average of 500 tokens per interaction, you’re looking at $15,000 to $30,000 per month just in API costs. For a Brazilian SaaS charging R$99/month, that math doesn’t work.
The practical result: companies build AI features for marketing, not for real value. The user sees “powered by AI” on the site, uses it once, and never again. Because the feature was designed not to be used too much.
What changed this week
DeepSeek V4 and Qwen 3.6-27B represent an inflection point. These aren’t “almost good” models. They’re matching or beating GPT-4o on benchmarks that matter — reasoning, code, complex instructions — with one crucial difference: you can run them on your own infrastructure.
DeepSeek V4 is available with open weights. Qwen 3.6-27B runs on a single 24GB GPU. DeepSeek’s API costs $0.14 per million input tokens — that’s 100x cheaper than GPT-4o.
The same AI feature that would cost $20,000/month via OpenAI could cost $200/month via DeepSeek API, or fixed infrastructure cost if you self-host.
Premium API Model (GPT-4o)
- $15-60 per million tokens
- Vendor dependency
- Additional network latency
- Data passes through external servers
- Cost scales linearly with usage
Compact Model (DeepSeek V4 API)
- $0.14-0.27 per million tokens
- API or self-hosted
- Latency under your control
- Data can stay internal
- Self-hosting is fixed cost
When it makes sense to migrate (and when it doesn’t)
Before you start migrating everything to DeepSeek, understand the real trade-off.
Migrating makes sense when:
- Your usage volume is high enough that API cost is a real problem
- The AI feature is core to the product, not peripheral
- You have technical capacity to operate ML infrastructure (or can hire for it)
- Latency is critical and you need tight control
- Sensitive data can’t leave your infrastructure
Not migrating makes sense when:
- Your volume is still low — operational overhead doesn’t pay off
- You need the most advanced model on the market for specific cases (GPT-4o still wins in some scenarios)
- Your team has no ML infrastructure experience and you don’t want that problem right now
- Current API cost is acceptable within your unit economics
- Does your AI feature have volume above 100K requests/month?
- Does API cost represent more than 10% of your variable costs?
- Do you have someone on the team who can operate GPUs in production?
- Is the compact model’s quality sufficient for your use case?
- Is sub-500ms latency a product requirement?
If you answered yes to 3 or more, it’s worth investigating seriously.
The path of least friction
You don’t need self-hosting to capture most of the benefit. The most pragmatic path for most products:
Phase 1: Switch APIs
Keep your current architecture. Swap the OpenAI call for DeepSeek or another compact model provider. Cut costs by 90%+ without touching infrastructure. Implementation time: hours.
Phase 2: Evaluate quality
Run both in parallel for a week. Compare outputs. For most use cases — classification, extraction, structured text generation — the difference is imperceptible. In some cases, DeepSeek is better.
Phase 3: Self-hosting (if it makes sense)
If volume justifies it and you have operational capacity, move to your own infrastructure. Cost becomes fixed: an A100 GPU on AWS costs ~$3/hour. If you’re spending more than $2,000/month on API, it probably pays off.
What this means for product
For PMs and product leaders, the implication is direct: AI features that were in the backlog due to cost infeasibility are now viable.
That idea of having a contextual assistant inside the product? Viable. Automatic analysis of documents users upload? Viable. Real-time content personalization? Viable.
The limiting factor shifted from “what does the API cost?” to “what makes sense for the user?” That’s a paradigm shift.
But here’s the catch: cost viability doesn’t mean product viability. The question is still “does this feature solve a real problem?” — not “is this feature cheap to run?” Many companies will fall into the trap of adding AI because it’s now affordable, not because it’s now useful.
The elephant in the room: latency and experience
Cost is half the equation. The other half is latency.
Large models via API have network latency plus processing latency. In typical use, you’re looking at 2-5 seconds of response time. For many features, that’s acceptable. For others, it kills the experience.
Compact models running locally or on edge can deliver responses in hundreds of milliseconds. That opens UX possibilities that simply didn’t exist before:
- Smart autocomplete as the user types
- Real-time semantic form validation
- Instant contextual suggestions
- Natural language search with real understanding
These features only work with low latency. And low latency in AI, until this week, meant prohibitive costs or poor quality. That trade-off is disappearing.
What I’d do right now
If you have a digital product with AI features or plan to:
-
Revise your cost spreadsheet — recalculate unit economics with new API pricing. You probably can increase allowed usage or remove artificial limits.
-
Test the new models — DeepSeek V4 and Qwen 3.6-27B have free playgrounds. Run your actual use cases, not generic benchmarks.
-
Reevaluate features in your backlog — that idea you killed 6 months ago for cost reasons might be viable now.
-
Don’t self-host for hype — if the API solves your problem for $200/month, you don’t need to operate GPUs. Operational complexity has invisible costs.
-
Think about latency, not just cost — some AI features you think are impossible aren’t impossible because of price. They’re impossible because of latency. Local models solve that.
The window is open. Companies that understand quickly that AI went from premium cost to commodity will build features competitors will take months to copy. Not because of lack of technology — but because of outdated assumptions.
Your assumptions about AI in production from 6 months ago are probably wrong. Update them.
Author
Raphael Pereira
Designer & strategist focused on performance-led digital experiences.
Related posts
How to Evaluate if Your AI in Production Is Actually Working
Most companies ship AI without knowing how to measure if it's working. This guide turns technical monitoring into strategic decision-making.
Continue reading
Agile Engineering with AI: When 'Vibe Coding' Becomes Production
AI-generated code stopped being a curiosity. Now it's an architecture decision.
Continue reading