Most retrieval pipelines exist because of a price tag, not a principle. When a million-token prompt cost a fortune and crawled, teams chopped their documents into chunks, embedded them, stored them in a vector database, and retrieved the top few snippets at query time. It worked. It was also a workaround for expensive context.
That price tag is moving. MiniMax M3 landed this month with a sparse attention architecture that drops per-token compute to roughly a twentieth of previous models, with reported gains of around 9x faster prefilling and 15x faster decoding on million-token contexts. Whatever the exact numbers settle at, the direction is clear. The long context that used to be a luxury is becoming a default.
Why this is an architecture question, not a model update
A new model is easy to ignore. A change in the cost curve is not. The whole reason RAG became the standard pattern was that stuffing everything into the prompt was too slow and too expensive to do per request. Take that constraint away and a lot of the complexity teams built starts to look optional.
Think about what a typical RAG stack actually carries. A chunking strategy nobody fully agrees on. An embedding model you have to keep in sync. A vector store to operate and pay for. A retrieval step that silently drops the one paragraph that mattered because it ranked seventh. Each piece is a place where quality leaks and where an engineer spends a Friday afternoon debugging why the answer was confidently wrong.
When long context gets cheap, you can sometimes skip that entire chain and hand the model the whole document, the whole ticket history, the whole policy file. Fewer moving parts. Fewer failure modes. Less to maintain.
What we would not throw away yet
This is not a funeral for retrieval. Some workloads still need it, and pretending otherwise is how you end up with a six-figure inference bill.
Retrieval still wins when your corpus is genuinely large. You are not going to fit a 40GB knowledge base into a prompt, cheap tokens or not. It still wins when freshness matters and you need the answer grounded in a source you can cite and audit. And there is a real difference between a model being able to read a million tokens and a model reasoning well across all of them. Recall in the middle of a long context is still uneven.
The honest position is that the default is shifting. RAG used to be the obvious first move. Now the first question is simpler: how much context does this task actually need, and can I just give it directly?
A practical way to decide
For your next AI feature, try the boring version first. Put the relevant content straight into the context and measure quality, latency, and cost. Only reach for a vector store when the boring version breaks on size, freshness, or spend. You will ship faster, and you will avoid maintaining infrastructure you did not need.
The teams that win the next year will not be the ones with the most elaborate retrieval pipeline. They will be the ones who keep checking which constraints still hold and delete the workarounds that no longer earn their keep.
We are here to help founders and teams design and build digital products that are built to scale with you, not slow you down. If you are looking to build something, get in contact with us today.