RAG Is Dead. The Engineering Isn't.
Everyone's declaring RAG dead. They're right about the term and wrong about what it means. I built a code search system to figure out what comes after it, and the answer surprised me: retrieval was the easy part.

Table of Contents
"RAG is dead." I kept seeing this take, and after building a retrieval system from scratch, I think they're half right. The term deserved to die. It always collapsed three distinct engineering problems — finding information, assembling context, and generating answers — into one buzzword that made it sound like a solved pattern. Stick an embedding model in front of a vector database, pipe the results into a prompt, done. RAG.
But the retrieval problem didn't go away just because we stopped calling it RAG. If anything, killing the term forced a more honest question: what are you actually building when you build this thing?
Jeff Huber, CEO of Chroma, put it sharply in a Latent Space interview:
"Context engineering is the job of figuring out what should be in the context window at any given LLM generation step."
That framing clicked for me. Not because it's a better label — although it is — but because it decomposes the problem into pieces you can reason about independently. Retrieval. Filtering. Re-ranking. Assembly. Memory. Evaluation. Each one has its own failure modes, its own tuning knobs, its own way of going wrong.
So I built a code search system to see which of those pieces actually matters. The answer surprised me.
Full tutorial: learn-strands/rag-chatbot
What I built
The system indexes the Strands SDK codebase — Python files, markdown docs, examples — and lets you ask questions about the framework in natural language. A chatbot that answers questions about Strands by searching its own source code.
The stack:
- FAISS for dense vector search (local embeddings, no API calls)
- Strands SDK for agent orchestration (tools, multi-agent coordination)
- mem0 for conversation memory
- Gemini 2.5 Flash Lite for generation
Nothing exotic. I wanted to see how far you get with simple, well-composed primitives before reaching for anything fancy.
Retrieval was the straightforward part
I expected retrieval to be where I'd spend most of my time. It wasn't. FAISS with IndexFlatL2 and sentence-transformers gives you exact search with 100% recall, and for a codebase under 10K chunks, it's fast enough that you don't need approximate indices:
class FAISSVectorStore:
"""FAISS-based vector store with local embeddings."""
def __init__(self, local_model: str = "all-MiniLM-L6-v2", dimension: int = 384):
self.embedder = SentenceTransformer(local_model)
self.dimension = dimension
self.index = faiss.IndexFlatL2(dimension)
self.documents = []
def add_documents(self, documents: List[Dict[str, Any]], batch_size: int = 32):
"""Add documents with batched embedding generation."""
texts = [doc.get('text', '') for doc in documents]
all_embeddings = []
for i in range(0, len(texts), batch_size):
batch = texts[i:i + batch_size]
embeddings = self.embedder.encode(batch, show_progress_bar=False)
all_embeddings.append(embeddings)
embeddings_array = np.vstack(all_embeddings).astype('float32')
faiss.normalize_L2(embeddings_array)
self.index.add(embeddings_array)
self.documents.extend(documents)
def search(self, query: str, k: int = 5, threshold: float = 0.3) -> List[Dict[str, Any]]:
"""Dense vector search with cosine similarity."""
if self.index.ntotal == 0:
return []
query_embedding = self.embedder.encode([query]).astype('float32')
faiss.normalize_L2(query_embedding)
distances, indices = self.index.search(query_embedding, min(k, self.index.ntotal))
similarities = 1 - (distances[0] / 2) # L2 on normalized vectors → cosine
results = []
for idx, similarity in zip(indices[0], similarities):
if similarity >= threshold and idx < len(self.documents):
doc = self.documents[int(idx)].copy()
doc['similarity'] = float(similarity)
results.append(doc)
return results
L2 normalization lets you get cosine similarity from Euclidean distance. The 0.3 threshold is aggressive but intentional — I'd rather get fewer, better chunks than flood the context with marginal matches.
The search itself becomes a Strands tool that the agent can call explicitly:
@tool
def search_strands_sdk(query: str, max_results: int = 5) -> List[Dict[str, Any]]:
"""Semantic search over Strands SDK codebase."""
results = vector_store.search(query=query, k=max_results, threshold=0.3)
return [{
'text': r['text'],
'similarity': r['similarity'],
'file_path': r.get('file_path', 'unknown'),
'file_type': r.get('file_type', 'unknown')
} for r in results]
@tool
def search_by_file_type(query: str, file_type: str, max_results: int = 5) -> List[Dict[str, Any]]:
"""Combine dense search with metadata filtering."""
all_results = vector_store.search(query=query, k=max_results * 3, threshold=0.3)
filtered = [r for r in all_results if r.get('file_type', '') == file_type][:max_results]
return [{
'text': r['text'],
'similarity': r['similarity'],
'file_path': r.get('file_path', 'unknown')
} for r in filtered]
These aren't buried in a pipeline. They're named tools the agent invokes explicitly — search_strands_sdk for broad semantic queries, search_by_file_type when you need to narrow by file extension. Huber calls this "naming the primitives," and he's right that it matters: when your retrieval is an explicit, composable step rather than a hidden subroutine, you can reason about it, debug it, and swap it out.
This whole layer took maybe a day to get working well. I kept waiting for the hard part. It wasn't here.
The surprise: accurate retrieval saves money
Here's what I didn't expect. When first-stage retrieval is precise — when you go from 6,000 chunks to 20 good ones instead of 200 mediocre ones — the downstream agent burns dramatically fewer tokens.
Think about what happens when retrieval is sloppy. The agent gets a pile of vaguely relevant chunks, can't find a clear answer, and starts exploring. It might call the search tool again with a rephrased query. It might ask the LLM to reason through ambiguous context. It might generate a hedged, uncertain answer that prompts the user to ask a follow-up. Every one of those extra steps costs tokens.
With tight retrieval — good embeddings, reasonable thresholds, metadata filtering — the agent gets what it needs on the first call. It reads five chunks, finds the answer, cites the source, done. The retrieval precision translates directly into agent efficiency.
Huber frames this as "win the first stage":
"Using signals like vector search, like full text search, like metadata filtering... to go from 10,000 down to 300."
He's talking about recall — making sure the good chunks are in the candidate set. But what I found is that the precision side matters just as much, maybe more, when an agent is consuming the results. An agent with 20 excellent chunks behaves completely differently from an agent with 200 okay chunks. The first one answers. The second one wanders.
Where I fell short: re-ranking
Here's where I have to be honest about a gap.
Huber is clear about re-ranking:
"Using an LLM as a re-ranker and brute forcing from 300 down to 30, I've seen now emerging... way more cost effective than a lot of people realize."
That's a separate pass. You take your 300 first-stage candidates, send them through a cross-encoder or an LLM scoring step, produce a ranked list, take the top 30, and then generate your answer from those 30. The re-ranking step reduces what enters the generation context.
My system doesn't do this. What I built was a response specialist that receives all retrieved chunks in a single prompt and generates from them:
@tool
def response_specialist_tool(query: str, context: str) -> str:
"""Generate response from retrieved context."""
agent = Agent(
system_prompt="""You are a response generation specialist for Strands SDK queries.
Generate answers based ONLY on provided context.
Guidelines:
1. PRIORITIZE the most relevant chunks from context
2. CITE sources with file paths and line numbers
3. COMBINE information from multiple chunks coherently
4. If context is insufficient, say so clearly
5. Provide runnable code snippets when possible""",
tools=[use_llm],
model=gemini_model
)
response = agent(f"""User Question: {query}
Retrieved Context:
{context}
Generate a comprehensive answer using ONLY the context above.""")
return str(response)
I initially told myself that the LLM's attention mechanism was doing "implicit re-ranking" — it focuses on relevant chunks and ignores the rest. But that's not what re-ranking is. Everything is still in the context window. Every token still counts against the context budget. The model might attend more to good chunks, but the bad ones are still there, still contributing to what Huber calls context rot:
"As you use more and more tokens, the model can pay attention to less and then also can reason sort of less effectively."
So I'm eating the cost of stuffing noise into the context and hoping the model copes. For a small codebase with tight first-stage retrieval (where I'm sending 5-10 chunks, not 300), this works well enough. But it won't scale. A proper re-ranking step — score the candidates, take the top N, discard the rest — is on the list.
The hard part: assembly and memory
Retrieval was a day of work. Assembly took three.
Assembly is the problem of taking retrieved chunks — disconnected fragments from different files, different sections of documentation, different code examples — and composing them into context that the model can reason about coherently. It's not just concatenation. The order matters. The framing matters. Whether you include file paths and line numbers matters. Whether you show the chunk verbatim or summarize it matters.
And then there's conversation memory. In a multi-turn chat, the model needs context from previous turns, but you can't just append the full conversation history. That grows without bound and directly triggers context rot.
This is where mem0 earned its place in the stack:
mem0_config = {
"vector_store": {
"provider": "qdrant",
"config": {
"collection_name": "strands_chat",
"embedding_model_dims": 384,
"path": ":memory:"
}
},
"embedder": {
"provider": "huggingface",
"config": {"model": "all-MiniLM-L6-v2"}
}
}
memory = Memory.from_config(mem0_config)
@tool
def remember_conversation(user_message: str, assistant_response: str, user_id: str = "user") -> str:
"""Extract and store salient facts from the conversation."""
memory.add(
f"User asked: {user_message}\nAssistant responded: {assistant_response}",
user_id=user_id
)
return f"Stored conversation in memory for {user_id}"
@tool
def recall_conversation(query: str = "", user_id: str = "user") -> str:
"""Retrieve relevant conversation history — not the full transcript."""
if not query:
query = "recent conversation history"
results = memory.search(query, user_id=user_id, limit=5)
if not results or 'results' not in results or not results['results']:
return "No previous conversation history found."
history = []
for item in results['results']:
if 'memory' in item:
history.append(item['memory'])
return "\n\n".join(history) if history else "No relevant conversation history found."
Instead of appending full dialogue turns (which bloats the context window with every exchange), mem0 extracts salient facts and stores them as searchable memories. When the agent needs conversation context, it retrieves relevant memories, not the entire transcript. Huber's principle — "tight, structured contexts beat maximal windows" — is exactly right, and memory is where most systems violate it first.
Putting it together
The orchestrator ties retrieval, response generation, and memory into a single agent:
rag_chatbot = Agent(
system_prompt="""You are an intelligent assistant with expertise in the Strands SDK.
WORKFLOW:
1. RETRIEVE: Use retrieval_specialist to find relevant docs and code
2. RECALL: Use recall_conversation for relevant conversation context
3. RESPOND: Use response_specialist to generate a cited answer
4. REMEMBER: Use remember_conversation to store salient facts
Always retrieve context before answering technical questions.
Prefer code examples from the actual Strands SDK codebase.""",
tools=[
retrieval_specialist,
response_specialist_tool,
remember_conversation,
recall_conversation,
use_llm
],
model=gemini_model
)
This is what context engineering looks like in practice: not a monolithic pipeline but composable steps — retrieval, memory, assembly, generation — each one an explicit tool the orchestrator invokes as needed. When you ask "How do I create an agent with custom tools?", the orchestrator:
- Calls
retrieval_specialist→ searches the codebase, filters by relevance - Calls
recall_conversation→ pulls relevant memories from earlier turns - Passes both to
response_specialist→ generates an answer citing specific files - Calls
remember_conversation→ stores the key facts for future turns
Every step is visible. Every step is debuggable. And when something breaks — when the answer is wrong, when the citations are off — you can tell which step failed instead of staring at a monolithic "RAG pipeline" wondering where the problem lives.
Full working tutorial (self-contained Jupyter notebook, one-command setup): learn-strands/rag-chatbot
Bootstrapping the runtime
An agent is only as good as the runtime it runs in. To make this runnable, we need to initialize the FAISS store, load the index from disk, set up our memory database, and start a chat loop.
Here is the glue code that holds it all together:
import os
from pathlib import Path
from strands import Agent, tool
from strands.models.gemini import GeminiModel
from strands_tools import use_llm
from mem0 import Memory
# 1. Initialize the LLM
GOOGLE_API_KEY = os.environ["GOOGLE_API_KEY"]
gemini_model = GeminiModel(
client_args={"api_key": GOOGLE_API_KEY},
model_id="gemini-2.5-flash-lite"
)
# 2. Boot the FAISS vector store
vector_store = FAISSVectorStore(
local_model="all-MiniLM-L6-v2",
dimension=384
)
# 3. Load or index the codebase
if Path("data/strands_sdk.faiss").exists():
vector_store.load("data/strands_sdk.faiss", "data/documents.json")
else:
# Index the SDK docs and source code on the fly
documents = load_and_chunk_documents(
repo_path="data/strands-sdk",
chunk_size=1000,
chunk_overlap=200
)
vector_store.add_documents(documents)
vector_store.save("data/strands_sdk.faiss", "data/documents.json")
# 4. Initialize in-memory conversation memory
memory = Memory.from_config({
"vector_store": {
"provider": "qdrant",
"config": {
"collection_name": "strands_chat",
"embedding_model_dims": 384,
"path": ":memory:"
}
},
"embedder": {
"provider": "huggingface",
"config": {"model": "all-MiniLM-L6-v2"}
}
})
# 5. Start the interactive chat loop
print("💬 Strands SDK RAG Chatbot is online.")
while True:
query = input("👤 You: ")
if query.lower() in ['exit', 'quit']:
break
response = rag_chatbot(query, user_id="user")
print(f"\n🤖 Assistant: {response}\n")
This bootstrap setup feels less like a corporate architecture slide and more like a developer's local utility. By loading a pre-built FAISS index when available and compiling the index on the fly when missing, we keep developer iteration fast.
Connecting the dots
If we step back and look at how this maps to Jeff Huber's principles, the architecture feels remarkably clean:
- Named primitives: Our search, filter, response, and memory steps are not hidden subroutines of a large library class. They are explicit
@tooldecorations, each independently testable. - Winning the first stage: FAISS culls the candidates using normalized L2 distance and a 0.3 threshold, focusing on recall and leaving precision for later.
- Implicit re-ranking: The response agent uses attention mechanisms to highlight the most relevant chunks.
- Combatting context rot:
mem0prunes transcripts down to facts, keeping our prompt budget tidy.
By structuring our agent this way, we haven't built a rigid pipeline. We've built a set of composable capabilities that the agent can choose to employ.
What I'd measure next
Huber's most practical recommendation is one I haven't implemented yet: golden datasets.
"People should be creating small golden data sets of what queries they want to work and what chunks should return... quantitatively evaluate what matters."
The idea is dead simple — spend an evening creating labeled query-chunk pairs:
[
{
"query": "How do I create an agent with custom tools?",
"expected_chunks": [
"data/strands-sdk/examples/custom_tools.py",
"data/strands-sdk/docs/agents.md"
],
"expected_concepts": ["@tool decorator", "Agent class initialization"]
}
]
Then wire Recall@10 into CI and fail the build if it drops below a threshold. Without this, retrieval quality degrades silently. You change an embedding model, update a chunking strategy, re-index the codebase, and never notice that three important queries stopped returning the right files. Golden datasets are the eval loop that keeps the system honest.
What "RAG is dead" actually means
I started this project because everyone was saying RAG is dead, and I wanted to know what comes after it. The answer, it turns out, is the same engineering work — but decomposed into pieces that are honest about where the difficulty lives.
Retrieval was the easy part. FAISS, sentence-transformers, a threshold, some metadata filters. A day of work, and it was good enough. The real surprise was how much accurate retrieval matters downstream — not just for answer quality, but for token cost. A precise first stage means the agent stops wandering and starts answering.
Assembly was the hard part. Taking disconnected chunks from different files and composing them into context the model can reason about. Managing conversation memory without bloating the context window. Deciding what to include, what to summarize, and what to discard. This is where I spent most of my time and where the system still has the most room to improve.
And re-ranking — the step between retrieval and assembly — is where I cut a corner I shouldn't have. For a small codebase it's survivable, but it's the next thing to fix.
That's context engineering. Not a pipeline. Not a pattern you install from a template. A set of specific engineering decisions about what enters the context window and what doesn't, made independently at each stage, debugged independently when they break.
RAG is dead. Good riddance. The work was always more interesting than the acronym.


