Context Engineering with FAISS and Strands: Beyond the RAG Hype
How Jeff Huber's five retrieval principles map to building production code search with FAISS vector search and Strands agent framework. Learn to implement named primitives, first-stage retrieval, re-ranking, and golden datasets for production RAG systems.

Context Engineering with FAISS and Strands: Beyond the RAG Hype
How Jeff Huber's five retrieval principles map to building production code search with FAISS vector search and Strands agent framework.
Why "RAG" is the Wrong Abstraction
"We never use the term rag. I hate the term rag."
This provocative statement from Jeff Huber, CEO of Chroma, in his recent Latent Space interview cuts to the heart of a fundamental problem in how we build AI systems. As he explains:
"RAG is just retrieval, first of all. Like, retrieval, augmented generation. Are three concepts put together into one thing?"
The problem isn't semantic—it's architectural. When we say "RAG," we're conflating retrieval (finding relevant information), augmentation (assembling context), and generation (LLM output) into a single abstraction. This obscures the real engineering work.
Instead, Huber advocates for context engineering:
"Context engineering is the job of figuring out what should be in the context window any given LLM generation step."
And this isn't just a better term—it's a competitive advantage:
"Most AI startups that you know of that you think of today that's doing very well... what are they fundamentally good at? It is context engineering."
This post explores Huber's five retrieval principles and shows how to implement them using FAISS for vector search and Strands SDK for agent orchestration. We'll build a production-ready code search system that indexes the Strands SDK itself, creating a chatbot that answers questions about the framework by searching its own codebase.
Full tutorial: learn-strands/rag-chatbot
The Five Principles of Context Engineering
Principle 1: Don't Ship "RAG." Ship Retrieval.
"Name the primitives: dense, lexical, filters, re-rank, assembly, eval loop."
Huber's first principle is about explicitness. Instead of a monolithic "RAG pipeline," production systems need named, composable primitives:
- Dense retrieval: Vector similarity search (FAISS, embeddings)
- Lexical retrieval: Full-text search, regex, exact matching
- Metadata filters: File types, timestamps, authors
- Re-ranking: LLM or cross-encoder scoring
- Assembly: Combining chunks into coherent context
- Eval loop: Continuous measurement against golden datasets
Our FAISS + Strands architecture implements each primitive explicitly:
# Primitive 1: Dense retrieval with FAISS
@tool
def search_strands_sdk(query: str, max_results: int = 5) -> List[Dict[str, Any]]:
"""Dense vector search over Strands SDK codebase."""
results = vector_store.search(query=query, k=max_results, threshold=0.3)
formatted_results = []
for result in results:
formatted_results.append({
'text': result['text'],
'similarity': result['similarity'],
'file_path': result.get('file_path', 'unknown'),
'file_type': result.get('file_type', 'unknown')
})
return formatted_results
# Primitive 2: Lexical retrieval with metadata filtering
@tool
def search_by_file_type(query: str, file_type: str, max_results: int = 5) -> List[Dict[str, Any]]:
"""Combine dense search + metadata filters."""
# First: semantic search (cast wide net)
all_results = vector_store.search(query=query, k=max_results * 3, threshold=0.3)
# Then: filter by metadata (file type)
filtered = [r for r in all_results if r.get('file_type', '') == file_type][:max_results]
return filtered
These aren't hidden implementation details—they're named tools that the agent can invoke explicitly. This is context engineering: making primitives visible and composable.
Principle 2: Win the First Stage with Hybrid Recall
"Using signals like vector search, like full text search, like metadata filtering... to go from 10,000 down to 300."
Huber emphasizes that first-stage retrieval is about aggressive culling, not perfect precision:
"200–300 candidates is fine—LLMs can read."
The goal isn't to retrieve the perfect 5 chunks. It's to reduce thousands of candidates to hundreds using cheap, fast signals. Then let re-ranking handle precision.
For code search specifically, Huber notes:
"We support regex search natively... because we've seen that as like a very powerful tool for code search."
This aligns with what we see in practice: developers often search for exact function names, class definitions, or import statements—things regex handles perfectly.
Our implementation combines dense (vector) and sparse (metadata) signals:
class FAISSVectorStore:
"""First-stage retrieval: semantic search with metadata."""
def search(self, query: str, k: int = 5, threshold: float = 0.3) -> List[Dict[str, Any]]:
"""Dense vector search - first stage retrieval."""
if self.index.ntotal == 0:
return []
# Generate query embedding
query_embedding = self.embedder.encode([query]).astype('float32')
faiss.normalize_L2(query_embedding)
# Search FAISS index
distances, indices = self.index.search(query_embedding, min(k, self.index.ntotal))
# Convert L2 distance to cosine similarity
similarities = 1 - (distances[0] / 2)
# Build results with metadata
results = []
for idx, similarity in zip(indices[0], similarities):
if similarity >= threshold and idx < len(self.documents):
doc = self.documents[int(idx)].copy()
doc['similarity'] = float(similarity)
doc['index'] = int(idx)
results.append(doc)
return results
Key architectural choices:
- Threshold filtering (0.3): Aggressive but fast culling
- L2 normalization: Enables cosine similarity via Euclidean distance
- Metadata preservation: File paths, types, line numbers for later filtering
- Batched encoding: Efficient for bulk indexing
This implements "win the first stage"—get to 200-300 candidates quickly, let re-ranking handle the rest.
Principle 3: Always Re-rank Before Assembly
"Using an LLM as a re-ranker and brute forcing from 300 down to 30, I've seen now emerging... way more cost effective than a lot of people realize."
Re-ranking is non-negotiable for production systems. Huber's recommendation: use the LLM itself as a re-ranker.
In our Strands implementation, re-ranking happens in the response specialist agent:
@tool
def response_specialist(query: str, context: str) -> str:
"""Generate response with implicit LLM re-ranking.
The LLM's attention mechanism naturally re-ranks chunks
by relevance when generating the response.
"""
agent = Agent(
system_prompt="""You are a response generation specialist for Strands SDK queries.
CRITICAL: Generate answers based ONLY on provided context.
Guidelines:
1. PRIORITIZE the most relevant chunks from context (implicit re-ranking)
2. CITE sources with file paths and line numbers
3. COMBINE information from multiple chunks coherently
4. EXPLICIT: If context is insufficient, say so clearly
5. CODE EXAMPLES: Provide runnable snippets when possible
Use markdown formatting for code blocks.""",
tools=[use_llm],
model=gemini_model
)
response = agent(f"""User Question: {query}
Retrieved Context:
{context}
Generate a comprehensive answer using ONLY the context above.
Cite specific files and line numbers.""")
return str(response)
How this implements re-ranking:
The system prompt instructs the agent to "PRIORITIZE the most relevant chunks." The LLM's attention mechanism naturally focuses on chunks that answer the query, effectively re-ranking through selective attention.
This is what Huber means by "brute forcing from 300 down to 30"—the LLM reads all retrieved chunks and implicitly selects the most relevant ones for generating the response.
Principle 4: Respect Context Rot
"As you use more and more tokens, the model can pay attention to less and then also can reason sort of less effectively."
Context rot is the phenomenon where LLM performance degrades as context length increases. Huber's advice:
"Tight, structured contexts beat maximal windows."
The mistake many make:
"A lot of people are just still eating everything into the context window... context caching certainly helps, but like there are costs and speed, but isn't helping the context problem at all."
Our implementation combats context rot through:
- Threshold filtering: Only chunks above 0.3 similarity make it to context
- Limited retrieval: Default to 5 chunks, not 50
- Explicit assembly: Response agent combines information, doesn't just concatenate
- Conversation memory: mem0 extracts salient facts instead of stuffing full history
# Initialize mem0 with in-memory vector store
mem0_config = {
"vector_store": {
"provider": "qdrant",
"config": {
"collection_name": "strands_chat",
"embedding_model_dims": 384,
"path": ":memory:" # In-memory, no persistence
}
},
"embedder": {
"provider": "huggingface",
"config": {"model": "all-MiniLM-L6-v2"}
}
}
memory = Memory.from_config(mem0_config)
@tool
def recall_conversation(query: str = "", user_id: str = "user") -> str:
"""Retrieve relevant conversation history - not full transcript."""
if not query:
query = "recent conversation history"
results = memory.search(query, user_id=user_id, limit=5)
if not results or 'results' not in results or not results['results']:
return "No previous conversation history found."
history = []
for item in results['results']:
if 'memory' in item:
history.append(item['memory'])
return "\n\n".join(history) if history else "No relevant conversation history found."
mem0 solves the conversation memory problem: instead of appending full dialogue history (which grows unbounded), it extracts and stores salient facts, then retrieves only relevant memories per query.
This is context engineering applied to conversation: structured, selective context beats dumping everything.
Principle 5: Build Golden Datasets
"Say to your team Thursday night, we're all going to be in the conference room. We're ordering pizza... bootstrap this... a couple hundred even, like, high-quality examples is extremely beneficial."
Huber's most practical advice: invest one evening creating labeled query-chunk pairs:
"People should be creating small golden data sets of what queries they want to work and what chunks should return... quantitatively evaluate what matters."
Example golden dataset for code search:
[
{
"query": "How do I create an agent with custom tools?",
"expected_chunks": [
"data/strands-sdk/examples/custom_tools.py",
"data/strands-sdk/docs/agents.md"
],
"expected_concepts": ["@tool decorator", "Agent class initialization"]
},
{
"query": "What's the difference between Agent and tool?",
"expected_chunks": [
"data/strands-sdk/strands/agent.py",
"data/strands-sdk/strands/tools.py"
],
"expected_concepts": ["Agent orchestration", "tool decorator pattern"]
}
]
Then wire this into CI:
def evaluate_retrieval(golden_dataset, vector_store):
"""Continuous evaluation against golden queries."""
total_queries = len(golden_dataset)
successful_retrievals = 0
for item in golden_dataset:
results = vector_store.search(item['query'], k=10)
retrieved_files = [r['file_path'] for r in results]
# Check if expected chunks are in top-10
if any(expected in retrieved_files for expected in item['expected_chunks']):
successful_retrievals += 1
recall_at_10 = successful_retrievals / total_queries
print(f"Recall@10: {recall_at_10:.2%}")
return recall_at_10
# Run in CI pipeline
assert evaluate_retrieval(golden_dataset, vector_store) >= 0.80, "Recall dropped below 80%"
This implements Huber's "eval loop" primitive—continuous measurement prevents silent degradation.
Building the System: FAISS + Strands Architecture
Now let's see how these five principles come together in a complete system.
Step 1: Indexing with FAISS
FAISS (Facebook AI Similarity Search) is our dense retrieval primitive. We use the simplest index type—IndexFlatL2—which provides exact search with 100% recall:
import faiss
import numpy as np
from sentence_transformers import SentenceTransformer
from typing import List, Dict, Any
class FAISSVectorStore:
"""FAISS-based vector store with local embeddings."""
def __init__(self, local_model: str = "all-MiniLM-L6-v2", dimension: int = 384):
"""Initialize with sentence-transformers for local embeddings."""
print(f"🔧 Loading local embedding model: {local_model}")
self.embedder = SentenceTransformer(local_model)
self.dimension = dimension
# IndexFlatL2: exact search, 100% recall
self.index = faiss.IndexFlatL2(dimension)
self.documents = []
print(f"✅ Vector store initialized (dimension: {dimension})")
def add_documents(self, documents: List[Dict[str, Any]], batch_size: int = 32):
"""Add documents with batched embedding generation."""
if not documents:
return
print(f"📥 Adding {len(documents)} documents to vector store...")
# Extract text from documents
texts = [doc.get('text', '') for doc in documents]
# Generate embeddings in batches
all_embeddings = []
for i in range(0, len(texts), batch_size):
batch = texts[i:i + batch_size]
embeddings = self.embedder.encode(batch, show_progress_bar=False)
all_embeddings.append(embeddings)
# Combine and normalize for cosine similarity
embeddings_array = np.vstack(all_embeddings).astype('float32')
faiss.normalize_L2(embeddings_array)
# Add to FAISS index
self.index.add(embeddings_array)
self.documents.extend(documents)
print(f"✅ Indexed {len(documents)} chunks (total: {self.index.ntotal})")
Why IndexFlatL2?
- Exact search: No approximation, 100% recall
- Simple: No training, no hyperparameters
- Fast enough: For <10K vectors, exhaustive search is still sub-10ms
For larger codebases (100K+ chunks), migrate to HNSW for approximate but faster search.
Step 2: Agent-as-Tool Pattern with Strands
Strands SDK's @tool decorator turns functions into agent tools. We create specialized agents for retrieval and response generation:
from strands import Agent, tool
from strands.models.gemini import GeminiModel
# Initialize Gemini model
gemini_model = GeminiModel(
client_args={"api_key": GOOGLE_API_KEY},
model_id="gemini-2.5-flash-lite"
)
# Tool 1: Dense retrieval
@tool
def search_strands_sdk(query: str, max_results: int = 5) -> List[Dict[str, Any]]:
"""Semantic search over Strands SDK codebase."""
results = vector_store.search(query=query, k=max_results, threshold=0.3)
formatted_results = []
for result in results:
formatted_results.append({
'text': result['text'],
'similarity': result['similarity'],
'file_path': result.get('file_path', 'unknown'),
'file_type': result.get('file_type', 'unknown'),
'relevance_score': f"{result['similarity']:.2%}"
})
return formatted_results
# Tool 2: Metadata filtering
@tool
def search_by_file_type(query: str, file_type: str, max_results: int = 5) -> List[Dict[str, Any]]:
"""Search with file type filter - hybrid retrieval."""
all_results = vector_store.search(query=query, k=max_results * 3, threshold=0.3)
filtered = [r for r in all_results if r.get('file_type', '') == file_type][:max_results]
return [{
'text': r['text'],
'similarity': r['similarity'],
'file_path': r.get('file_path', 'unknown'),
'relevance_score': f"{r['similarity']:.2%}"
} for r in filtered]
# Retrieval Specialist Agent
retrieval_specialist = Agent(
system_prompt="""You are a retrieval specialist for the Strands SDK.
Your job: search and find the most relevant documentation, code examples,
and explanations from the Strands SDK repository.
Always return:
- File paths and references
- Relevance scores
- Brief context for each result
- Code snippets when available
Use search_strands_sdk for general queries.
Use search_by_file_type when users need specific file types (.py, .md, etc.).""",
tools=[search_strands_sdk, search_by_file_type],
model=gemini_model
)
This implements Huber's "name your primitives"—search_strands_sdk and search_by_file_type are explicit, not hidden in a pipeline.
Step 3: Re-ranking and Assembly
The response specialist handles re-ranking and context assembly:
from strands_tools import use_llm
@tool
def response_specialist_tool(query: str, context: str) -> str:
"""Response agent with LLM re-ranking."""
agent = Agent(
system_prompt="""You are a response generation specialist for Strands SDK queries.
Generate accurate answers based ONLY on provided context.
Guidelines (context engineering principles):
1. PRIORITIZE the most relevant chunks (re-ranking via attention)
2. CITE sources with file paths and line numbers
3. COMBINE information from multiple chunks coherently
4. EXPLICIT: If context insufficient, say so clearly
5. CODE EXAMPLES: Provide runnable snippets when possible
Use markdown formatting. Explain clearly for AI engineers.""",
tools=[use_llm],
model=gemini_model
)
response = agent(f"""User Question: {query}
Retrieved Context:
{context}
Generate a comprehensive answer using ONLY the context above.
Cite specific files and line numbers.""")
return str(response)
The LLM re-ranks implicitly through attention—it focuses on relevant chunks when generating the response.
Step 4: Orchestration
The orchestrator coordinates retrieval, re-ranking, assembly, and memory:
# Conversation memory tools
@tool
def remember_conversation(user_message: str, assistant_response: str, user_id: str = "user") -> str:
"""Store conversation in mem0 memory."""
memory.add(
f"User asked: {user_message}\nAssistant responded: {assistant_response}",
user_id=user_id
)
return f"Stored conversation in memory for {user_id}"
@tool
def recall_conversation(query: str = "", user_id: str = "user") -> str:
"""Retrieve relevant conversation history (not full transcript)."""
if not query:
query = "recent conversation history"
results = memory.search(query, user_id=user_id, limit=5)
if not results or 'results' not in results or not results['results']:
return "No previous conversation history found."
history = []
for item in results['results']:
if 'memory' in item:
history.append(item['memory'])
return "\n\n".join(history) if history else "No relevant conversation history found."
# Main RAG orchestrator
rag_chatbot = Agent(
system_prompt="""You are an intelligent RAG assistant with expertise in the Strands SDK.
CONTEXT ENGINEERING WORKFLOW (Jeff Huber's principles):
1. FIRST-STAGE RETRIEVAL: Use retrieval_specialist to find relevant docs/code
- Goal: Reduce thousands of candidates to dozens
- Combine semantic search + metadata filtering
2. RE-RANKING: Use response_specialist to generate answer
- LLM re-ranks chunks by relevance via attention
- Assembles tight, structured context
- Cites sources with file paths
3. MEMORY: Use recall_conversation for user context
- Pull relevant conversation history (not full transcript)
- Combine with retrieval results
4. PERSISTENCE: Use remember_conversation after interaction
- Store salient facts for future retrieval
- Combat context rot by not bloating context
CRITICAL: Always retrieve context before answering technical questions.
Prefer code examples from actual Strands SDK.""",
tools=[
retrieval_specialist,
response_specialist_tool,
remember_conversation,
recall_conversation,
use_llm
],
model=gemini_model
)
# Usage
response = rag_chatbot("How do I create an agent with custom tools?", user_id="engineer")
print(response)
This orchestrator implements all five principles:
- Named primitives: Explicit tools for retrieval, re-ranking, memory
- First-stage retrieval: retrieval_specialist reduces candidates
- Re-ranking: response_specialist applies LLM attention
- Respects context rot: mem0 for selective history, threshold filtering
- Eval loop: (Add golden dataset evaluation in CI)
How This Maps to Context Engineering
Let's connect the implementation back to Huber's framework:
| Huber's Principle | Our Implementation |
|---|---|
| Name primitives | search_strands_sdk, search_by_file_type, response_specialist as explicit tools |
| Win first stage | FAISS threshold filtering (0.3) + metadata filters reduce 6K chunks to dozens |
| Always re-rank | Response specialist's LLM attention re-ranks via selective focus |
| Respect context rot | mem0 extracts facts, not full transcripts; limited retrieval (k=5) |
| Golden datasets | Evaluation loop with query-chunk pairs in CI (recommended, not shown) |
The key insight: these aren't separate pipeline stages. They're composable tools the agent orchestrates dynamically.
When you ask "How do I create an agent?", the orchestrator:
- Invokes
retrieval_specialist→ callssearch_strands_sdk(first-stage) - Passes results to
response_specialist→ LLM re-ranks via attention - May invoke
recall_conversation→ pulls relevant memory - Generates response citing specific files
- Invokes
remember_conversation→ stores salient facts
This is context engineering: explicitly controlling what enters the LLM's context window at each step.
Complete Working Example
Here's the full initialization code from our tutorial:
from strands import Agent, tool
from strands.models.gemini import GeminiModel
from strands_tools import use_llm
from mem0 import Memory
import os
# Configuration
GOOGLE_API_KEY = os.environ["GOOGLE_API_KEY"]
gemini_model = GeminiModel(
client_args={"api_key": GOOGLE_API_KEY},
model_id="gemini-2.5-flash-lite"
)
# Initialize FAISS vector store
vector_store = FAISSVectorStore(
local_model="all-MiniLM-L6-v2",
dimension=384
)
# Load pre-built index or create new one
if Path("data/strands_sdk.faiss").exists():
vector_store.load("data/strands_sdk.faiss", "data/documents.json")
else:
# Index the Strands SDK
documents = load_and_chunk_documents(
repo_path="data/strands-sdk",
chunk_size=1000,
chunk_overlap=200
)
vector_store.add_documents(documents)
vector_store.save("data/strands_sdk.faiss", "data/documents.json")
# Initialize mem0 (in-memory conversation memory)
memory = Memory.from_config({
"vector_store": {
"provider": "qdrant",
"config": {
"collection_name": "strands_chat",
"embedding_model_dims": 384,
"path": ":memory:"
}
},
"embedder": {
"provider": "huggingface",
"config": {"model": "all-MiniLM-L6-v2"}
}
})
# Create all tools and agents (code shown earlier)
# ...
# Start chatting
print("💬 Strands SDK RAG Chatbot")
print("Built with context engineering principles from Jeff Huber\n")
while True:
query = input("👤 You: ")
if query.lower() in ['exit', 'quit']:
break
response = rag_chatbot(query, user_id="user")
print(f"\n🤖 Assistant: {response}\n")
Full tutorial with self-contained Jupyter notebook: learn-strands/rag-chatbot
The tutorial includes:
- One-command setup script (
./start_notebook.sh) - Automatic SDK cloning and indexing
- Interactive chat UI
- All code embedded in notebook (fully self-contained)
Conclusion: Context Engineering as a Discipline
Jeff Huber's core insight:
"Context engineering clearly describes the job and it elevates the status of the job."
Building this RAG chatbot taught us that "RAG" obscures the real work:
- First-stage retrieval with hybrid signals (vector + metadata)
- Re-ranking with LLM attention
- Assembly into tight, structured context
- Evaluation against golden datasets
- Memory that extracts facts, not transcripts
FAISS + Strands SDK gave us the primitives to implement these explicitly:
- FAISS: Dense retrieval primitive (exact search, local embeddings)
- Strands agents: Composable tools with
@tooldecorator - mem0: Conversation memory that respects context rot
- Agent orchestration: Dynamic composition, not rigid pipelines
The result: a chatbot that answers questions about Strands SDK by searching its own codebase—implementing all five of Huber's context engineering principles.
Try the tutorial: learn-strands/rag-chatbot
What context engineering patterns are you building? Share your architectures—let's elevate this discipline together.
References
- Latent Space: Jeff Huber on Context Engineering - Full interview
- learn-strands Tutorial - Complete working implementation
- FAISS Documentation - Facebook AI Similarity Search
- Strands SDK - Lightweight agent framework
- mem0 - Conversation memory for agents
"The one thing every successful AI company does well is context engineering." - Jeff Huber



