Context Engineering with FAISS and Strands: Beyond the RAG Hype

How Jeff Huber's five retrieval principles map to building production code search with FAISS vector search and Strands agent framework. Learn to implement named primitives, first-stage retrieval, re-ranking, and golden datasets for production RAG systems.

RAG FAISS Strands LLM context engineering vector search AI agents

November 9, 2025⋅5 min read

Dan

Context Engineering with FAISS and Strands: Beyond the RAG Hype

How Jeff Huber's five retrieval principles map to building production code search with FAISS vector search and Strands agent framework.

Why "RAG" is the Wrong Abstraction

"We never use the term rag. I hate the term rag."

This provocative statement from Jeff Huber, CEO of Chroma, in his recent Latent Space interview cuts to the heart of a fundamental problem in how we build AI systems. As he explains:

"RAG is just retrieval, first of all. Like, retrieval, augmented generation. Are three concepts put together into one thing?"

The problem isn't semantic—it's architectural. When we say "RAG," we're conflating retrieval (finding relevant information), augmentation (assembling context), and generation (LLM output) into a single abstraction. This obscures the real engineering work.

Instead, Huber advocates for context engineering:

"Context engineering is the job of figuring out what should be in the context window any given LLM generation step."

And this isn't just a better term—it's a competitive advantage:

"Most AI startups that you know of that you think of today that's doing very well... what are they fundamentally good at? It is context engineering."

This post explores Huber's five retrieval principles and shows how to implement them using FAISS for vector search and Strands SDK for agent orchestration. We'll build a production-ready code search system that indexes the Strands SDK itself, creating a chatbot that answers questions about the framework by searching its own codebase.

Full tutorial: learn-strands/rag-chatbot

The Five Principles of Context Engineering

Principle 1: Don't Ship "RAG." Ship Retrieval.

"Name the primitives: dense, lexical, filters, re-rank, assembly, eval loop."

Huber's first principle is about explicitness. Instead of a monolithic "RAG pipeline," production systems need named, composable primitives:

Dense retrieval: Vector similarity search (FAISS, embeddings)
Lexical retrieval: Full-text search, regex, exact matching
Metadata filters: File types, timestamps, authors
Re-ranking: LLM or cross-encoder scoring
Assembly: Combining chunks into coherent context
Eval loop: Continuous measurement against golden datasets

Our FAISS + Strands architecture implements each primitive explicitly:

# Primitive 1: Dense retrieval with FAISS
@tool
def search_strands_sdk(query: str, max_results: int = 5) -> List[Dict[str, Any]]:
    """Dense vector search over Strands SDK codebase."""
    results = vector_store.search(query=query, k=max_results, threshold=0.3)

    formatted_results = []
    for result in results:
        formatted_results.append({
            'text': result['text'],
            'similarity': result['similarity'],
            'file_path': result.get('file_path', 'unknown'),
            'file_type': result.get('file_type', 'unknown')
        })

    return formatted_results


# Primitive 2: Lexical retrieval with metadata filtering
@tool
def search_by_file_type(query: str, file_type: str, max_results: int = 5) -> List[Dict[str, Any]]:
    """Combine dense search + metadata filters."""
    # First: semantic search (cast wide net)
    all_results = vector_store.search(query=query, k=max_results * 3, threshold=0.3)

    # Then: filter by metadata (file type)
    filtered = [r for r in all_results if r.get('file_type', '') == file_type][:max_results]

    return filtered

These aren't hidden implementation details—they're named tools that the agent can invoke explicitly. This is context engineering: making primitives visible and composable.

Principle 2: Win the First Stage with Hybrid Recall

"Using signals like vector search, like full text search, like metadata filtering... to go from 10,000 down to 300."

Huber emphasizes that first-stage retrieval is about aggressive culling, not perfect precision:

"200–300 candidates is fine—LLMs can read."

The goal isn't to retrieve the perfect 5 chunks. It's to reduce thousands of candidates to hundreds using cheap, fast signals. Then let re-ranking handle precision.

For code search specifically, Huber notes:

"We support regex search natively... because we've seen that as like a very powerful tool for code search."

This aligns with what we see in practice: developers often search for exact function names, class definitions, or import statements—things regex handles perfectly.

Our implementation combines dense (vector) and sparse (metadata) signals:

class FAISSVectorStore:
    """First-stage retrieval: semantic search with metadata."""

    def search(self, query: str, k: int = 5, threshold: float = 0.3) -> List[Dict[str, Any]]:
        """Dense vector search - first stage retrieval."""
        if self.index.ntotal == 0:
            return []

        # Generate query embedding
        query_embedding = self.embedder.encode([query]).astype('float32')
        faiss.normalize_L2(query_embedding)

        # Search FAISS index
        distances, indices = self.index.search(query_embedding, min(k, self.index.ntotal))

        # Convert L2 distance to cosine similarity
        similarities = 1 - (distances[0] / 2)

        # Build results with metadata
        results = []
        for idx, similarity in zip(indices[0], similarities):
            if similarity >= threshold and idx < len(self.documents):
                doc = self.documents[int(idx)].copy()
                doc['similarity'] = float(similarity)
                doc['index'] = int(idx)
                results.append(doc)

        return results

Key architectural choices:

Threshold filtering (0.3): Aggressive but fast culling
L2 normalization: Enables cosine similarity via Euclidean distance
Metadata preservation: File paths, types, line numbers for later filtering
Batched encoding: Efficient for bulk indexing

This implements "win the first stage"—get to 200-300 candidates quickly, let re-ranking handle the rest.

Principle 3: Always Re-rank Before Assembly

"Using an LLM as a re-ranker and brute forcing from 300 down to 30, I've seen now emerging... way more cost effective than a lot of people realize."

Re-ranking is non-negotiable for production systems. Huber's recommendation: use the LLM itself as a re-ranker.

In our Strands implementation, re-ranking happens in the response specialist agent:

@tool
def response_specialist(query: str, context: str) -> str:
    """Generate response with implicit LLM re-ranking.

    The LLM's attention mechanism naturally re-ranks chunks
    by relevance when generating the response.
    """
    agent = Agent(
        system_prompt="""You are a response generation specialist for Strands SDK queries.

CRITICAL: Generate answers based ONLY on provided context.

Guidelines:
1. PRIORITIZE the most relevant chunks from context (implicit re-ranking)
2. CITE sources with file paths and line numbers
3. COMBINE information from multiple chunks coherently
4. EXPLICIT: If context is insufficient, say so clearly
5. CODE EXAMPLES: Provide runnable snippets when possible

Use markdown formatting for code blocks.""",
        tools=[use_llm],
        model=gemini_model
    )

    response = agent(f"""User Question: {query}

Retrieved Context:
{context}

Generate a comprehensive answer using ONLY the context above.
Cite specific files and line numbers.""")

    return str(response)

How this implements re-ranking:

The system prompt instructs the agent to "PRIORITIZE the most relevant chunks." The LLM's attention mechanism naturally focuses on chunks that answer the query, effectively re-ranking through selective attention.

This is what Huber means by "brute forcing from 300 down to 30"—the LLM reads all retrieved chunks and implicitly selects the most relevant ones for generating the response.

Principle 4: Respect Context Rot

"As you use more and more tokens, the model can pay attention to less and then also can reason sort of less effectively."

Context rot is the phenomenon where LLM performance degrades as context length increases. Huber's advice:

"Tight, structured contexts beat maximal windows."

The mistake many make:

"A lot of people are just still eating everything into the context window... context caching certainly helps, but like there are costs and speed, but isn't helping the context problem at all."

Our implementation combats context rot through:

Threshold filtering: Only chunks above 0.3 similarity make it to context
Limited retrieval: Default to 5 chunks, not 50
Explicit assembly: Response agent combines information, doesn't just concatenate
Conversation memory: mem0 extracts salient facts instead of stuffing full history

# Initialize mem0 with in-memory vector store
mem0_config = {
    "vector_store": {
        "provider": "qdrant",
        "config": {
            "collection_name": "strands_chat",
            "embedding_model_dims": 384,
            "path": ":memory:"  # In-memory, no persistence
        }
    },
    "embedder": {
        "provider": "huggingface",
        "config": {"model": "all-MiniLM-L6-v2"}
    }
}

memory = Memory.from_config(mem0_config)

@tool
def recall_conversation(query: str = "", user_id: str = "user") -> str:
    """Retrieve relevant conversation history - not full transcript."""
    if not query:
        query = "recent conversation history"

    results = memory.search(query, user_id=user_id, limit=5)

    if not results or 'results' not in results or not results['results']:
        return "No previous conversation history found."

    history = []
    for item in results['results']:
        if 'memory' in item:
            history.append(item['memory'])

    return "\n\n".join(history) if history else "No relevant conversation history found."

mem0 solves the conversation memory problem: instead of appending full dialogue history (which grows unbounded), it extracts and stores salient facts, then retrieves only relevant memories per query.

This is context engineering applied to conversation: structured, selective context beats dumping everything.

Principle 5: Build Golden Datasets

"Say to your team Thursday night, we're all going to be in the conference room. We're ordering pizza... bootstrap this... a couple hundred even, like, high-quality examples is extremely beneficial."

Huber's most practical advice: invest one evening creating labeled query-chunk pairs:

"People should be creating small golden data sets of what queries they want to work and what chunks should return... quantitatively evaluate what matters."

Example golden dataset for code search:

[
  {
    "query": "How do I create an agent with custom tools?",
    "expected_chunks": [
      "data/strands-sdk/examples/custom_tools.py",
      "data/strands-sdk/docs/agents.md"
    ],
    "expected_concepts": ["@tool decorator", "Agent class initialization"]
  },
  {
    "query": "What's the difference between Agent and tool?",
    "expected_chunks": [
      "data/strands-sdk/strands/agent.py",
      "data/strands-sdk/strands/tools.py"
    ],
    "expected_concepts": ["Agent orchestration", "tool decorator pattern"]
  }
]

Then wire this into CI:

def evaluate_retrieval(golden_dataset, vector_store):
    """Continuous evaluation against golden queries."""
    total_queries = len(golden_dataset)
    successful_retrievals = 0

    for item in golden_dataset:
        results = vector_store.search(item['query'], k=10)
        retrieved_files = [r['file_path'] for r in results]

        # Check if expected chunks are in top-10
        if any(expected in retrieved_files for expected in item['expected_chunks']):
            successful_retrievals += 1

    recall_at_10 = successful_retrievals / total_queries
    print(f"Recall@10: {recall_at_10:.2%}")
    return recall_at_10

# Run in CI pipeline
assert evaluate_retrieval(golden_dataset, vector_store) >= 0.80, "Recall dropped below 80%"

This implements Huber's "eval loop" primitive—continuous measurement prevents silent degradation.

Building the System: FAISS + Strands Architecture

Now let's see how these five principles come together in a complete system.

Step 1: Indexing with FAISS

FAISS (Facebook AI Similarity Search) is our dense retrieval primitive. We use the simplest index type—IndexFlatL2—which provides exact search with 100% recall:

import faiss
import numpy as np
from sentence_transformers import SentenceTransformer
from typing import List, Dict, Any

class FAISSVectorStore:
    """FAISS-based vector store with local embeddings."""

    def __init__(self, local_model: str = "all-MiniLM-L6-v2", dimension: int = 384):
        """Initialize with sentence-transformers for local embeddings."""
        print(f"🔧 Loading local embedding model: {local_model}")
        self.embedder = SentenceTransformer(local_model)
        self.dimension = dimension

        # IndexFlatL2: exact search, 100% recall
        self.index = faiss.IndexFlatL2(dimension)
        self.documents = []
        print(f"✅ Vector store initialized (dimension: {dimension})")

    def add_documents(self, documents: List[Dict[str, Any]], batch_size: int = 32):
        """Add documents with batched embedding generation."""
        if not documents:
            return

        print(f"📥 Adding {len(documents)} documents to vector store...")

        # Extract text from documents
        texts = [doc.get('text', '') for doc in documents]

        # Generate embeddings in batches
        all_embeddings = []
        for i in range(0, len(texts), batch_size):
            batch = texts[i:i + batch_size]
            embeddings = self.embedder.encode(batch, show_progress_bar=False)
            all_embeddings.append(embeddings)

        # Combine and normalize for cosine similarity
        embeddings_array = np.vstack(all_embeddings).astype('float32')
        faiss.normalize_L2(embeddings_array)

        # Add to FAISS index
        self.index.add(embeddings_array)
        self.documents.extend(documents)

        print(f"✅ Indexed {len(documents)} chunks (total: {self.index.ntotal})")

Why IndexFlatL2?

Exact search: No approximation, 100% recall
Simple: No training, no hyperparameters
Fast enough: For <10K vectors, exhaustive search is still sub-10ms

For larger codebases (100K+ chunks), migrate to HNSW for approximate but faster search.

Step 2: Agent-as-Tool Pattern with Strands

Strands SDK's @tool decorator turns functions into agent tools. We create specialized agents for retrieval and response generation:

from strands import Agent, tool
from strands.models.gemini import GeminiModel

# Initialize Gemini model
gemini_model = GeminiModel(
    client_args={"api_key": GOOGLE_API_KEY},
    model_id="gemini-2.5-flash-lite"
)

# Tool 1: Dense retrieval
@tool
def search_strands_sdk(query: str, max_results: int = 5) -> List[Dict[str, Any]]:
    """Semantic search over Strands SDK codebase."""
    results = vector_store.search(query=query, k=max_results, threshold=0.3)

    formatted_results = []
    for result in results:
        formatted_results.append({
            'text': result['text'],
            'similarity': result['similarity'],
            'file_path': result.get('file_path', 'unknown'),
            'file_type': result.get('file_type', 'unknown'),
            'relevance_score': f"{result['similarity']:.2%}"
        })

    return formatted_results


# Tool 2: Metadata filtering
@tool
def search_by_file_type(query: str, file_type: str, max_results: int = 5) -> List[Dict[str, Any]]:
    """Search with file type filter - hybrid retrieval."""
    all_results = vector_store.search(query=query, k=max_results * 3, threshold=0.3)

    filtered = [r for r in all_results if r.get('file_type', '') == file_type][:max_results]

    return [{
        'text': r['text'],
        'similarity': r['similarity'],
        'file_path': r.get('file_path', 'unknown'),
        'relevance_score': f"{r['similarity']:.2%}"
    } for r in filtered]


# Retrieval Specialist Agent
retrieval_specialist = Agent(
    system_prompt="""You are a retrieval specialist for the Strands SDK.

Your job: search and find the most relevant documentation, code examples,
and explanations from the Strands SDK repository.

Always return:
- File paths and references
- Relevance scores
- Brief context for each result
- Code snippets when available

Use search_strands_sdk for general queries.
Use search_by_file_type when users need specific file types (.py, .md, etc.).""",
    tools=[search_strands_sdk, search_by_file_type],
    model=gemini_model
)

This implements Huber's "name your primitives"—search_strands_sdk and search_by_file_type are explicit, not hidden in a pipeline.

Step 3: Re-ranking and Assembly

The response specialist handles re-ranking and context assembly:

from strands_tools import use_llm

@tool
def response_specialist_tool(query: str, context: str) -> str:
    """Response agent with LLM re-ranking."""
    agent = Agent(
        system_prompt="""You are a response generation specialist for Strands SDK queries.

Generate accurate answers based ONLY on provided context.

Guidelines (context engineering principles):
1. PRIORITIZE the most relevant chunks (re-ranking via attention)
2. CITE sources with file paths and line numbers
3. COMBINE information from multiple chunks coherently
4. EXPLICIT: If context insufficient, say so clearly
5. CODE EXAMPLES: Provide runnable snippets when possible

Use markdown formatting. Explain clearly for AI engineers.""",
        tools=[use_llm],
        model=gemini_model
    )

    response = agent(f"""User Question: {query}

Retrieved Context:
{context}

Generate a comprehensive answer using ONLY the context above.
Cite specific files and line numbers.""")

    return str(response)

The LLM re-ranks implicitly through attention—it focuses on relevant chunks when generating the response.

Step 4: Orchestration

The orchestrator coordinates retrieval, re-ranking, assembly, and memory:

# Conversation memory tools
@tool
def remember_conversation(user_message: str, assistant_response: str, user_id: str = "user") -> str:
    """Store conversation in mem0 memory."""
    memory.add(
        f"User asked: {user_message}\nAssistant responded: {assistant_response}",
        user_id=user_id
    )
    return f"Stored conversation in memory for {user_id}"


@tool
def recall_conversation(query: str = "", user_id: str = "user") -> str:
    """Retrieve relevant conversation history (not full transcript)."""
    if not query:
        query = "recent conversation history"

    results = memory.search(query, user_id=user_id, limit=5)

    if not results or 'results' not in results or not results['results']:
        return "No previous conversation history found."

    history = []
    for item in results['results']:
        if 'memory' in item:
            history.append(item['memory'])

    return "\n\n".join(history) if history else "No relevant conversation history found."


# Main RAG orchestrator
rag_chatbot = Agent(
    system_prompt="""You are an intelligent RAG assistant with expertise in the Strands SDK.

CONTEXT ENGINEERING WORKFLOW (Jeff Huber's principles):

1. FIRST-STAGE RETRIEVAL: Use retrieval_specialist to find relevant docs/code
   - Goal: Reduce thousands of candidates to dozens
   - Combine semantic search + metadata filtering

2. RE-RANKING: Use response_specialist to generate answer
   - LLM re-ranks chunks by relevance via attention
   - Assembles tight, structured context
   - Cites sources with file paths

3. MEMORY: Use recall_conversation for user context
   - Pull relevant conversation history (not full transcript)
   - Combine with retrieval results

4. PERSISTENCE: Use remember_conversation after interaction
   - Store salient facts for future retrieval
   - Combat context rot by not bloating context

CRITICAL: Always retrieve context before answering technical questions.
Prefer code examples from actual Strands SDK.""",
    tools=[
        retrieval_specialist,
        response_specialist_tool,
        remember_conversation,
        recall_conversation,
        use_llm
    ],
    model=gemini_model
)

# Usage
response = rag_chatbot("How do I create an agent with custom tools?", user_id="engineer")
print(response)

This orchestrator implements all five principles:

Named primitives: Explicit tools for retrieval, re-ranking, memory
First-stage retrieval: retrieval_specialist reduces candidates
Re-ranking: response_specialist applies LLM attention
Respects context rot: mem0 for selective history, threshold filtering
Eval loop: (Add golden dataset evaluation in CI)

How This Maps to Context Engineering

Let's connect the implementation back to Huber's framework:

Huber's Principle	Our Implementation
Name primitives	`search_strands_sdk`, `search_by_file_type`, `response_specialist` as explicit tools
Win first stage	FAISS threshold filtering (0.3) + metadata filters reduce 6K chunks to dozens
Always re-rank	Response specialist's LLM attention re-ranks via selective focus
Respect context rot	mem0 extracts facts, not full transcripts; limited retrieval (k=5)
Golden datasets	Evaluation loop with query-chunk pairs in CI (recommended, not shown)

The key insight: these aren't separate pipeline stages. They're composable tools the agent orchestrates dynamically.

When you ask "How do I create an agent?", the orchestrator:

Invokes retrieval_specialist → calls search_strands_sdk (first-stage)
Passes results to response_specialist → LLM re-ranks via attention
May invoke recall_conversation → pulls relevant memory
Generates response citing specific files
Invokes remember_conversation → stores salient facts

This is context engineering: explicitly controlling what enters the LLM's context window at each step.

Complete Working Example

Here's the full initialization code from our tutorial:

from strands import Agent, tool
from strands.models.gemini import GeminiModel
from strands_tools import use_llm
from mem0 import Memory
import os

# Configuration
GOOGLE_API_KEY = os.environ["GOOGLE_API_KEY"]
gemini_model = GeminiModel(
    client_args={"api_key": GOOGLE_API_KEY},
    model_id="gemini-2.5-flash-lite"
)

# Initialize FAISS vector store
vector_store = FAISSVectorStore(
    local_model="all-MiniLM-L6-v2",
    dimension=384
)

# Load pre-built index or create new one
if Path("data/strands_sdk.faiss").exists():
    vector_store.load("data/strands_sdk.faiss", "data/documents.json")
else:
    # Index the Strands SDK
    documents = load_and_chunk_documents(
        repo_path="data/strands-sdk",
        chunk_size=1000,
        chunk_overlap=200
    )
    vector_store.add_documents(documents)
    vector_store.save("data/strands_sdk.faiss", "data/documents.json")

# Initialize mem0 (in-memory conversation memory)
memory = Memory.from_config({
    "vector_store": {
        "provider": "qdrant",
        "config": {
            "collection_name": "strands_chat",
            "embedding_model_dims": 384,
            "path": ":memory:"
        }
    },
    "embedder": {
        "provider": "huggingface",
        "config": {"model": "all-MiniLM-L6-v2"}
    }
})

# Create all tools and agents (code shown earlier)
# ...

# Start chatting
print("💬 Strands SDK RAG Chatbot")
print("Built with context engineering principles from Jeff Huber\n")

while True:
    query = input("👤 You: ")
    if query.lower() in ['exit', 'quit']:
        break

    response = rag_chatbot(query, user_id="user")
    print(f"\n🤖 Assistant: {response}\n")

Full tutorial with self-contained Jupyter notebook: learn-strands/rag-chatbot

The tutorial includes:

One-command setup script (./start_notebook.sh)
Automatic SDK cloning and indexing
Interactive chat UI
All code embedded in notebook (fully self-contained)

Conclusion: Context Engineering as a Discipline

Jeff Huber's core insight:

"Context engineering clearly describes the job and it elevates the status of the job."

Building this RAG chatbot taught us that "RAG" obscures the real work:

First-stage retrieval with hybrid signals (vector + metadata)
Re-ranking with LLM attention
Assembly into tight, structured context
Evaluation against golden datasets
Memory that extracts facts, not transcripts

FAISS + Strands SDK gave us the primitives to implement these explicitly:

FAISS: Dense retrieval primitive (exact search, local embeddings)
Strands agents: Composable tools with @tool decorator
mem0: Conversation memory that respects context rot
Agent orchestration: Dynamic composition, not rigid pipelines

The result: a chatbot that answers questions about Strands SDK by searching its own codebase—implementing all five of Huber's context engineering principles.

Try the tutorial: learn-strands/rag-chatbot

What context engineering patterns are you building? Share your architectures—let's elevate this discipline together.

References

Latent Space: Jeff Huber on Context Engineering - Full interview
learn-strands Tutorial - Complete working implementation
FAISS Documentation - Facebook AI Similarity Search
Strands SDK - Lightweight agent framework
mem0 - Conversation memory for agents

"The one thing every successful AI company does well is context engineering." - Jeff Huber

Context Engineering with FAISS and Strands: Beyond the RAG Hype

Context Engineering with FAISS and Strands: Beyond the RAG Hype

Why "RAG" is the Wrong Abstraction

The Five Principles of Context Engineering

Principle 1: Don't Ship "RAG." Ship Retrieval.

Principle 2: Win the First Stage with Hybrid Recall

Principle 3: Always Re-rank Before Assembly

Principle 4: Respect Context Rot

Principle 5: Build Golden Datasets

Building the System: FAISS + Strands Architecture

Step 1: Indexing with FAISS

Step 2: Agent-as-Tool Pattern with Strands

Step 3: Re-ranking and Assembly

Step 4: Orchestration

How This Maps to Context Engineering

Complete Working Example

Conclusion: Context Engineering as a Discipline

References

Recommended

Building data quality checks in your pySpark data pipelines

Improve your PySpark ETL's performance by providing explicit schema

Integration Testing for your Databricks CI/CD Data Pipelines with Microsoft Nutter

Automate all your PySpark Unit Test with Hypothesis!

Subscribe to new posts