Skip to main content
When your agent starts a session, Agent Memory needs to answer one question: out of everything this agent has ever done, what’s most relevant right now? The answer comes from three independent search streams running in parallel — BM25 keyword matching, vector similarity search, and knowledge graph traversal — whose results are then merged into a single ranked list using Reciprocal Rank Fusion. This triple-stream approach is why Agent Memory achieves 95.2% recall at R@5 on the LongMemEval-S benchmark: no single search method covers all retrieval patterns, but all three together almost never miss.

Three Search Streams

BM25 — Keyword Search

BM25 is always available — no LLM key, no embedding provider, no setup required. It tokenizes your query and your stored memories using stemming and synonym expansion, then scores matches using TF-IDF-style term frequency weighting.BM25 excels at exact and near-exact matches: if you search for “jose JWT middleware”, BM25 finds observations that contain those terms. It also handles multilingual content — Greek, Cyrillic, Hebrew, Arabic, and accented Latin are tokenized out of the box. For Chinese, Japanese, and Korean content, install the optional segmenters:
npm install @node-rs/jieba tiny-segmenter
Default weight: BM25_WEIGHT=0.4BM25 runs in-process against your SQLite database with no network calls. It’s the fastest stream and the fallback when no embedding provider is configured.

Vector Search — Semantic Similarity

Vector search converts your query and stored observations into dense embedding vectors, then finds the observations whose vectors are closest to your query using cosine similarity. This enables semantic retrieval: searching for “database performance optimization” can return the observation where your agent fixed an N+1 query, even if that observation never used the words “performance” or “optimization.”Vector search requires an embedding provider. Agent Memory auto-detects your provider from environment variables (see the Embedding Providers table below). For free offline embeddings, install:
npm install @xenova/transformers
This gives you the all-MiniLM-L6-v2 model running entirely on your machine — no API calls, no cost.Default weight: VECTOR_WEIGHT=0.6Vector search adds approximately 8 percentage points of recall over BM25 alone on the LongMemEval-S benchmark.

Knowledge Graph — Conceptual Traversal

The knowledge graph stream traverses a graph of entities and relationships extracted from your memories. When your query mentions a concept (a file name, a library, an error type, an architectural pattern), Agent Memory identifies matching graph nodes and walks their edges to find related observations that a keyword or vector search might not surface.Knowledge graph search requires GRAPH_EXTRACTION_ENABLED=true and an LLM provider. Once enabled, Agent Memory automatically extracts entities and relationships at session end.Default weight: AGENTMEMORY_GRAPH_WEIGHT=0.3The graph is optional but powerful for conceptual queries — “what decisions did we make about authentication?” surfaces the graph path through the authentication concept node to related decisions, files, and patterns.

Reciprocal Rank Fusion (RRF)

The three streams each produce a ranked list of results. Reciprocal Rank Fusion merges those lists into one, giving credit to results that appear highly ranked across multiple streams:
final_score(d) = Σ  1 / (k + rank_i(d))
               streams
Agent Memory uses k=60, a standard value that smooths out differences between rank positions. A result ranked #1 by BM25 and #3 by vector search scores higher than a result ranked #1 by only one stream. This fusion approach means you don’t need to tune individual stream weights obsessively — RRF naturally surfaces results that multiple search methods agree on. Results are also session-diversified: Agent Memory caps results from any single session at 3, so you get a spread of relevant context across your history rather than the entirety of one verbose session.

Embedding Providers

Agent Memory auto-detects your embedding provider from environment variables. Set EMBEDDING_PROVIDER to force a specific provider, or let Agent Memory pick based on which API keys are present.
ProviderEnv VariableModelNotes
Local (offline)EMBEDDING_PROVIDER=localall-MiniLM-L6-v2Free, no API calls. Requires npm install @xenova/transformers. Best starting point.
OpenAIOPENAI_API_KEYtext-embedding-3-smallHighest quality embeddings. Also activates the OpenAI LLM provider.
Voyage AIVOYAGE_API_KEYvoyage-code-3Optimized specifically for code — recommended if your sessions are code-heavy.
CohereCOHERE_API_KEYembed-english-v3.0Strong general-purpose embeddings with a free trial tier.
GeminiGEMINI_API_KEYgemini-embedding-001100+ languages, supports 768/1536/3072 dimensions (MRL), 2048-token input.
OpenRouterOPENROUTER_API_KEYprovider-dependentMulti-model proxy; embedding support varies by the underlying model.
If multiple API keys are set, Agent Memory uses this auto-detection priority: Gemini → OpenAI → Voyage → Cohere → OpenRouter. Set EMBEDDING_PROVIDER=local explicitly to use local embeddings even when other keys are present.
Configure your provider in ~/.agentmemory/.env:
# Option 1: Free local embeddings (recommended for getting started)
EMBEDDING_PROVIDER=local

# Option 2: OpenAI embeddings
OPENAI_API_KEY=sk-...
# OPENAI_EMBEDDING_MODEL=text-embedding-3-small  # optional override
# OPENAI_EMBEDDING_DIMENSIONS=1536               # required for non-standard models

# Option 3: Voyage (best for code)
VOYAGE_API_KEY=pa-...

# Option 4: Gemini
GEMINI_API_KEY=...

Knowledge Graph

When GRAPH_EXTRACTION_ENABLED=true, Agent Memory uses your LLM to extract entities and relationships from observations at session end. These form a graph of typed nodes and edges: Node types: file, function, concept, error, decision, pattern, library, person, project, preference, location, organization, event Edge types: uses, imports, modifies, causes, fixes, depends_on, related_to, prefers, blocked_by, caused_by, optimizes_for, rejected, avoids, succeeded_by Enable graph extraction in ~/.agentmemory/.env:
GRAPH_EXTRACTION_ENABLED=true
Query the graph directly using the memory_graph_query MCP tool:
{
  "tool": "memory_graph_query",
  "arguments": {
    "query": "authentication",
    "depth": 2
  }
}
Or via the REST API:
curl -X POST http://localhost:3111/agentmemory/graph/query \
  -H "Content-Type: application/json" \
  -d '{"query": "authentication", "depth": 2}'
The graph viewer is available in the real-time dashboard at http://localhost:3113.

Search Tuning

You can tune every aspect of how search behaves through environment variables in ~/.agentmemory/.env:
# Stream weights (must be 0–1; don't need to sum to 1 — RRF handles normalization)
BM25_WEIGHT=0.4                  # keyword search contribution (default: 0.4)
VECTOR_WEIGHT=0.6                # semantic similarity contribution (default: 0.6)
AGENTMEMORY_GRAPH_WEIGHT=0.3     # knowledge graph contribution (default: 0.3)

# Result limits
TOKEN_BUDGET=2000                # max tokens of context injected per session (default: 2000)
MAX_OBS_PER_SESSION=500          # max observations stored per session (default: 500)
Increase BM25_WEIGHT (e.g., to 0.6) when you’re working with highly technical, terminology-dense codebases where exact term matching matters — specific function names, error codes, flag names. Keyword matching is more precise when the vocabulary is consistent.
Increase VECTOR_WEIGHT (e.g., to 0.8) when you want more semantic retrieval — finding observations about a concept even when the wording varies. Useful when your sessions use varied language to describe the same problems.
Enable the knowledge graph when you want conceptual traversal — “what else relates to this library?” or “what caused this error pattern?” The graph excels at multi-hop queries that keyword and vector search don’t cover well. Requires an LLM provider.

Token Budget

The TOKEN_BUDGET setting controls the maximum number of tokens Agent Memory injects into your context window at session start. The default is 2,000 tokens.
TOKEN_BUDGET=2000   # default — ~1,900 tokens of actual context in practice
Agent Memory uses your budget to fit as many high-scoring memories as possible:
  • Results are scored and ranked by the hybrid search
  • Memories are added to the context block from highest to lowest score
  • Once the running token count would exceed TOKEN_BUDGET, injection stops
  • Results are session-diversified before budget trimming (max 3 per session)
Increase the budget if you want more historical context and your agent model has a large context window. Decrease it if you’re paying per token and want to minimize context costs — the top memories are the most relevant anyway.
TOKEN_BUDGET=4000   # more context, more tokens per session
TOKEN_BUDGET=1000   # leaner injection, lower token cost
Use the memory_smart_search MCP tool from your agent:
{
  "tool": "memory_smart_search",
  "arguments": {
    "query": "how does the auth middleware work",
    "project": "my-api",
    "limit": 10
  }
}
This runs the full triple-stream hybrid search and returns ranked CompressedObservation results with individual BM25, vector, and graph scores plus the combined score.
Agent Memory works without any API keys using BM25-only search. Install @xenova/transformers to add free local vector embeddings. Add an LLM key and set GRAPH_EXTRACTION_ENABLED=true for the full triple-stream experience. Each layer is additive — you get value at every tier.