Modern AI systems struggle with memory. They often forget past interactions or rely on Retrieval-Augmented Generation (RAG), which depends on constant access to external data. This becomes a limitation when building assistants that need both historical context and a deeper understanding of users.
MemPalace offers a different approach, enabling structured, persistent memory with higher precision and consistency. In this article, we explore how it improves AI memory systems and how you can implement it effectively.

MemPalace is an open-source, local-first memory system that stores conversations and project data in their original form. Each message is treated as a distinct memory unit, enabling persistent, structured recall.
Its design follows a hierarchical “palace” model: Wings for people or projects, Rooms for topics, Halls for memory types, and Drawers for transcripts, with Closets for summaries.
Traditional systems like RAG pipelines or vector databases focus on retrieval efficiency, which results in reduced context richness. They divide data into segments, create embeddings, and obtain similar segments during the inference process.
MemPalace uses a distinct method to store information:
The system achieves superior reasoning capabilities and better traceability features through its hybrid framework when compared to conventional memory systems.

Most agent memory tools use an LLM to summarize or extract key facts from conversations. The tools Mem0 and Zep analyze chat content to create brief reports which include essential facts and user preferences. The solution results in the loss of both contextual information and subtle details. As an LLM must decide what’s “important” and discard the rest.
MemPalace takes the opposite approach: “store everything”. The system keeps a complete record of all messages between users and assistants. The system keeps all data intact without any form of summarization or deletion. The method of unprocessed data storage provides important advantages which include:
The design of MemPalace uses the ancient mnemonic method of loci as its foundation. The system creates a multi-tiered framework which enables users to easily locate and access stored memories. The memory palace system establishes its hierarchical structure and data processing system through the following overview.
MemPalace’s pipeline consists of two main components which operate as writing memory for ingestion and as reading memory for query-time retrieval.
MemPalace provides an optional compression function which it designates as AAAK. AAAK functions as a special shorthand system which enables users to store extensive information through minimal token usage. The system performs lossy compression because its primary mechanism uses regular expressions to transform words into abbreviations while selecting key sentences for extraction, which results in approximately 30 times reduction of tokens.
The system enables AI agents to maintain permanent memory elements which users can search at any time. The system transforms spoken dialogue into vector representations which it saves in ChromaDB. The agent accesses its essential memories when it requires specific information instead of using its complete memory database.
Data ingestion is the first step. MemPalace listens to every turn of a conversation and captures user messages, AI responses, and metadata. It then prepares this raw text for storage.
The system processes data through its ingest process which produces vector embeddings for each data segment. The system uses a sentence-transformer model which converts text into a high-dimensional numerical vector. ChromaDB stores this vector together with the original text and its accompanying information.
The indexing process has two key components:
The system transforms user messages into query vectors which MemPalace uses to find the most similar database entries through its search of ChromaDB. The system only displays results for chunks that exceed the minimum score threshold of 0.70.
The retrieval pipeline applies three filters in order:
session_id. Cross-session bleed does not occur. MemPalace does not stuff the entire conversation history into the prompt. The system creates a structured block which contains the top-K retrieved chunks and adds it before the system prompt. The LLM sees only relevant past context not every turn.
The injected context block looks like this:

Each memory block includes a similarity score and turn number. The LLM receives provenance information through this mechanism. The user can select between two memory options which contain score values of 0.94 and 0.71 respectively. The injection adds zero overhead to ChromaDB because it utilizes results which the system retrieved during the search process.
LangGraph enables you to construct agents through state machines which operate with nodes that execute single tasks and edges which determine movement between nodes. MemPalace operates through two specialized nodes which include a retrieval node that connects to the chat node and a saving node that connects to the chat node. The system provides LangGraph agents with permanent memory storage which users can search through.
The section provides a guide which explains how to complete each integration step. The section provides complete Python code together with the terminal output that should appear at each development stage.
MemPalace, LangGraph, ChromaDB, and the sentence-transformer library should be installed in a Python virtual environment.

Verify all packages installed correctly:
import mempalace
import langgraph
import chromadb
print(f'MemPalace: {mempalace.__version__}')
print(f'LangGraph: {langgraph.__version__}')
print(f'ChromaDB: {chromadb.__version__}')
Output:
MemPalace: 3.3.3
LangGraph: 1.1.10
ChromaDB: 1.5.8
Create a .env file at the root of your project. The variables determine both the location where ChromaDB stores its data and the specific embedding model which MemPalace will utilize.
OPENAI_API_KEY=sk-...
MEMPALACE_DB_PATH="./chroma_palace"
MEMPALACE_COLLECTION="agent_memory"
MEMPALACE_EMBED_MODEL="all-MiniLM-L6-v2"
This will create the ChromaDB client connection and prepares the embedding function and creates a MemPalace instance. The collection is created by executing the program once. The program automatically loads the existing collection during all following executions. Put the below piece of code in palace_init.py.
import os
from dotenv import load_dotenv
import chromadb
from chromadb.utils import embedding_functions
from mempalace import MemPalace, PalaceConfig
load_dotenv()
# 1. Persistent ChromaDB client
chroma_client = chromadb.PersistentClient(
path=os.getenv('MEMPALACE_DB_PATH', './chroma_palace')
)
# 2. Sentence-transformer embedding function
embed_fn = embedding_functions.SentenceTransformerEmbeddingFunction(
model_name=os.getenv('MEMPALACE_EMBED_MODEL', 'all-MiniLM-L6-v2'),
device='cpu' # switch to 'cuda' if a GPU is available
)
# 3. Get or create a named collection
collection = chroma_client.get_or_create_collection(
name=os.getenv('MEMPALACE_COLLECTION', 'agent_memory'),
embedding_function=embed_fn,
metadata={'hnsw:space': 'cosine'}
)
# 4. Configure MemPalace
config = PalaceConfig(
max_memories=5000,
similarity_threshold=0.75,
chunk_size=512,
chunk_overlap=64,
top_k=5,
)
Output:
# First run (empty palace):
Palace ready. Memories stored: 0
# Subsequent runs (data persists):
Palace ready. Memories stored: 243
LangGraph transfers a state dictionary through its node connections. The AgentState TypedDict requires four specific fields which include the message list, the injected memory context, a turn counter, and the session ID. The chat node reads from this state and writes back to it. Put this in agent.py
from __future__ import annotations
from typing import Annotated, TypedDict, List
from langgraph.graph import StateGraph, END
from langchain_core.messages import BaseMessage, HumanMessage, AIMessage
from langchain_openai import ChatOpenAI
class AgentState(TypedDict):
messages: List[BaseMessage]
memory_context: str # retrieved memories, injected into system prompt
turn_count: int # tracks turns for auto-save trigger
session_id: str
llm = ChatOpenAI(model='gpt-4o-mini', temperature=0.7)
def build_system_prompt(memory_ctx: str) -> str:
base = 'You are a helpful assistant with persistent memory.\n'
if memory_ctx:
return base + f'\n## Relevant memories:\n{memory_ctx}\n'
return base
def chat_node(state: AgentState) -> AgentState:
system = build_system_prompt(state['memory_context'])
response = llm.invoke([
{'role': 'system', 'content': system},
*state['messages']
])
return {
**state,
'messages': state['messages'] + [AIMessage(content=response.content)],
'turn_count': state['turn_count'] + 1,
}
The retrieve node runs before every chat turn. The system takes the most recent human message and uses it to search ChromaDB through MemPalace. The output results from this process are stored in memory_context. The chat node then sees that context in its system prompt. Put this in search_hooks.py
from langchain_core.messages import HumanMessage
from palace_init import palace
from agent import AgentState
def retrieve_memories_node(state: AgentState) -> AgentState:
messages = state['messages']
if not messages:
return {**state, 'memory_context': ''}
# Use the last human message as the search query
query = ''
for msg in reversed(messages):
if isinstance(msg, HumanMessage):
query = msg.content
break
if not query:
return {**state, 'memory_context': ''}
# Search ChromaDB via MemPalace
results = palace.search(
query=query,
top_k=5,
filters={'session_id': state['session_id']},
min_score=0.70
)
if not results:
return {**state, 'memory_context': ''}
# Format results for the system prompt
ctx_lines = []
Output:
[MemPalace] Retrieved 3 memories.
[Memory 1 | score=0.94 | turn=4]
User prefers async endpoints. PostgreSQL + SQLAlchemy 2.
[Memory 2 | score=0.88 | turn=12]
User wants concise code examples. No verbose explanations.
[Memory 3 | score=0.77 | turn=19]
Project: FastAPI SaaS backend with Redis caching.
The save node runs after the chat node according to a conditional edge. When turn_count reaches a multiple of 15, it writes the last 15 messages to ChromaDB with role, turn, and timestamp metadata. The system then resets turn_count to zero. Put this in autosave.py
from datetime import datetime
from langchain_core.messages import HumanMessage, AIMessage
from palace_init import palace
from agent import AgentState
SAVE_EVERY = 15
def save_memories_node(state: AgentState) -> AgentState:
messages = state['messages']
session_id = state['session_id']
batch_start = max(0, len(messages) - SAVE_EVERY)
batch = messages[batch_start:]
docs, metadatas, ids = [], [], []
for i, msg in enumerate(batch):
role = 'human' if isinstance(msg, HumanMessage) else 'ai'
docs.append(msg.content)
metadatas.append({
'session_id': session_id,
'role': role,
'turn': batch_start + i,
'saved_at': datetime.utcnow().isoformat(),
})
ids.append(f'{session_id}-turn-{batch_start + i}')
palace.add_batch(documents=docs, metadatas=metadatas, ids=ids)
print(f' [MemPalace] Saved {len(docs)} messages. Total: {palace.count()}')
return {**state, 'turn_count': 0} # reset counter
def should_save(state: AgentState) -> str:
return 'save' if state['turn_count'] % SAVE_EVERY == 0 else 'end'
Output:
# Turn 15 fires the save:
[MemPalace] Saved 15 messages. Total: 15
# Turn 30 fires the save again:
[MemPalace] Saved 15 messages. Total: 30
The expanding palace construction needs more space because unprocessed materials take up area and building materials become harder to retrieve. The summarize node fires after every save, once the total doc count exceeds a threshold. The process combines 15 previous dialogue segments into a single summary which it creates through LLM technology while it removes all unprocessed material. Put this in summarizer.py
from datetime import datetime
from typing import List
from langchain_core.messages import BaseMessage, HumanMessage
from langchain_openai import ChatOpenAI
from palace_init import palace
SUMMARIZE_EVERY = 15 # batch window size
COMPRESS_THRESHOLD = 50 # only compress once palace exceeds this
summarizer_llm = ChatOpenAI(model='gpt-4o-mini', temperature=0)
SUMMARY_PROMPT = '''You are a memory compressor for an AI assistant.
Given the conversation excerpt below, produce a dense factual summary.
Preserve all user preferences, decisions, and context.
Write in third person. Aim for 3-6 sentences.
Conversation:
{transcript}
Summary:'''
def _format_transcript(messages: List[BaseMessage]) -> str:
lines = []
for msg in messages:
role = 'User' if isinstance(msg, HumanMessage) else 'Assistant'
lines.append(f'{role}: {msg.content}')
return '\n'.join(lines)
def summarize_and_compress(messages, session_id, batch_start) -> str:
transcript = _format_transcript(messages)
prompt = SUMMARY_PROMPT.format(transcript=transcript)
response = summarizer_llm.invoke([HumanMessage(content=prompt)])
summary_text = response.content.strip()
summary_id = f'{session_id}-summary-turns-{batch_start}-{batch_start + len(messages)}'
palace.add_batch(
documents=[summary_text],
metadatas=[{
'session_id': session_id,
'type': 'summary',
'turn_start': batch_start,
'turn_end': batch_start + len(messages),
'saved_at': datetime.utcnow().isoformat(),
'raw_turns': len(messages),
}],
ids=[summary_id],
)
The process begins with 15 raw chunks which the LLM transforms into 3-6 sentence summaries. The process results in a single summary chunk. ChromaDB deletes the 15 originals. The process results in a storage reduction of approximately 93 percent while maintaining the original meaning of the content. Now we’ll create a summarizer node which will decide when the agent will provide summary.
from agent import AgentState
from palace_init import palace
from summarizer import (
summarize_and_compress,
delete_raw_batch,
SUMMARIZE_EVERY,
COMPRESS_THRESHOLD
)
def summarize_node(state: AgentState) -> AgentState:
if palace.count() < COMPRESS_THRESHOLD:
print(f' [Summarizer] Skipped — {palace.count()} docs in palace.')
return state
messages = state['messages']
session_id = state['session_id']
total_turns = len(messages)
batch_start = max(0, total_turns - SUMMARIZE_EVERY * 2)
batch_end = batch_start + SUMMARIZE_EVERY
batch = messages[batch_start:batch_end]
if not batch:
return state
summarize_and_compress(batch, session_id, batch_start)
delete_raw_batch(session_id, batch_start, batch_end)
print(f' [Summarizer] Palace size after compression: {palace.count()}')
return state
def should_summarize(state: AgentState) -> str:
return 'summarize' if state['turn_count'] == 0 else 'end'
The process requires you to merge all nodes into one StateGraph structure The graph flows: retrieve -> chat -> (save | end) -> (summarize | end). The graph maintains operational efficiency because its conditional edges allow nodes to activate only when their respective triggering conditions are met. Now we’ll finally combine all the above nodes into a full_graph.py
from langgraph.graph import StateGraph, END
from agent import AgentState, chat_node
from search_hooks import retrieve_memories_node
from autosave import save_memories_node, should_save
from summarize_node import summarize_node, should_summarize
graph = StateGraph(AgentState)
graph.add_node('retrieve', retrieve_memories_node)
graph.add_node('chat', chat_node)
graph.add_node('save', save_memories_node)
graph.add_node('summarize', summarize_node)
graph.set_entry_point('retrieve')
graph.add_edge('retrieve', 'chat')
# After chat: save if turn_count hit the threshold
graph.add_conditional_edges(
'chat',
should_save,
{
'save': 'save',
'end': END
}
)
# After save: compress if palace is large enough
graph.add_conditional_edges(
'save',
should_summarize,
{
'summarize': 'summarize',
'end': END
}
)
graph.add_edge('summarize', END)
agent = graph.compile()
For this we will conduct a 20-turn test conversation to test three functions which include auto-save timing at turn 15 and memory retrieval from turn 10 and subsequent times and the accuracy of cross-session recall results which show similarity scores.
import uuid
from langchain_core.messages import HumanMessage
from full_graph import agent
from palace_init import palace
SAMPLE_TURNS = [
'Hi! I am building a FastAPI backend for a SaaS app.',
'I prefer async endpoints. PostgreSQL is my database.',
'Can you suggest a folder structure for the project?',
'I want to add JWT authentication.',
'Pydantic v2 for validation, SQLAlchemy 2 async ORM.',
'Keep code examples concise — no verbose explanations.',
'What is the best way to handle database migrations?',
'Show me an async endpoint with a DB session dependency.',
'Add rate limiting to the auth routes.',
'How should I structure Pydantic schemas?',
'I also need background tasks for email sending.',
'Use Redis for caching user sessions.',
'What testing framework do you recommend?',
'Help me write a pytest fixture for the DB.',
'Run a final check — is the project structure solid?', # turn 15 -> save
'Now add a websocket for real-time notifications.',
'How do I deploy this to AWS ECS?',
'Add a Dockerfile and docker-compose.yml.',
'Configure CORS for the frontend at localhost:3000.',
'Final review — anything I missed?', # turn 20
]
def run_test():
session_id = str(uuid.uuid4())
state = {
'messages': [],
'memory_context': '',
}
Output:
=== Session: a3f9c2d1... ===
Turn 01 | memories=0000 | ctx=False
Turn 02 | memories=0000 | ctx=False
Turn 05 | memories=0000 | ctx=False
[MemPalace] Retrieved 1 memories.
Turn 10 | memories=0000 | ctx=True
[MemPalace] Saved 15 messages. Total: 15
Turn 15 | memories=0015 | ctx=True <- auto-save fired
[MemPalace] Retrieved 3 memories.
Turn 20 | memories=0015 | ctx=True
Final memories in palace: 15
--- Cross-session recall ---
[0.94] Turn 4: Pydantic v2 for validation, SQLAlchemy 2 async ORM...
[0.91] Turn 1: I prefer async endpoints. PostgreSQL is my database...
[0.77] Turn 11: Use Redis for caching user sessions...
The output shows how the system builds and uses memory step by step. The system starts without memory because it needs to access previous information. The system starts to retrieve helpful data after the dialogue progresses. At turn 15, it saves 15 messages into long-term memory. The system uses its memory after turn 20 to improve its answers. The system demonstrates memory retention by accurately recollecting significant details from previous talks.
| Aspect | MemPalace vs RAG Pipelines | MemPalace vs Vector Databases | MemPalace vs Agent Memory Frameworks |
|---|---|---|---|
| Core Function | RAG retrieves static documents such as PDFs and knowledge bases at query time. | Vector databases store embeddings for similarity search. | Agent memory frameworks store short-term chat memory or key-value data. |
| Memory Type | RAG does not store previous dialogue sessions or track user behavior. | Vector databases provide flat embedding storage without memory structure. | These frameworks usually maintain brief records or essential facts. |
| MemPalace Difference | MemPalace acts as a persistent memory store beyond a single prompt. | MemPalace adds organized spatial elements such as wings, rooms, and halls. | MemPalace can replace commercial memory tools while giving users full control. |
| Key Advantage | RAG can be layered on top of MemPalace as document memory. | Its hierarchy helps users narrow down search results more effectively. | It offers privacy, control, and a local-first alternative to paid services like Letta. |
The demonstration of MemPalace shows how artificial intelligence systems now operate with permanent structured memory because their agents function as ongoing learning systems instead of operating as non-dependent instruments. The architectural development progresses from RAG to new systems which depend on memory as their core element for executing reasoning tasks and managing user interactions.
The organization of MemPalace sets a new standard for AI memory systems by prioritizing fidelity, structure, and long-term retention. Its hierarchical design and exact data preservation overcome limitations of traditional systems like RAG and summarization-based approaches.
Its strength comes from combining AAAK compression, a temporal knowledge graph, and MCP integration. The next step for context-aware agents is building memory systems that preserve full user experiences, not just outputs. MemPalace reflects this shift by enabling extended memory capabilities and marking a significant step toward true AI memory.
A. MemPalace is a local-first memory system that stores complete conversations as structured, persistent memory units for accurate recall and context.
A. Unlike RAG, MemPalace stores full data verbatim and uses hierarchical structure for richer context, better reasoning, and improved traceability.
A. It preserves all details by storing raw conversations, ensuring higher recall, complete context, and verifiable memory without losing subtle information.