TreeMemory

TreeMemory is an experimental external memory system for AI assistants. The core idea is simple:

Facts should not be stored as one flat pile. They should be organized into semantic branches, so updates and retrieval stay local.

This project explores whether a hierarchical memory can reduce context contamination, wrong-branch retrieval, and stale fact conflicts compared with flat lexical memory.

Motivation

Long-term AI memory has a practical problem: similar words can refer to unrelated concepts.

Examples:

Michelin tires vs Michelin stars
Python code vs python snake
Apple company vs apple fruit
Jaguar car vs jaguar animal
Mercury planet vs mercury element

A flat memory search can retrieve the right answer, but often includes many unrelated facts in the context. That is dangerous for an AI assistant because the language model may mix branches and answer from the wrong memory.

TreeMemory tests a different approach:

root
+-- artifacts
|   +-- vehicles
|   |   +-- car_tires
|   |   +-- car_engine
|   +-- computing
|       +-- python_code
+-- living
|   +-- reptiles
|   |   +-- python_snake
|   +-- mammals
|       +-- jaguar_animal
+-- culture
    +-- food
        +-- restaurants

Each fact is stored in a semantic path. Updates replace only the local fact in the same path and slot.

Hypothesis

External hierarchical memory should improve AI memory quality by:

Reducing retrieval from unrelated branches.
Localizing updates to the correct branch.
Preserving nearby but distinct facts.
Producing cleaner context for downstream LLM answers.
Making memory behavior easier to inspect and debug.

This project does not claim that LLM weights naturally store knowledge as a clean tree. Earlier experiments in this folder tested gradient masks in GPT-2. The stronger practical direction is external TreeMemory: a designed memory layer that an AI system can retrieve from.

Current Design

The main implementation is:

tree_memory_engine.py

It implements a hybrid tree retrieval system:

semantic paths such as artifacts/vehicles/car_tires
local fact updates by path + slot
beam routing to multiple likely branches
global fallback retrieval
final reranking
compact context output
retrieval explanations
JSON save/load

The important design choice is that routing is not hard top-1 routing. A naive hard tree failed because if it chose the wrong branch, it lost recall. TreeMemory uses beam routing plus fallback.

Quickstart

Run the demo:

python examples/demo.py

Run in Google Colab:

notebooks/tree_memory_colab_demo.ipynb

Direct Colab link:

https://colab.research.google.com/github/g1g4b1t/tree-memory/blob/main/notebooks/tree_memory_colab_demo.ipynb

Colab is optional. TreeMemory runs on CPU, but the notebook is useful for a zero-install demo. See:

docs/colab.md

Run the optional real-LLM benchmark in Colab:

https://colab.research.google.com/github/g1g4b1t/tree-memory/blob/main/notebooks/tree_memory_llm_benchmark_colab.ipynb

This benchmark uses a language model to test whether cleaner TreeMemory context improves generated answers. Use a T4 GPU if available. See:

docs/llm_benchmark.md

Latest real-LLM Colab result with google/flan-t5-small:

flat_replace       accuracy 0.70, context contamination 0.875
hybrid_tree        accuracy 0.70, context contamination 0.492
hard_tree          accuracy 0.95, context contamination 0.000
gated_hybrid_tree  accuracy 0.95, context contamination 0.000
no_context         accuracy 0.00

See:

docs/llm_results.md

Run the optional TreeMemory vs LoRA benchmark in Colab:

https://colab.research.google.com/github/g1g4b1t/tree-memory/blob/main/notebooks/tree_memory_lora_benchmark_colab.ipynb

This compares external memory updates against LoRA adapter training for factual memory. See:

docs/lora_benchmark.md

Run the interactive CLI:

python examples/interactive_cli.py

Example CLI commands:

ask Who produces premium car tires now?
update
ask What are Michelin stars, not tires?
explain What are Michelin stars, not tires?
save my_memory.json
quit

Expected behavior:

Before update:
Who produces premium car tires? -> Michelin
What are Michelin stars? -> restaurant awards

After local update:
Who produces premium car tires now? -> Bridgestone
What are Michelin stars, not tires? -> restaurant awards
What power source do Tesla car engines use after the tire update? -> electric motors

This shows a local update:

artifacts/vehicles/car_tires:
Michelin -> Bridgestone

without corrupting:

culture/food/restaurants:
Michelin stars -> restaurant awards

Basic Usage

from tree_memory_engine import TreeMemory

memory = TreeMemory()

memory.add_alias(
    "artifacts/vehicles/car_tires",
    "car tires rubber road grip michelin bridgestone vehicle maker",
)

memory.add_fact(
    path="artifacts/vehicles/car_tires",
    text="Michelin produces premium car tires.",
    slot="car_tires.maker",
    answer="Michelin",
    tags="vehicle tires maker",
)

print(memory.answer("Who produces premium car tires?")["answer"])

memory.update_fact(
    path="artifacts/vehicles/car_tires",
    slot="car_tires.maker",
    text="Bridgestone produces premium car tires in the updated memory.",
    answer="Bridgestone",
    tags="vehicle tires maker update",
)

print(memory.answer("Who produces premium car tires now?")["answer"])

Retrieval Explanation

TreeMemory can explain its retrieval:

explanation = memory.explain_retrieval("What are Michelin stars, not tires?")

The explanation includes:

routed branches
matched terms
retrieved facts
scores
source fact IDs
final answer

This is useful for debugging memory behavior and for showing why the assistant answered a certain way.

Benchmark

The current comparison benchmark is:

benchmarks/flat_vs_tree_5tasks.py

It compares:

FlatMemory: one global lexical search over all facts
HybridTreeMemory: beam-routed tree retrieval with local updates and fallback

The benchmark includes five task types:

Ambiguous entities: python code vs python snake, apple company vs apple fruit
Noisy queries: questions that explicitly reject a wrong meaning
Local updates: replacing facts inside one branch
Conflict isolation: checking that nearby branches are not corrupted
Context efficiency: measuring how many retrieved facts are needed to reach the answer

Run:

python benchmarks/flat_vs_tree_5tasks.py

Current local result:

Final Best Tree vs Flat verdict: PASS

The larger scaled benchmark is:

benchmarks/scaled_memory_benchmark.py

It compares four retrieval strategies:

flat_append: flat memory where updates are appended and stale facts remain retrievable
flat_replace: stronger flat memory with exact path+slot replacement
hard_tree: strict top-1 branch routing
hybrid_tree: beam tree routing with compact fallback
gated_hybrid_tree: hard routing when confidence is high, fallback only when confidence is low

Current scaled result:

Final Scaled Memory Benchmark verdict: PASS

Latest scaled summary:

Concepts: 37
Base facts: 111
Updates: 17
Queries: 256

Overall top-1 accuracy:
flat_replace 0.934
hard_tree    1.000
hybrid_tree  0.957
gated_hybrid_tree 1.000

Context contamination:
flat_replace 0.767
hard_tree    0.000
hybrid_tree  0.038
gated_hybrid_tree 0.000

AI context risk:
flat_replace 0.089
hard_tree    0.000
hybrid_tree  0.028
gated_hybrid_tree 0.000

This result supports the main claim: hierarchical memory can keep retrieval context much cleaner than flat memory while preserving strong answer retrieval.

Summary from the latest run:

Overall top1 accuracy:
flat        0.852
hybrid_tree 0.852

Path precision:
flat        0.272
hybrid_tree 0.840

Wrong branch hits:
flat        3.259
hybrid_tree 0.370

Conflict hits:
flat        0.111
hybrid_tree 0.000

Interpretation:

TreeMemory did not improve overall top-1 accuracy in this benchmark, but it produced much cleaner retrieval. It reduced wrong-branch facts and eliminated update conflicts while keeping answer accuracy competitive.

For a fuller write-up, see:

docs/results.md

Why This Matters

For an AI assistant, retrieval quality is not only about whether the answer appears somewhere in the context. It is also about what else appears with it.

Flat memory can return the correct answer alongside many misleading facts. TreeMemory aims to return a smaller, cleaner, more local context.

This matters for:

long-term assistant memory
personal knowledge bases
RAG systems
local fact updates
auditable memory
reducing hallucination from contaminated context

Earlier Experiments

This repository also contains earlier exploratory experiments under:

experiments/gpt2_gradient_masks/

Those files tested whether GPT-2 gradient masks behave like semantic memory branches. The results were mixed:

gradient footprints show some semantic structure
naive parameter freezing does not scale reliably
conflict-aware gradient masking works better

The current direction is external TreeMemory, because it is more practical and controllable than trying to force long-term facts into model weights.

Older external-memory iterations are kept under:

experiments/external_memory_iterations/

Benchmark and demo outputs are stored under:

artifacts/results/
artifacts/demo/

Repository Layout

tree_memory_engine.py              Core TreeMemory implementation
examples/demo.py                   Small runnable demo
examples/interactive_cli.py        Tiny manual testing CLI
benchmarks/flat_vs_tree_5tasks.py  FlatMemory vs HybridTree benchmark
benchmarks/_flat_vs_tree_5tasks_impl.py
                                    Benchmark implementation
benchmarks/scaled_memory_benchmark.py
                                    Larger flat vs hard-tree vs hybrid-tree benchmark
benchmarks/llm_context_benchmark.py
                                    Optional real-LLM context benchmark
benchmarks/lora_vs_tree_benchmark.py
                                    Optional TreeMemory vs LoRA benchmark
docs/hypothesis.md                 Research hypothesis and predictions
docs/architecture.md               System architecture
docs/results.md                    Benchmark summary and interpretation
docs/github_setup.md               Publishing notes for GitHub
docs/colab.md                      Google Colab quickstart
docs/llm_benchmark.md              Optional real-LLM benchmark notes
docs/llm_results.md                First real-LLM benchmark result
docs/lora_benchmark.md             Optional TreeMemory vs LoRA notes
experiments/                       Archived exploratory experiments
artifacts/                         Local demo and benchmark outputs
scripts/validate.py                Full local validation suite
notebooks/tree_memory_colab_demo.ipynb
                                    Zero-install Colab demo
notebooks/tree_memory_llm_benchmark_colab.ipynb
                                    Colab notebook for real-LLM benchmark
notebooks/tree_memory_lora_benchmark_colab.ipynb
                                    Colab notebook for TreeMemory vs LoRA
tests/test_tree_memory_engine.py   Basic regression tests
.github/workflows/ci.yml           GitHub Actions validation
LICENSE                            MIT License
README.md                          Project overview
requirements.txt                   Minimal benchmark dependency
requirements-llm.txt               Optional LLM benchmark dependencies
requirements-lora.txt              Optional LoRA benchmark dependencies

Run tests:

python -m unittest discover -s tests

Run the full local validation suite:

python scripts/validate.py

This checks syntax, unit tests, demo behavior, CLI smoke behavior, and the 5-task benchmark.

GitHub Actions runs the same validation suite on Ubuntu and Windows. See:

.github/workflows/ci.yml
docs/github_setup.md

Limitations

This is an early prototype.

Known limitations:

routing is still lexical, not embedding-based
path aliases are manually provided
no LLM is used yet for fact extraction or answer generation
benchmark facts are synthetic
no persistence backend beyond JSON
no conflict resolution UI
no automatic tree growth

These limitations are intentional at this stage. The goal is to isolate and test the memory structure before adding more moving parts.

Roadmap

Near-term:

Add automatic fact extraction from text.
Add automatic path routing using an LLM or embeddings.
Add larger benchmark datasets.
Add benchmark scripts for ablations.

Medium-term:

SQLite storage.
Version history per branch.
Conflict reports.
Memory deletion and forgetting.
Branch summaries.
LLM answer generation from retrieved context.

Research direction:

Compare flat RAG vs TreeMemory RAG.
Measure context contamination.
Test on larger ambiguous datasets.
Add human-readable retrieval traces.

Project Status

Status: experimental prototype.

TreeMemory is not a solved memory system. It is a small research project exploring whether hierarchical external memory can make AI retrieval cleaner, more local, and easier to update.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

TreeMemory

Motivation

Hypothesis

Current Design

Quickstart

Basic Usage

Retrieval Explanation

Benchmark

Why This Matters

Earlier Experiments

Repository Layout

Limitations

Roadmap

Project Status

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 31 Commits
.github/workflows		.github/workflows
artifacts		artifacts
benchmarks		benchmarks
docs		docs
examples		examples
experiments		experiments
notebooks		notebooks
scripts		scripts
tests		tests
.gitattributes		.gitattributes
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
requirements-llm.txt		requirements-llm.txt
requirements-lora.txt		requirements-lora.txt
requirements.txt		requirements.txt
tree_memory_engine.py		tree_memory_engine.py

Folders and files

Latest commit

History

Repository files navigation

TreeMemory

Motivation

Hypothesis

Current Design

Quickstart

Basic Usage

Retrieval Explanation

Benchmark

Why This Matters

Earlier Experiments

Repository Layout

Limitations

Roadmap

Project Status

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages