| Flat session — what the context window holds | Bella hypergraph — what survives |
|---|---|
| ```text user: test flaked again, 3rd time assistant: which one — auth or sync? user: sync_external_api, like last time assistant: ok, looking at retry.py assistant: timeout hardcoded to 2s. bump to 5s? user: bandaid, don't paper over it user: we need to know WHY it times out assistant: fair. tracing retry logic assistant: sync retries with 200ms backoff assistant: huh, only one retry — that's tight user: CI load? Mondays spike hard assistant: checking metrics dashboard... assistant: yes — CI hits the rate-limiter assistant: first retry hits same limit, fails assistant: 2s window exceeded before retry user: real fix: retry jitter, not timeout assistant: agreed. patching: backoff ± jitter user: good assistant: add latency metrics too? user: not now, focus on the fix ``` | ```text [retry-jitter is the fix] m=0.74 v=2 (user + assistant) ⇒ cause chain CI load → rate-limiter → first retry → 2s window exceeded before retry ⊥ rejected "bump timeout from 2s to 5s" (user: "bandaid, not a fix") __self__ observation "I reach for timeout bumps when retry semantics are the real problem" ``` |
| *~220 tokens · 20 turns · ordered by time · dies at `/clear`* | *~50 tokens · 4 beliefs · ordered by evidence mass · persists* |
*Shape encodes class (hexagon/diamond/ellipse/cube), color encodes
nature (amber/blue/green), node size encodes mass, and typed edges
(`→` support, `⇒` cause, `…` elaborate, `⊥` dispute) are the
structure that flat context can't preserve.*
---
## Empirical results
Latest measurement: [benchmarks/v0.0.4rc1.md](https://github.com/immartian/bellamem/blob/v0.2.0-ref/benchmarks/v0.0.4rc1.md)
(2026-04-10, budget = 1200 tokens, LLM judge enabled, 13-item
hand-labeled corpus, 1834-belief forest).
```
metric flat_tail compact rag_topk expand before_edit
----------------------------------------------------------------------------------
exact hit rate 15 % 0 % 15 % 69 % 46 %
embed hit rate 23 % 31 % 31 % 85 % 77 %
llm judge rate 0 % 8 % 31 % 92 % 69 %
avg tokens used 1200 602 1161 1143 964
```
`flat_tail (0%) < compact (8%) < rag_topk (31%) < before_edit (69%) < expand (92%)`.
**Headline story — compare to [v0.0.2](https://github.com/immartian/bellamem/blob/v0.2.0-ref/benchmarks/v0.0.2.md):** as
the forest grew from the v0.0.2 dogfood snapshot to 1834 beliefs,
`rag_topk` collapsed from 85% → 31% LLM judge (cosine top-k pulls up
more plausible-looking-but-wrong neighbors in a larger forest), while
`expand` held at 92%. The gap from `expand` to the next-best contender
widened from **15pp to 61pp**. Structured mass-weighted retrieval
scales with forest size; cosine top-k doesn't. The retrieval code
path (the v0.2 walker) is unchanged between v0.0.2
and v0.0.4rc1 — every delta is a property of forest growth, not
algorithm changes.
See [benchmarks/README.md](https://github.com/immartian/bellamem/blob/v0.2.0-ref/benchmarks/README.md) for the versioning
convention and when to re-run.
### Compression at scale — 15 real Claude Code projects
The bench above answers *"is `expand` accurate?"* The next question
is *"how many tokens does Bella actually save?"* For that, the
[`docs/scenarios.md`](https://github.com/immartian/bellamem/blob/v0.2.0-ref/docs/scenarios.md)
harness measures real Claude Code session transcripts sampled from
**15 different projects** on a developer's machine — news monorepos,
IRB documents, refactoring sessions, agent prototypes, marketing
work — at a fixed `expand` budget of 1500 tokens.
| | tokens |
|---|---:|
| raw conversation range | 274 → 132,399 |
| compression ratio range | **3.6× → 90×** |
| **median ratio** | **17.6×** |
Visually: