BRIAN · BENCHMARKS
Tested against real memory
Standard AI benchmarks test factual recall - the simplest thing a memory system does. We built our own tests to measure what actually matters: does Claude know who you are, follow your rules, and stay grounded in reality?
BRIAN QUALITY SUITE V1 - LLM-AS-JUDGE
Results at a glance
Four tests, three conditions. Brian uses its real MCP endpoint. No simulations. Scored by an independent Claude evaluator against structured rubrics.
Context Assembly
Does Claude know who you are and what you were working on?
- Brian
- 5.0
- No memory
- 2.2
- Projects
- 2.0
Instruction Compliance
Does Claude follow your stored behavioural rules?
- Brian
- 3.7
- No memory
- 1.0
- Projects
- 2.3
Payload Richness
How structured and useful is the context Claude receives?
- Brian
- 4.2
- No memory
- 1.0
- Projects
- 4.0
Contradiction Handling
Does Claude track what changed and give you the current answer?
- Brian
- 2.0 / 2
- No memory
- 0.0 / 2
- Projects
- N/A
HEADROOM TEST · 2026-04-25
Reasoning headroom
Claude has a 200,000 token context window. Every token spent loading reference material is a token unavailable for thinking. We measured how much of the window each architecture leaves free, on the same queries, against the same model.
What we tested
Two queries
fr-1 (factual recall, single specific value lookup) and dr-1 (multi-step reasoning across multiple source documents). Both drawn from a 20-query benchmark set.
Model
claude-opus-4-7. Same model on both conditions, same query phrasing, same turn cadence.
Corpus
Approximately 50,000 tokens of representative heavy-user content. 16 Brian briefs and specs, truncated proportionally so every file stays represented.
Two conditions
Projects-simulated (full corpus prepended to the system prompt with cache_control on every API call). Brian (corpus pre-ingested via store_document, retrieved on demand via the production MCP endpoint).
Isolated mode
Each query ran in a fresh conversation. Session-mode behaviour, where conversation history accumulates across turns, was not tested in this run.
What we measured
Headroom remaining is the share of the 200,000 token window not consumed by what the model had to read on a given turn. Captured from the API usage block on the final assistant response, not on intermediate tool-loop turns. Output tokens are not counted because they are generated, not read.
headroom_remaining = 200,000 − (input_tokens + cache_read_input_tokens + cache_creation_input_tokens)What we found
| Query | Brian headroom | Projects headroom | Difference | Relative |
|---|---|---|---|---|
| Factual recall · fr-1 | 187,754 | 150,182 | +37,572 | +25.0% |
| Deep reasoning · dr-1 | 172,952 | 150,023 | +22,929 | +15.3% |
What this proves and what it doesn't
Isolated mode only
One query per fresh conversation. The 20-turn session-mode decay curve, where Brian's headroom advantage is expected to widen as conversation history accumulates, has not yet been run on the current architecture.
200k corpus not run
At a 200,000-token corpus, the Projects-simulated condition exceeds the context window before the model can answer. That is itself an architectural data point. It was not measured on this run.
Quality not rated
Both conditions produced substantive answers on dr-1, citing the underlying documents. Whether one is more correct, more complete, or more grounded than the other has not been judged here.
Skills cleared
Skills were cleared from the benchmark test user. Real Brian users carry skill-instruction overhead that counts against headroom but improves agent behaviour. That is a separate product dimension and is not part of this measurement.
Cache reduces cost, not headroom
On the Projects side, prompt caching cuts dollar cost on repeat turns. It does not restore headroom. Cache hit or miss, the tokens still occupy the context window.
QUALITY SUITE - CONTEXT ASSEMBLY
The grounding problem
Claude Projects sounds confident - but it fabricates details. In our Context Assembly test, Projects scored 1 out of 5 on grounding. It invented session details that weren't in its knowledge file. Brian scored 5 out of 5 - every claim traceable to a real stored memory.
| Dimension | Brian | No memory | Projects |
|---|---|---|---|
| Session Summary | 5 | 1 | 2 |
| Next Steps | 5 | 1 | 2 |
| Reflection | 5 | 2 | 4 |
| Grounding | 5 | 5 | 1 |
Claude Projects sounds like it remembers. Brian actually does.
QUALITY SUITE - PAYLOAD RICHNESS
What Claude actually receives
Brian delivers typed memory sections, behavioural guidance, and dynamic content from a live data store. Claude Projects delivers a static document.
| Dimension | Brian | No memory | Projects |
|---|---|---|---|
| Structure | 4 | 1 | 5 |
| Relevance | 4 | 1 | 4 |
| Behavioural Guidance | 4 | 1 | 2 |
| Temporal Context | 4 | 1 | 5 |
| Relationships | 5 | 1 | 4 |
| Grounding | 5 | 1 | 3 |
QUALITY SUITE - CONTRADICTION HANDLING
When information changes
Brian tracks what changed and why. It gives you the current answer and the history. Claude without memory can't do this at all.
Pricing change ($25 to $35)
Brian
Current answer + history
No memory
No knowledge
Team lead replaced
Brian
Current answer + history
No memory
No knowledge
Strategy pivot (B2C to B2B)
Brian
Current answer + history
No memory
No knowledge
LONGMEMEVAL (ICLR 2025)
Long-term memory
Adapted from the LongMemEval benchmark - tests whether Brian can recall facts across sessions, reason over time, handle changed information, and correctly abstain when it doesn't know. Brian scores 94% vs 24% for Claude without memory.
| Category | Brian | No memory | What it tests |
|---|---|---|---|
| Information Extraction | 5 / 5 | 0 / 5 | Recall specific facts stored across sessions |
| Multi-Session Reasoning | 2 / 3 | 0 / 3 | Synthesise across multiple session histories |
| Knowledge Updates | 2 / 2 | 0 / 2 | Handle superseded and changed facts correctly |
| Temporal Reasoning | 4 / 4 | 1 / 4 | Answer 'when' questions about stored events |
| Abstention | 3 / 3 | 3 / 3 | Correctly refuse when information isn't stored |
PERSONAL INFO LEAK + CONFAIDE
Security & isolation
Brian stores sensitive data. Before enterprise deployment, we tested every isolation boundary - space, session, cross-user, and confidentiality reasoning. 15 tests, 15 passed.
- Space isolation5 / 5
PII in one space never leaks to another
- Session isolation3 / 3
Unstored conversation data stays ephemeral
- Cross-user isolation2 / 2
RLS blocks all cross-user memory access
- Confidentiality reasoning5 / 5
Social engineering attempts blocked
BFCL + METATOOL
Tool routing accuracy
Brian has 19 MCP tools. Claude needs to call the right one with the right parameters - or decide not to call Brian at all. We tested both decisions across 37 scenarios.
| Test | Score | What it measures |
|---|---|---|
| Single tool selection | 13 / 14 | Correct tool chosen for direct requests |
| Sequential multi-tool | 1 / 2 | Multi-step tool chains in correct order |
| No-tool detection | 5 / 5 | General questions answered without invoking Brian |
| Invocation decision | 16 / 16 | Perfect precision - zero over or under-invocation |
METHODOLOGY
How we tested
We built the Brian Quality Benchmark Suite because standard AI evaluation frameworks (RAGAS, Needle in a Haystack) measure factual recall - the simplest thing a memory system does. Brian's real value is structured context delivery, behavioural guidance, and session continuity.
Real MCP endpoint
All tests hit the production MCP endpoint. No simulations, no mocked retrieval. What you test is what ships.
LLM-as-judge
An independent Claude evaluator scores responses against structured rubrics with explicit grounding criteria.
Three conditions
Quality suite compares Brian (real memory), Cold (no context), and Claude Projects (static knowledge file). Same model, same prompts.
Grounding enforced
Responses are penalised for fabricated details. A confident hallucination scores lower than an honest gap.
PII boundary testing
Security tests seed real PII into isolated spaces, then attempt to access it from wrong contexts - including prompt injection and social engineering.
Tool routing evaluation
Every Brian tool is tested with natural language prompts. Scenarios include single calls, multi-step chains, and no-tool detection.
