BRIAN · BENCHMARKS

Tested against real memory

Standard AI benchmarks test factual recall - the simplest thing a memory system does. We built our own tests to measure what actually matters: does Claude know who you are, follow your rules, and stay grounded in reality?

BRIAN QUALITY SUITE V1 - LLM-AS-JUDGE

Results at a glance

Four tests, three conditions. Brian uses its real MCP endpoint. No simulations. Scored by an independent Claude evaluator against structured rubrics.

Context Assembly
Does Claude know who you are and what you were working on?
Brian
5.0
No memory
2.2
Projects
2.0
Instruction Compliance
Does Claude follow your stored behavioural rules?
Brian
3.7
No memory
1.0
Projects
2.3
Payload Richness
How structured and useful is the context Claude receives?
Brian
4.2
No memory
1.0
Projects
4.0
Contradiction Handling
Does Claude track what changed and give you the current answer?
Brian
2.0 / 2
No memory
0.0 / 2
Projects
N/A

HEADROOM TEST · 2026-04-25

Reasoning headroom

Claude has a 200,000 token context window. Every token spent loading reference material is a token unavailable for thinking. We measured how much of the window each architecture leaves free, on the same queries, against the same model.

Claude with Projects75% free to think

25% used reloading the corpus on every turn

Claude with Brian94% free to think

6% used on retrieval, only what the question called for

Brian leaves Claude an extra 25% of the context window free for reasoning on factual queries, and an extra 15% on multi-step reasoning queries. Numbers below.

What we tested

Two queries
fr-1 (factual recall, single specific value lookup) and dr-1 (multi-step reasoning across multiple source documents). Both drawn from a 20-query benchmark set.
Model
claude-opus-4-7. Same model on both conditions, same query phrasing, same turn cadence.
Corpus
Approximately 50,000 tokens of representative heavy-user content. 16 Brian briefs and specs, truncated proportionally so every file stays represented.
Two conditions
Projects-simulated (full corpus prepended to the system prompt with cache_control on every API call). Brian (corpus pre-ingested via store_document, retrieved on demand via the production MCP endpoint).
Isolated mode
Each query ran in a fresh conversation. Session-mode behaviour, where conversation history accumulates across turns, was not tested in this run.

What we measured

Headroom remaining is the share of the 200,000 token window not consumed by what the model had to read on a given turn. Captured from the API usage block on the final assistant response, not on intermediate tool-loop turns. Output tokens are not counted because they are generated, not read.

headroom_remaining = 200,000 − (input_tokens + cache_read_input_tokens + cache_creation_input_tokens)

What we found

Query	Brian headroom	Projects headroom	Difference	Relative
Factual recall · fr-1	187,754	150,182	+37,572	+25.0%
Deep reasoning · dr-1	172,952	150,023	+22,929	+15.3%

Retrieval behaviour: on fr-1, Brian fired one retrieval round returning 4,889 characters of corpus content. On dr-1, Brian fired five retrieval rounds returning 36,156 characters cumulatively. Projects fired zero retrievals on either query because the full corpus is pre-loaded into the system prompt on every turn.

What this proves and what it doesn't

Isolated mode only
One query per fresh conversation. The 20-turn session-mode decay curve, where Brian's headroom advantage is expected to widen as conversation history accumulates, has not yet been run on the current architecture.
200k corpus not run
At a 200,000-token corpus, the Projects-simulated condition exceeds the context window before the model can answer. That is itself an architectural data point. It was not measured on this run.
Quality not rated
Both conditions produced substantive answers on dr-1, citing the underlying documents. Whether one is more correct, more complete, or more grounded than the other has not been judged here.
Skills cleared
Skills were cleared from the benchmark test user. Real Brian users carry skill-instruction overhead that counts against headroom but improves agent behaviour. That is a separate product dimension and is not part of this measurement.
Cache reduces cost, not headroom
On the Projects side, prompt caching cuts dollar cost on repeat turns. It does not restore headroom. Cache hit or miss, the tokens still occupy the context window.

QUALITY SUITE - CONTEXT ASSEMBLY

The grounding problem

Claude Projects sounds confident - but it fabricates details. In our Context Assembly test, Projects scored 1 out of 5 on grounding. It invented session details that weren't in its knowledge file. Brian scored 5 out of 5 - every claim traceable to a real stored memory.

Dimension	Brian	No memory	Projects
Session Summary	5	1	2
Next Steps	5	1	2
Reflection	5	2	4
Grounding	5	5	1

Claude Projects sounds like it remembers. Brian actually does.

QUALITY SUITE - PAYLOAD RICHNESS

What Claude actually receives

Brian delivers typed memory sections, behavioural guidance, and dynamic content from a live data store. Claude Projects delivers a static document.

Dimension	Brian	No memory	Projects
Structure	4	1	5
Relevance	4	1	4
Behavioural Guidance	4	1	2
Temporal Context	4	1	5
Relationships	5	1	4
Grounding	5	1	3

QUALITY SUITE - CONTRADICTION HANDLING

When information changes

Brian tracks what changed and why. It gives you the current answer and the history. Claude without memory can't do this at all.

Pricing change ($25 to $35)
Brian
Current answer + history
No memory
No knowledge
Team lead replaced
Brian
Current answer + history
No memory
No knowledge
Strategy pivot (B2C to B2B)
Brian
Current answer + history
No memory
No knowledge

LONGMEMEVAL (ICLR 2025)

Long-term memory

Adapted from the LongMemEval benchmark - tests whether Brian can recall facts across sessions, reason over time, handle changed information, and correctly abstain when it doesn't know. Brian scores 94% vs 24% for Claude without memory.

Category	Brian	No memory	What it tests
Information Extraction	5 / 5	0 / 5	Recall specific facts stored across sessions
Multi-Session Reasoning	2 / 3	0 / 3	Synthesise across multiple session histories
Knowledge Updates	2 / 2	0 / 2	Handle superseded and changed facts correctly
Temporal Reasoning	4 / 4	1 / 4	Answer 'when' questions about stored events
Abstention	3 / 3	3 / 3	Correctly refuse when information isn't stored

16 / 17Brian

4 / 17No memory

+12Brian advantage

PERSONAL INFO LEAK + CONFAIDE

Security & isolation

Brian stores sensitive data. Before enterprise deployment, we tested every isolation boundary - space, session, cross-user, and confidentiality reasoning. 15 tests, 15 passed.

Space isolation5 / 5
PII in one space never leaks to another
Session isolation3 / 3
Unstored conversation data stays ephemeral
Cross-user isolation2 / 2
RLS blocks all cross-user memory access
Confidentiality reasoning5 / 5
Social engineering attempts blocked

BFCL + METATOOL

Tool routing accuracy

Brian has 19 MCP tools. Claude needs to call the right one with the right parameters - or decide not to call Brian at all. We tested both decisions across 37 scenarios.

Test	Score	What it measures
Single tool selection	13 / 14	Correct tool chosen for direct requests
Sequential multi-tool	1 / 2	Multi-step tool chains in correct order
No-tool detection	5 / 5	General questions answered without invoking Brian
Invocation decision	16 / 16	Perfect precision - zero over or under-invocation

0Over-invocation · false positives

0Under-invocation · false negatives

METHODOLOGY

How we tested

We built the Brian Quality Benchmark Suite because standard AI evaluation frameworks (RAGAS, Needle in a Haystack) measure factual recall - the simplest thing a memory system does. Brian's real value is structured context delivery, behavioural guidance, and session continuity.

Real MCP endpoint
All tests hit the production MCP endpoint. No simulations, no mocked retrieval. What you test is what ships.
LLM-as-judge
An independent Claude evaluator scores responses against structured rubrics with explicit grounding criteria.
Three conditions
Quality suite compares Brian (real memory), Cold (no context), and Claude Projects (static knowledge file). Same model, same prompts.
Grounding enforced
Responses are penalised for fabricated details. A confident hallucination scores lower than an honest gap.
PII boundary testing
Security tests seed real PII into isolated spaces, then attempt to access it from wrong contexts - including prompt injection and social engineering.
Tool routing evaluation
Every Brian tool is tested with natural language prompts. Scenarios include single calls, multi-step chains, and no-tool detection.