Everything here is measured and reproducible — no hand-picked or hardcoded numbers. Every figure is produced by code in the repo and gated in CI, so it can’t silently regress. Where a number is an estimate or a real-repo extrapolation, it says so.
Reproduce locally:
python -m tests.benchmark.run(token reduction + learning
- synapse A/B),
python -m evals.faithfulness.runner --run(answer quality),python -m evals.onboarding.runner --run(onboarding lift),python -m evals.parity.run(backend parity).
On code questions, NeuralMind sends the agent the few entities that matter instead of whole files — so the same answer costs 40–70× fewer tokens on real repositories. That real-repo range is the product’s positioning; the number we measure in CI is deliberately conservative, on a tiny 500-line fixture where there’s little to prune, and it still clears a wide margin.
| What | Measured (CI, 500-line fixture) | On real repos |
|---|---|---|
| Token reduction on code questions | 6.2× | 40–70× (more files to prune ⇒ larger ratio) |
| Regression floor (CI fails below) | 4.0× | — |
The fixture number is the floor of a floor: small repo, conservative gate. The mechanism is what scales — the bigger the codebase, the more whole-file context you avoid.
Yes, and it’s measured. The faithfulness eval compares NeuralMind’s selected context against naive truncation at the same token budget — the honest comparison, not “small context vs the whole repo.”
| Metric (built-in backend, gold set) | NeuralMind | Matched-budget naive | Delta |
|---|---|---|---|
| Expected-fact recall | 0.717 | 0.574 | +0.143 |
| Grounding (right modules cited) | 1.000 | — | — |
A positive delta means smart selection beats dumb truncation at equal cost. Gated in CI at delta ≥ 0.
NeuralMind’s moat is usage memory: a Hebbian synapse layer that learns what your team edits together and surfaces it on future queries. Both effects are measured by isolated A/Bs:
| Effect | Off | On | Lift |
|---|---|---|---|
| Synapse recall — top-k retrieval hit rate (same warm graph) | 72% | 83% | +12 pts |
| Onboarding lift — top-k module hit-rate from a committed team baseline | — | — | +6.5 pts |
Both are budget-neutral by design: recalled nodes displace the weakest hits rather than adding tokens. The onboarding lift is the answer to “does an agent that inherits a committed team memory retrieve better on its first queries than a cold agent?” — gated in CI at lift ≥ 0.
The opt-in turbovec backend (Google TurboQuant) can embed and search with
zero ChromaDB, and it does so without giving up quality:
| Backend | Fact recall | Top-k hit@4 | Vector size |
|---|---|---|---|
| chroma (float32 HNSW, default) | 0.744 | 0.759 | 1× |
| turbovec (4-bit, ChromaDB-free) | 0.800 | 0.759 | ~8–16× smaller |
all-MiniLM-L6-v2): verified cosine 1.0, max elementwise diff 0.0 — so
retrieval quality is unchanged; only the index representation differs.| Language | graphify symbols | built-in covers | dangling edges |
|---|---|---|---|
| Python | (gold-fact eval above) | — | — |
| TypeScript | 54 | 54 (100%) | 0 |
| Go | 45 | 45 (100%) | 0 |
The built-in tree-sitter backend matches graphify symbol-for-symbol on the
reference fixtures; an optional SCIP pass replaces heuristic call edges with
compiler-accurate ones. All gated by evals/parity/run.py.
benchmark-your-repo.pip install -e ".[dev]" tiktoken
python -m tests.benchmark.run # reduction + learning + synapse A/B
python -m evals.faithfulness.runner --run # answer-quality delta
python -m evals.onboarding.runner --run # onboarding lift
python -m evals.parity.run # backend parity (incl. turbovec)
Each prints a report and exits non-zero if it falls below its gate — the same checks that run on every PR.