neuralmind

Use Case: LLM Cost Optimization

What you’re solving for

Your team or product spends measurable money on LLM API calls for code questions. You need to reduce spend without moving to a cheaper model or cutting features, and report the savings to a stakeholder.

Step 1 — Baseline the current spend

Before installing NeuralMind, capture what you’re spending. Pick a representative workday:

# Count tokens per query today by logging your agent's input/output
# (most agents have a debug mode or you can estimate with tiktoken)

Compute: avg_tokens_per_query × queries_per_day × 30 × $_per_MTok. That’s your monthly floor.

Step 2 — Install NeuralMind

pip install neuralmind
neuralmind build .
neuralmind install-hooks .     # Claude Code users only

Step 3 — Measure the new baseline

neuralmind benchmark . --json

Returns:

{
  "wakeup_tokens": 341,
  "avg_query_tokens": 739,
  "avg_reduction_ratio": 65.6,
  "results": [...]
}

Compare avg_query_tokens to your pre-install baseline. This is the retrieval-side savings.

Step 4 — Measure consumption-side savings (Claude Code)

PostToolUse hooks compress Read/Bash/Grep output. Rough numbers:

Tool Typical reduction
Read ~88% (file → skeleton)
Bash ~91% (errors + tail)
Grep Capped at 25 matches

Combined retrieval + consumption is typically 5–10× total reduction vs baseline.

Step 5 — Report to stakeholders

A one-page summary template:

NeuralMind rollout — token cost impact

Ongoing hygiene

Debugging cost spikes with the graph view (v0.6.0+)

If a query returns 5K tokens when you’d expect 800, you used to be debugging by reading log files. v0.6.0 makes it visual.

  1. neuralmind serve . in a separate terminal.
  2. Run the offending query.
  3. In the detail panel, click Replay last query.

The graph view highlights the L3 hits the agent received. The diagnosis is usually obvious from the pulse pattern — a stale cluster boundary that grabbed too many nodes, a missing structural edge that forced the retriever wider, or an unexpected hub node pulling in unrelated context. Fix the underlying issue (rebuild the index, update CLAUDE.md, or tune the cluster boundary) and the next replay shows a tighter result.

The pulse-rings live feed is also useful during normal use: if you notice the canvas going quiet during sessions you’d expect to be busy, that’s a signal the agent isn’t actually using NeuralMind retrieval (maybe the MCP server isn’t wired up, or the hooks didn’t install). A visual heartbeat is faster than checking a log.

What this doesn’t fix


← Back to use-case index · Main README