BCR-memory-2 vs. a Python dict and SGLang's RadixCache: an honest comparison
Objective
BCR-memory-2 is a Rust prefix-cache data structure. This note puts it on the same table as the two things a user would reasonably compare it to — a trivial Python-dict cache and SGLang's production RadixCache — and walks through what we measured on a single NVIDIA L4. It calls out the wins, the losses, and the dimensions where the comparison can't honestly be made yet.
Description
What we're comparing
Three things sit on the same bench in this note:
| System | What it is | What it does |
|---|---|---|
| BCR-memory-2 | Rust + PyO3 prefix-cache data structure | Maps tokenized-prefix blocks to opaque KV-block values via a flat-array DFA. No serving engine attached. |
| Python dict reference | Pure-Python trivially-correct prefix cache | Same public API as BCR-memory-2. Exists only as a sanity-check baseline. |
SGLang RadixCache | Production prefix cache inside SGLang 0.5.10 | Full serving-engine component: KV-aware, tree-based, integrated with the scheduler. |
This is intentionally three different things at three different levels of the stack. BCR-memory-2 and the Python dict are interchangeable at the API boundary. SGLang is a full server. Where the levels line up we compare directly; where they don't we say so.
All numbers below come from one NVIDIA L4 spot instance on GCP, Qwen2.5-7B-Instruct fp16, SGLang 0.5.10.post1, sglang.bench_serving with ShareGPT and the generated-shared-prefix dataset. Repo: github.com/armanas/BCR-memory-2.
Comparison 1 — microbench vs. the Python dict
Same API, same 2000-request workload through both implementations, same hit rate (97.89% on both). Per-op p50 / p99 in microseconds:
First bar = BCR-memory-2; second bar = Python dict.
| Op | BCR-memory-2 | Python dict | Who wins | Why |
|---|---|---|---|---|
match_prefix p50 | 3.2 µs | 4.9 µs | BCR 1.5× | Rust hashing + flat-array traversal |
match_prefix p99 | 5.4 µs | 9.9 µs | BCR 1.8× | Same |
insert p50 | 6.0 µs | 10.6 µs | BCR 1.8× | Same |
insert p99 | 20.7 µs | 24.3 µs | BCR 1.2× | Allocation path dominates; Rust edge is smaller |
release p50 | 0.53 µs | 0.30 µs | Dict 1.8× | Trivial refcount tick; PyO3 FFI crossing dominates |
| Total wall-clock (2 000 req) | 11 ms | 16 ms | BCR 1.4× | Weighted by op mix |
Net: BCR-memory-2 is faster on the hot path (match + insert), slower on a trivial op where the FFI boundary costs more than the work itself. Total wall time: BCR finishes the workload in 2/3 of the time.
Honest caveat: the Python dict reference is not what serving engines actually ship. It exists so I can prove correctness parity by differential testing. The useful comparison is the next one.
Comparison 2 — memory behaviour under sustained load
The question: does either cache grow unboundedly under continuous churn?
BCR-memory-2 bookkeeping soak (standalone, 30 min)
Pure cache workload, no LLM: 8 million insert / match / release / evict ops at high rate, RSS + state count sampled every 15 seconds.
| Metric | Value |
|---|---|
| Ops served | 8 031 876 |
| RSS start / end | 785.8 MB / 785.9 MB |
| RSS growth over 8 M ops | +0.1 MB |
| Live state slots (steady state) | 179, flat |
SGLang RadixCache soak inside a real server (8.6 min)
Same question, but with the actual LLM and the actual production cache. Continuous shared-prefix traffic through SGLang; sampler watches the server process's RSS.
| Metric | Value |
|---|---|
| Requests served | 1 200 |
| Duration | 471 s post-warmup |
| RSS drift | +4.6 MB |
| TTFT median (sustained) | 188 ms |
Net: both caches are stable under sustained load. BCR-memory-2 costs you essentially nothing in RSS growth per million ops. SGLang's RadixCache holds flat too. Neither has a memory-management problem on this workload. This is a draw — and a useful one to establish, because it removes the "stability" axis from the remaining comparison.
Comparison 3 — what SGLang's RadixCache does at the serving level
This is where BCR-memory-2 can't directly compete yet, so it's honestly reported as the bar rather than a head-to-head. I ran SGLang twice on the same 500-prompt workload: with RadixCache enabled, and with --disable-radix-cache. Two workloads: ShareGPT (random chat prompts, prefix-disjoint) and generated-shared-prefix (10 groups × 50 prompts with a 512-token shared system prefix).
| Workload | Cache | req/s | TTFT median | TTFT p99 | E2E median |
|---|---|---|---|---|---|
| ShareGPT | on | 3.17 | 4147 ms | 35 863 ms | 40 053 ms |
| ShareGPT | off | 3.18 | 4006 ms | 35 857 ms | 39 710 ms |
| GSP | on | 3.48 | 190 ms | 326 ms | 11 743 ms |
| GSP | off | 3.33 | 4055 ms | 8 477 ms | 32 389 ms |
Two findings worth calling out:
- On ShareGPT, the cache is effectively irrelevant. On vs off differ by 0.3% on throughput and 3.5% on TTFT. Random chat prompts don't share prefixes, so the cache has nothing to hit. Any prefix-cache benchmark that reports only ShareGPT numbers is measuring something other than cache effectiveness.
- On GSP the cache is transformative. TTFT median drops from 4055 ms to 190 ms — a 21.3× improvement — because the shared 512-token system prompt is a single cache hit instead of a full prefill. That's the real production value of a prefix cache.
What this means for BCR-memory-2. I never integrated it into SGLang's scheduler, so I can't put a fourth bar next to those. What I can say is: the microbench numbers from Comparison 1 make it plausible that BCR-memory-2 could match RadixCache on hot-path bookkeeping latency, because the per-op work (hash + trie walk + state update) is in the same regime (single-digit microseconds). It would not automatically outperform it — RadixCache is a mature tree-based structure tuned for SGLang's access patterns. The honest range of expectations is "competitive, not a landslide either way" on bookkeeping latency, with end-to-end serving throughput dominated by attention kernels no matter which cache wins.
Where BCR-memory-2 is clearly better
| Dimension | vs. Python dict | vs. SGLang RadixCache | Evidence |
|---|---|---|---|
| Hot-path op latency | 1.4 – 1.8× faster on match and insert | Not measured head-to-head; microbench numbers are in the same regime | exp3.json |
| Invariant auditability | Dict has no invariants beyond key-existence | RadixCache verifies correctness inside a much larger scheduler; the standalone surface is smaller | assert_invariants() walks the graph after every mutation; every test calls it |
| Collision-safe block hashing at scale | N/A (Python dict compares tokens directly) | N/A (RadixCache compares tokens at each radix node) | test_128bit_hash_distinguishes_across_many_blocks — 100 k distinct blocks, 0 collisions |
| Transport-agnostic reuse | Dict is tied to Python | RadixCache is tied to SGLang's scheduler | Rust cdylib + Python wheel; the core BCRGraph has zero Python deps |
| Single-writer contract enforced by the library itself | Dict has none; you write your own lock | SGLang enforces via the scheduler | Rust-side writer_guard raises on reentrant mutation; Python wrapper adds threading.Lock |
Where BCR-memory-2 is clearly worse
| Dimension | vs. Python dict | vs. SGLang RadixCache | What would change this |
|---|---|---|---|
Trivial-op latency (release) | 1.8× slower — PyO3 crossing dominates actual work | Not measured | Moving the refcount into Rust (sub-100 ns saved per op; probably not worth the interface friction) |
| End-to-end serving throughput | N/A | Not measured — BCR is not integrated into any serving engine | 1–2 days patching SGLang's scheduler to swap cache backends |
| Multi-model validation | N/A | SGLang works against every supported model out of the box | Per-model regression runs; not done |
| GPU-side KV cache integration | N/A | SGLang owns the full KV memory manager | Not in scope — BCR is a block-routing data structure, not a KV allocator |
| Hit rate on real serving traffic | Measured identical on the differential test | Directly measurable in SGLang via --enable-cache-report | Finishing the SGLang integration and re-running the GSP sweep with BCR plugged in |
| Maturity / ecosystem | Same — small project | RadixCache has production deployment history, telemetry, integration tests across many models | Time + real-world use |
Where the comparison simply can't be made yet
Three dimensions where I deliberately did not publish numbers because I haven't measured them honestly:
- BCR-memory-2 serving end-to-end. Requires the SGLang scheduler patch. Estimated 1–2 days of focused work plus a re-run of the full serving bench. I chose not to do it because the evidence above already told me whether the result would be worth the cost.
- Concurrent-reader throughput. BCR-memory-2 is single-writer under a lock. I didn't measure how many read-only
match_prefixcalls per second the lock allows. The SGLang scheduler serializes too, so it's unclear this would differ meaningfully. - vLLM. The vLLM comparison run got blocked by a
flash-attn-4beta vs.vllm-0.19flash_attn.opsdep conflict. A solvable problem, not a done one.
Summary ledger
Green is where BCR wins or ties. Amber is where it can't compete today. Blue is the honest positioning: it's a building block, not a serving engine.
Practical takeaways
- If you need a drop-in production prefix cache for LLM serving, use SGLang's
RadixCache— it's already in the critical path of a working server and gives you the 21× TTFT win measured above. - If you're building your own serving infrastructure and want a small, fast, inspectable cache data structure you can reason about in isolation, BCR-memory-2 is measurably faster than a naive Python implementation and holds flat under sustained load.
- If you want to extend or compare against BCR, the
assert_invariants()harness and the differential-test framework in the repo are reusable for any prefix cache, not just ours.
Related Papers from Research Archives
Ariel Mendez, Amandine Pascal·Jan 1, 2050·OpenAlex
Amir Reza Ansari Dezfoli·Sep 5, 2035·OpenAlex
Johannes Tim Wildberger·Apr 11, 2029·OpenAlex
Johannes Tim Wildberger·Apr 11, 2029·OpenAlex
Francisco Javier Martinez Sanchez·Feb 9, 2029·OpenAlex
SoSy-Lab·Jan 8, 2029·OpenAlex
Junwei Wei·Jan 1, 2029·OpenAlex
Jacques, Laurie, Hardman, Mark, Joubert, Marie et al.·Feb 1, 2027·OpenAlex