BCR-memory-2 vs. a Python dict and SGLang's RadixCache: an honest comparison

Objective

BCR-memory-2 is a Rust prefix-cache data structure. This note puts it on the same table as the two things a user would reasonably compare it to — a trivial Python-dict cache and SGLang's production RadixCache — and walks through what we measured on a single NVIDIA L4. It calls out the wins, the losses, and the dimensions where the comparison can't honestly be made yet.

Description

What we're comparing

Three things sit on the same bench in this note:

System	What it is	What it does
BCR-memory-2	Rust + PyO3 prefix-cache data structure	Maps tokenized-prefix blocks to opaque KV-block values via a flat-array DFA. No serving engine attached.
Python dict reference	Pure-Python trivially-correct prefix cache	Same public API as BCR-memory-2. Exists only as a sanity-check baseline.
SGLang `RadixCache`	Production prefix cache inside SGLang 0.5.10	Full serving-engine component: KV-aware, tree-based, integrated with the scheduler.

This is intentionally three different things at three different levels of the stack. BCR-memory-2 and the Python dict are interchangeable at the API boundary. SGLang is a full server. Where the levels line up we compare directly; where they don't we say so.

All numbers below come from one NVIDIA L4 spot instance on GCP, Qwen2.5-7B-Instruct fp16, SGLang 0.5.10.post1, sglang.bench_serving with ShareGPT and the generated-shared-prefix dataset. Repo: github.com/armanas/BCR-memory-2.

Comparison 1 — microbench vs. the Python dict

Same API, same 2000-request workload through both implementations, same hit rate (97.89% on both). Per-op p50 / p99 in microseconds:

Rendering diagram…

First bar = BCR-memory-2; second bar = Python dict.

Op	BCR-memory-2	Python dict	Who wins	Why
`match_prefix` p50	3.2 µs	4.9 µs	BCR 1.5×	Rust hashing + flat-array traversal
`match_prefix` p99	5.4 µs	9.9 µs	BCR 1.8×	Same
`insert` p50	6.0 µs	10.6 µs	BCR 1.8×	Same
`insert` p99	20.7 µs	24.3 µs	BCR 1.2×	Allocation path dominates; Rust edge is smaller
`release` p50	0.53 µs	0.30 µs	Dict 1.8×	Trivial refcount tick; PyO3 FFI crossing dominates
Total wall-clock (2 000 req)	11 ms	16 ms	BCR 1.4×	Weighted by op mix

Net: BCR-memory-2 is faster on the hot path (match + insert), slower on a trivial op where the FFI boundary costs more than the work itself. Total wall time: BCR finishes the workload in 2/3 of the time.

Honest caveat: the Python dict reference is not what serving engines actually ship. It exists so I can prove correctness parity by differential testing. The useful comparison is the next one.

Comparison 2 — memory behaviour under sustained load

The question: does either cache grow unboundedly under continuous churn?

BCR-memory-2 bookkeeping soak (standalone, 30 min)

Pure cache workload, no LLM: 8 million insert / match / release / evict ops at high rate, RSS + state count sampled every 15 seconds.

Rendering diagram…

Metric	Value
Ops served	8 031 876
RSS start / end	785.8 MB / 785.9 MB
RSS growth over 8 M ops	+0.1 MB
Live state slots (steady state)	179, flat

SGLang `RadixCache` soak inside a real server (8.6 min)

Same question, but with the actual LLM and the actual production cache. Continuous shared-prefix traffic through SGLang; sampler watches the server process's RSS.

Rendering diagram…

Metric	Value
Requests served	1 200
Duration	471 s post-warmup
RSS drift	+4.6 MB
TTFT median (sustained)	188 ms

Net: both caches are stable under sustained load. BCR-memory-2 costs you essentially nothing in RSS growth per million ops. SGLang's RadixCache holds flat too. Neither has a memory-management problem on this workload. This is a draw — and a useful one to establish, because it removes the "stability" axis from the remaining comparison.

Comparison 3 — what SGLang's `RadixCache` does at the serving level

This is where BCR-memory-2 can't directly compete yet, so it's honestly reported as the bar rather than a head-to-head. I ran SGLang twice on the same 500-prompt workload: with RadixCache enabled, and with --disable-radix-cache. Two workloads: ShareGPT (random chat prompts, prefix-disjoint) and generated-shared-prefix (10 groups × 50 prompts with a 512-token shared system prefix).

Rendering diagram…

Workload	Cache	req/s	TTFT median	TTFT p99	E2E median
ShareGPT	on	3.17	4147 ms	35 863 ms	40 053 ms
ShareGPT	off	3.18	4006 ms	35 857 ms	39 710 ms
GSP	on	3.48	190 ms	326 ms	11 743 ms
GSP	off	3.33	4055 ms	8 477 ms	32 389 ms

Two findings worth calling out:

On ShareGPT, the cache is effectively irrelevant. On vs off differ by 0.3% on throughput and 3.5% on TTFT. Random chat prompts don't share prefixes, so the cache has nothing to hit. Any prefix-cache benchmark that reports only ShareGPT numbers is measuring something other than cache effectiveness.
On GSP the cache is transformative. TTFT median drops from 4055 ms to 190 ms — a 21.3× improvement — because the shared 512-token system prompt is a single cache hit instead of a full prefill. That's the real production value of a prefix cache.

What this means for BCR-memory-2. I never integrated it into SGLang's scheduler, so I can't put a fourth bar next to those. What I can say is: the microbench numbers from Comparison 1 make it plausible that BCR-memory-2 could match RadixCache on hot-path bookkeeping latency, because the per-op work (hash + trie walk + state update) is in the same regime (single-digit microseconds). It would not automatically outperform it — RadixCache is a mature tree-based structure tuned for SGLang's access patterns. The honest range of expectations is "competitive, not a landslide either way" on bookkeeping latency, with end-to-end serving throughput dominated by attention kernels no matter which cache wins.

Where BCR-memory-2 is clearly better

Dimension	vs. Python dict	vs. SGLang RadixCache	Evidence
Hot-path op latency	1.4 – 1.8× faster on `match` and `insert`	Not measured head-to-head; microbench numbers are in the same regime	`exp3.json`
Invariant auditability	Dict has no invariants beyond key-existence	RadixCache verifies correctness inside a much larger scheduler; the standalone surface is smaller	`assert_invariants()` walks the graph after every mutation; every test calls it
Collision-safe block hashing at scale	N/A (Python dict compares tokens directly)	N/A (RadixCache compares tokens at each radix node)	`test_128bit_hash_distinguishes_across_many_blocks` — 100 k distinct blocks, 0 collisions
Transport-agnostic reuse	Dict is tied to Python	RadixCache is tied to SGLang's scheduler	Rust `cdylib` + Python wheel; the core `BCRGraph` has zero Python deps
Single-writer contract enforced by the library itself	Dict has none; you write your own lock	SGLang enforces via the scheduler	Rust-side `writer_guard` raises on reentrant mutation; Python wrapper adds `threading.Lock`

Where BCR-memory-2 is clearly worse

Dimension	vs. Python dict	vs. SGLang RadixCache	What would change this
Trivial-op latency (`release`)	1.8× slower — PyO3 crossing dominates actual work	Not measured	Moving the refcount into Rust (sub-100 ns saved per op; probably not worth the interface friction)
End-to-end serving throughput	N/A	Not measured — BCR is not integrated into any serving engine	1–2 days patching SGLang's scheduler to swap cache backends
Multi-model validation	N/A	SGLang works against every supported model out of the box	Per-model regression runs; not done
GPU-side KV cache integration	N/A	SGLang owns the full KV memory manager	Not in scope — BCR is a block-routing data structure, not a KV allocator
Hit rate on real serving traffic	Measured identical on the differential test	Directly measurable in SGLang via `--enable-cache-report`	Finishing the SGLang integration and re-running the GSP sweep with BCR plugged in
Maturity / ecosystem	Same — small project	RadixCache has production deployment history, telemetry, integration tests across many models	Time + real-world use

Where the comparison simply can't be made yet

Three dimensions where I deliberately did not publish numbers because I haven't measured them honestly:

BCR-memory-2 serving end-to-end. Requires the SGLang scheduler patch. Estimated 1–2 days of focused work plus a re-run of the full serving bench. I chose not to do it because the evidence above already told me whether the result would be worth the cost.
Concurrent-reader throughput. BCR-memory-2 is single-writer under a lock. I didn't measure how many read-only match_prefix calls per second the lock allows. The SGLang scheduler serializes too, so it's unclear this would differ meaningfully.
vLLM. The vLLM comparison run got blocked by a flash-attn-4 beta vs. vllm-0.19 flash_attn.ops dep conflict. A solvable problem, not a done one.

Summary ledger

Rendering diagram…

Green is where BCR wins or ties. Amber is where it can't compete today. Blue is the honest positioning: it's a building block, not a serving engine.

Practical takeaways

If you need a drop-in production prefix cache for LLM serving, use SGLang's RadixCache — it's already in the critical path of a working server and gives you the 21× TTFT win measured above.
If you're building your own serving infrastructure and want a small, fast, inspectable cache data structure you can reason about in isolation, BCR-memory-2 is measurably faster than a naive Python implementation and holds flat under sustained load.
If you want to extend or compare against BCR, the assert_invariants() harness and the differential-test framework in the repo are reusable for any prefix cache, not just ours.

Description

What we're comparing

Three things sit on the same bench in this note:

System	What it is	What it does
BCR-memory-2	Rust + PyO3 prefix-cache data structure	Maps tokenized-prefix blocks to opaque KV-block values via a flat-array DFA. No serving engine attached.
Python dict reference	Pure-Python trivially-correct prefix cache	Same public API as BCR-memory-2. Exists only as a sanity-check baseline.
SGLang `RadixCache`	Production prefix cache inside SGLang 0.5.10	Full serving-engine component: KV-aware, tree-based, integrated with the scheduler.

Comparison 1 — microbench vs. the Python dict

Same API, same 2000-request workload through both implementations, same hit rate (97.89% on both). Per-op p50 / p99 in microseconds:

Rendering diagram…

First bar = BCR-memory-2; second bar = Python dict.

Op	BCR-memory-2	Python dict	Who wins	Why
`match_prefix` p50	3.2 µs	4.9 µs	BCR 1.5×	Rust hashing + flat-array traversal
`match_prefix` p99	5.4 µs	9.9 µs	BCR 1.8×	Same
`insert` p50	6.0 µs	10.6 µs	BCR 1.8×	Same
`insert` p99	20.7 µs	24.3 µs	BCR 1.2×	Allocation path dominates; Rust edge is smaller
`release` p50	0.53 µs	0.30 µs	Dict 1.8×	Trivial refcount tick; PyO3 FFI crossing dominates
Total wall-clock (2 000 req)	11 ms	16 ms	BCR 1.4×	Weighted by op mix

Honest caveat: the Python dict reference is not what serving engines actually ship. It exists so I can prove correctness parity by differential testing. The useful comparison is the next one.

Comparison 2 — memory behaviour under sustained load

The question: does either cache grow unboundedly under continuous churn?

BCR-memory-2 bookkeeping soak (standalone, 30 min)

Pure cache workload, no LLM: 8 million insert / match / release / evict ops at high rate, RSS + state count sampled every 15 seconds.

Rendering diagram…

Metric	Value
Ops served	8 031 876
RSS start / end	785.8 MB / 785.9 MB
RSS growth over 8 M ops	+0.1 MB
Live state slots (steady state)	179, flat

SGLang `RadixCache` soak inside a real server (8.6 min)

Same question, but with the actual LLM and the actual production cache. Continuous shared-prefix traffic through SGLang; sampler watches the server process's RSS.

Rendering diagram…

Metric	Value
Requests served	1 200
Duration	471 s post-warmup
RSS drift	+4.6 MB
TTFT median (sustained)	188 ms

Comparison 3 — what SGLang's `RadixCache` does at the serving level

Rendering diagram…

Workload	Cache	req/s	TTFT median	TTFT p99	E2E median
ShareGPT	on	3.17	4147 ms	35 863 ms	40 053 ms
ShareGPT	off	3.18	4006 ms	35 857 ms	39 710 ms
GSP	on	3.48	190 ms	326 ms	11 743 ms
GSP	off	3.33	4055 ms	8 477 ms	32 389 ms

Two findings worth calling out:

On ShareGPT, the cache is effectively irrelevant. On vs off differ by 0.3% on throughput and 3.5% on TTFT. Random chat prompts don't share prefixes, so the cache has nothing to hit. Any prefix-cache benchmark that reports only ShareGPT numbers is measuring something other than cache effectiveness.
On GSP the cache is transformative. TTFT median drops from 4055 ms to 190 ms — a 21.3× improvement — because the shared 512-token system prompt is a single cache hit instead of a full prefill. That's the real production value of a prefix cache.

Where BCR-memory-2 is clearly better

Dimension	vs. Python dict	vs. SGLang RadixCache	Evidence
Hot-path op latency	1.4 – 1.8× faster on `match` and `insert`	Not measured head-to-head; microbench numbers are in the same regime	`exp3.json`
Invariant auditability	Dict has no invariants beyond key-existence	RadixCache verifies correctness inside a much larger scheduler; the standalone surface is smaller	`assert_invariants()` walks the graph after every mutation; every test calls it
Collision-safe block hashing at scale	N/A (Python dict compares tokens directly)	N/A (RadixCache compares tokens at each radix node)	`test_128bit_hash_distinguishes_across_many_blocks` — 100 k distinct blocks, 0 collisions
Transport-agnostic reuse	Dict is tied to Python	RadixCache is tied to SGLang's scheduler	Rust `cdylib` + Python wheel; the core `BCRGraph` has zero Python deps
Single-writer contract enforced by the library itself	Dict has none; you write your own lock	SGLang enforces via the scheduler	Rust-side `writer_guard` raises on reentrant mutation; Python wrapper adds `threading.Lock`

Where BCR-memory-2 is clearly worse

Dimension	vs. Python dict	vs. SGLang RadixCache	What would change this
Trivial-op latency (`release`)	1.8× slower — PyO3 crossing dominates actual work	Not measured	Moving the refcount into Rust (sub-100 ns saved per op; probably not worth the interface friction)
End-to-end serving throughput	N/A	Not measured — BCR is not integrated into any serving engine	1–2 days patching SGLang's scheduler to swap cache backends
Multi-model validation	N/A	SGLang works against every supported model out of the box	Per-model regression runs; not done
GPU-side KV cache integration	N/A	SGLang owns the full KV memory manager	Not in scope — BCR is a block-routing data structure, not a KV allocator
Hit rate on real serving traffic	Measured identical on the differential test	Directly measurable in SGLang via `--enable-cache-report`	Finishing the SGLang integration and re-running the GSP sweep with BCR plugged in
Maturity / ecosystem	Same — small project	RadixCache has production deployment history, telemetry, integration tests across many models	Time + real-world use

Where the comparison simply can't be made yet

Three dimensions where I deliberately did not publish numbers because I haven't measured them honestly:

BCR-memory-2 serving end-to-end. Requires the SGLang scheduler patch. Estimated 1–2 days of focused work plus a re-run of the full serving bench. I chose not to do it because the evidence above already told me whether the result would be worth the cost.
Concurrent-reader throughput. BCR-memory-2 is single-writer under a lock. I didn't measure how many read-only match_prefix calls per second the lock allows. The SGLang scheduler serializes too, so it's unclear this would differ meaningfully.
vLLM. The vLLM comparison run got blocked by a flash-attn-4 beta vs. vllm-0.19 flash_attn.ops dep conflict. A solvable problem, not a done one.

Summary ledger

Rendering diagram…

Green is where BCR wins or ties. Amber is where it can't compete today. Blue is the honest positioning: it's a building block, not a serving engine.

Practical takeaways

If you need a drop-in production prefix cache for LLM serving, use SGLang's RadixCache — it's already in the critical path of a working server and gives you the 21× TTFT win measured above.
If you're building your own serving infrastructure and want a small, fast, inspectable cache data structure you can reason about in isolation, BCR-memory-2 is measurably faster than a naive Python implementation and holds flat under sustained load.
If you want to extend or compare against BCR, the assert_invariants() harness and the differential-test framework in the repo are reusable for any prefix cache, not just ours.

BCR-memory-2 vs. a Python dict and SGLang's RadixCache: an honest comparison

Objective

Description

What we're comparing

Comparison 1 — microbench vs. the Python dict

Comparison 2 — memory behaviour under sustained load

BCR-memory-2 bookkeeping soak (standalone, 30 min)

SGLang `RadixCache` soak inside a real server (8.6 min)

Comparison 3 — what SGLang's `RadixCache` does at the serving level

Where BCR-memory-2 is clearly better

Where BCR-memory-2 is clearly worse

Where the comparison simply can't be made yet

Summary ledger

Practical takeaways

Related Papers from Research Archives

BCR-memory-2 vs. a Python dict and SGLang's RadixCache: an honest comparison

Objective

Description

What we're comparing

Comparison 1 — microbench vs. the Python dict

Comparison 2 — memory behaviour under sustained load

BCR-memory-2 bookkeeping soak (standalone, 30 min)

SGLang `RadixCache` soak inside a real server (8.6 min)

Comparison 3 — what SGLang's `RadixCache` does at the serving level

Where BCR-memory-2 is clearly better

Where BCR-memory-2 is clearly worse

Where the comparison simply can't be made yet

Summary ledger

Practical takeaways

Related Papers from Research Archives

Objective

Description

What we're comparing

Comparison 1 — microbench vs. the Python dict

Comparison 2 — memory behaviour under sustained load

BCR-memory-2 bookkeeping soak (standalone, 30 min)

SGLang RadixCache soak inside a real server (8.6 min)

Comparison 3 — what SGLang's RadixCache does at the serving level

Where BCR-memory-2 is clearly better

Where BCR-memory-2 is clearly worse

Where the comparison simply can't be made yet

Summary ledger

Practical takeaways

Related Papers from Research Archives

Objective

Description

What we're comparing

Comparison 1 — microbench vs. the Python dict

Comparison 2 — memory behaviour under sustained load

BCR-memory-2 bookkeeping soak (standalone, 30 min)

SGLang RadixCache soak inside a real server (8.6 min)

Comparison 3 — what SGLang's RadixCache does at the serving level

Where BCR-memory-2 is clearly better

Where BCR-memory-2 is clearly worse

Where the comparison simply can't be made yet

Summary ledger

Practical takeaways

Related Papers from Research Archives

SGLang `RadixCache` soak inside a real server (8.6 min)

Comparison 3 — what SGLang's `RadixCache` does at the serving level

SGLang `RadixCache` soak inside a real server (8.6 min)

Comparison 3 — what SGLang's `RadixCache` does at the serving level