BCR-memory-4: exact long-context compression is real; serving-speed wins are not
A scientific note on what the current evidence actually supports
Objective
This note separates three questions that are easy to conflate in prefix-cache work: exact-prefix serving fairness against SGLang's RadixCache, server-side long-context correctness, and tiered compression in the Hugging Face Qwen path. The current evidence supports a narrow but real result: exact outputs up to 32K with about 2.27x stored-token compression for one tiered setting. It does not support a latency, throughput, or strong peak-memory win.
Description
The question this note is actually answering
The current BCR-memory-4 repo contains three distinct experimental tracks. They are easy to blur together, but they do not support the same claim:
- Exact-prefix serving fairness: does the
bcrserving path beat SGLang's productionRadixCacheon real workloads? - Server-side long-context correctness: does exact-prefix
bcrremain bit-exact withradixnear 32K context? - Tiered compression: can a compressed cache representation preserve exact outputs in the Hugging Face Qwen path, and if so what does it buy in latency and memory?
This note keeps those tracks separate and only makes claims that are directly supported by the current fetched artifacts.
The central result is not "BCR wins." The central result is narrower:
- Exact-prefix BCR is not currently faster than
RadixCacheon the measured serving workloads. - Tiered compression can preserve exact outputs on the tested Qwen 7B prompts up to 32K.
- That tiered result currently comes with a substantial latency penalty and only a small peak-VRAM reduction.
Experimental surface
All numbers below come from the current BCR-memory-4 artifact set, using Qwen/Qwen2.5-7B-Instruct as the model throughout:
| Track | Comparison | Hardware | Main metric |
|---|---|---|---|
| Exact-prefix serving fairness | bcr vs SGLang radix | A100 80GB | TTFT, E2E latency, reuse telemetry |
| Server long-context correctness | bcr vs radix | A100 80GB | Bit-exact output equality |
| Tiered compression promotion | dense vs tiered | L4 24GB at 16K, A100 80GB at 32K | Exactness, stored-token compression, TTFT/TPOT, peak VRAM |
For the tiered path, I use three derived quantities:
where:
- is the logical token budget represented by the cache,
- is the number of tokens actually kept after exact retention, surprise pages, and phantom summaries,
- is the stored-token compression factor, and
- is the measured peak-VRAM reduction.
The distinction matters. A large does not automatically imply a large .
Result 1: exact-prefix BCR is not a serving-speed win over RadixCache
The serving-fairness comparison used the pinned SGLang path and two cache-friendly workloads on the A100. Reuse telemetry matched exactly, which makes the latency comparison fair: both systems reused the same amount of prompt work.
| Workload | Cached-token delta (bcr - radix) | TTFT p50 ratio (bcr / radix) | TTFT p99 ratio (bcr / radix) | E2E p50 ratio (bcr / radix) | Interpretation |
|---|---|---|---|---|---|
system_repeat | 0 | 1.081 | 1.197 | 1.003 | Same reuse, slower TTFT |
rag_fanout | 0 | 1.102 | 1.289 | 1.004 | Same reuse, slower TTFT |
This is a negative result for the exact-prefix serving path. On the current evidence:
bcris not faster thanradix,bcris not lower-latency thanradix,- and the current repo does not support a concurrency win claim either.
What the data does support is narrower: BCR can match reuse accounting, but it does not currently convert that into a serving-speed advantage.
Result 2: server-side long-context correctness still holds near 32K
The server-side exact-prefix comparison near 32K context remains clean:
| Metric | Value |
|---|---|
| Checked prompts | 30 |
| Mismatches | 0 |
radix cached tokens total | 331,999 |
bcr cached tokens total | 331,999 |
That is a real positive result, but it is a correctness result, not a speed result. It says the exact-prefix bcr path can preserve outputs and reuse accounting relative to radix on the tested long-context set.
Result 3: exact tiered compression exists, but it is not a memory or latency win yet
The tiered path had to be treated as a separate system because it does not compare bcr against radix. It compares dense local KV against a compressed tiered representation in the Hugging Face generation path.
How the promoted setting was chosen
The first 8K exactness gate with the baseline middle-layer setting failed. A small failure-isolation sweep then varied only the knobs that could plausibly change fidelity. Five follow-up cells were tested at 8K; only two were exact:
| 8K setting | Exact? |
|---|---|
middle_ret35 | no |
middle_ret50 | no |
middle_surp12 | no |
middle_phantom8 | yes |
front_ret25 | yes |
That is why the later promotion runs focus on front_ret25 and middle_phantom8.
Promotion results at 16K and 32K
| Setting | Context | Exactness | Stored-token compression | Stored fraction | TTFT ratio (tiered / dense) | TPOT ratio (tiered / dense) | Dense peak VRAM | Tiered peak VRAM | |
|---|---|---|---|---|---|---|---|---|---|
front_ret25 | 16K | 4/4 | 2.2599x | 0.4425 | 1.2732x | 1.1920x | 17.294 GiB | 17.137 GiB | 0.91% |
middle_phantom8 | 16K | 4/4 | 1.5915x | 0.6283 | 1.2686x | 1.0042x | 17.294 GiB | 17.189 GiB | 0.60% |
front_ret25 | 32K | 4/4 | 2.2723x | 0.4401 | 2.2355x | 2.2135x | 20.349 GiB | 20.034 GiB | 1.55% |
Two points are important here:
- Exactness survives promotion.
front_ret25is exact on the tested16Kand32KQwen 7B sets. That is the strongest positive result in this repo right now. - The cost profile is poor. At
32K, the representation-level compression remains strong (), but TTFT and TPOT both more than double relative to dense, while measured peak VRAM drops by only about0.315 GiB.
This is why I call the result "exact long-context compression" rather than "memory-efficient serving." The first claim is supported. The second is not.
Why stored-token compression does not translate into a large HBM win
The current implementation still builds dense decode views from page tensors during decode. In the repo, _get_decode_kv() concatenates the page tensors with torch.cat(...), and _refresh_compat_views() also rebuilds concatenated compatibility views for the compressed layers.
That means the present system can satisfy both of these statements at the same time:
- the representation is compressed by roughly
2.27x, and - the live peak VRAM only drops by about
1.55%.
This is not a contradiction. It is a sign that representation-level compression has not yet been converted into a fully page-native decode path.
A claim ledger
The cleanest way to summarize the current state is as a claim ledger.
| Claim | Supported by current evidence? | Verdict |
|---|---|---|
Exact-prefix BCR is faster than SGLang RadixCache | no | rejected by the current A100 serving comparison |
Exact-prefix BCR preserves long-context correctness relative to radix near 32K | yes | supported |
| Tiered compression can preserve exact outputs on real Qwen 7B prompts at 16K and 32K | yes | supported for the tested sets |
| Stored-token compression is large | yes | supported; about 2.26x at 16K and 2.27x at 32K for front_ret25 |
| Peak-VRAM reduction is large | no | not supported; measured reduction is only 0.9% to 1.6% |
| Tiered compression improves latency | no | rejected; TTFT and TPOT are worse |
| The result generalizes broadly across models and workloads | not yet | unverified |
Limitations
This note intentionally does not claim more than the measurements justify:
- It uses one model family: Qwen2.5-7B-Instruct.
- The tiered promotion runs use small exactness sets (
4prompts at16Kand32K). - The
32Krun emitted a model-length warning because generation steps push slightly beyond the nominal32768boundary. Exactness still passed4/4, so I report the result, but the caveat matters. - The serving-fairness comparison covers two cache-friendly workloads, not a full production traffic distribution.
In other words, this is not yet "general long-context compression for all models." It is a precise result on a specific implementation surface and a specific model.
What I think the repo actually demonstrates
The most defensible interpretation of BCR-memory-4 today is:
- As an exact-prefix serving cache, it does not yet beat SGLang's
RadixCache. - As a server-side exactness path, it can remain bit-exact with
radixnear 32K. - As a tiered compression experiment, it can preserve exact outputs on a real Qwen 7B path up to 32K while compressing the stored-token representation by about
2.27x.
That is already scientifically interesting, but it is narrower than a systems-performance headline. The positive result is about exactness under compression, not about faster or cheaper serving.
The best next experiment
The next step should not be a broader sweep. It should be a latency decomposition on the exact 32K front_ret25 setting:
- Measure time spent in summary construction,
- measure time spent rebuilding decode views,
- measure time spent in attention / generation proper,
- then rerun the exact same
32Kcase after removing the dominant overhead.
If the decode-view concatenation path dominates, the real systems change is clear: move from a representation that still densifies at decode time to a manifest-aware or page-native decode path.
That is the point where this line of work could plausibly become a stronger systems paper. Right now, the evidence is good enough for an honest technical note and a useful negative result, but not yet for a headline speedup claim.
Related Papers from Research Archives
Ariel Mendez, Amandine Pascal·Jan 1, 2050·OpenAlex
Amir Reza Ansari Dezfoli·Sep 5, 2035·OpenAlex
Johannes Tim Wildberger·Apr 11, 2029·OpenAlex
Johannes Tim Wildberger·Apr 11, 2029·OpenAlex
Francisco Javier Martinez Sanchez·Feb 9, 2029·OpenAlex
SoSy-Lab·Jan 8, 2029·OpenAlex
Junwei Wei·Jan 1, 2029·OpenAlex
Jacques, Laurie, Hardman, Mark, Joubert, Marie et al.·Feb 1, 2027·OpenAlex