BCR-memory-4: exact long-context compression is real; serving-speed wins are not

Objective

This note separates three questions that are easy to conflate in prefix-cache work: exact-prefix serving fairness against SGLang's RadixCache, server-side long-context correctness, and tiered compression in the Hugging Face Qwen path. The current evidence supports a narrow but real result: exact outputs up to 32K with about 2.27x stored-token compression for one tiered setting. It does not support a latency, throughput, or strong peak-memory win.

Description

The question this note is actually answering

The current BCR-memory-4 repo contains three distinct experimental tracks. They are easy to blur together, but they do not support the same claim:

Exact-prefix serving fairness: does the bcr serving path beat SGLang's production RadixCache on real workloads?
Server-side long-context correctness: does exact-prefix bcr remain bit-exact with radix near 32K context?
Tiered compression: can a compressed cache representation preserve exact outputs in the Hugging Face Qwen path, and if so what does it buy in latency and memory?

This note keeps those tracks separate and only makes claims that are directly supported by the current fetched artifacts.

Rendering diagram…

The central result is not "BCR wins." The central result is narrower:

Exact-prefix BCR is not currently faster than RadixCache on the measured serving workloads.
Tiered compression can preserve exact outputs on the tested Qwen 7B prompts up to 32K.
That tiered result currently comes with a substantial latency penalty and only a small peak-VRAM reduction.

Experimental surface

All numbers below come from the current BCR-memory-4 artifact set, using Qwen/Qwen2.5-7B-Instruct as the model throughout:

Track	Comparison	Hardware	Main metric
Exact-prefix serving fairness	`bcr` vs SGLang `radix`	A100 80GB	TTFT, E2E latency, reuse telemetry
Server long-context correctness	`bcr` vs `radix`	A100 80GB	Bit-exact output equality
Tiered compression promotion	dense vs tiered	L4 24GB at 16K, A100 80GB at 32K	Exactness, stored-token compression, TTFT/TPOT, peak VRAM

For the tiered path, I use three derived quantities:

\rho = \frac{T_{\mathrm{stored}}}{T_{\mathrm{logical}}}

C = \frac{T_{\mathrm{logical}}}{T_{\mathrm{stored}}} = \frac{1}{\rho}

\Delta_V = \frac{V_{\mathrm{dense}} - V_{\mathrm{tiered}}}{V_{\mathrm{dense}}}

where:

$T_{\mathrm{logical}}$ is the logical token budget represented by the cache,
$T_{\mathrm{stored}}$ is the number of tokens actually kept after exact retention, surprise pages, and phantom summaries,
$C$ is the stored-token compression factor, and
$\Delta_V$ is the measured peak-VRAM reduction.

The distinction matters. A large $C$ does not automatically imply a large $\Delta_V$ .

Result 1: exact-prefix BCR is not a serving-speed win over RadixCache

The serving-fairness comparison used the pinned SGLang path and two cache-friendly workloads on the A100. Reuse telemetry matched exactly, which makes the latency comparison fair: both systems reused the same amount of prompt work.

Workload	Cached-token delta (`bcr - radix`)	TTFT p50 ratio (`bcr / radix`)	TTFT p99 ratio (`bcr / radix`)	E2E p50 ratio (`bcr / radix`)	Interpretation
`system_repeat`	0	1.081	1.197	1.003	Same reuse, slower TTFT
`rag_fanout`	0	1.102	1.289	1.004	Same reuse, slower TTFT

This is a negative result for the exact-prefix serving path. On the current evidence:

bcr is not faster than radix,
bcr is not lower-latency than radix,
and the current repo does not support a concurrency win claim either.

What the data does support is narrower: BCR can match reuse accounting, but it does not currently convert that into a serving-speed advantage.

Result 2: server-side long-context correctness still holds near 32K

The server-side exact-prefix comparison near 32K context remains clean:

Metric	Value
Checked prompts	30
Mismatches	0
`radix` cached tokens total	331,999
`bcr` cached tokens total	331,999

That is a real positive result, but it is a correctness result, not a speed result. It says the exact-prefix bcr path can preserve outputs and reuse accounting relative to radix on the tested long-context set.

Result 3: exact tiered compression exists, but it is not a memory or latency win yet

The tiered path had to be treated as a separate system because it does not compare bcr against radix. It compares dense local KV against a compressed tiered representation in the Hugging Face generation path.

How the promoted setting was chosen

The first 8K exactness gate with the baseline middle-layer setting failed. A small failure-isolation sweep then varied only the knobs that could plausibly change fidelity. Five follow-up cells were tested at 8K; only two were exact:

8K setting	Exact?
`middle_ret35`	no
`middle_ret50`	no
`middle_surp12`	no
`middle_phantom8`	yes
`front_ret25`	yes

That is why the later promotion runs focus on front_ret25 and middle_phantom8.

Promotion results at 16K and 32K

Setting	Context	Exactness	Stored-token compression $C$	Stored fraction $\rho$	TTFT ratio (tiered / dense)	TPOT ratio (tiered / dense)	Dense peak VRAM	Tiered peak VRAM	$\Delta_V$
`front_ret25`	16K	4/4	2.2599x	0.4425	1.2732x	1.1920x	17.294 GiB	17.137 GiB	0.91%
`middle_phantom8`	16K	4/4	1.5915x	0.6283	1.2686x	1.0042x	17.294 GiB	17.189 GiB	0.60%
`front_ret25`	32K	4/4	2.2723x	0.4401	2.2355x	2.2135x	20.349 GiB	20.034 GiB	1.55%

Two points are important here:

Exactness survives promotion. front_ret25 is exact on the tested 16K and 32K Qwen 7B sets. That is the strongest positive result in this repo right now.
The cost profile is poor. At 32K, the representation-level compression remains strong ( $C \approx 2.27$ ), but TTFT and TPOT both more than double relative to dense, while measured peak VRAM drops by only about 0.315 GiB.

This is why I call the result "exact long-context compression" rather than "memory-efficient serving." The first claim is supported. The second is not.

Why stored-token compression does not translate into a large HBM win

The current implementation still builds dense decode views from page tensors during decode. In the repo, _get_decode_kv() concatenates the page tensors with torch.cat(...), and _refresh_compat_views() also rebuilds concatenated compatibility views for the compressed layers.

That means the present system can satisfy both of these statements at the same time:

the representation is compressed by roughly 2.27x, and
the live peak VRAM only drops by about 1.55%.

This is not a contradiction. It is a sign that representation-level compression has not yet been converted into a fully page-native decode path.

A claim ledger

The cleanest way to summarize the current state is as a claim ledger.

Claim	Supported by current evidence?	Verdict
Exact-prefix BCR is faster than SGLang `RadixCache`	no	rejected by the current A100 serving comparison
Exact-prefix BCR preserves long-context correctness relative to `radix` near 32K	yes	supported
Tiered compression can preserve exact outputs on real Qwen 7B prompts at 16K and 32K	yes	supported for the tested sets
Stored-token compression is large	yes	supported; about `2.26x` at 16K and `2.27x` at 32K for `front_ret25`
Peak-VRAM reduction is large	no	not supported; measured reduction is only `0.9%` to `1.6%`
Tiered compression improves latency	no	rejected; TTFT and TPOT are worse
The result generalizes broadly across models and workloads	not yet	unverified

Limitations

This note intentionally does not claim more than the measurements justify:

It uses one model family: Qwen2.5-7B-Instruct.
The tiered promotion runs use small exactness sets (4 prompts at 16K and 32K).
The 32K run emitted a model-length warning because generation steps push slightly beyond the nominal 32768 boundary. Exactness still passed 4/4, so I report the result, but the caveat matters.
The serving-fairness comparison covers two cache-friendly workloads, not a full production traffic distribution.

In other words, this is not yet "general long-context compression for all models." It is a precise result on a specific implementation surface and a specific model.

What I think the repo actually demonstrates

The most defensible interpretation of BCR-memory-4 today is:

As an exact-prefix serving cache, it does not yet beat SGLang's RadixCache.
As a server-side exactness path, it can remain bit-exact with radix near 32K.
As a tiered compression experiment, it can preserve exact outputs on a real Qwen 7B path up to 32K while compressing the stored-token representation by about 2.27x.

That is already scientifically interesting, but it is narrower than a systems-performance headline. The positive result is about exactness under compression, not about faster or cheaper serving.

The best next experiment

The next step should not be a broader sweep. It should be a latency decomposition on the exact 32K front_ret25 setting:

Measure time spent in summary construction,
measure time spent rebuilding decode views,
measure time spent in attention / generation proper,
then rerun the exact same 32K case after removing the dominant overhead.

If the decode-view concatenation path dominates, the real systems change is clear: move from a representation that still densifies at decode time to a manifest-aware or page-native decode path.

That is the point where this line of work could plausibly become a stronger systems paper. Right now, the evidence is good enough for an honest technical note and a useful negative result, but not yet for a headline speedup claim.

Description

The question this note is actually answering

The current BCR-memory-4 repo contains three distinct experimental tracks. They are easy to blur together, but they do not support the same claim:

Exact-prefix serving fairness: does the bcr serving path beat SGLang's production RadixCache on real workloads?
Server-side long-context correctness: does exact-prefix bcr remain bit-exact with radix near 32K context?
Tiered compression: can a compressed cache representation preserve exact outputs in the Hugging Face Qwen path, and if so what does it buy in latency and memory?

This note keeps those tracks separate and only makes claims that are directly supported by the current fetched artifacts.

Rendering diagram…

The central result is not "BCR wins." The central result is narrower:

Exact-prefix BCR is not currently faster than RadixCache on the measured serving workloads.
Tiered compression can preserve exact outputs on the tested Qwen 7B prompts up to 32K.
That tiered result currently comes with a substantial latency penalty and only a small peak-VRAM reduction.

Experimental surface

All numbers below come from the current BCR-memory-4 artifact set, using Qwen/Qwen2.5-7B-Instruct as the model throughout:

Track	Comparison	Hardware	Main metric
Exact-prefix serving fairness	`bcr` vs SGLang `radix`	A100 80GB	TTFT, E2E latency, reuse telemetry
Server long-context correctness	`bcr` vs `radix`	A100 80GB	Bit-exact output equality
Tiered compression promotion	dense vs tiered	L4 24GB at 16K, A100 80GB at 32K	Exactness, stored-token compression, TTFT/TPOT, peak VRAM

For the tiered path, I use three derived quantities:

\rho = \frac{T_{\mathrm{stored}}}{T_{\mathrm{logical}}}

C = \frac{T_{\mathrm{logical}}}{T_{\mathrm{stored}}} = \frac{1}{\rho}

\Delta_V = \frac{V_{\mathrm{dense}} - V_{\mathrm{tiered}}}{V_{\mathrm{dense}}}

where:

$T_{\mathrm{logical}}$ is the logical token budget represented by the cache,
$T_{\mathrm{stored}}$ is the number of tokens actually kept after exact retention, surprise pages, and phantom summaries,
$C$ is the stored-token compression factor, and
$\Delta_V$ is the measured peak-VRAM reduction.

The distinction matters. A large $C$ does not automatically imply a large $\Delta_V$ .

Result 1: exact-prefix BCR is not a serving-speed win over RadixCache

Workload	Cached-token delta (`bcr - radix`)	TTFT p50 ratio (`bcr / radix`)	TTFT p99 ratio (`bcr / radix`)	E2E p50 ratio (`bcr / radix`)	Interpretation
`system_repeat`	0	1.081	1.197	1.003	Same reuse, slower TTFT
`rag_fanout`	0	1.102	1.289	1.004	Same reuse, slower TTFT

This is a negative result for the exact-prefix serving path. On the current evidence:

bcr is not faster than radix,
bcr is not lower-latency than radix,
and the current repo does not support a concurrency win claim either.

What the data does support is narrower: BCR can match reuse accounting, but it does not currently convert that into a serving-speed advantage.

Result 2: server-side long-context correctness still holds near 32K

The server-side exact-prefix comparison near 32K context remains clean:

Metric	Value
Checked prompts	30
Mismatches	0
`radix` cached tokens total	331,999
`bcr` cached tokens total	331,999

Result 3: exact tiered compression exists, but it is not a memory or latency win yet

How the promoted setting was chosen

8K setting	Exact?
`middle_ret35`	no
`middle_ret50`	no
`middle_surp12`	no
`middle_phantom8`	yes
`front_ret25`	yes

That is why the later promotion runs focus on front_ret25 and middle_phantom8.

Promotion results at 16K and 32K

Setting	Context	Exactness	Stored-token compression $C$	Stored fraction $\rho$	TTFT ratio (tiered / dense)	TPOT ratio (tiered / dense)	Dense peak VRAM	Tiered peak VRAM	$\Delta_V$
`front_ret25`	16K	4/4	2.2599x	0.4425	1.2732x	1.1920x	17.294 GiB	17.137 GiB	0.91%
`middle_phantom8`	16K	4/4	1.5915x	0.6283	1.2686x	1.0042x	17.294 GiB	17.189 GiB	0.60%
`front_ret25`	32K	4/4	2.2723x	0.4401	2.2355x	2.2135x	20.349 GiB	20.034 GiB	1.55%

Two points are important here:

Exactness survives promotion. front_ret25 is exact on the tested 16K and 32K Qwen 7B sets. That is the strongest positive result in this repo right now.
The cost profile is poor. At 32K, the representation-level compression remains strong ( $C \approx 2.27$ ), but TTFT and TPOT both more than double relative to dense, while measured peak VRAM drops by only about 0.315 GiB.

This is why I call the result "exact long-context compression" rather than "memory-efficient serving." The first claim is supported. The second is not.

Why stored-token compression does not translate into a large HBM win

That means the present system can satisfy both of these statements at the same time:

the representation is compressed by roughly 2.27x, and
the live peak VRAM only drops by about 1.55%.

This is not a contradiction. It is a sign that representation-level compression has not yet been converted into a fully page-native decode path.

A claim ledger

The cleanest way to summarize the current state is as a claim ledger.

Claim	Supported by current evidence?	Verdict
Exact-prefix BCR is faster than SGLang `RadixCache`	no	rejected by the current A100 serving comparison
Exact-prefix BCR preserves long-context correctness relative to `radix` near 32K	yes	supported
Tiered compression can preserve exact outputs on real Qwen 7B prompts at 16K and 32K	yes	supported for the tested sets
Stored-token compression is large	yes	supported; about `2.26x` at 16K and `2.27x` at 32K for `front_ret25`
Peak-VRAM reduction is large	no	not supported; measured reduction is only `0.9%` to `1.6%`
Tiered compression improves latency	no	rejected; TTFT and TPOT are worse
The result generalizes broadly across models and workloads	not yet	unverified

Limitations

This note intentionally does not claim more than the measurements justify:

It uses one model family: Qwen2.5-7B-Instruct.
The tiered promotion runs use small exactness sets (4 prompts at 16K and 32K).
The 32K run emitted a model-length warning because generation steps push slightly beyond the nominal 32768 boundary. Exactness still passed 4/4, so I report the result, but the caveat matters.
The serving-fairness comparison covers two cache-friendly workloads, not a full production traffic distribution.

In other words, this is not yet "general long-context compression for all models." It is a precise result on a specific implementation surface and a specific model.

What I think the repo actually demonstrates

The most defensible interpretation of BCR-memory-4 today is:

As an exact-prefix serving cache, it does not yet beat SGLang's RadixCache.
As a server-side exactness path, it can remain bit-exact with radix near 32K.
As a tiered compression experiment, it can preserve exact outputs on a real Qwen 7B path up to 32K while compressing the stored-token representation by about 2.27x.

That is already scientifically interesting, but it is narrower than a systems-performance headline. The positive result is about exactness under compression, not about faster or cheaper serving.

The best next experiment

The next step should not be a broader sweep. It should be a latency decomposition on the exact 32K front_ret25 setting:

Measure time spent in summary construction,
measure time spent rebuilding decode views,
measure time spent in attention / generation proper,
then rerun the exact same 32K case after removing the dominant overhead.

If the decode-view concatenation path dominates, the real systems change is clear: move from a representation that still densifies at decode time to a manifest-aware or page-native decode path.

Marginalia & Comments

BCR-memory-4: exact long-context compression is real; serving-speed wins are not

Objective

Description

The question this note is actually answering

Experimental surface

Result 1: exact-prefix BCR is not a serving-speed win over RadixCache

Result 2: server-side long-context correctness still holds near 32K

Result 3: exact tiered compression exists, but it is not a memory or latency win yet

How the promoted setting was chosen

Promotion results at 16K and 32K

Why stored-token compression does not translate into a large HBM win

A claim ledger

Limitations

What I think the repo actually demonstrates

The best next experiment

Marginalia & Comments

BCR-memory-4: exact long-context compression is real; serving-speed wins are not

Objective

Description

The question this note is actually answering

Experimental surface

Result 1: exact-prefix BCR is not a serving-speed win over RadixCache

Result 2: server-side long-context correctness still holds near 32K

Result 3: exact tiered compression exists, but it is not a memory or latency win yet

How the promoted setting was chosen

Promotion results at 16K and 32K

Why stored-token compression does not translate into a large HBM win

A claim ledger

Limitations

What I think the repo actually demonstrates

The best next experiment

Objective

Description

The question this note is actually answering

Experimental surface

Result 1: exact-prefix BCR is not a serving-speed win over RadixCache

Result 2: server-side long-context correctness still holds near 32K

Result 3: exact tiered compression exists, but it is not a memory or latency win yet

How the promoted setting was chosen

Promotion results at 16K and 32K

Why stored-token compression does not translate into a large HBM win

A claim ledger

Limitations

What I think the repo actually demonstrates

The best next experiment

Related Papers from Research Archives

Objective

Description

The question this note is actually answering

Experimental surface

Result 1: exact-prefix BCR is not a serving-speed win over RadixCache

Result 2: server-side long-context correctness still holds near 32K

Result 3: exact tiered compression exists, but it is not a memory or latency win yet

How the promoted setting was chosen

Promotion results at 16K and 32K

Why stored-token compression does not translate into a large HBM win

A claim ledger

Limitations

What I think the repo actually demonstrates

The best next experiment

Related Papers from Research Archives