HypercubeQuant: the 50x "compression" is a measurement artifact, not a result

Objective

Earlier versions of this note led with a table showing HypercubeQuant stored-KV compressing >50x against a dense baseline at 128K nominal context. A direct audit of the sweep code shows that number is not compression. The sweep caps each request at 2,048 shared + 512 unique = 2,560 tokens regardless of the nominal context-length label, the dense backend reports a pre-allocated buffer sized to the label, and the HQ backend reports actually-inserted bytes. The reported ratio is exactly ctx_len / 2,560 by construction. On identical 2,560-token workloads, the two backends would store the same amount of KV. What survives the audit is narrower: bit-exact logits against the dense path and a small fixed metadata overhead. The memory and context-scaling claims do not survive.

Description

Status: work in progress. This is a public engineering-honesty note on a project that is still being built. The audit below reflects the current state of the benchmark harness and is intended to be superseded once the harness is fixed and real numbers land. Nothing here is a final result.

Why this note exists in its current form

An earlier draft of this note led with HypercubeQuant's stored-KV being "flat in context length," with compression exceeding 50x at 128K versus a dense baseline on Qwen 2.5 7B. After a direct review of the benchmark code, that framing does not hold up. The number is a measurement artifact produced by two backends reporting different things on identical workloads.

I am publishing the audit rather than the original result because the correction is the more useful note.

What the sweep actually does

The sweep driver reads a list of context_lengths and, for each one, builds a workload. The relevant passage in sweep_runner.cu:

WorkloadConfig wl;
wl.num_requests           = concurrency;
wl.shared_prefix_fraction = bench.shared_prefix_fraction;      // 0.8
wl.shared_prefix_tokens   = std::min(bench.shared_prefix_tokens,   // 2048
                                     context_length);
wl.unique_suffix_tokens   = std::min(bench.unique_suffix_tokens,   // 512
                                     context_length - wl.shared_prefix_tokens);

With the sweep configs used to produce the CSVs (shared_prefix_tokens: 2048, unique_suffix_tokens: 512) and every nominal context_length at or above 2 560, the std::min(...) clamps kick in and every request is exactly 2 048 + 512 = 2 560 tokens long, independent of the context_length label on the row.

In the exported CSV, r.context_length = ctx_len is then tagged with the nominal sweep value, not the actual request length.

What each backend reports as "KV memory"

The two backends do not compute KV memory the same way.

Dense in dense_kv_backend.cu:

size_t per_layer = max_seq_len_ * num_kv_heads * head_dim * sizeof(half_t);
size_t total_kv  = num_cache_layers_ * per_layer * 2;  // K + V
stats.kv_bytes_allocated = total_kv;

This is the size of the pre-allocated KV buffer for the full nominal context, not what was written into it. On the 7B model at ctx_len = 131072, this evaluates to:

$131{,}072 \times 4 \times 128 \times 2 \times 2 \times 28 = 7{,}516{,}192{,}768 \approx 7{,}168 \text{ MiB}$

matching the reported 7168 MB exactly.

HQ in hq_kv_backend.cu:

total_kv_bytes_ += full_blocks * block_bytes * 2;

This is the actually-inserted KV, accumulated as blocks land. With each request's 2 560 tokens yielding 2560 / 16 = 160 full blocks, the 7B case evaluates to:

$160 \times (16 \times 4 \times 128 \times 2) \times 2 \times 28 = 146{,}800{,}640 \approx 140 \text{ MiB}$

matching the reported 140 MB.

Why the ratio looks like 50x

With a fixed 2 560-token request on every row, the HQ number is a flat 140 MB for 7B and 30 MB for 0.5B across the whole sweep. The dense number scales linearly with the label. The "compression ratio" is therefore exactly:

$\text{ratio} = \frac{\text{context\_length label}}{2{,}560}$

Label	Expected ratio	Reported 7B ratio (7168 / HQ 140)
4 096	1.6x	1.6x
16 384	6.4x	6.4x
32 768	12.8x	12.8x
65 536	25.6x	25.6x
131 072	51.2x	51.2x

Every row matches the ctx_len / 2560 prediction to three significant figures. That is the signature of an accounting mismatch, not a compression property.

What the comparison would look like if matched

A fair KV-memory comparison on these sweeps would compare the two backends on the same 2 560-token request. Both would then end up at roughly 140 MB on 7B. Under that accounting the row reads:

Model	Actual request	Dense KV (utilization)	HQ KV (utilization)	Ratio
Qwen 2.5 7B	2 560 tokens	~140 MB	~140 MB	~1x
Qwen 2.5 0.5B	2 560 tokens	~30 MB	~30 MB	~1x

There is no compression on this workload because there is nothing to compress: the nominal context_length parameter does not control how much KV is inserted.

At concurrency = 1 there is also only a single request per iteration, so there is no cross-request shared-prefix reuse to exploit either. The reuse factor of several thousand reported in the CSV is accumulated across warmup and measurement iterations, not across distinct requests in a batch.

What survives the audit

Two things remain intact.

Bit-exact logits

The perplexity harness uses a different code path (correctness.cu with real WikiText-2 tokens on a full-context prefill) and produces an identical-to-zero logit difference between the HQ and dense paths on Qwen 2.5 0.5B and 7B at 4 096 tokens.

Model	Context	Dense perplexity	HQ perplexity	Max logit difference
Qwen 2.5 0.5B	4 096	8.4468e6	8.4468e6	0
Qwen 2.5 7B	4 096	1.0939e9	1.0939e9	0

"Max logit difference: 0" means bit-identical per-token logits. The exact-reuse path does not corrupt outputs on the tested set. That is a real property, and it does not depend on the sweep harness.

Small, roughly fixed metadata overhead

The ~10 MB metadata overhead reported for the HQ path is real and largely independent of the workload. Not a headline result, but a property worth recording.

What does not survive the audit

"Stored KV is flat in context length" — false as a system claim. It is flat because the workload is flat. The sweep's context_length label does not drive request size.
"Compression exceeds 50x at 128K" — false as a compression claim. The ratio is ctx_len / 2560 by construction.
"Context-scaling advantage versus dense" — unmeasured. The sweep does not vary real inserted bytes.
"Reuse-exploiting memory win at concurrency 1" — not possible. One request cannot reuse itself.

The reduced claim ledger

Claim	Supported?	Verdict
Output logits are bit-identical to the dense baseline on tested models	yes	supported
Metadata overhead is small and roughly fixed (~10 MB)	yes	supported
Stored KV is flat in context length	no	artifact of capped-length workload
Compression >50x at 128K	no	ratio equals `ctx_len / 2560` by construction
Peak HBM during prefill is reduced	no	slightly worse
TTFT is lower than the dense baseline	no	near parity
Any memory or serving advantage over SGLang / vLLM APC	not tested	unverified

What needs to change in the sweep

The memory and context-scaling claims can only be revisited after the harness is corrected. Concretely:

Report dense utilization, not allocation. Dense should accumulate bytes as it writes K/V for the actual request, the same way HQ does. Allocation-vs-utilization cannot be compared.
Make request size a function of context_length. Either drop the shared_prefix_tokens/unique_suffix_tokens clamps, or introduce a context_length-proportional component, so the label means what the column header says.
Exercise cross-request reuse at concurrency > 1. Prefix sharing across concurrent or sequential requests is the actual mechanism HQ is supposed to exploit. A single-request sweep cannot demonstrate it.
Publish the reuse factor next to every memory row. Compression claims without a visible reuse factor on the same row are easy to misread.

What I want on record

The correct summary today is narrow:

The exact-reuse forward path produces bit-identical logits against the dense path on the tested models.
The metadata overhead is small and roughly fixed.
Every other memory, context-scaling, or serving claim previously made in this note is either an artifact of the sweep harness or an unmeasured comparison.

There is no compression number to stand behind yet. The next step is fixing the benchmark, not writing the follow-up note.

This note will be revised once the harness is corrected and real memory and scaling numbers are available. It is published in its current form because the honest methodology correction is more useful than the original misleading framing.

Description

Status: work in progress. This is a public engineering-honesty note on a project that is still being built. The audit below reflects the current state of the benchmark harness and is intended to be superseded once the harness is fixed and real numbers land. Nothing here is a final result.

Why this note exists in its current form

I am publishing the audit rather than the original result because the correction is the more useful note.

What the sweep actually does

The sweep driver reads a list of context_lengths and, for each one, builds a workload. The relevant passage in sweep_runner.cu:

WorkloadConfig wl;
wl.num_requests           = concurrency;
wl.shared_prefix_fraction = bench.shared_prefix_fraction;      // 0.8
wl.shared_prefix_tokens   = std::min(bench.shared_prefix_tokens,   // 2048
                                     context_length);
wl.unique_suffix_tokens   = std::min(bench.unique_suffix_tokens,   // 512
                                     context_length - wl.shared_prefix_tokens);

In the exported CSV, r.context_length = ctx_len is then tagged with the nominal sweep value, not the actual request length.

What each backend reports as "KV memory"

The two backends do not compute KV memory the same way.

Dense in dense_kv_backend.cu:

size_t per_layer = max_seq_len_ * num_kv_heads * head_dim * sizeof(half_t);
size_t total_kv  = num_cache_layers_ * per_layer * 2;  // K + V
stats.kv_bytes_allocated = total_kv;

This is the size of the pre-allocated KV buffer for the full nominal context, not what was written into it. On the 7B model at ctx_len = 131072, this evaluates to:

$131{,}072 \times 4 \times 128 \times 2 \times 2 \times 28 = 7{,}516{,}192{,}768 \approx 7{,}168 \text{ MiB}$

matching the reported 7168 MB exactly.

HQ in hq_kv_backend.cu:

total_kv_bytes_ += full_blocks * block_bytes * 2;

This is the actually-inserted KV, accumulated as blocks land. With each request's 2 560 tokens yielding 2560 / 16 = 160 full blocks, the 7B case evaluates to:

$160 \times (16 \times 4 \times 128 \times 2) \times 2 \times 28 = 146{,}800{,}640 \approx 140 \text{ MiB}$

matching the reported 140 MB.

Why the ratio looks like 50x

$\text{ratio} = \frac{\text{context\_length label}}{2{,}560}$

Label	Expected ratio	Reported 7B ratio (7168 / HQ 140)
4 096	1.6x	1.6x
16 384	6.4x	6.4x
32 768	12.8x	12.8x
65 536	25.6x	25.6x
131 072	51.2x	51.2x

Every row matches the ctx_len / 2560 prediction to three significant figures. That is the signature of an accounting mismatch, not a compression property.

What the comparison would look like if matched

A fair KV-memory comparison on these sweeps would compare the two backends on the same 2 560-token request. Both would then end up at roughly 140 MB on 7B. Under that accounting the row reads:

Model	Actual request	Dense KV (utilization)	HQ KV (utilization)	Ratio
Qwen 2.5 7B	2 560 tokens	~140 MB	~140 MB	~1x
Qwen 2.5 0.5B	2 560 tokens	~30 MB	~30 MB	~1x

There is no compression on this workload because there is nothing to compress: the nominal context_length parameter does not control how much KV is inserted.

What survives the audit

Two things remain intact.

Bit-exact logits

Model	Context	Dense perplexity	HQ perplexity	Max logit difference
Qwen 2.5 0.5B	4 096	8.4468e6	8.4468e6	0
Qwen 2.5 7B	4 096	1.0939e9	1.0939e9	0

"Max logit difference: 0" means bit-identical per-token logits. The exact-reuse path does not corrupt outputs on the tested set. That is a real property, and it does not depend on the sweep harness.

Small, roughly fixed metadata overhead

The ~10 MB metadata overhead reported for the HQ path is real and largely independent of the workload. Not a headline result, but a property worth recording.

What does not survive the audit

"Stored KV is flat in context length" — false as a system claim. It is flat because the workload is flat. The sweep's context_length label does not drive request size.
"Compression exceeds 50x at 128K" — false as a compression claim. The ratio is ctx_len / 2560 by construction.
"Context-scaling advantage versus dense" — unmeasured. The sweep does not vary real inserted bytes.
"Reuse-exploiting memory win at concurrency 1" — not possible. One request cannot reuse itself.

The reduced claim ledger

Claim	Supported?	Verdict
Output logits are bit-identical to the dense baseline on tested models	yes	supported
Metadata overhead is small and roughly fixed (~10 MB)	yes	supported
Stored KV is flat in context length	no	artifact of capped-length workload
Compression >50x at 128K	no	ratio equals `ctx_len / 2560` by construction
Peak HBM during prefill is reduced	no	slightly worse
TTFT is lower than the dense baseline	no	near parity
Any memory or serving advantage over SGLang / vLLM APC	not tested	unverified

What needs to change in the sweep

The memory and context-scaling claims can only be revisited after the harness is corrected. Concretely:

Report dense utilization, not allocation. Dense should accumulate bytes as it writes K/V for the actual request, the same way HQ does. Allocation-vs-utilization cannot be compared.
Make request size a function of context_length. Either drop the shared_prefix_tokens/unique_suffix_tokens clamps, or introduce a context_length-proportional component, so the label means what the column header says.
Exercise cross-request reuse at concurrency > 1. Prefix sharing across concurrent or sequential requests is the actual mechanism HQ is supposed to exploit. A single-request sweep cannot demonstrate it.
Publish the reuse factor next to every memory row. Compression claims without a visible reuse factor on the same row are easy to misread.

What I want on record

The correct summary today is narrow:

The exact-reuse forward path produces bit-identical logits against the dense path on the tested models.
The metadata overhead is small and roughly fixed.
Every other memory, context-scaling, or serving claim previously made in this note is either an artifact of the sweep harness or an unmeasured comparison.

There is no compression number to stand behind yet. The next step is fixing the benchmark, not writing the follow-up note.

HypercubeQuant: the 50x "compression" is a measurement artifact, not a result

Objective

Description

Why this note exists in its current form

What the sweep actually does

What each backend reports as "KV memory"

Why the ratio looks like 50x

What the comparison would look like if matched

What survives the audit

Bit-exact logits

Small, roughly fixed metadata overhead

What does not survive the audit

The reduced claim ledger

What needs to change in the sweep

What I want on record

Related Papers from Research Archives

HypercubeQuant: the 50x "compression" is a measurement artifact, not a result

Objective

Description

Why this note exists in its current form

What the sweep actually does

What each backend reports as "KV memory"

Why the ratio looks like 50x

What the comparison would look like if matched

What survives the audit

Bit-exact logits

Small, roughly fixed metadata overhead

What does not survive the audit

The reduced claim ledger

What needs to change in the sweep

What I want on record

Related Papers from Research Archives