HypercubeQuant: the 50x "compression" is a measurement artifact, not a result
A code-level audit of the sweep shows dense and HQ accounting measure different things on identical workloads
Objective
Earlier versions of this note led with a table showing HypercubeQuant stored-KV compressing >50x against a dense baseline at 128K nominal context. A direct audit of the sweep code shows that number is not compression. The sweep caps each request at 2,048 shared + 512 unique = 2,560 tokens regardless of the nominal context-length label, the dense backend reports a pre-allocated buffer sized to the label, and the HQ backend reports actually-inserted bytes. The reported ratio is exactly ctx_len / 2,560 by construction. On identical 2,560-token workloads, the two backends would store the same amount of KV. What survives the audit is narrower: bit-exact logits against the dense path and a small fixed metadata overhead. The memory and context-scaling claims do not survive.
Description
Status: work in progress. This is a public engineering-honesty note on a project that is still being built. The audit below reflects the current state of the benchmark harness and is intended to be superseded once the harness is fixed and real numbers land. Nothing here is a final result.
Why this note exists in its current form
An earlier draft of this note led with HypercubeQuant's stored-KV being "flat in context length," with compression exceeding 50x at 128K versus a dense baseline on Qwen 2.5 7B. After a direct review of the benchmark code, that framing does not hold up. The number is a measurement artifact produced by two backends reporting different things on identical workloads.
I am publishing the audit rather than the original result because the correction is the more useful note.
What the sweep actually does
The sweep driver reads a list of context_lengths and, for each one, builds a workload. The relevant passage in sweep_runner.cu:
WorkloadConfig wl;
wl.num_requests = concurrency;
wl.shared_prefix_fraction = bench.shared_prefix_fraction; // 0.8
wl.shared_prefix_tokens = std::min(bench.shared_prefix_tokens, // 2048
context_length);
wl.unique_suffix_tokens = std::min(bench.unique_suffix_tokens, // 512
context_length - wl.shared_prefix_tokens);
With the sweep configs used to produce the CSVs (shared_prefix_tokens: 2048, unique_suffix_tokens: 512) and every nominal context_length at or above 2 560, the std::min(...) clamps kick in and every request is exactly 2 048 + 512 = 2 560 tokens long, independent of the context_length label on the row.
In the exported CSV, r.context_length = ctx_len is then tagged with the nominal sweep value, not the actual request length.
What each backend reports as "KV memory"
The two backends do not compute KV memory the same way.
Dense in dense_kv_backend.cu:
size_t per_layer = max_seq_len_ * num_kv_heads * head_dim * sizeof(half_t);
size_t total_kv = num_cache_layers_ * per_layer * 2; // K + V
stats.kv_bytes_allocated = total_kv;
This is the size of the pre-allocated KV buffer for the full nominal context, not what was written into it. On the 7B model at ctx_len = 131072, this evaluates to:
matching the reported 7168 MB exactly.
HQ in hq_kv_backend.cu:
total_kv_bytes_ += full_blocks * block_bytes * 2;
This is the actually-inserted KV, accumulated as blocks land. With each request's 2 560 tokens yielding 2560 / 16 = 160 full blocks, the 7B case evaluates to:
matching the reported 140 MB.
Why the ratio looks like 50x
With a fixed 2 560-token request on every row, the HQ number is a flat 140 MB for 7B and 30 MB for 0.5B across the whole sweep. The dense number scales linearly with the label. The "compression ratio" is therefore exactly:
| Label | Expected ratio | Reported 7B ratio (7168 / HQ 140) |
|---|---|---|
| 4 096 | 1.6x | 1.6x |
| 16 384 | 6.4x | 6.4x |
| 32 768 | 12.8x | 12.8x |
| 65 536 | 25.6x | 25.6x |
| 131 072 | 51.2x | 51.2x |
Every row matches the ctx_len / 2560 prediction to three significant figures. That is the signature of an accounting mismatch, not a compression property.
What the comparison would look like if matched
A fair KV-memory comparison on these sweeps would compare the two backends on the same 2 560-token request. Both would then end up at roughly 140 MB on 7B. Under that accounting the row reads:
| Model | Actual request | Dense KV (utilization) | HQ KV (utilization) | Ratio |
|---|---|---|---|---|
| Qwen 2.5 7B | 2 560 tokens | ~140 MB | ~140 MB | ~1x |
| Qwen 2.5 0.5B | 2 560 tokens | ~30 MB | ~30 MB | ~1x |
There is no compression on this workload because there is nothing to compress: the nominal context_length parameter does not control how much KV is inserted.
At concurrency = 1 there is also only a single request per iteration, so there is no cross-request shared-prefix reuse to exploit either. The reuse factor of several thousand reported in the CSV is accumulated across warmup and measurement iterations, not across distinct requests in a batch.
What survives the audit
Two things remain intact.
Bit-exact logits
The perplexity harness uses a different code path (correctness.cu with real WikiText-2 tokens on a full-context prefill) and produces an identical-to-zero logit difference between the HQ and dense paths on Qwen 2.5 0.5B and 7B at 4 096 tokens.
| Model | Context | Dense perplexity | HQ perplexity | Max logit difference |
|---|---|---|---|---|
| Qwen 2.5 0.5B | 4 096 | 8.4468e6 | 8.4468e6 | 0 |
| Qwen 2.5 7B | 4 096 | 1.0939e9 | 1.0939e9 | 0 |
"Max logit difference: 0" means bit-identical per-token logits. The exact-reuse path does not corrupt outputs on the tested set. That is a real property, and it does not depend on the sweep harness.
Small, roughly fixed metadata overhead
The ~10 MB metadata overhead reported for the HQ path is real and largely independent of the workload. Not a headline result, but a property worth recording.
What does not survive the audit
- "Stored KV is flat in context length" — false as a system claim. It is flat because the workload is flat. The sweep's
context_lengthlabel does not drive request size. - "Compression exceeds 50x at 128K" — false as a compression claim. The ratio is
ctx_len / 2560by construction. - "Context-scaling advantage versus dense" — unmeasured. The sweep does not vary real inserted bytes.
- "Reuse-exploiting memory win at concurrency 1" — not possible. One request cannot reuse itself.
The reduced claim ledger
| Claim | Supported? | Verdict |
|---|---|---|
| Output logits are bit-identical to the dense baseline on tested models | yes | supported |
| Metadata overhead is small and roughly fixed (~10 MB) | yes | supported |
| Stored KV is flat in context length | no | artifact of capped-length workload |
| Compression >50x at 128K | no | ratio equals ctx_len / 2560 by construction |
| Peak HBM during prefill is reduced | no | slightly worse |
| TTFT is lower than the dense baseline | no | near parity |
| Any memory or serving advantage over SGLang / vLLM APC | not tested | unverified |
What needs to change in the sweep
The memory and context-scaling claims can only be revisited after the harness is corrected. Concretely:
- Report dense utilization, not allocation. Dense should accumulate bytes as it writes K/V for the actual request, the same way HQ does. Allocation-vs-utilization cannot be compared.
- Make request size a function of
context_length. Either drop theshared_prefix_tokens/unique_suffix_tokensclamps, or introduce acontext_length-proportional component, so the label means what the column header says. - Exercise cross-request reuse at
concurrency > 1. Prefix sharing across concurrent or sequential requests is the actual mechanism HQ is supposed to exploit. A single-request sweep cannot demonstrate it. - Publish the reuse factor next to every memory row. Compression claims without a visible reuse factor on the same row are easy to misread.
What I want on record
The correct summary today is narrow:
- The exact-reuse forward path produces bit-identical logits against the dense path on the tested models.
- The metadata overhead is small and roughly fixed.
- Every other memory, context-scaling, or serving claim previously made in this note is either an artifact of the sweep harness or an unmeasured comparison.
There is no compression number to stand behind yet. The next step is fixing the benchmark, not writing the follow-up note.
This note will be revised once the harness is corrected and real memory and scaling numbers are available. It is published in its current form because the honest methodology correction is more useful than the original misleading framing.
Related Papers from Research Archives
Ariel Mendez, Amandine Pascal·Jan 1, 2050·OpenAlex
Amir Reza Ansari Dezfoli·Sep 5, 2035·OpenAlex
Johannes Tim Wildberger·Apr 11, 2029·OpenAlex
Johannes Tim Wildberger·Apr 11, 2029·OpenAlex
Francisco Javier Martinez Sanchez·Feb 9, 2029·OpenAlex
SoSy-Lab·Jan 8, 2029·OpenAlex
Junwei Wei·Jan 1, 2029·OpenAlex
Jacques, Laurie, Hardman, Mark, Joubert, Marie et al.·Feb 1, 2027·OpenAlex