HypercubeQuant: early notes from an A100 KV-cache experiment
Correctness is proven; the memory story needs a harness that isn't rigged in my favor
Objective
HypercubeQuant is an in-progress inference-time memory experiment for transformer serving on a single NVIDIA A100. This is an early teaser note: what it is, where it sits in the space, what the current evidence actually supports, and what the next gates are. The short version: outputs are bit-identical to a dense baseline on the models tested so far, but the memory numbers are not ready to publish. The benchmark harness currently produces headline compression ratios that are artifacts of how the sweep is wired, and pretending otherwise would not age well.
Description
Status: work in progress. HypercubeQuant is still being built. This is a teaser note from inside the build — directional, not definitive. Expect revisions.
What this is
HypercubeQuant is an experiment in how a single NVIDIA A100 should handle the memory side of transformer serving when the same prefix shows up across many requests. The problem it targets is familiar:
- KV cache dominates serving memory
- repeated prefixes waste both storage and attention work
- strong systems already exist for this (SGLang's RadixAttention, vLLM's automatic prefix caching)
The question the project is asking is narrower than "beat those systems." It is: can a GPU-resident design for exact prefix reuse make the metadata and hot-path work cheap enough to matter on A100, without losing correctness and without relying on approximations?
The "how" is intentionally left out of this note — it is the piece that has to survive contact with real benchmarks before it is worth writing up. This is a progress note, not a systems paper.
The piece that is real today
One result has held up under direct scrutiny and is worth putting on record:
Correctness is bit-exact. On Qwen 2.5 at 0.5B and 7B parameters, a full-context prefill under HypercubeQuant produces identical per-token logits to a dense baseline. Not "close." Not "within tolerance." Identical.
| Model | Dense perplexity | HQ perplexity | Max logit difference |
|---|---|---|---|
| Qwen 2.5 0.5B | 8.4468e6 | 8.4468e6 | 0 |
| Qwen 2.5 7B | 1.0939e9 | 1.0939e9 | 0 |
A zero logit difference is the strictest possible gate for an exact-reuse scheme. The reuse path is not corrupting outputs on the tested set. That is not a headline systems result, but it is the foundation everything else has to stand on.
Alongside that, the design's fixed metadata footprint is small and roughly constant across configurations — on the order of a few tens of megabytes. Not a performance claim, but a property the design needs to have if it is ever going to scale.
The piece that is not real yet
Earlier internal numbers suggested HypercubeQuant was compressing stored KV by more than an order of magnitude — and at the longest tested contexts, by more than 50× against a dense baseline.
That number does not survive scrutiny of the benchmark harness itself.
Without getting into the mechanics, the current sweep's "compression ratio" at any given nominal context length turns out to be arithmetically equal to (nominal context length) / (actual request length) — because the two backends under comparison report different things on identical workloads: one reports the buffer it pre-allocated for the nominal context, the other reports only the bytes actually written. Per-request workload size is held roughly constant across every row of the sweep, so the "compression" is really just the ratio between an allocated slab and a utilized slab.
That is not a result. It is a harness artifact. It would be embarrassing to publish as a compression claim, and it is exactly the kind of number that looks great in a graph and falls apart when someone asks a second question about it.
Why this matters as a teaser, not a takedown
The useful thing about catching this kind of artifact now is that it reshapes the entire near-term plan.
- Compression claims cannot be made yet. The workload has to actually vary in size, and the two backends have to measure the same quantity. Until both of those are true, no number goes out.
- Shared-prefix reuse requires more than one request. A single-request-at-a-time sweep cannot exercise the mechanism the design is built around. The benchmark has to be concurrent, with real cross-request prefix sharing, before any reuse-driven memory win can show up.
- The honest baseline is not a dense forward. It is SGLang's RadixAttention and vLLM's prefix cache — systems that already solve a lot of this problem and do it well. Any win has to be demonstrated against those, not against a softer comparison.
What I want this note to do
This note is a marker:
- It says, out loud, that the correctness property holds and is bit-exact.
- It says, out loud, that none of the memory numbers previously associated with this project should be cited until the harness is redone.
- It commits to publishing the next round of results only against a matched, concurrent, realistic benchmark — and against the systems that already do this well.
If the project is real, it will show up on that comparison. If it isn't, the correction above is the most useful thing the current state of this work can contribute.
A longer follow-up will land once the new benchmarks run. Until then, the only claims I'm willing to stand behind publicly are the two at the top: bit-exact outputs, and a small fixed metadata footprint. Everything else is in flight.
Revisions will follow as the harness is rebuilt and re-measured. This note exists mainly to retract the memory framing early and to set the bar for what the next numbers have to clear before they are publishable.
Related Papers from Research Archives
Sue Anne Teo, S. P. Mann, Paul Jurcys·Jan 1, 2027·OpenAlex
Antoine Laurent, Sameer Khurana, Anthony Larcher et al.·Sep 1, 2026·OpenAlex
Fengzhu ZENG, Qian SHAO, Ling CHENG et al.·Aug 1, 2026·OpenAlex
Franco Terranova, Sana Rekbi, Abdelkader Lahmadi et al.·Jul 1, 2026·OpenAlex
James Faulconbridge, Kasper Trolle Elmholdt, Frida Pemer et al.·Jun 30, 2026·OpenAlex
Xiang Song, Zhuoheng Li, Junhan Yu·Jun 30, 2026·OpenAlex
Jinghao Chang·Jun 30, 2026·OpenAlex
Stefano; id_orcid 0009-0009-9778-4019 Simonetto, Ronan Oostveen, Thijs van Ede et al.·Jun 26, 2026·OpenAlex