HypercubeQuant: early notes from an A100 KV-cache experiment

Objective

HypercubeQuant is an in-progress inference-time memory experiment for transformer serving on a single NVIDIA A100. This is an early teaser note: what it is, where it sits in the space, what the current evidence actually supports, and what the next gates are. The short version: outputs are bit-identical to a dense baseline on the models tested so far, but the memory numbers are not ready to publish. The benchmark harness currently produces headline compression ratios that are artifacts of how the sweep is wired, and pretending otherwise would not age well.

Description

Status: work in progress. HypercubeQuant is still being built. This is a teaser note from inside the build — directional, not definitive. Expect revisions.

What this is

HypercubeQuant is an experiment in how a single NVIDIA A100 should handle the memory side of transformer serving when the same prefix shows up across many requests. The problem it targets is familiar:

KV cache dominates serving memory
repeated prefixes waste both storage and attention work
strong systems already exist for this (SGLang's RadixAttention, vLLM's automatic prefix caching)

The question the project is asking is narrower than "beat those systems." It is: can a GPU-resident design for exact prefix reuse make the metadata and hot-path work cheap enough to matter on A100, without losing correctness and without relying on approximations?

The "how" is intentionally left out of this note — it is the piece that has to survive contact with real benchmarks before it is worth writing up. This is a progress note, not a systems paper.

The piece that is real today

One result has held up under direct scrutiny and is worth putting on record:

Correctness is bit-exact. On Qwen 2.5 at 0.5B and 7B parameters, a full-context prefill under HypercubeQuant produces identical per-token logits to a dense baseline. Not "close." Not "within tolerance." Identical.

Model	Dense perplexity	HQ perplexity	Max logit difference
Qwen 2.5 0.5B	8.4468e6	8.4468e6	0
Qwen 2.5 7B	1.0939e9	1.0939e9	0

A zero logit difference is the strictest possible gate for an exact-reuse scheme. The reuse path is not corrupting outputs on the tested set. That is not a headline systems result, but it is the foundation everything else has to stand on.

Alongside that, the design's fixed metadata footprint is small and roughly constant across configurations — on the order of a few tens of megabytes. Not a performance claim, but a property the design needs to have if it is ever going to scale.

The piece that is not real yet

Earlier internal numbers suggested HypercubeQuant was compressing stored KV by more than an order of magnitude — and at the longest tested contexts, by more than 50× against a dense baseline.

That number does not survive scrutiny of the benchmark harness itself.

Without getting into the mechanics, the current sweep's "compression ratio" at any given nominal context length turns out to be arithmetically equal to (nominal context length) / (actual request length) — because the two backends under comparison report different things on identical workloads: one reports the buffer it pre-allocated for the nominal context, the other reports only the bytes actually written. Per-request workload size is held roughly constant across every row of the sweep, so the "compression" is really just the ratio between an allocated slab and a utilized slab.

That is not a result. It is a harness artifact. It would be embarrassing to publish as a compression claim, and it is exactly the kind of number that looks great in a graph and falls apart when someone asks a second question about it.

Why this matters as a teaser, not a takedown

The useful thing about catching this kind of artifact now is that it reshapes the entire near-term plan.

Compression claims cannot be made yet. The workload has to actually vary in size, and the two backends have to measure the same quantity. Until both of those are true, no number goes out.
Shared-prefix reuse requires more than one request. A single-request-at-a-time sweep cannot exercise the mechanism the design is built around. The benchmark has to be concurrent, with real cross-request prefix sharing, before any reuse-driven memory win can show up.
The honest baseline is not a dense forward. It is SGLang's RadixAttention and vLLM's prefix cache — systems that already solve a lot of this problem and do it well. Any win has to be demonstrated against those, not against a softer comparison.

What I want this note to do

This note is a marker:

It says, out loud, that the correctness property holds and is bit-exact.
It says, out loud, that none of the memory numbers previously associated with this project should be cited until the harness is redone.
It commits to publishing the next round of results only against a matched, concurrent, realistic benchmark — and against the systems that already do this well.

If the project is real, it will show up on that comparison. If it isn't, the correction above is the most useful thing the current state of this work can contribute.

A longer follow-up will land once the new benchmarks run. Until then, the only claims I'm willing to stand behind publicly are the two at the top: bit-exact outputs, and a small fixed metadata footprint. Everything else is in flight.

Revisions will follow as the harness is rebuilt and re-measured. This note exists mainly to retract the memory framing early and to set the bar for what the next numbers have to clear before they are publishable.

Description

Status: work in progress. HypercubeQuant is still being built. This is a teaser note from inside the build — directional, not definitive. Expect revisions.

What this is

KV cache dominates serving memory
repeated prefixes waste both storage and attention work
strong systems already exist for this (SGLang's RadixAttention, vLLM's automatic prefix caching)

The "how" is intentionally left out of this note — it is the piece that has to survive contact with real benchmarks before it is worth writing up. This is a progress note, not a systems paper.

The piece that is real today

One result has held up under direct scrutiny and is worth putting on record:

Model	Dense perplexity	HQ perplexity	Max logit difference
Qwen 2.5 0.5B	8.4468e6	8.4468e6	0
Qwen 2.5 7B	1.0939e9	1.0939e9	0

The piece that is not real yet

Earlier internal numbers suggested HypercubeQuant was compressing stored KV by more than an order of magnitude — and at the longest tested contexts, by more than 50× against a dense baseline.

That number does not survive scrutiny of the benchmark harness itself.

Why this matters as a teaser, not a takedown

The useful thing about catching this kind of artifact now is that it reshapes the entire near-term plan.

Compression claims cannot be made yet. The workload has to actually vary in size, and the two backends have to measure the same quantity. Until both of those are true, no number goes out.
Shared-prefix reuse requires more than one request. A single-request-at-a-time sweep cannot exercise the mechanism the design is built around. The benchmark has to be concurrent, with real cross-request prefix sharing, before any reuse-driven memory win can show up.
The honest baseline is not a dense forward. It is SGLang's RadixAttention and vLLM's prefix cache — systems that already solve a lot of this problem and do it well. Any win has to be demonstrated against those, not against a softer comparison.

What I want this note to do

This note is a marker:

It says, out loud, that the correctness property holds and is bit-exact.
It says, out loud, that none of the memory numbers previously associated with this project should be cited until the harness is redone.
It commits to publishing the next round of results only against a matched, concurrent, realistic benchmark — and against the systems that already do this well.

If the project is real, it will show up on that comparison. If it isn't, the correction above is the most useful thing the current state of this work can contribute.

Marginalia & Comments

HypercubeQuant: early notes from an A100 KV-cache experiment

Objective

Description

What this is

The piece that is real today

The piece that is not real yet

Why this matters as a teaser, not a takedown

What I want this note to do

Marginalia & Comments

HypercubeQuant: early notes from an A100 KV-cache experiment

Objective

Description

What this is

The piece that is real today

The piece that is not real yet

Why this matters as a teaser, not a takedown

What I want this note to do

Objective

Description

What this is

The piece that is real today

The piece that is not real yet

Why this matters as a teaser, not a takedown

What I want this note to do

Related Papers from Research Archives

Objective

Description

What this is

The piece that is real today

The piece that is not real yet

Why this matters as a teaser, not a takedown

What I want this note to do

Related Papers from Research Archives