Correctness is proven; the memory story needs a harness that isn't rigged in my favor
HypercubeQuant is an in-progress inference-time memory experiment for transformer serving on a single NVIDIA A100. This is an early teaser note: what it is, where it sits in the space, what the current evidence actually supports, and what the next gates are. The short version: ou...
A scientific note on what the current evidence actually supports
This note separates three questions that are easy to conflate in prefix-cache work: exact-prefix serving fairness against SGLang's RadixCache, server-side long-context correctness, and tiered compression in the Hugging Face Qwen path. The current evidence supports a narrow but re...