Armanas Povilionis-Muradian

HypercubeQuant: early notes from an A100 KV-cache experiment

3 months ago

Correctness is proven; the memory story needs a harness that isn't rigged in my favor

in-progress

Link

Objective

HypercubeQuant is an in-progress inference-time memory experiment for transformer serving on a single NVIDIA A100. This is an early teaser note: what it is, where it sits in the space, what the current evidence actually supports, and what the next gates are. The short version: ou...

Link

BCR-memory-4: exact long-context compression is real; serving-speed wins are not

3 months ago

A scientific note on what the current evidence actually supports

complete

Objective

This note separates three questions that are easy to conflate in prefix-cache work: exact-prefix serving fairness against SGLang's RadixCache, server-side long-context correctness, and tiered compression in the Hugging Face Qwen path. The current evidence supports a narrow but re...

Open artifact