BCR-memory-2 is a Rust prefix-cache data structure. This note puts it on the same table as the two things a user would reasonably compare it to — a trivial Python-dict cache and SGLang's production RadixCache — and walks through what we measured on a single NVIDIA L4. It calls ou...
A hypothesis on why contiguous memory access matters more than algorithmic cleverness for long-context LLM inference
On modern GPU architectures, a KV-cache compression scheme that selects contiguous positional blocks via cheap centroid routing will outperform a scheme that selects semantically clustered but spatially scattered tokens at equal compression ratios — not because it captures more r...