KV Cache, Made Visible: Qwen3-0.6B on Apple Silicon
Written by Opus, reviewed and edited by me. TL;DR: I built a minimal Qwen3-0.6B in pure PyTorch that runs on Apple Silicon, with a live chat UI that shows KV cache memory and per-token latency side by side. Toggle the cache off and watch attention compute go quadratic in real time. Code: github.com/ricklamers/kvcache-exploration. I kept reading explanations of KV cache that all said roughly the same thing: it trades memory for compute, stores keys and values from prior tokens so attention doesn’t redo O(n²) work each step, makes decoding fast. I’ve heard it, you’ve heard it. But “KV cache saves time” is pedagogically thin. It doesn’t tell you how much time, at what memory cost, or what happens when you just don’t have one. ...