AMD's $699 Chip Makes the Case for Running AI Without the Cloud

The Chip That Finally Makes Local AI Worth It

For the last two years, "run AI locally" has been the advice everyone gives and almost nobody follows through on — because the performance gap between local inference and API calls was real. You'd fire up Ollama, load Llama 3.1, and wait. And wait. Five tokens per second isn't usable for anything interactive. That's starting to change. AMD's new Ryzen 9 9950X3D — the $699 chip Linus Tech Tips just reviewed — isn't being marketed as an AI processor. But its architecture, specifically that 128MB of L3 cache, makes it legitimately fast for the local inference workloads developers actually care about. [1]

The context: most consumer CPUs top out at 32-64MB of L3 cache. The 9950X3D doubles that by literally stacking an extra layer of fast cache directly on the die — AMD calls it 3D V-Cache. That cache is multiple times faster than system RAM. For AI inference, where the model's weights need to be repeatedly accessed, more cache means more of the model stays close to the compute rather than being fetched from RAM constantly. The result is dramatically higher tokens-per-second throughput for CPU inference. We're not talking H100 numbers — but for a 7-13B quantized model, a high-cache CPU like this pushes into ranges that are genuinely interactive: 15-25 tokens per second for 7B models with good quantization. That's enough to get real work done. [2]