Why does real-world LLM speed vary?

Real-world speed varies depending on the inference engine used (such as Ollama, llama.cpp, vLLM, or MLX), driver versions, and background system load. Therefore, calculator results should be treated as planning estimates rather than exact guarantees.

speed Free Speed Estimator

GPU Tokens Per Second Calculator: LLM Speed Estimator

Q: How is tokens per second calculated?

Tokens per second is estimated using the formula: (GPU Memory bandwidth × 70% efficiency) ÷ active model weight size. This models the memory-bandwidth-bound nature of large language model (LLM) inference.

Pick your GPU (or Mac) and a model — get an estimated tokens-per-second, then see how it stacks up against other hardware.

tune Your Setup

memoryHardware (GPU / Mac)

smart_toyAI Model

layersQuantization

info How we calculate this

Speed: (Memory bandwidth × ~70% efficiency) ÷ active model weight size. This models the memory-bandwidth-bound nature of LLM inference.

Note: Real-world speed varies by engine (Ollama, llama.cpp, vLLM, MLX), driver version, and system load. Treat this as a planning estimate, not a guarantee.

Live Speed Readout

speed

Select your hardware and a model on the left to estimate speed.

GPU Speed Comparison

Estimated tokens/sec for this model across popular hardware.

Speed vs. Quality

Same hardware, different quantization levels.

Quant	Quality	Tokens/Sec