TL;DR
Installing flash-attn often means long compilation times and CUDA compatibility headaches. flashinfer gives you the same fast attention kernels, but as ≈30 MB wheels that install in < 5 s on most CUDA ≥ 11.8 / PyTorch ≥ 2.4 combos.
No ninja, no Triton, no blood sacrifice.
the pain
# You, 45 minutes ago pip install flash-attn # ... why isn't it working ... pip install flash-attn --build-isolation error: command '/usr/local/cuda/bin/nvcc' failed with exit status 1
If you've spent the last hour watching pip install flash-attn
get stuck on "Building wheel for flash-attn (setup.py)", welcome to the club. Flash-attn's build-from-source dance is the new "it compiles on my laptop."
- 2 h single-threaded if you forgot ninja.
- 3–5 min with ninja—if nothing's mismatched.
- Or it just dies because your minor CUDA patch version hurt its feelings.
enter flashinfer
flashinfer is different—it has easily installable precompiled wheels for most CUDA/PyTorch combos. And it just... works. 30MB download, done.
# This actually completes in seconds pip install flashinfer-python
Wheels cover:
CUDA | PyTorch |
---|---|
11.8 | 2.3, 2.4, 2.5 |
12.1 | 2.3, 2.4, 2.5, 2.6 |
12.4 | 2.4, 2.5, 2.6 |
Linux only for now.
why it's faster (for inference)
flashinfer isn't "flash-attn with easier packaging."
It's inference-only and exploits that:
- Decode-phase attention is bandwidth-bound → kernels optimized for weight-loading, not gradient-sync.
- PageAttention → paged KV cache, zero-copy block table management.
- Cascade attention → hierarchical caches for shared prefixes.
- Fused RoPE/ALiBi → one kernel instead of three memory passes.
Numbers on H100:
Task | Speed-up vs SOTA |
---|---|
inter-token decode | 29–69 % ↓ latency |
batched generation | 13–17 % ↑ throughput |
already in production
- TensorRT-LLM kernels now ship via flashinfer.
- vLLM, SGLang, MLC-Engine have opt-in backends.
- AOT compilation for prod (no JIT surprises).
3-line drop-in
import flashinfer as fi # q, k, v: [batch, seq, heads, dim] output = fi.single_prefill_with_kv_cache( q, k, v, causal=True, rotary_mode="k_rpe" # fused RoPE )
should you switch?
You | Recommendation |
---|---|
Serving LLMs in prod | Yes, yesterday. |
Training models | Stick with flash-attn for now. |
Installing flash-attn just for inference | Save yourself - use flashinfer |
links
- GitHub: github.com/flashinfer-ai/flashinfer
- Wheels index: flashinfer.ai/whl