On March 24, 2026, NVIDIA released driver 595.58.03. It fixes a tensor memory bug on the Blackwell architecture (RTX PRO 6000) and transitions to the Production branch. We immediately applied it to our AI inference server (k1) and benchmarked the performance changes. This version is only available via NVIDIA's official .run file — not yet in apt repositories.
590→595
Driver Version
+1.6%
Single Inference TPS
CUDA 13.2
Upgrade
600W
Power Limit Warning
1What Changed in Driver 595
The driver transitions from the 590 (New Feature branch) to 595 (Production branch). Here are the changes most relevant to AI inference workloads.
| Item | 590.48.01 | 595.58.03 |
|---|---|---|
| Branch | New Feature | Production (stable) |
| CUDA | 13.1 | 13.2 |
| Kernel Module | Proprietary (DKMS) | Open (DKMS) |
| Release Date | 2025-12-08 | 2026-03-24 |
Key AI-Related Changes
CudaNoStablePerfLimit — P0 PState Now Reachable
On 590, CUDA apps were limited to P2 PState. Starting with 595, they can reach P0 (highest clock speed) for maximum performance. However, the GPU stays at P0 even when idle, resulting in slightly higher idle power consumption.
cuTensorMapEncodeTiled() Bug Fix — Critical Blackwell Patch
Fixed an illegal memory access error that occurred with tensors smaller than 128KB. This is the key stability patch for Blackwell users (RTX PRO 6000, RTX 5090, etc.).
Improved VRAM → System Memory Fallback
Better overflow logic when VRAM runs out. Hard to notice with 96GB VRAM for typical workloads, but beneficial when serving multiple models simultaneously.
Why Isn't It in apt?
Ubuntu official repositories typically take weeks to months to register new drivers. As of 2026-03-25, the nvidia-driver-595 package doesn't exist yet — installation is only possible via the official .run file from NVIDIA's website.
2Test Environment
We measured on the same server (k1) first with driver 590, then upgraded to 595 and re-measured under identical conditions. SGLang inference test model: Qwen3-8B-AWQ.
3Benchmark Results
SGLang Inference Performance (Qwen3-8B-AWQ)
Same model measured 3 runs each, comparing single-request TPS (tokens/sec) and concurrent 4-request aggregate TPS.
| Test | 590 (Before) | 595 (After) | Change |
|---|---|---|---|
| Single Request TPS (Run 1) | 211.6 | 215.2 | +1.7% |
| Single Request TPS (Run 2) | 210.0 | 215.2 | +2.5% |
| Single Request TPS (Run 3) | 213.7 | 215.3 | +0.7% |
| Single Avg TPS | 211.8 | 215.2 | +1.6% |
| Concurrent 4-Req Aggregate TPS | 653.7 | 601.4 | -8.0% |
The -8% Concurrent Drop Isn't Real Performance Loss
The aggregate TPS drop from 653.7 to 601.4 in the 4-concurrent test is within variance from response length fluctuation (model probability sampling, not driver-related). The consistent +1.6% improvement across 3 single-request runs confirms that actual performance difference is negligible.
CUDA Compute Benchmark (MatMul 4096×4096)
| Precision | 590 (Before) | 595 (After) | Change |
|---|---|---|---|
| FP32 | 77.6 TFLOPS | 77.6 TFLOPS | Same |
| FP16 | 315.3 TFLOPS | 319.6 TFLOPS | +1.4% |
| BF16 | 419.1 TFLOPS | 423.3 TFLOPS | +1.0% |
FP16/BF16 operations — the ones that actually matter for AI inference — show modest 1-1.4% improvement. FP32 is unchanged. Memory bandwidth (Memory Copy BW) is effectively identical at 1468→1467 GB/s.
4P-State and Power Limit Changes
More important than the performance numbers are these operational changes. Two things you must check after upgrading.
| Item | 590 (Before) | 595 (After) |
|---|---|---|
| Idle P-State | P8 | P0 |
| Default Power Cap | 350W | 600W (auto-reset) |
| Load GPU Clock | 2602 MHz | 2647 MHz (+1.7%) |
| Load Power Draw | 79.82 W | 257.50 W |
| Load Temperature | 28°C | 38°C |
| Load P-State | P1 | P1 |
Power Limit Resets to 600W — Immediate Reconfiguration Required
After installing driver 595, the Power Limit resets to the default 600W. The RTX PRO 6000's TDP is 300W, but it allows up to 600W. Considering power costs and cooling burden, we reconfigured to 350W. (See our GPU Power Limit Performance Comparison)
# Reset Power Limit sudo nvidia-smi -pl 350 # Verify nvidia-smi --query-gpu=power.limit --format=csv,noheader
Idle P-State Changed P8 → P0 — Persistence Mode Needed
On 590, the GPU idled at P8 (lowest power state). Starting with 595, it locks to P0 (highest clock). This is due to the CudaNoStablePerfLimit change — it eliminates clock stabilization delay on first CUDA app launch, but slightly increases idle power consumption. Enable Persistence Mode if it isn't already.
# Enable Persistence Mode sudo nvidia-smi -pm 1
5Upgrade Process Notes
Since 595 isn't in apt, you must install via NVIDIA's official .run file. Here are the key pitfalls we encountered.
Stop X Server / GDM + Unload Kernel Modules
The .run installer will fail if nvidia kernel modules are in use. Stop the display manager and unload the modules first.
sudo systemctl stop gdm sudo modprobe -r nvidia-drm nvidia-modeset nvidia-uvm nvidia
Manual DKMS Build After Kernel Changes
If you upgraded the kernel before or after installing the driver, you need to manually build DKMS modules for the current kernel.
sudo dkms build nvidia/595.58.03 -k $(uname -r) sudo dkms install nvidia/595.58.03 -k $(uname -r) sudo modprobe nvidia nvidia-uvm nvidia-modeset nvidia-drm
Reset Power Limit After Every Reboot
The Power Limit reverts to 600W on each reboot. Add the reset command to your startup script or a systemd service.
# Add to /etc/rc.local or systemd ExecStartPre sudo nvidia-smi -pl 350 sudo nvidia-smi -pm 1
Key Takeaways
- ✓595 brings Production branch + CUDA 13.2 + Blackwell tensor bug fix — worth upgrading for long-term server stability
- ✓AI inference improvement is +1.6% single, CUDA FP16/BF16 +1.0-1.4% — barely noticeable in practice
- ✓Power Limit auto-resets to 600W after upgrade — must immediately reconfigure to your target (e.g., 350W)
- ✓Idle P-State changed P8→P0 — eliminates CUDA app startup latency but slightly increases idle power
- ✓As of 2026-03-25, not in apt — install only via NVIDIA's official .run file
Conclusion
Driver 595 isn't about dramatic performance gains. The value lies in fixing the Blackwell tensor memory bug, upgrading to CUDA 13.2, and leaving the instability of the New Feature branch for the Production branch.
For AI inference servers, we recommend upgrading for stability + CUDA modernization. However, the 600W Power Limit auto-reset is a must-check. Miss it, and you'll see direct impact on power bills and GPU temperatures.
Once it lands in apt, installation will be simpler. But for Blackwell users who need the fix now, the .run file installation is well worth the effort.
This test was conducted on March 25, 2026 on the k1 server (RTX PRO 6000 Blackwell). Measured sequentially on the same server: 590 first, then 595. All benchmark numbers are actual measurements. Non-commercial sharing is free, but for commercial use, please contact us.
Need AI Infrastructure?
Treeru designs local LLM infrastructure based on RTX PRO 6000 and other enterprise GPUs.
Request Free Consultation