NVIDIA Driver 595 Upgrade Benchmark — RTX PRO 6000 Blackwell AI Performance Changes

2026-03-25

Treeru

On March 24, 2026, NVIDIA released driver 595.58.03. It fixes a tensor memory bug on the Blackwell architecture (RTX PRO 6000) and transitions to the Production branch. We immediately applied it to our AI inference server and benchmarked the performance changes. This version is only available via NVIDIA's official .run file — not yet in apt repositories.

590→595

Driver Version

+1.6%

Single Inference TPS

CUDA 13.2

Upgrade

600W

Power Limit Warning

1What Changed in Driver 595

The driver transitions from the 590 (New Feature branch) to 595 (Production branch). Here are the changes most relevant to AI inference workloads.

Item	590.48.01	595.58.03
Branch	New Feature	Production (stable)
CUDA	13.1	13.2
Kernel Module	Proprietary (DKMS)	Open (DKMS)
Release Date	2025-12-08	2026-03-24

Key AI-Related Changes

CudaNoStablePerfLimit — P0 PState Now Reachable

On 590, CUDA apps were limited to P2 PState. Starting with 595, they can reach P0 (highest clock speed) for maximum performance. However, the GPU stays at P0 even when idle, resulting in slightly higher idle power consumption.

cuTensorMapEncodeTiled() Bug Fix — Critical Blackwell Patch

Fixed an illegal memory access error that occurred with tensors smaller than 128KB. This is the key stability patch for Blackwell users (RTX PRO 6000, RTX 5090, etc.).

Improved VRAM → System Memory Fallback

Better overflow logic when VRAM runs out. Hard to notice with 96GB VRAM for typical workloads, but beneficial when serving multiple models simultaneously.

Why Isn't It in apt?

Ubuntu official repositories typically take weeks to months to register new drivers. As of 2026-03-25, the nvidia-driver-595 package doesn't exist yet — installation is only possible via the official .run file from NVIDIA's website.

2Test Environment

CPU16-core / 32-thread (high-end desktop CPU)

RAM96GB DDR5-4800 (48GB × 2)

GPUNVIDIA RTX PRO 6000 Blackwell (96GB, 350W limit)

OSUbuntu 24.04, Linux 6.17.0-19-generic

PyTorch2.9.1+cu128

SGLang0.5.9

FlashInfer0.6.3

sgl-kernel0.3.21

Triton3.5.1

We measured on the same server first with driver 590, then upgraded to 595 and re-measured under identical conditions. SGLang inference test model: Qwen3-8B-AWQ.

3Benchmark Results

SGLang Inference Performance (Qwen3-8B-AWQ)

Same model measured 3 runs each, comparing single-request TPS (tokens/sec) and concurrent 4-request aggregate TPS.

Test	590 (Before)	595 (After)	Change
Single Request TPS (Run 1)	211.6	215.2	+1.7%
Single Request TPS (Run 2)	210.0	215.2	+2.5%
Single Request TPS (Run 3)	213.7	215.3	+0.7%
Single Avg TPS	211.8	215.2	+1.6%
Concurrent 4-Req Aggregate TPS	653.7	601.4	-8.0%

The -8% Concurrent Drop Isn't Real Performance Loss

The aggregate TPS drop from 653.7 to 601.4 in the 4-concurrent test is within variance from response length fluctuation (model probability sampling, not driver-related). The consistent +1.6% improvement across 3 single-request runs confirms that actual performance difference is negligible.

CUDA Compute Benchmark (MatMul 4096×4096)

Precision	590 (Before)	595 (After)	Change
FP32	77.6 TFLOPS	77.6 TFLOPS	Same
FP16	315.3 TFLOPS	319.6 TFLOPS	+1.4%
BF16	419.1 TFLOPS	423.3 TFLOPS	+1.0%

FP16/BF16 operations — the ones that actually matter for AI inference — show modest 1-1.4% improvement. FP32 is unchanged. Memory bandwidth (Memory Copy BW) is effectively identical at 1468→1467 GB/s.

4P-State and Power Limit Changes

More important than the performance numbers are these operational changes. Two things you must check after upgrading.

Item	590 (Before)	595 (After)
Idle P-State	P8	P0
Default Power Cap	350W	600W (auto-reset)
Load GPU Clock	2602 MHz	2647 MHz (+1.7%)
Load Power Draw	79.82 W	257.50 W
Load Temperature	28°C	38°C
Load P-State	P1	P1

Power Limit Resets to 600W — Immediate Reconfiguration Required

After installing driver 595, the Power Limit resets to the default 600W. The RTX PRO 6000's TDP is 300W, but it allows up to 600W. Considering power costs and cooling burden, we reconfigured to 350W. (See our GPU Power Limit Performance Comparison)

# Reset Power Limit
sudo nvidia-smi -pl 350

# Verify
nvidia-smi --query-gpu=power.limit --format=csv,noheader

Idle P-State Changed P8 → P0 — Persistence Mode Needed

On 590, the GPU idled at P8 (lowest power state). Starting with 595, it locks to P0 (highest clock). This is due to the CudaNoStablePerfLimit change — it eliminates clock stabilization delay on first CUDA app launch, but slightly increases idle power consumption. Enable Persistence Mode if it isn't already.

# Enable Persistence Mode
sudo nvidia-smi -pm 1

5Upgrade Process Notes

Since 595 isn't in apt, you must install via NVIDIA's official .run file. Here are the key pitfalls we encountered.

Stop X Server / GDM + Unload Kernel Modules

The .run installer will fail if nvidia kernel modules are in use. Stop the display manager and unload the modules first.

sudo systemctl stop gdm
sudo modprobe -r nvidia-drm nvidia-modeset nvidia-uvm nvidia

Manual DKMS Build After Kernel Changes

If you upgraded the kernel before or after installing the driver, you need to manually build DKMS modules for the current kernel.

sudo dkms build nvidia/595.58.03 -k $(uname -r)
sudo dkms install nvidia/595.58.03 -k $(uname -r)
sudo modprobe nvidia nvidia-uvm nvidia-modeset nvidia-drm

Reset Power Limit After Every Reboot

The Power Limit reverts to 600W on each reboot. Add the reset command to your startup script or a systemd service.

# Add to /etc/rc.local or systemd ExecStartPre
sudo nvidia-smi -pl 350
sudo nvidia-smi -pm 1

Key Takeaways

✓595 brings Production branch + CUDA 13.2 + Blackwell tensor bug fix — worth upgrading for long-term server stability
✓AI inference improvement is +1.6% single, CUDA FP16/BF16 +1.0-1.4% — barely noticeable in practice
✓Power Limit auto-resets to 600W after upgrade — must immediately reconfigure to your target (e.g., 350W)
✓Idle P-State changed P8→P0 — eliminates CUDA app startup latency but slightly increases idle power
✓As of 2026-03-25, not in apt — install only via NVIDIA's official .run file

Conclusion

Driver 595 isn't about dramatic performance gains. The value lies in fixing the Blackwell tensor memory bug, upgrading to CUDA 13.2, and leaving the instability of the New Feature branch for the Production branch.

For AI inference servers, we recommend upgrading for stability + CUDA modernization. However, the 600W Power Limit auto-reset is a must-check. Miss it, and you'll see direct impact on power bills and GPU temperatures.

Once it lands in apt, installation will be simpler. But for Blackwell users who need the fix now, the .run file installation is well worth the effort.

This test was conducted on March 25, 2026 on our AI inference server (RTX PRO 6000 Blackwell). Measured sequentially on the same server: 590 first, then 595. All benchmark numbers are actual measurements. Non-commercial sharing is free, but for commercial use, please contact us.