What it does

This server diagnoses CUDA compatibility issues — the root cause of most GPU initialization failures in PyTorch, TensorFlow, JAX, and other frameworks. It scans your driver version, CUDA toolkit, cuDNN library, and installed Python packages, then identifies mismatches (e.g., driver supports CUDA 11.8 but you have 12.4 wheels installed). It can also validate Docker GPU configurations, check compute capability for specific GPU architectures, detect Python version conflicts with AI libraries, and generate safe pip install commands. The MCP integration exposes 11 tools including full environment diagnostics, component-specific checks, CUDA installation steps, and AI model memory validation.

Who it's for

Data scientists and ML engineers troubleshooting GPU initialization errors locally or in CI/CD pipelines, and platform engineers validating GPU configurations across Docker containers and distributed training setups.

Common use cases

Run a full CUDA/driver/cuDNN compatibility scan in seconds before attempting a fresh PyTorch installation
Check if a pre-trained model fits on your GPU's available memory before downloading
Validate Dockerfile GPU configurations for CUDA version mismatches before building
Generate safe pip install commands for extension libraries (flash-attn, xformers) matching your specific driver
Diagnose why torch.cuda.is_available() returns False on a new GPU architecture (e.g., Blackwell)

Setup pitfalls

Requires filesystem and network write permissions to query driver info and optionally install CUDA — consider sandboxing or restricting to trusted contexts
CUDA installation via --run flag requires administrative privileges; CI/CD integration needs environment-specific handling (GitHub Actions, GitLab CI, etc.)
Some CUDA diagnostics rely on nvidia-smi and driver-level introspection; virtualized or WSL2 environments may report incomplete GPU state

env-doctor

What it does

Who it's for

Common use cases

Setup pitfalls