Torch/nccl version dismatch

Hello, the team,

I use a cluster equiped with A100 GPUs which don’t support torch 1.8.1 , so I have to install 1.9.1 version torch. It make me confused that if I run the code as tutorial says, it will crashed with the error:

‘RuntimeError: NCCL error in: /pytorch/torch/lib/c10d/ProcessGroupNCCL.cpp:911, unhandled cuda error, NCCL version 2.7.8. ncclUnhandledCudaError: Call to CUDA function failed.’

It seems like version dismatch proplem? I’ll be appreciate for any advice. Thank you in advance.

We had similar issues with our A100s early on as well. What resolved the issue was installing CUDA 11.1 instead of 10.2 (making sure your system CUDA version is also 11.1+). Note - you’ll also need to update the pytorch geometric packages

Some additional information from our A100 system in case the above doesn’t work for you:
NVIDIA-SMI 460.73.01 Driver Version: 460.73.01 CUDA Version: 11.2.