Torch/nccl version dismatch

Hello, the team,

I use a cluster equiped with A100 GPUs which don’t support torch 1.8.1 , so I have to install 1.9.1 version torch. It make me confused that if I run the code as tutorial says, it will crashed with the error:

‘RuntimeError: NCCL error in: /pytorch/torch/lib/c10d/ProcessGroupNCCL.cpp:911, unhandled cuda error, NCCL version 2.7.8. ncclUnhandledCudaError: Call to CUDA function failed.’

It seems like version dismatch proplem? I’ll be appreciate for any advice. Thank you in advance.

We had similar issues with our A100s early on as well. What resolved the issue was installing CUDA 11.1 instead of 10.2 (making sure your system CUDA version is also 11.1+). Note - you’ll also need to update the pytorch geometric packages

Some additional information from our A100 system in case the above doesn’t work for you:
NVIDIA-SMI 460.73.01 Driver Version: 460.73.01 CUDA Version: 11.2.

Hello team,
Will the latest CUDA version, i.e., 12.1 and PyTorch version 2.0.1, work to run the OCP models on our systems?
I am asking because I am facing some issues when running the torch-scatter package; it is not being loaded despite being installed.I wonder if it’s due to the version incompatibility.

Hey @Sejal2002, both torch-scatter specifically and OCP models overall should work with pytorch 2.0+ (note that torch.compile won’t work yet).

Before running OCP models, could you try one of the scatter examples in the readme here to see if they work: GitHub - rusty1s/pytorch_scatter: PyTorch Extension Library of Optimized Scatter Operations.

If that doesn’t work and it seems like it’s an installation issue, you could try installing via pip as mentioned here: GitHub - rusty1s/pytorch_scatter: PyTorch Extension Library of Optimized Scatter Operations.

Alright
Thanks a lot!!