Torch/nccl version dismatch

Mingle · September 13, 2021, 2:48pm

Hello, the team,

I use a cluster equiped with A100 GPUs which don’t support torch 1.8.1 , so I have to install 1.9.1 version torch. It make me confused that if I run the code as tutorial says, it will crashed with the error:

‘RuntimeError: NCCL error in: /pytorch/torch/lib/c10d/ProcessGroupNCCL.cpp:911, unhandled cuda error, NCCL version 2.7.8. ncclUnhandledCudaError: Call to CUDA function failed.’

It seems like version dismatch proplem? I’ll be appreciate for any advice. Thank you in advance.

mshuaibi · September 13, 2021, 3:44pm

We had similar issues with our A100s early on as well. What resolved the issue was installing CUDA 11.1 instead of 10.2 (making sure your system CUDA version is also 11.1+). Note - you’ll also need to update the pytorch geometric packages

Some additional information from our A100 system in case the above doesn’t work for you:
NVIDIA-SMI 460.73.01 Driver Version: 460.73.01 CUDA Version: 11.2.

Sejal2002 · June 26, 2023, 12:38am

Hello team,
Will the latest CUDA version, i.e., 12.1 and PyTorch version 2.0.1, work to run the OCP models on our systems?
I am asking because I am facing some issues when running the torch-scatter package; it is not being loaded despite being installed.I wonder if it’s due to the version incompatibility.

abhshkdz · June 26, 2023, 4:45pm

Hey @Sejal2002, both torch-scatter specifically and OCP models overall should work with pytorch 2.0+ (note that torch.compile won’t work yet).

Before running OCP models, could you try one of the scatter examples in the readme here to see if they work: GitHub - rusty1s/pytorch_scatter: PyTorch Extension Library of Optimized Scatter Operations.

If that doesn’t work and it seems like it’s an installation issue, you could try installing via pip as mentioned here: GitHub - rusty1s/pytorch_scatter: PyTorch Extension Library of Optimized Scatter Operations.

Sejal2002 · June 26, 2023, 8:15pm

Alright
Thanks a lot!!

Topic		Replies	Views
AdsorbML tutorial: Can't import OCPCalculator	1	35	August 16, 2024
CCAI - OCP Tutorial not being able to run	0	405	January 5, 2023
Setting up multi-GPUs in the notebook	2	1509	March 18, 2021
Pretrained OCP models	8	694	November 27, 2023
Cannot set up environment	1	28	November 25, 2024

Torch/nccl version dismatch

Related topics