I use a cluster equiped with A100 GPUs which don’t support torch 1.8.1 , so I have to install 1.9.1 version torch. It make me confused that if I run the code as tutorial says, it will crashed with the error:
‘RuntimeError: NCCL error in: /pytorch/torch/lib/c10d/ProcessGroupNCCL.cpp:911, unhandled cuda error, NCCL version 2.7.8. ncclUnhandledCudaError: Call to CUDA function failed.’
It seems like version dismatch proplem? I’ll be appreciate for any advice. Thank you in advance.
We had similar issues with our A100s early on as well. What resolved the issue was installing CUDA 11.1 instead of 10.2 (making sure your system CUDA version is also 11.1+). Note - you’ll also need to update the pytorch geometric packages
Some additional information from our A100 system in case the above doesn’t work for you: NVIDIA-SMI 460.73.01 Driver Version: 460.73.01 CUDA Version: 11.2.
Hello team,
Will the latest CUDA version, i.e., 12.1 and PyTorch version 2.0.1, work to run the OCP models on our systems?
I am asking because I am facing some issues when running the torch-scatter package; it is not being loaded despite being installed.I wonder if it’s due to the version incompatibility.