New GemNet-dT code, results, model weights

Hi all,

We just released an implementation of GemNet-dT (following arxiv.org/abs/2106.08903) on the OCP repository along with pretrained model weights. This model achieves the best results we know of thus far across all OCP tasks (see leaderboards). This was made possible by Johannes Klicpera who implemented GemNet in the OCP codebase over his summer internship with us, thank you!

Specifically, improvements (averaged across all splits) compared to the next-best entry on the leaderboard:

— IS2RE energy MAE (via relaxation), 0.4342 —> 0.3997 (7.9%↑ relative)
— S2EF force MAE, 0.0297 —> 0.0242 (18.5%↑ relative)
— IS2RS AFbT, 21.8% —> 27.6% (26.6%↑ relative)

GemNet-dT is also relatively quite efficient. For S2EF, we’re able to fit batch sizes of up to 16 (or 32 with AMP) on NVIDIA 32GB V100 GPUs, compared to 8 for DimeNet++ and 3 for SpinConv. Training these models for 24 hours on 16 x V100s gets to 0.025 force MAE on val ID for GemNet, compared to 0.035 for DimeNet++ and 0.047 for SpinConv.

Also included as part of this code release is an implementation of SpinConv (following arxiv.org/abs/2106.09575) and several other improvements. Complete details here: Release v0.0.3: GemNet-dT, SpinConv, new data: MD, Rattled, per-adsorbate trajectories, etc. · Open-Catalyst-Project/ocp · GitHub.

Note that our team will not be entering GemNet, SpinConv, or any other model in the challenge we’re hosting at NeurIPS. We encourage everyone to refer to and/or build on any of this code for the challenge (or otherwise).

Thanks

1 Like

Thanks for the release! It seems like you guys are using a very large number of GPUs (64 or 16 V100s?) for multiple days to train the models. If we are students at a research institution with limited GPU resources (e.g. 1 or 2 V100 GPUs for training each model), is it feasible to participate in this competition and get good results, or would you say that large-scale compute is a strong prerequisite to get good results?

Hi - This is a common concern we’ve been receiving. We discuss some of them in more detail here - IS2RE Leaderboard Concerns.

TLDR - Models trained on the S2EF dataset (trained with the compute you mentioned) that then run a relaxation to get the relaxed energy are currently the best performing approach. Alternatively, training a model on the IS2RE dataset (~250x less data than the S2EF dataset) to directly predict the relaxed energy is something we’re also interested in for compute reasons (direct approaches are 200-400x faster at inference). To address this (not finalized yet), we are leaning towards awarding 2 teams (1) overall best performance, irrespective of the dataset/approach used and (2) the best performance having only trained on the IS2RE dataset (~460k data points). This would allow teams without heavy compute to still compete without being at a significant disadvantage merely due to compute resources.

Let us know if there are any other concerns. We are constantly trying to make the competition as engaging as possible for the community.

1 Like

Thanks for sharing this awesome model!
How does the GemNet perform in the validation set?

Hi -

Here are some validation numbers for GemNet:

IS2RE (relaxation)/IS2RS Energy MAE (eV) EwT ADwT
ID 0.397 11.81% 58.21%
S2EF Energy MAE (eV) Forces MAE (eV/A) Forces Cos
ID 0.234 0.021 0.632
OOD-Ads 0.245 0.024 0.621
OOD-Cat 0.347 0.025 0.575
OOD-Both 0.405 0.032 0.605
3 Likes

Here are the problems that I try to reimplement this: “Training these models for 24 hours on 16 x V100s gets to 0.025 force MAE on val ID for GemNet”.

  1. It seems that the pre-trained GEMNET has a bsz of 2048. However, 16 x V100s can not fit such big bsz and I can not find any code related to grad accumlation.
  2. After implementing the grad accumulation myself, the training loss seems not good.

BTW, could you please provide the tensorboard log file for leting us compare the training process.
Thanks a lot!

Hi -

The pre-trained GemNet was trained on 64 x 32 GB V100 cards with AMP (--amp at the command line). This allowed for a batch size of 2048, no grad accumulation was necessary. At an effective batch size of 512 you should be able to get similar performance as well.

As far as logs, we will certainly look into this and see if it’s possible internally.

Thank you very much. You address my problems!