We’re very excited to announce the 2nd Open Catalyst Challenge at NeurIPS 2022!
This year’s challenge will be on the same task — Initial Structure to Relaxed Energy (IS2RE) prediction — and use the same training and validation data — OC20 — as last year.
Different from last year, we’re planning to have a single track, and allowing (and encouraging!) the use of both the IS2RE as well as the S2EF (Structure to Energy-Forces) 2M datasets for training.
We’ve found relaxation-based IS2RE approaches (trained on S2EF-2M) to consistently perform significantly better than direct IS2RE approaches, albeit more expensive.
Training on S2EF splits larger than 2M is not allowed for the challenge to keep compute costs ~manageable.
The submission deadline for the challenge is October 07, 2022. We’ll be releasing the test dataset for the challenge on September 21, 2022, ~two weeks before the deadline.
More details here: Open Catalyst Challenge.
We saw amazing participation and modeling improvements last year (thank you!), and look forward to more of the same this year!
Please let us know if you have any questions or concerns.
— Open Catalyst team
I have a question regarding this particular line: “Using DFT is not allowed.” Could you elaborate on what do you mean by DFT? Do you mean the whole theory, codes and/or data due to the theory, parts of ideas due to the theory or something else? It would be good to be precise.
I have 1 question to ask:
it is about to what extend we can utilize extra information in the training? Like the Mendeleev periodic table information or some basic knowledge about the atom itself. Or we just cannot use any extra information as not provided in the dataset, just use the atomic number?
@abhshkdz for your comment. Thanks!
Thanks for the questions!
Just to reiterate our motivation behind organizing the challenge — we want to encourage methods that can accelerate IS2RE pipelines and make them considerably faster than DFT. Using DFT (with equivalent theory as in OC20 data) at test time would be similarly slow, and hence not allowed.
Having said that, there might be other cheaper calculations (e.g. force fields, reactive force fields, some approximate tight binding methods) that are much faster. That’s fine to do, especially if these calculations take < 1 second per IS2RE prediction (simulating the entire relaxation may take slightly longer, < 10 seconds). We obviously don’t have a way to strictly enforce / check inference times since we just ask for predictions, but would appreciate it if you stick to the spirit and keep the ~1 second ballpark number in mind. Note that most tight binding methods are significantly more expensive than this.
Other auxiliary features (e.g. Bader charges, other element properties as in the CGCNN paper, etc.) are also fine to use to train models, but worth keeping in mind that some of these features (e.g. Bader charges) might not be available at test time. We recently released Bader charge data for OC20 training / validation here: ocp/DATASET.md at main · Open-Catalyst-Project/ocp · GitHub.
The IS2RE data contains the relaxed structure for each initial structure. I want to make sure that we are allowed to use the relaxed structures, right? Thanks!
Yes, using relaxed structures is allowed.
Hello @mshuaibi, are IS2RE validation data allowed for training? Just curious.
Yep you’re allowed to use IS2RE validation data for training.
Some questions about OCP challenge @ NeurIPS 2022 are listed as the following:
- Is it permitted to use the Relaxation Trajectories dataset? Since you release the data in oc22_trajectories.tar.gz, but I didn’t find the restriction on this data.
- It seems that the S2EF-2M is not released yet. s2ef_total_train_val_test_lmdbs.tar.gz is the S2EF-total?
- The link in “The challenge will have a single track, wherein participants are allowed to train on the IS2RE dataset (size 460k) and/or the S2EF 2M dataset.” refers to the OC 2020 dataset, Is there any correction or we need to train our model on OC 2020 data?
This year’s Open Catalyst Challenge is based on the OC20 dataset (not the OC22 dataset). All the details posted on the challenge webpage (Open Catalyst Challenge) are correct.
You’re allowed to use the OC20 S2EF-2M and IS2RE datasets for the competition.
Hi @abhshkdz ,
Thanks for the clarification!
Another question here is that are we allowed to use the OC20 S2EF MD trajectories (https://dl.fbaipublicfiles.com/opencatalystproject/data/s2ef_md.tar)? (or only OC20 IS2RE and S2EF-2M, and no other data are permitted)
MD data is not permitted. Only OC20 IS2RE and S2EF-2M data is allowed.
Could you release the existing baselines’ results on the eval dataset in the table on this page Open Catalyst Challenge? I am trying to compare my model with the existing state-of-the-art models on the eval dataset at the early stage. However, I am unsure about the performance of existing baselines on the eval dataset. I have run the released code in the GitHub repository, but I am unsure whether I have successfully reproduced these baselines. Thanks a lot.
Yep, here are a couple of prediction files on IS2RE test:
— GemNet-OC trained on S2EF-2M: 0.407 energy MAE averaged across the 4 test splits.
— GemNet-dT trained on S2EF-2M: 0.438 energy MAE averaged across the 4 test splits.
Both of the above model weights are released here.
Thanks for the reply. Could u release the energy MAE averaged across the 4 EVAL splits? We only have ten trials to run the test splits; thus, we hope to compare with baselines on EVAL splits at the early stage.
Yeah, here are the predictions and results on the IS2RE validation splits for the GemNet-OC model trained on S2EF-2M.
Thanks. This helps a lot.
For alignment with the upcoming test dataset, I want to make sure that our goal is to predict E_ads and not E_system. That is, we need to subtract the reference energy, right? Will reference energy be available in the upcoming test dataset?