Open Catalyst 2022 (OC22) Dataset: Oxide Electrocatalysis

Hi all -

We’re very excited to announce the release of the Open Catalyst 2022 (OC22) Dataset: Oxide Electrocatalysis.

Two years ago we released the Open Catalyst 2020 (OC20) Dataset and have been impressed by the amazing progress the community has made so far. While OC20 spanned a large chemical and material space, it did not include everything. Specifically, OC20 lacked oxide materials - a class of materials that play an important role in green hydrogen production (Oxygen Evolution Reaction (OER)) and other oxide chemistries. Today we’re releasing OC22 in hopes of continuing to encourage the development of faster, more accurate models on even more complex systems. OC22 consists of ~60,000 DFT relaxations (~9M single point calculations) and took upwards of 20M compute hours. For reference, OC20 took ~70M compute hours and was almost 16x this size.

While OC20 is yet a solved problem, we anticipate OC22 to aid in the development of more generally applicable models and methods. Noteworthy, OC22 modifies the energy targets to be the DFT total energy, instead of the adsorption energy. A more challenging task, the DFT total energy would allow models to additionally screen surface configurations, an important and necessary step for studying OER. We’ve released a new dataloader that allows you to explore the same task for OC20 as well.

One question that arises when new datasets are created is whether the data complements existing datasets or vice versa. In this work, we explore the extent OC20 can aid OC22 via transfer learning or by jointly training on both datasets. We hope the existence of both datasets will also encourage the community to explore transfer learning strategies to aid catalyst applications more broadly.

For more details make sure to check out our paper.
Dataset download: ocp/ at main · Open-Catalyst-Project/ocp · GitHub

We hope to launch a public leaderboard similar to that of OC20 in the upcoming future. Stay tuned!
In the mean time, please let us know if you have any questions.

– Open Catalyst Team


Dear OCP teams,

I have some question about OC22.

i)The paper1* says it is the biggest challenge to combine data sets of varying levels of DFT. On the other hand, the paper2** says the OC20 dataset enabled the catalysis community to use transfer learning to improve model performance on other datasets. Actually, OC20 and OC22 are combined although they have different levels of DFT. Is it currently possible to combine data sets OC20, OC22 and other datasets(such as ANI-1)?
**: [2206.08917] The Open Catalyst 2022 (OC22) Dataset and Challenges for Oxide Electrocatalysis

ii)In the future, is it possible to calculate DFT total energy of isolated molecule with Open Catalyst? I want to calculate an adsorption energy with DFT total energy. What is the best practice to calculate the total energy and an adsorption energy for solving such as microkinetics models of catalytic reactions . When I calculate the isolated molecule with OC20 + OC22 model, it moves through the cell and the calculation does not converge.

I would appreciate it if you could please send me the reply.


Hi -

  1. This is something we have started to think about as well - how do we combine other datasets alongside OC20/OC22. There’s nothing specifically stopping one from combining the dataset and training on energy/forces(if available) so it’s certainly possible. However, it may not necessarily be straightforward to just combine them directly as it could be the case the varying levels of DFT theory across different datasets could be problematic. But this is something certainly worth exploring.

  2. Our current models are not capable of accurately predicting isolated molecules. This is due to our datasets not containing isolated molecules, so models are unable to explicitly learn those energies. In the context of adsorption energies, we use linearly fit gas references that are used for any arbitrary adsorbate (see Table 5 in the SI here). Using a total energy model, the steps necessary to compute adsorption energy are as follows:

Example - CO on a Cu surface
a. Run an ML relaxation on a clean Cu surface, the relaxed energy here is E_slab
b. Place CO on the relaxed Cu surface
c. Run an ML relaxation on the CO + relaxed Cu surface, the relaxed energy here is E_adslab
d. Compute the CO gas reference energy based off the above table (E_gas = E_C+E_O = -14.486eV)
e. E_ads = E_adslab - E_slab - E_gas

We hope to share some code for this pipeline as part of the OC22 paper in the near future to make it easier for users.

Thank you very much for your kind reply.
Could I use same reference energy when I use pre-trained model trained by OC22 (+ OC20)?


For the gas reference energy, yeah that will be fine.

Thank you very much!