Adsorption energy

Hello, Excuse me, I would like to ask what energy does the energy in each randomxxxx.extxyz file represent? Can you provide the adsorption capability of each step of the optimized trajectory of DFT? Hope you can clear up your confusion.

Hi -

The energy corresponds to the raw DFT energy produced by VASP. If you want the adsorption energy of each step, you must extract the reference energy and subtract that. Specifically, there should be a text file system.txt in the format system_id, reference_energy. system_id corresponds to the randomXXX of the particular .extxyz file. For example, if I wanted to compute per-step adsorption energy for random123456.extxyz, I would find the reference_energy corresponding to random123456 in system.txt then subtract it from each step in the trajectory:

import ase.io

trajectory = ase.io.read("random123456.extxyz", ":") # read all frames
per_step_adsorption_energies = [atoms.get_potential_energy() - reference_energy for atoms in trajectory]

Ok, I see. Thank you very much for your answer.

Hello, can I check whether I understood the reference_energy?

There are energy in extxyz file, and reference_energy in txt file.

adsorption_energy = energy - reference_energy where reference_energy = slab_energy + adsorbate_energy

energy is the raw DFT energy in the slab(catalyst) and adsorbate combined system.
slab_energy is the raw DFT energy of the slab(catalyst).
adsorbate_energy is the raw DFT energy of the adsorbate.

To train the raw DFT energy, we have to use just energy as a target value.

Additionally, is the y in the OC22 LMDB files the same as energy in OC20 extxyz files?

Hi - Yes you’re understanding is correct! One minor thing to clarify - adsorbate_energy comes from a linear combination of DFT energies of a few gas molecules (see here for more details).

If you wanted to train on raw DFT energies for OC20 - yes you need to just use energy and not subtract off the reference. For OC22 we just use raw DFT energies and you can use y in the LMDB directly.

Thank you for replying.

Would it not be a problem to combine OC20 and OC22 datasets to jointly train the model?
I saw what you mentioned before here.

It might be trouble with different levels of DFT calculation in energy for the same system.
When we use different pseudopotentials (also other detail settings), these cause a huge energy difference for the same system.

Can you comment on this matter further?

Great question. This is something that was surprising to us as well. However one thing to note here are the datasets are non-overlapping - meaning OC20 has no oxide materials and OC22 has only oxide materials. While the two datasets differ in theory: OC20 (RPBE, no spin-polarization) vs OC22 (PBE+U, spin-polarization), there exists no identical data point under the two different levels of theory, which would certainly pose challenges for the model. As a result, the model in principle should be able to learn the two theories simultaneously and is what we observe in the manuscript. There certainly remains a lot of open questions here on how to best handle different levels of theory simultaneously.

What’s particularly interesting is that when training both datasets together, despite their DFT theory differences, it aids the performance on both datasets. The additional force supervision that comes with larger datasets could be benefiting both datasets here in improving their energy predictions as well. Better understanding these effects is certainly something we hope the community and ourselves explore in the future.

1 Like

Thank you so much for the comments.

I have trouble understanding OC20 IS2RE datasets.
I’m checking the OC20 IS2RE datasets. These are in LMDB format.

I found some of them have positive energies in y_relaxed and almost every data has a near 0 value.
Does that mean y_relaxed values are adsorption energy instead of the raw DFT energy? And, all energies are in a unit of eV?

However, y_relaxed in OC22 IS2RE data seems to be the raw DFT energy.

Correct - All OC20 LMDBs contain adsorption energies, not raw DFT energies. If you want to convert them to raw DFT energy you need to unreference them with the mapping provided here: ocp/DATASET.md at main · Open-Catalyst-Project/ocp · GitHub. If you’re looking to train models on OC20 raw energies you can do so using a config like this ocp/gemnet_oc_oc20_oc22.yml at main · Open-Catalyst-Project/ocp · GitHub.

1 Like

Thank you for noticing me.

Because there are various kinds of databases, I think I missed that information.

Upon your comments, here is the summary.

In IS2RE (and IS2RS) databases,

OC20

y_relaxed: adsorption energy, therefore, require adding reference_energy from OC20 reference information
raw DFT total energy = y_relaxed + reference_energy

Catalyst system trajectories (optional download) for catalysts (not catalyst + adsorbate systems) in OC20
energy: raw_DFT_total_energy, therefore no requirement to convert.

OC22

y_relaxed: raw_DFT_total_energy, therefore no requirement to convert.

If you want to use the original DFT energy, you need raw DFT total energy = y_relaxed + reference_energy, I found that there are a total of 964277 pieces of data in the ref_OC20.pkl file, but there are 460328 pieces of data in the OC20 datasets. In this case, not one randomcorresponds to one sid. Why is this? Why are the two pieces of data different?

Hi -

The 964,277 number in the oc20_ref.pkl is for data across the different splits of data - train, val, and test. The 460,328 number you’re seeing for the OC20 dataset is just the training set, so that will be a subset and is why you’re not seeing the same sizes.