Hello, Excuse me, I would like to ask what energy does the energy in each randomxxxx.extxyz file represent? Can you provide the adsorption capability of each step of the optimized trajectory of DFT? Hope you can clear up your confusion.
Hi -
The energy corresponds to the raw DFT energy produced by VASP. If you want the adsorption energy of each step, you must extract the reference energy and subtract that. Specifically, there should be a text file system.txt
in the format system_id, reference_energy
. system_id
corresponds to the randomXXX
of the particular .extxyz
file. For example, if I wanted to compute per-step adsorption energy for random123456.extxyz
, I would find the reference_energy
corresponding to random123456
in system.txt
then subtract it from each step in the trajectory:
import ase.io
trajectory = ase.io.read("random123456.extxyz", ":") # read all frames
per_step_adsorption_energies = [atoms.get_potential_energy() - reference_energy for atoms in trajectory]
Ok, I see. Thank you very much for your answer.
Hello, can I check whether I understood the reference_energy
?
There are energy
in extxyz file, and reference_energy
in txt file.
adsorption_energy = energy - reference_energy
where reference_energy = slab_energy + adsorbate_energy
energy
is the raw DFT energy in the slab(catalyst) and adsorbate combined system.
slab_energy
is the raw DFT energy of the slab(catalyst).
adsorbate_energy
is the raw DFT energy of the adsorbate.
To train the raw DFT energy, we have to use just energy
as a target value.
Additionally, is the y
in the OC22 LMDB files the same as energy
in OC20 extxyz files?
Hi - Yes you’re understanding is correct! One minor thing to clarify - adsorbate_energy
comes from a linear combination of DFT energies of a few gas molecules (see here for more details).
If you wanted to train on raw DFT energies for OC20 - yes you need to just use energy
and not subtract off the reference. For OC22 we just use raw DFT energies and you can use y
in the LMDB directly.
Thank you for replying.
Would it not be a problem to combine OC20 and OC22 datasets to jointly train the model?
I saw what you mentioned before here.
It might be trouble with different levels of DFT calculation in energy for the same system.
When we use different pseudopotentials (also other detail settings), these cause a huge energy difference for the same system.
Can you comment on this matter further?
Great question. This is something that was surprising to us as well. However one thing to note here are the datasets are non-overlapping - meaning OC20 has no oxide materials and OC22 has only oxide materials. While the two datasets differ in theory: OC20 (RPBE, no spin-polarization) vs OC22 (PBE+U, spin-polarization), there exists no identical data point under the two different levels of theory, which would certainly pose challenges for the model. As a result, the model in principle should be able to learn the two theories simultaneously and is what we observe in the manuscript. There certainly remains a lot of open questions here on how to best handle different levels of theory simultaneously.
What’s particularly interesting is that when training both datasets together, despite their DFT theory differences, it aids the performance on both datasets. The additional force supervision that comes with larger datasets could be benefiting both datasets here in improving their energy predictions as well. Better understanding these effects is certainly something we hope the community and ourselves explore in the future.
Thank you so much for the comments.
I have trouble understanding OC20 IS2RE datasets.
I’m checking the OC20 IS2RE datasets. These are in LMDB format.
I found some of them have positive energies in y_relaxed
and almost every data has a near 0 value.
Does that mean y_relaxed
values are adsorption energy
instead of the raw DFT energy? And, all energies are in a unit of eV
?
However, y_relaxed
in OC22 IS2RE data seems to be the raw DFT energy.
Correct - All OC20 LMDBs contain adsorption energies, not raw DFT energies. If you want to convert them to raw DFT energy you need to unreference them with the mapping provided here: ocp/DATASET.md at main · Open-Catalyst-Project/ocp · GitHub. If you’re looking to train models on OC20 raw energies you can do so using a config like this ocp/gemnet_oc_oc20_oc22.yml at main · Open-Catalyst-Project/ocp · GitHub.
Thank you for noticing me.
Because there are various kinds of databases, I think I missed that information.
Upon your comments, here is the summary.
In IS2RE (and IS2RS) databases,
OC20
y_relaxed
: adsorption energy, therefore, require adding reference_energy
from OC20 reference information
raw DFT total energy = y_relaxed
+ reference_energy
Catalyst system trajectories (optional download) for catalysts (not catalyst + adsorbate systems) in OC20
energy
: raw_DFT_total_energy
, therefore no requirement to convert.
OC22
y_relaxed
: raw_DFT_total_energy
, therefore no requirement to convert.
If you want to use the original DFT energy, you need raw DFT total energy = y_relaxed + reference_energy, I found that there are a total of 964277 pieces of data in the ref_OC20.pkl file, but there are 460328 pieces of data in the OC20 datasets. In this case, not one randomcorresponds to one sid. Why is this? Why are the two pieces of data different?
Hi -
The 964,277 number in the oc20_ref.pkl
is for data across the different splits of data - train, val, and test. The 460,328 number you’re seeing for the OC20 dataset is just the training set, so that will be a subset and is why you’re not seeing the same sizes.