Using adsorbate data to train models


I was just a little unsure about how best to convert the adsorbate trajectories provided here to a form that could be easily used with the models and training pipelines provided in the ocp package (specifically for the IS2RE task). I assume I need to convert them to an LMDB file somehow similar to the existing training/val/test LMDBs? I tried to look through the existing scripts but didn’t see a straightforward way to do it. The preprocess_ef script seems to do something similar, but expects .txt files that aren’t present (and it also says it is for the S2EF task not the IS2RE task). Do I need to write a script myself to do this or is there a simpler way?

Hi -

I would start with the section on creating IS2RE LMDBs here. Treat this script as a backbone, you may need to modify things slightly depending on the data format - adsorbate trajectories in this case.

In this case, the only thing to be aware of is the line

initial_struc.y_init = initial_struc.y # subtract off reference energy, if applicable

By default the energy stored will be the raw DFT energy. To reference in the same manner as OC20, refer to the system.txt file that gets downloaded with the data. For a particular system randomXXX.extxyz, system.txt will contain randomXXX,ref_energy, use this reference energy as follows:

for system in systems:
     ref_energy = # read ref_energy associated with {system}
     initial_struc.y_init = initial_struc.y - ref_energy

If you run into issues, let me know.

Okay, I’ll try that. As far as I can tell though, the randomXXX.extxyz file contains only one state of the system, rather than the full trajectory. Don’t I need both the initial and final states in order to build the input as I need the initial positions and relaxed energies?

Correct. randomXXX.extxyz, however, does indeed contain the full state, you need to read it in accordingly (which the code I had shared does):

traj =, ":")
initial_state = traj[0]
final_state =traj[-1]

Oh, I see, I thought it was intended to be read with

Excuse me, I need to trouble you with a question. When I extract the energy of the catalyst + adsorbate, I add up the system energy of all the adsorbates provided by you, and it is only more than 560,000, and there is no 128,000 of the OC20 mapping data. I hope you can help me answer, or can provide me with other data extraction methods. Sorry for your troubles.

Hi - Sorry, can you provide more details as to what data and mapping file you’re comparing so I can better assist you?

Hello, I added up the system energies corresponding to these adsorbates separately ocp/ at master · Open-Catalyst-Project/ocp · GitHub, which cannot be compared with https: // The number extracted remains the same. Did I choose the wrong location when extracting the energy of the catalyst + adsorbate system? Hope you can understand.

The mapping file - oc20_data_mapping.pkl contains all the metadata for every data point in the dataset (train, val, and test). The adsorbate+catalyst data you pointed to however are full, raw trajectories of all data except test data. Because the raw trajectories contain the energy and force targets, the test data was intentionally left out. As a result, the number of systems here is less than what you’ll fine in the mapping file.

I hope this answers your question. If not, please provide more details as to what numbers you’re comparing exactly, etc.

I see what you mean, can you provide assistance with the catalyst + adsorbate extraction method for all the data? Thank you very much for your support

Do you mean how to generally read the data provided here, or how to retrieve the test systems as well?

If referring to the test systems, we do not release that data as it contains the ground-truth energies and forces and would be unfair for usage in the public challenge. If you’re looking for test data with no energies/forces attached let me know and I can clarify more details.

Hello, I want to extract the energies of all catalyst + adsorbate systems in the OC20 mapping file, because I want to reproduce the calculation of the adsorption energy of 872000 in the literature, can you provide the extraction method? Excuse me.

I’ll describe the splits to better explain. The OC20 mapping file has 1.28M entries. These correspond to all the adsorbate+catalyst systems in OC20. ~872,000 of these had successfully converged relaxed states which a true adsorption energy can be calculated, this data is used for IS2RE. Of the 872,000, ~460,000 was used for training and ~100,000 for validation, the energies for this data is made public and you can download it here, or if you want it organized by adsorbate here, both are ~560,000 total as you mentioned earlier. The remaining data is either part of the test set our intentionally left out for future use so that data is not available to share.

The additional data you’ll need to reproduce the adsorption energy is the bare slab relaxed energy, you can find those here. The gas phase reference energy can be taken from the appendix of the paper.

To extract the relaxed energy from the downloaded data -

traj =, ":") # Where XXX is a unique identifier found in the mapping file
relaxed_energy = traj[-1].get_potential_energy()

I hope this clarifies it!

I see what you mean, thank you very much for your help.