Extract Specific Structure Information from OC22

Dear OC22 Team,

I am interested in extracting DFT-relaxed slab structures of oxides from the dataset. The primary data I require are the relaxed structural positions (pos_relaxed - x, y, z Cartesian coordinates of all atoms in the system) of these oxide slabs, which I intend to use for additional computations to build our proprietary database. I understand that the LMDBs in the Initial Structure to Relaxed Structure (IS2RS) task contain the structural information I seek. Additionally, I am aware of the Python pickle file provided for OC22 Mappings, which includes information about slabs and adsorbates for each system in the dataset. From my understanding, the system-ids (sids) in this dictionary should correspond to unique structures within the OC22 dataset, correct?

While I have been able to navigate through the LMDBs and transform the structures, an issue has arisen. It appears that the sids in the OC22 mappings do not correspond to the sids in the IS2RS dataset; they seem to reference entirely different structures. The extracted structures are not what I expected.

For example, here is one structure information I am interested in from the OC22 mapping Python pickle file.

data = {
‘System ID’: 42414,
‘Bulk ID’: ‘mp-1336’,
‘Miller Index’: (1, 1, 3),
‘Number of Adsorbates’: 0,
‘Trajectory ID’: ‘90423983501299385769012_PdO_mp-1336_clean_CafZ0MJQk8’,
‘Bulk Symbols’: ‘Pd2O2’,
‘Slab SID’: ‘Unknown’,
‘Adsorbate Symbols’: ‘Unknown’

Now, I would like to extract the DFT-relaxed structural coordination information about this structure. How can I pinpoint the structure based on the sid (‘System ID’: 42414) from any LMDB in OC22?

This discrepancy has left me puzzled and unsure of how to proceed. Is there a comprehensive LMDB for OC22 containing all DFT-relaxed structure information where I can accurately find the corresponding structures using the sids from OC22 mappings? Any guidance or assistance you could provide on this matter would be greatly appreciated, as I am currently at an impasse.

Thank you for your time and consideration. I look forward to any insights you may be able to offer.

Hi. I’m also user of OCP models. I’m not sure this is what you are looking for, but you may give it a try.
Using trajectory files you may access the structures as ASE Atoms object. Here is an example code snippet I hope it can help:

import numpy as np
import pandas as pd
from tqdm import tqdm
import copy
import pickle
import matplotlib.pyplot as plt
import ase.io

# define current directory (Google Drive for Colab or local computer)
  import google.colab
  IN_COLAB = True
  base_dir = '/content/drive/MyDrive/'
  IN_COLAB = False
  PATH = 'path_in_your_local_system'

# load only one trajectory file to see the structure of each file
traj_data = ase.io.read(PATH+"oc22_trajectories/trajectories/oc22/raw_trajs/1001022_Al2-WO4-3_mvc-14989_clean_DAf9CVLas7.traj", ":")

# consider the relaxed structure just as an example
t = traj_data[-1] 

#visualize the structure
from ase.visualize.plot import plot_atoms
fig, ax = plt.subplots()
plot_atoms(t, ax, rotation='-80x, 0y, 0z')

Here is the image of the structure of Al16O72W20:

This notebook is for calculating adsorption energies. It can be helpful.


Hi, sedaoturak,

Thanks a lot. Yes, it worked. I successfully extracted the DFT-relaxed structures based on Traj_ID.


Hi sedaoturak,
I found some traj_ids are “unkown” based on the OC22 mapping.

For example, I am interested in DFT-relaxed CeO2 slab structure. All of them are like the following case, in which the “Trajectory ID” is “Unknown”.
entry = {
“System ID”: 82010,
“Bulk ID”: “mp-20194”,
“Miller Index”: (1, 0, 0),
“Number of Adsorbates”: 0,
“Trajectory ID”: “Unknown”,
“Bulk Symbols”: “Ce4O8”,
“Slab SID”: “Unknown”,
“Adsorbate Symbols”: “Unknown”
I have no idea about what is the meaning of this. My guess is it maybe comes from test dataset, in which OC22 does not provide DFT-realaxed structure. Am I right? But I do care about the structure. How can I extract its structure (initial structure or relaxed structure)? The only avaiable id is sid, right?How to deal with such cases? Do you have any suggestions?

Thanks for your time and lookforward to your reply.

I checked the traj files (train, val_id, val_ood), but I didn’t see any file of Ce4O8 either. So I’m not sure where its structure and its calculated energy are.

If I see something related to this, I’ll update here.

Yes, it does not exist in traj files (train, val_id, val_ood) because its traj_id is “unknown”. Besides, I could not find any CeOx in them as well. I mean, however, it does exist in the OC22 mapping file. I am interested in the DFT-relaxed CeOx structures, as they serve important roles in my project. So is it possible to get the DFT-relaxed structures for them from OC22 now that we can find their sid in OC22?

Thanks for your time.