Welcome to the discussion board!

Welcome to the Open Catalyst discussion forum! Feel free to ask any questions about the project, data, models, etc. Questions about the code may be posted on our Github page.

More information about the project can be found at opencatalystproject.org.

Best,
The Open Catalyst team

4 Likes

Hi OCP-team, I am just trying to understand that IS2RE-Total data and the associated pickle dictionary.
It’s mentioned that for every OC22 dataset, the adsorbate+slab info could be obtained from the pickle dictionary file with a key. Can you specify where to get this key? I notice that the ‘sid’ corresponding to a data within the lmdb file is not the key for the pickle dictionary. For example a data from the lmdb is as follows: ‘Data(edge_index=[2, 4774], pos=[113, 3], cell=[1, 3, 3], atomic_numbers=[113], natoms=113, cell_offsets=[4774, 3], force=[113, 3], distances=[4774], fixed=[113], sid=1189586, tags=[113], y_init=1.767500509999877, y_relaxed=-5.01559060000011, pos_relaxed=[113, 3])’

I simply cannot use the sid of the above data to get information on the adsorbate+slab system from the pickle dictionary by ‘dictionary[sid]’. So, how are the data related to adsorbate+slab information and what’s the key?

Thanks!
rajG

Hi -

To clarify, the data you shared isn’t OC22, that is OC20. For OC20, the key to index this table must be prefixed with “random”, so in the example you shared you would do dictionary["random1189586"].

If you end up exploring the OC22 dataset, you would just index that directly with the sid, no prefix is needed. All details are outlined in the respective metadata sections:

OC20: https://github.com/Open-Catalyst-Project/ocp/blob/main/DATASET.md#oc20-mappings
OC22: https://github.com/Open-Catalyst-Project/ocp/blob/main/DATASET.md#oc22-mappings

Thank you, now understood :+1: Some small queries to clarify if possible:

  1. Is the adsorbate species not always specified in the pickle file? In other words, any way to get the adsorbate species then?

For example, corresponding to this “Data(pos=[82, 3], cell=[1, 3, 3], atomic_numbers=[82], natoms=82, fixed=[82], tags=[82], nads=2, y_relaxed=-366.69039128, pos_relaxed=[82, 3], sid=29213, id=‘0_1000’, oc22=1)”, we can get the info as “{‘bulk_id’: ‘mp-656887’, ‘miller_index’: (2, 0, 3), ‘nads’: 2, ‘traj_id’: ‘Ni3O4_mp-656887_lguoUErSVy_8Nvu8707hy’, ‘bulk_symbols’: ‘Ni6O8’, ‘slab_sid’: 40130, ‘ads_symbols’: ‘O’}”. But the ‘ads_symbols’ is not always present.

For example, corresponding to this " Data(pos=[46, 3], cell=[1, 3, 3], atomic_numbers=[46], natoms=46, fixed=[46], tags=[46], nads=0, y_relaxed=-256.74895213, pos_relaxed=[46, 3], sid=45992, id=‘0_10’, oc22=1) ", the ‘ads_symbols’ is not there: " {‘bulk_id’: ‘mp-2697’, ‘miller_index’: (3, 1, 0), ‘nads’: 0, ‘traj_id’: ‘90423983534_SrO2_mp-2697_clean_4PfBdIquvC’, ‘bulk_symbols’: ‘Sr2O4’} "

  1. Do you have meanings of the ‘traj_id’ sections? For example, " ‘traj_id’: ‘90423983534_SrO2_mp-2697_clean_4PfBdIquvC’ ". (i) Does ‘SrO2’ mean something in reference to the bulk which is Sr2O4? (ii) Does the string ‘4PfBdIquvC’ mean something informative?

  2. Is there any reference material/ link to understand what are fixed, tags and cell tensors (unit cell for PBC?) of the dataset?

Thanks again.

Hi -

  1. Correct, the adsorbate is not always present. In OC22 the dataset contains both clean slabs and adsorbate+slabs. So if you don’t see “ads_symbols” it then corresponds to a clean slab system.

  2. The traj_id has no meaningful information that isn’t already contained in the mapping file. Its merely just a unique identifier for us to keep track of the corresponding outputs. The specific naming corresponds to the names of the actual trajectories that can be found here if needed - ocp/DATASET.md at main · Open-Catalyst-Project/ocp · GitHub.

  3. Of course. See this tutorial for a better understanding of some of the dataset attributes - ocp/data_visualization.ipynb at main ¡ Open-Catalyst-Project/ocp ¡ GitHub.

1 Like

This was superbly helpful. Particularly this file ocp/data_visualization.ipynb contains quite a lot of info and learnings.

2 Likes

Hello, I am finding it not straightforward to upload on the eval server the generated datasets (train, validation, test) from the lmdb files as .csv files which will then be used in the training/ validation/ testing scripts to get the MAE/ EwT metrics. I constructed new descriptors to transform the lmdb data (train/validation/test) into .csv files first and then using the files which contain the transformed data, I carry out training/ validation/ testing steps. Any basic, example grade guidelines on how to upload the derived datasets and the relevant scripts on the eval server? This link does not clearly mention what+how to proceed when transformed dataset as .csv files are used: ocp/train.md at release ¡ Open-Catalyst-Project/ocp ¡ GitHub. Any help on detail guidelines would be highly appreciated. Thank you.