Welcome to the discussion board!

lzitnick · January 6, 2021, 10:03pm

Welcome to the Open Catalyst discussion forum! Feel free to ask any questions about the project, data, models, etc. Questions about the code may be posted on our Github page.

More information about the project can be found at opencatalystproject.org.

Best,
The Open Catalyst team

rajarshiche · September 16, 2022, 6:15am

Hi OCP-team, I am just trying to understand that IS2RE-Total data and the associated pickle dictionary.
It’s mentioned that for every OC22 dataset, the adsorbate+slab info could be obtained from the pickle dictionary file with a key. Can you specify where to get this key? I notice that the ‘sid’ corresponding to a data within the lmdb file is not the key for the pickle dictionary. For example a data from the lmdb is as follows: ‘Data(edge_index=[2, 4774], pos=[113, 3], cell=[1, 3, 3], atomic_numbers=[113], natoms=113, cell_offsets=[4774, 3], force=[113, 3], distances=[4774], fixed=[113], sid=1189586, tags=[113], y_init=1.767500509999877, y_relaxed=-5.01559060000011, pos_relaxed=[113, 3])’

I simply cannot use the sid of the above data to get information on the adsorbate+slab system from the pickle dictionary by ‘dictionary[sid]’. So, how are the data related to adsorbate+slab information and what’s the key?

Thanks!
rajG

mshuaibi · September 16, 2022, 4:47pm

Hi -

To clarify, the data you shared isn’t OC22, that is OC20. For OC20, the key to index this table must be prefixed with “random”, so in the example you shared you would do dictionary["random1189586"].

If you end up exploring the OC22 dataset, you would just index that directly with the sid, no prefix is needed. All details are outlined in the respective metadata sections:

OC20: https://github.com/Open-Catalyst-Project/ocp/blob/main/DATASET.md#oc20-mappings
OC22: https://github.com/Open-Catalyst-Project/ocp/blob/main/DATASET.md#oc22-mappings

rajarshiche · September 16, 2022, 7:22pm

Thank you, now understood Some small queries to clarify if possible:

Is the adsorbate species not always specified in the pickle file? In other words, any way to get the adsorbate species then?

For example, corresponding to this “Data(pos=[82, 3], cell=[1, 3, 3], atomic_numbers=[82], natoms=82, fixed=[82], tags=[82], nads=2, y_relaxed=-366.69039128, pos_relaxed=[82, 3], sid=29213, id=‘0_1000’, oc22=1)”, we can get the info as “{‘bulk_id’: ‘mp-656887’, ‘miller_index’: (2, 0, 3), ‘nads’: 2, ‘traj_id’: ‘Ni3O4_mp-656887_lguoUErSVy_8Nvu8707hy’, ‘bulk_symbols’: ‘Ni6O8’, ‘slab_sid’: 40130, ‘ads_symbols’: ‘O’}”. But the ‘ads_symbols’ is not always present.

For example, corresponding to this " Data(pos=[46, 3], cell=[1, 3, 3], atomic_numbers=[46], natoms=46, fixed=[46], tags=[46], nads=0, y_relaxed=-256.74895213, pos_relaxed=[46, 3], sid=45992, id=‘0_10’, oc22=1) ", the ‘ads_symbols’ is not there: " {‘bulk_id’: ‘mp-2697’, ‘miller_index’: (3, 1, 0), ‘nads’: 0, ‘traj_id’: ‘90423983534_SrO2_mp-2697_clean_4PfBdIquvC’, ‘bulk_symbols’: ‘Sr2O4’} "

Do you have meanings of the ‘traj_id’ sections? For example, " ‘traj_id’: ‘90423983534_SrO2_mp-2697_clean_4PfBdIquvC’ ". (i) Does ‘SrO2’ mean something in reference to the bulk which is Sr2O4? (ii) Does the string ‘4PfBdIquvC’ mean something informative?
Is there any reference material/ link to understand what are fixed, tags and cell tensors (unit cell for PBC?) of the dataset?

Thanks again.

mshuaibi · September 21, 2022, 12:16am

Hi -

Correct, the adsorbate is not always present. In OC22 the dataset contains both clean slabs and adsorbate+slabs. So if you don’t see “ads_symbols” it then corresponds to a clean slab system.
The traj_id has no meaningful information that isn’t already contained in the mapping file. Its merely just a unique identifier for us to keep track of the corresponding outputs. The specific naming corresponds to the names of the actual trajectories that can be found here if needed - ocp/DATASET.md at main · Open-Catalyst-Project/ocp · GitHub.
Of course. See this tutorial for a better understanding of some of the dataset attributes - ocp/data_visualization.ipynb at main · Open-Catalyst-Project/ocp · GitHub.

rajarshiche · September 22, 2022, 3:20am

This was superbly helpful. Particularly this file ocp/data_visualization.ipynb contains quite a lot of info and learnings.

rajarshiche · October 5, 2022, 3:36am

Hello, I am finding it not straightforward to upload on the eval server the generated datasets (train, validation, test) from the lmdb files as .csv files which will then be used in the training/ validation/ testing scripts to get the MAE/ EwT metrics. I constructed new descriptors to transform the lmdb data (train/validation/test) into .csv files first and then using the files which contain the transformed data, I carry out training/ validation/ testing steps. Any basic, example grade guidelines on how to upload the derived datasets and the relevant scripts on the eval server? This link does not clearly mention what+how to proceed when transformed dataset as .csv files are used: ocp/train.md at release · Open-Catalyst-Project/ocp · GitHub. Any help on detail guidelines would be highly appreciated. Thank you.

Topic		Replies	Views
Questions about Data mapping information	6	72	April 2, 2024
Extract Specific Structure Information from OC22	5	152	February 2, 2024
NeurIPS test-challenge metadata	1	592	September 29, 2021
Data mapping information for test challenge set	7	1401	September 29, 2021
Accessing the .cif structure files from the dataset	1	803	January 18, 2021

Welcome to the discussion board!

Related Topics