Clarification questions about `sid` field

Hi there, thanks for providing this amazing huge dataset! I’m still getting oriented and understanding how everything corresponds to everything else, and I have a couple questions I haven’t been able to figure out the answers to:

  1. Should every entry in the LMDB files have an entry in the sid field? It seems that when I load it I don’t see sids for the train portions, obviously a large segment of the dataset…I’m not sure if it’s supposed to be that way or if I’ve done something wrong, or perhaps have a damaged version of the file…

  2. In the pickle files with mappings , do those system-id’s correspond to the same sid from the LMDB (prepended by random I guess) or to something else? When I tried just prepending random to one fo the sid values I could see, I got a KeyError so I suspect I’m doing something wrong…maybe the trajectory files have different id’s than the LMDB’s? If that’s the case, how do I correspond those to each other…?

Any insights would be greatly appreciated – thanks in advance!

(For additional context: the LMDB’s I’m referring to are the IS2RE task, and I’m loading them in with the SinglePointLmdbDataset class)

Hi -

  1. We actually recently updated the datasets due to some minor issues: ocp/DATASET.md at master · Open-Catalyst-Project/ocp · GitHub. As part of the reupload we addressed this and made sure to include sids for all train data points as well.

  2. Correct, these correspond to the same sids as those in the LMDB, prepended by random. Trajectories have the same sids as those in the LMDBs so there should be no confusion there. This mapping was also recently updated, try redownloading and trying it out: ocp/DATASET.md at master · Open-Catalyst-Project/ocp · GitHub. If you’re still running into issues send me the sid and I can take a look.

Sorry for the confusion this caused, let me know if this works out for you.

1 Like

Ah okay, thanks so much for the quick reply! I had indeed downloaded before the update, so hopefully this will fix things!

1 Like

Another followup on this: in these mapping pickles, it seems that the SID indicated as the “slab” usually does have something adsorbed on it? For example, the mapping_adslab_slab.pkl file has an entry at ‘random31’ that also reads ‘random31’ which I would take to mean that this entry is in fact a bare slab. However, the associated structure has *OHCH3 adsorbed onto it. So what is that mapping pickle actually saying?

Related: are the values in the ‘energy’ field the total DFT energies, or adsorption energies relative to these bare slabs?

Not quite - those are independent sets of randomids. Such that adslabs have a set of randomids and slabs have their own set. So the example you provided “random31” mapping to “random31”, there exists 2 “random31” systems, one in the adslab trajectories and one in the slab trajectories, you want to be inspecting the one in the slab trajectories - ocp/DATASET.md at master · Open-Catalyst-Project/ocp · GitHub which in fact contains no adsorbate (verified on my end).

I’m assuming you’re referring to the LMDBs here, but yes these are referenced to bare slabs and gas phase reference of the adsorbates. If you wanted raw DFT energies you could get those from the raw ASE trajectories - ocp/DATASET.md at master · Open-Catalyst-Project/ocp · GitHub. Alternatively, we could upload a reference mapping for someone to easily obtain those values without needing to download the full trajectories if that’s preferred?

Hope this helps!

Ahhh okay that clarifies things. Would be nice if this could be a bit more explicit in the README; at least to me it wasn’t clear. Thanks again for the quick response!