LmdbDataset has no attribute "data"

 self.normalizers["target"] = Normalizer(
                tensor=self.train_loader.dataset.data.y[
                    self.train_loader.dataset.__indices__
                ],
                device=self.device,
            )

The above code snippet is part of the base_trainer.py file.
On running the S2EF Model on the 200k dataset with the normalization as true (I have not provided the mean and standard deviation values), an error arises saying that LmdbDataset has no attribute “data” even though it has.
Any possible reasons and fixes for this?

Thanks in advance

Hi -

I was able to reproduce this on my end. It does indeed appear to be a bug since our data is not loaded in memory and you’re therefore unable to access data.y. If you’re using the 200k dataset I advise you to specify the numbers we’ve provided in our configs. Alternatively you can set the normalization flag to false. We’ll be resolving this bug in a PR that is soon to be merged in - Unified OCP Trainer by mshuaibii · Pull Request #520 · Open-Catalyst-Project/ocp · GitHub.

Oh, okay. Actually, I want to run the model on my S2eF Lmdb dataset, so that is why passing the tensor without calculating the values would have been very convenient for me.
But okay, I will go with the other method.

Thanks a lot for the info!!

I see. I would recommend computing the statistics before running the model. Something like this is what we typically do. Depending on your dataset size you don’t need to loop through the entire dataset, just a good sample of it.

from ocpmodels.datasets import LmdbDataset

dataset = LmdbDataset({"src": "path/to/your/data"})

y = []

for data in dataset:
      y.append(data.y)

mean = np.mean(y)
stdev = np.std(y)

Yes, I did try this earlier. But I didn’t see any improvement in the results compared to the scenario where normalization is set to false. Hence, I assumed there’s something wrong with what I’ve calculated. So, I thought of the other way…
Also, just a quick question. The dataset you mentioned above for calculating the mean and std is referring to training dataset right? Because including the validation set would cause some data leakage I suppose

That sounds consistent. We don’t see much improvements with and without normalization. So its likely you did something wrong.

As far as statistics, yes we get them from the training set.

Okay
Thank you so much!