LmdbDataset has no attribute "data"

Sejal2002 · November 17, 2023, 3:15pm

 self.normalizers["target"] = Normalizer(
                tensor=self.train_loader.dataset.data.y[
                    self.train_loader.dataset.__indices__
                ],
                device=self.device,
            )

The above code snippet is part of the base_trainer.py file.
On running the S2EF Model on the 200k dataset with the normalization as true (I have not provided the mean and standard deviation values), an error arises saying that LmdbDataset has no attribute “data” even though it has.
Any possible reasons and fixes for this?

Thanks in advance

mshuaibi · November 17, 2023, 3:42pm

Hi -

I was able to reproduce this on my end. It does indeed appear to be a bug since our data is not loaded in memory and you’re therefore unable to access data.y. If you’re using the 200k dataset I advise you to specify the numbers we’ve provided in our configs. Alternatively you can set the normalization flag to false. We’ll be resolving this bug in a PR that is soon to be merged in - Unified OCP Trainer by mshuaibii · Pull Request #520 · Open-Catalyst-Project/ocp · GitHub.

Sejal2002 · November 17, 2023, 3:50pm

Oh, okay. Actually, I want to run the model on my S2eF Lmdb dataset, so that is why passing the tensor without calculating the values would have been very convenient for me.
But okay, I will go with the other method.

Thanks a lot for the info!!

mshuaibi · November 17, 2023, 3:57pm

I see. I would recommend computing the statistics before running the model. Something like this is what we typically do. Depending on your dataset size you don’t need to loop through the entire dataset, just a good sample of it.

from ocpmodels.datasets import LmdbDataset

dataset = LmdbDataset({"src": "path/to/your/data"})

y = []

for data in dataset:
      y.append(data.y)

mean = np.mean(y)
stdev = np.std(y)

Sejal2002 · November 17, 2023, 4:15pm

Yes, I did try this earlier. But I didn’t see any improvement in the results compared to the scenario where normalization is set to false. Hence, I assumed there’s something wrong with what I’ve calculated. So, I thought of the other way…
Also, just a quick question. The dataset you mentioned above for calculating the mean and std is referring to training dataset right? Because including the validation set would cause some data leakage I suppose

mshuaibi · November 17, 2023, 4:30pm

That sounds consistent. We don’t see much improvements with and without normalization. So its likely you did something wrong.

As far as statistics, yes we get them from the training set.

Sejal2002 · November 17, 2023, 6:09pm

Okay
Thank you so much!

Topic		Replies	Views
Customised dataset attribute	7	870	September 4, 2021
Clarification questions about `sid` field	5	890	February 26, 2021
Opening LMDB files	11	3175	July 12, 2023
Create OC22lmdbdataset by using OUTCAR（relaxation）	3	270	November 14, 2023
[MLPerf/open_catalyst] Runtime Errors	4	102	May 21, 2024

LmdbDataset has no attribute "data"

Related topics