[MLPerf/open_catalyst] Runtime Errors

vitduck · May 13, 2024, 6:49am

Hello,

I hope you don’t mind asking questions related to [MLPerf/HPC/open_catalyst] here.
The benchmark shipped with MLPerf has not been updated for a long time.
After struggling against MLPerf’s poor documentations, I managed to run the benchmark.

The config.yaml file is as follow:

dataset:
    - src: /scratch/optpar01/work/2024/06-mlperf/01-hpc/hpc/open_catalyst/oc20_data/s2ef/200k/train
      normalize_labels: True
      target_mean: -0.7554450631141663
      target_std: 2.887317180633545
      grad_target_mean: 0.0
      grad_target_std: 2.887317180633545

Is a validation dataset truly optional ?

According to the instructions from TRAIN.md:

- src: [Path to training data]
...
# Val data (optional)
- src: [Path to validation data]
# Test data (optional)
- src: [Path to test data]

However, without validation data, the initialization failed with the following error:

Traceback (most recent call last):
  File "main.py", line 125, in <module>
    Runner()(config)
  File "main.py", line 65, in __call__
    self.task.run()
  File "/scratch/optpar01/work/2024/06-mlperf/01-hpc/hpc/open_catalyst/ocpmodels/tasks/task.py", line 35, in run
    self.trainer.train(
  File "/scratch/optpar01/work/2024/06-mlperf/01-hpc/hpc/open_catalyst/ocpmodels/trainers/mlperf_forces_trainer.py", line 410, in train
    mllogger.event(key=mllog.constants.EVAL_SAMPLES, value=len(self.val_loader.dataset))
AttributeError: 'NoneType' object has no attribute 'dataset'

Even with --mode train the script still trying to load validation data set.

Unknown -C option of git command:
After supplying the path of validation dataset in config.yaml, the training did proceed.
But I am concerned with the following message emitted after successfully loading data sets:

val_dataset:
  src: /scratch/optpar01/work/2024/06-mlperf/01-hpc/hpc/open_catalyst/oc20_data/s2ef/all/val_id

Unknown option: -C
usage: git [--version] [--help] [-c name=value]
       [--exec-path[=<path>]] [--html-path] [--man-path] [--info-path]
       [-p|--paginate|--no-pager] [--no-replace-objects] [--bare]
       [--git-dir=<path>] [--work-tree=<path>] [--namespace=<name>]
       <command> [<args>]

Since the validation stages takes a lot of time even with A100s, ideally, I prefer to measure the performance of only the training stage.

Your insights are much appreciated.

Regards.

mshuaibi · May 14, 2024, 6:39pm

Hi -

At this time validation is indeed still a requirement for training. However, I don’t recommend you use the full validation sets as, like you mentioned, they are large and can be very time consuming. We typically create a small validation set of only ~30k samples to use while we train. This helps us ensure we have a proper validation set to avoid overfitting but its also small enough to make training reasonable.
The issue you’re getting seems to be a result of fairchem/ocpmodels/common/utils.py at 9108a87ce383b2982c24eff4178632f01fecb63e · FAIR-Chem/fairchem · GitHub. Maybe you need to update your git version. Either way, this isn’t a problem and should not impact your training/results.

Hope this helps!

vitduck · May 16, 2024, 2:30am

Hi @mshuaibi

Thank you very much for clarifications!

You are correct regarding the git version.
Upgrading from v1.8.3 to v2.35.1 has resolved the warning messages.

Could you give us some tips on creating a 30k validation steps ?

The DATASET.md (hpc/open_catalyst/DATASET.md at main · mlcommons/hpc · GitHub) provides the following 4 options:

val_id
val_ood_ads
val_ood_cat
val_ood_both

I used the download_data.py script provided by MLPerf to download and convert val_id to a set of
data.00{00..63}.lmdb. It is a total black box to me.

There are approximately 10^6/64 = 15k samples per LMDB file.
So does a 30k-sample correspond to just 2 LMDB files ?
If I only select, for instance, data.0000.lmdb and data.0001.lmdb, will there be any issue ?
As I understand, the data loader should not complain in this case.

mshuaibi · May 17, 2024, 8:47pm

Yeah we did something similar, we just randomly sampled 30k from the full Val-ID. But what you proposed should also work, copying a few of the .lmdb files into a separate directory and then using that new directory.

vitduck · May 21, 2024, 2:32am

Thanks for your suggestion.

The validation indeed proceeds quickly by just selecting a few generated .lmdb among the full data set.

Topic		Replies	Views
LmdbDataset has no attribute "data"	6	212	November 17, 2023
Unavailable links to datasets	1	100	May 9, 2024
How to upload engineered descriptor derived dataset (.csv files), model (train, validation, test) scripts and associated scripts on evaluation server?	4	488	October 7, 2022
Pretrained OCP models	8	694	November 27, 2023
OCPCalculator ASE	4	661	September 30, 2022

[MLPerf/open_catalyst] Runtime Errors

Related topics