[MLPerf/open_catalyst] Runtime Errors

Hello,

I hope you don’t mind asking questions related to [MLPerf/HPC/open_catalyst] here.
The benchmark shipped with MLPerf has not been updated for a long time.
After struggling against MLPerf’s poor documentations, I managed to run the benchmark.

The config.yaml file is as follow:

dataset:
    - src: /scratch/optpar01/work/2024/06-mlperf/01-hpc/hpc/open_catalyst/oc20_data/s2ef/200k/train
      normalize_labels: True
      target_mean: -0.7554450631141663
      target_std: 2.887317180633545
      grad_target_mean: 0.0
      grad_target_std: 2.887317180633545
  1. Is a validation dataset truly optional ?

    According to the instructions from TRAIN.md:

    - src: [Path to training data]
    ...
    # Val data (optional)
    - src: [Path to validation data]
    # Test data (optional)
    - src: [Path to test data]
    

    However, without validation data, the initialization failed with the following error:

    Traceback (most recent call last):
      File "main.py", line 125, in <module>
        Runner()(config)
      File "main.py", line 65, in __call__
        self.task.run()
      File "/scratch/optpar01/work/2024/06-mlperf/01-hpc/hpc/open_catalyst/ocpmodels/tasks/task.py", line 35, in run
        self.trainer.train(
      File "/scratch/optpar01/work/2024/06-mlperf/01-hpc/hpc/open_catalyst/ocpmodels/trainers/mlperf_forces_trainer.py", line 410, in train
        mllogger.event(key=mllog.constants.EVAL_SAMPLES, value=len(self.val_loader.dataset))
    AttributeError: 'NoneType' object has no attribute 'dataset'
    

    Even with --mode train the script still trying to load validation data set.

  2. Unknown -C option of git command:
    After supplying the path of validation dataset in config.yaml, the training did proceed.
    But I am concerned with the following message emitted after successfully loading data sets:

    val_dataset:
      src: /scratch/optpar01/work/2024/06-mlperf/01-hpc/hpc/open_catalyst/oc20_data/s2ef/all/val_id
    
    Unknown option: -C
    usage: git [--version] [--help] [-c name=value]
           [--exec-path[=<path>]] [--html-path] [--man-path] [--info-path]
           [-p|--paginate|--no-pager] [--no-replace-objects] [--bare]
           [--git-dir=<path>] [--work-tree=<path>] [--namespace=<name>]
           <command> [<args>]
    

Since the validation stages takes a lot of time even with A100s, ideally, I prefer to measure the performance of only the training stage.

Your insights are much appreciated.

Regards.

Hi -

  1. At this time validation is indeed still a requirement for training. However, I don’t recommend you use the full validation sets as, like you mentioned, they are large and can be very time consuming. We typically create a small validation set of only ~30k samples to use while we train. This helps us ensure we have a proper validation set to avoid overfitting but its also small enough to make training reasonable.

  2. The issue you’re getting seems to be a result of fairchem/ocpmodels/common/utils.py at 9108a87ce383b2982c24eff4178632f01fecb63e · FAIR-Chem/fairchem · GitHub. Maybe you need to update your git version. Either way, this isn’t a problem and should not impact your training/results.

Hope this helps!

Hi @mshuaibi

Thank you very much for clarifications!

You are correct regarding the git version.
Upgrading from v1.8.3 to v2.35.1 has resolved the warning messages.

Could you give us some tips on creating a 30k validation steps ?

The DATASET.md (hpc/open_catalyst/DATASET.md at main · mlcommons/hpc · GitHub) provides the following 4 options:

  • val_id
  • val_ood_ads
  • val_ood_cat
  • val_ood_both

I used the download_data.py script provided by MLPerf to download and convert val_id to a set of
data.00{00..63}.lmdb. It is a total black box to me.

  • There are approximately 10^6/64 = 15k samples per LMDB file.
    So does a 30k-sample correspond to just 2 LMDB files ?
  • If I only select, for instance, data.0000.lmdb and data.0001.lmdb, will there be any issue ?
    As I understand, the data loader should not complain in this case.

Yeah we did something similar, we just randomly sampled 30k from the full Val-ID. But what you proposed should also work, copying a few of the .lmdb files into a separate directory and then using that new directory.

Thanks for your suggestion.

The validation indeed proceeds quickly by just selecting a few generated .lmdb among the full data set.