Hello,
I hope you don’t mind asking questions related to [MLPerf/HPC/open_catalyst] here.
The benchmark shipped with MLPerf has not been updated for a long time.
After struggling against MLPerf’s poor documentations, I managed to run the benchmark.
The config.yaml
file is as follow:
dataset:
- src: /scratch/optpar01/work/2024/06-mlperf/01-hpc/hpc/open_catalyst/oc20_data/s2ef/200k/train
normalize_labels: True
target_mean: -0.7554450631141663
target_std: 2.887317180633545
grad_target_mean: 0.0
grad_target_std: 2.887317180633545
-
Is a validation dataset truly optional ?
According to the instructions from
TRAIN.md
:- src: [Path to training data] ... # Val data (optional) - src: [Path to validation data] # Test data (optional) - src: [Path to test data]
However, without validation data, the initialization failed with the following error:
Traceback (most recent call last): File "main.py", line 125, in <module> Runner()(config) File "main.py", line 65, in __call__ self.task.run() File "/scratch/optpar01/work/2024/06-mlperf/01-hpc/hpc/open_catalyst/ocpmodels/tasks/task.py", line 35, in run self.trainer.train( File "/scratch/optpar01/work/2024/06-mlperf/01-hpc/hpc/open_catalyst/ocpmodels/trainers/mlperf_forces_trainer.py", line 410, in train mllogger.event(key=mllog.constants.EVAL_SAMPLES, value=len(self.val_loader.dataset)) AttributeError: 'NoneType' object has no attribute 'dataset'
Even with
--mode train
the script still trying to load validation data set. -
Unknown
-C
option ofgit
command:
After supplying the path of validation dataset inconfig.yaml
, the training did proceed.
But I am concerned with the following message emitted after successfully loading data sets:val_dataset: src: /scratch/optpar01/work/2024/06-mlperf/01-hpc/hpc/open_catalyst/oc20_data/s2ef/all/val_id Unknown option: -C usage: git [--version] [--help] [-c name=value] [--exec-path[=<path>]] [--html-path] [--man-path] [--info-path] [-p|--paginate|--no-pager] [--no-replace-objects] [--bare] [--git-dir=<path>] [--work-tree=<path>] [--namespace=<name>] <command> [<args>]
Since the validation stages takes a lot of time even with A100s, ideally, I prefer to measure the performance of only the training stage.
Your insights are much appreciated.
Regards.