IS2RE Leaderboard Concerns

Many of you have reached out regarding the recent update to the IS2RE relaxation numbers - IS2RE Table 4 Corrections. Specifically, many labs don’t have access to the resources necessary to train a model to perform IS2RE using the relaxation approach, and can only train models using the direct approach. Currently, the relaxation approaches are showing state-of-the-art results. Following your feedback, we’ll share some of our thoughts below. These thoughts are not final so please feel free to engage in a discussion and provide your thoughts and opinions.

For the IS2RE task, we do recognize that there is a substantial difference in the compute needed to train on the IS2RE dataset (460k examples) and the S2EF All dataset (134M examples). Given that the Relaxation based approaches (trained on S2EF) are now outperforming the IS2RE direct methods, this may be unsettling for groups who may not have the resources to train such models but are forced to compete against them. Currently the leaderboard does not indicate which datasets were used to train each entry. This may lead to unfair comparisons between different approaches. To address this, we plan to add an additional column to the leaderboard showing the “Training Dataset” - IS2RE, S2EF All, IS2RE + S2EF 2M, etc. This information will be required for entries upon submission. For those with submissions on the leaderboard, we kindly ask you to add this information to your submission.

While we want to indicate the data used during training to ensure a fair comparison between different approaches, we don’t want to limit the creativity of the community in solving this problem. Our primary goal is to help the community solve the problem, regardless of the approach taken. For this reason, we didn’t want to add the “approach” to the leaderboard, , i.e., direct vs. relaxation. Many approaches may be hybrids (e.g., train IS2RE direct using S2EF as additional training data) and may not fit into either category.

Some have suggested that entries should provide training and inference times for comparison. Although we agree this would address many of the concerns raised above, this is problematic from a logistical point of view. Since models will be trained and evaluated on the user’s end, these would be self-reported metrics. Self-reported metrics are tricky when comparing models - 1. Labs have different resources, environments, configurations, etc. and normalizing across that is impractical, 2. There can be lots of grey area when it comes to inference times, particularly for the relaxation based approaches. 3. We have to always consider that there may be bad faith actors who falsely report their times. We certainly wouldn’t want to encounter this and participants would definitely not appreciate this. We will continue to think about self-reported metrics, but are currently hesitant to include this.

Lastly, we wanted to share some IS2RE relaxation results trained using only the S2EF-2M dataset (2M vs. the 134M S2EF All set) which is only 4x larger than the IS2RE dataset (460k vs. 2M). Results are reported on the validation set for SpinConv and DimeNet++. We see that even when trained on the 2M dataset, relaxation based approaches are able to outperform their direct counterparts. For transparency, models trained on the 2M dataset used 16 GPUs for 4-5 days. Although these compute resources may still be a bit hefty for some, they don’t compare to the All dataset which used 64 GPUs for 6+ days. We present these in case the 2M dataset is more accessible to some groups:

Finally, we are still considering how to select presenters for the NeurIPS Open Catalyst Challenge. We will have two slots for invited talks, and we are leaning towards 1) selecting the best overall approach, and 2) the best approach that only trained on the IS2RE dataset, for presentations. If a single team wins by both criteria, we’ll select the next best team to give a presentation. Please let us know if you have feedback on this.

Thanks again for your support of the dataset and challenge! Please feel free to share your thoughts, suggestions, or questions.

The Open Catalyst Team

3 Likes