Interpreting IS2RE results (MAE, EwT)

yilunliao · May 12, 2022, 1:55am

Hello OCP team.

Thank you for all the great efforts of creating the dataset and codebase and organizing all related stuff.

I have a question when comparing the IS2RE results of different methods.
I am curious about the metric of energy within threshold (EwT) and how it is related to energy MAE.
Specifically, as shown in the leaderboard, we can see some methods achieving lower MAE yet lower EwT than others.
Some examples are “GemNet-XL (Finetuned)” versus “GNS + Noisy Nodes (IS2RE Only)” and “SphereNet+” versus “SEGNN”.
I wonder what would be the reason for this, and my guess is that EwT measures the percentage where errors are below some threshold and therefore reducing average errors does not necessarily improve that metric if errors are still higher than that threshold.

Thank you in advance for help!

mshuaibi · May 12, 2022, 2:13pm

Hi -

While Energy MAE averages the absolute error of all systems, EwT is the % of all systems with absolute errors below 0.02 eV. This threshold is meant to be strict, with the motivation being that systems that fall below this threshold are practically relevant for researchers. While we use a threshold of 0.02 eV as an extreme case, practical relevance can still be achieved at higher thresholds e.g. 0.1 eV.

As far as models are concerned, it’s been interesting to see the different trends across models. Without knowing the full details of the different approaches, we don’t have a concrete answer why this may be the case. One design decision that could impact this is the loss function being used. An L2 loss puts more emphasis on outliers, which could help in reducing overall MAE but may hurt performance on other points - particularly the previously low error points which contribute to EwT if <0.02. An L1 loss on the other hand may not pay as much attention to outliers and thus aid in getting more points within threshold rather than trying to accommodate for outliers.

There could be other model/training decisions that are contributing to this difference. Hope this answers your question.

yilunliao · May 13, 2022, 4:22am

Thanks for the response!
The explanation on loss function makes sense to me.

Suppose two models are trained with similar settings – both minimizes MAE between predicted energy and ground truth energy and are trained with same IS2RE data only.
Is it reasonable to say one model is better than the other if it has lower testing MAE regardless of EwT (if the difference is little) as the same metric, MAE, is considered for both training and testing?
Besides, is there any reason that the IS2RE leaderboard mainly uses MAE to rank the performance of different methods instead of EwT?

mshuaibi · May 16, 2022, 3:07pm

Both are useful. MAE is a general metric to, overall, gauge the performance of the models on the dataset. EwT on the other hand is interested in only the proportion of systems with errors less than a tight threshold. For practical relevance, a model with a higher EwT may be more useful for screening applications even if it has a higher energy MAE. However, we’re not there yet and have been mainly focusing on MAE as it’s more familiar to the community than EwT.

In last year’s OCP Challenge for instance, winning models were selected based off energy MAE, regardless of EwT performance.

yilunliao · May 17, 2022, 4:20pm

I see!

Thanks for answering and providing more details on MAE and EwT.
I think most of my problems are well-addressed and look forward to future works that apply these methods to screening applications.

Topic		Replies	Views
What is the final evaluation metric?	3	641	August 26, 2021
IS2RE Table 4 Corrections	6	908	July 27, 2021
IS2RE Leaderboard Concerns	3	1093	January 4, 2023
New GemNet-dT code, results, model weights	10	1996	December 7, 2021
Evaluation server and leaderboards are now up!	0	746	March 1, 2021

Interpreting IS2RE results (MAE, EwT)

Related topics