Split of Test Challenge Data

Jingtun · September 23, 2021, 8:03pm

Why Test Challenge Data only has one .lmdb file with 120K data points, without split of 4 subsets (ID, OOD_cat, OOD_ads, OOB_both), as Val data and Test Data given before? How can we identify the subset each datapoint belongs to? I suppose that the final test data should be organized in the same way as the test data released months ago.

abhshkdz · September 23, 2021, 9:34pm

Yes, a similar thought process went into creating the test-challenge split as the test split (i.e. ID, OOD-Adsorbate, OOD-Catalyst, OOD-Both, etc.), but the breakdown itself isn’t meant to be public for the challenge.

This is meant to be a slight nudge towards training a single model for all of these ID / OOD subsplits instead of separate ones. Having said that, it should be easy to infer the breakdown based on the test input distribution if you really need it

Jingtun · September 23, 2021, 10:10pm

Can you confirm two points in your above words:

1, The Challenge includes the task that “inferring subset breakdown method”
2, The Host just released a Semi-processedTest Challenge Dataset for evaluating participants’s work

???

abhshkdz · September 23, 2021, 10:18pm

The test-challenge dataset does not have a breakdown into individual subsplits. It is a single 120k split.
Not sure what you mean by “semi-processed”, but let me know if (1) clarifies your concerns.

Jingtun · September 23, 2021, 10:44pm

No,
In Topic Results for 4 splits You answered that “we allow this flexibility in case it’s possible to do much better on certain splits/surfaces/adsorbates.”, but actually you did not release subsets split of your final test dataset. I said “semi-processed” meaning that it is the duty of the organizers to follow their own words and their own data format manner in the whole conpitation, rather than just giving out a bunch of data that is not consistant in data organization and telling particitants to do remaining necessary data processing step.

abhshkdz · September 23, 2021, 11:09pm

Thanks for the clarification, I understand your concern.

The question you linked to however — Results for 4 splits - #3 by Jingtun — is about the test split, not the test-challenge split. Indeed, for the OC20 test split, we do release the 4 subsplits here: ocp/DATASET.md at master · Open-Catalyst-Project/ocp · GitHub.

For the OC20 test-challenge split, we hadn’t promised to release a breakdown of subsplits, and there is going to be a single 120k split. Winners will be decided based on energy MAE overall. Apologies if there was miscommunication here.

Now, if your approach does rely on the distribution of bulk / surface / adsorbate elements in each structure, all of that information is available in the tags and atomic_numbers attributes in the released test-challenge LMDB. Apologies once again if this causes additional work at your end, but to be consistent to all teams participating in the challenge, we are not planning on releasing any additional data.

SanZhang · September 28, 2021, 5:48am

for adsorbate elements, it is hard to infer from atomic_numbers , as different adsorbates could have the same number of atoms.

mshuaibi · September 28, 2021, 3:10pm

Moving this discussion here - Data mapping information for test challenge set - #7 by mshuaibi.

Topic		Replies	Views
Results for 4 splits	2	792	August 23, 2021
Data mapping information for test challenge set	7	1549	September 29, 2021
What is exactly out of domain in the OOD splits?	1	441	October 12, 2022
Test Split True Energies	1	257	September 15, 2023
NeurIPS '21 challenge set now available!	3	791	October 5, 2021

Split of Test Challenge Data

Related topics