Why Test Challenge Data only has one .lmdb file with 120K data points, without split of 4 subsets (ID, OOD_cat, OOD_ads, OOB_both), as Val data and Test Data given before? How can we identify the subset each datapoint belongs to? I suppose that the final test data should be organized in the same way as the test data released months ago.
Yes, a similar thought process went into creating the test-challenge
split as the test
split (i.e. ID, OOD-Adsorbate, OOD-Catalyst, OOD-Both, etc.), but the breakdown itself isn’t meant to be public for the challenge.
This is meant to be a slight nudge towards training a single model for all of these ID / OOD subsplits instead of separate ones. Having said that, it should be easy to infer the breakdown based on the test
input distribution if you really need it
Can you confirm two points in your above words:
1, The Challenge includes the task that “inferring subset breakdown method”
2, The Host just released a Semi-processedTest Challenge Dataset for evaluating participants’s work
???
- The test-challenge dataset does not have a breakdown into individual subsplits. It is a single 120k split.
- Not sure what you mean by “semi-processed”, but let me know if (1) clarifies your concerns.
No,
In Topic Results for 4 splits You answered that “we allow this flexibility in case it’s possible to do much better on certain splits/surfaces/adsorbates.”, but actually you did not release subsets split of your final test dataset. I said “semi-processed” meaning that it is the duty of the organizers to follow their own words and their own data format manner in the whole conpitation, rather than just giving out a bunch of data that is not consistant in data organization and telling particitants to do remaining necessary data processing step.
Thanks for the clarification, I understand your concern.
The question you linked to however — Results for 4 splits - #3 by Jingtun — is about the test
split, not the test-challenge
split. Indeed, for the OC20 test
split, we do release the 4 subsplits here: ocp/DATASET.md at master · Open-Catalyst-Project/ocp · GitHub.
For the OC20 test-challenge
split, we hadn’t promised to release a breakdown of subsplits, and there is going to be a single 120k split. Winners will be decided based on energy MAE overall. Apologies if there was miscommunication here.
Now, if your approach does rely on the distribution of bulk / surface / adsorbate elements in each structure, all of that information is available in the tags
and atomic_numbers
attributes in the released test-challenge
LMDB. Apologies once again if this causes additional work at your end, but to be consistent to all teams participating in the challenge, we are not planning on releasing any additional data.
for adsorbate elements, it is hard to infer from atomic_numbers
, as different adsorbates could have the same number of atoms.
Moving this discussion here - Data mapping information for test challenge set - #7 by mshuaibi.