The size of preprocessed data compared to raw .xz files

rtadija · May 16, 2022, 2:15am

Hello!

I have a question regarding the relative size of lmdb files compared to the .xz files. I intended to start working with the 2M subset for the S2EF task, but I cannot gauge whether the preprocessing is failing due to limited storage or some other reason.

If the uncompressed .xz train data for S2EF is 16GB, what sort of increase should I be expecting after the preprocessing?

Thank you!

mshuaibi · May 16, 2022, 3:15pm

Hi -

Preprocessing can be quite storage heavy if saving out edges via --get-edges in this script. In your case, the 2M S2EF subset will be ~200GB. However, not saving out edges by excluding that command line argument is 10x faster and less storage, ~20GB (recommended).

Note - If you don’t save out edges, make sure to to add otf_graph=True to your yaml file so edges are computed on-the-fly.

Topic		Replies	Views
OC20 S2EF 200K Training Errors	2	477	July 22, 2023
Using adsorbate data to train models	13	846	February 1, 2022
Relaxed positions from OC20-Dense IS2RE lmdb files	2	208	September 14, 2023
Customised dataset attribute	7	800	September 4, 2021
How to translate the dataset to OCP format and fit it on S2EF Task	0	228	August 6, 2023

The size of preprocessed data compared to raw .xz files

Related Topics