The size of preprocessed data compared to raw .xz files

Hello!

I have a question regarding the relative size of lmdb files compared to the .xz files. I intended to start working with the 2M subset for the S2EF task, but I cannot gauge whether the preprocessing is failing due to limited storage or some other reason.

If the uncompressed .xz train data for S2EF is 16GB, what sort of increase should I be expecting after the preprocessing?

Thank you!

Hi -

Preprocessing can be quite storage heavy if saving out edges via --get-edges in this script. In your case, the 2M S2EF subset will be ~200GB. However, not saving out edges by excluding that command line argument is 10x faster and less storage, ~20GB (recommended).

Note - If you don’t save out edges, make sure to to add otf_graph=True to your yaml file so edges are computed on-the-fly.