Creating and reading large lmdb files

I am currently in the process of generating three large LMDB files for train/validation/test splits for the S2EF task. The dataset consists of 30 Mio. frames from MD simulations.

The issue I’ve encountered is that reading in large LMDBs (with the LmdbDataset object), which are around >100 GB in size, is extremely slow. During iteration through the db, I’m observing read speeds of approximately 50 it/s for the >100 GB LMDBs, whereas smaller LMDBs around 5 GB yield read speeds of about 2000 it/s.

I’ve experimented with adjusting the number of workers in the LMDB dataset object but haven’t seen significant improvement. Through profiling, I have discovered that the bottleneck most likely stems from file access during disk reads.

Is such a significant slowdown expected, even on modern SSDs?

Additionally, while examining other LMDBs provided by the OPC, I noticed a sharded approach was used, where the data was split across around 40 smaller LMDB subsets. However, I have also read in the Documentation that this was done due to improve write performance, not read performance.

Could you provide additional tips or best practices for creating LMDB files? The Documentation is very sparce on that topic, I’ve seen that a LMDB sizes of around 200k structures/frames was used per LMDB file. Is this a good rule of thumb, or should I consider larger sizes? Also, what map size is recommended? There are conflicting messages regarding the database fill level and its supposed performance benefits.

Any advice would be much appreciated!

Apologies for the delay on this.

Can you share how exactly you are writing and reading your LMDBs. I’m seeing ~700-800 it/s for reading the entire OC20-All dataset (~130M data points):

from fairchem.core.datasets import LmdbDataset
dataset = LmdbDataset({"src": "data/s2ef/train/all"})
for data in tqdm(dataset, total=len(dataset)):
    continue

The above can be parallelized by using a dataloader -

    from fairchem.core.datasets import data_list_collater
    loader = DataLoader(
        dataset,
        batch_size=32,
        num_workers=64,
        shuffle=False,
        collate_fn=data_list_collater,
    )

This is the script we use to write LMDBs - fairchem/src/fairchem/core/scripts/preprocess_ef.py at main · FAIR-Chem/fairchem · GitHub.