Reading LMDB files: field interpretations

Adam · August 6, 2023, 3:13am

Reading files

Each LMDB file contains key value pairs. The keys are ascii encoded integers beginning with 0. So the access of an element boils down to

import lmdb
env=lmdb.open("path/to/data.lmdb",some options)
env.begin()
entry0=env.get(b"0")
...
env.close()

The returned entry0 is a pickled python object. So the following turns it into a torch Data

import pickle
from torch_geometric.data import Data
data=pickle.loads(entry0)

before closing env. The object may be queried for raw data, for instance data["pos"] is a tensor of atom coordinates (if data came from the is2re dataset).

Field Questions

A datum from the is2re dataset has the fields

pos, cell, atomic_numbers, natoms, tags, fixed, sid, config, y_relaxed

Is there documentation for each field? Here’s what I come up with:

pos is a list of atom coordinates.
cell is a 3x3 matrix, say cell=[e1,e2,e3]. The vectors e1 and e2 seem to be the lattice periods; pos defines one unit cell, and it may be translated by e1 or e2 independently to generate all other cells along the face of the lattice.
- What is the significance e3, or cell[2]? It looks something like the cross product of e1 and e2, but not exactly.
atomic_numbers contains the atomic number of each atom (hydrogen 1…).
natoms is length(pos) and lenth(atomic_numbers) etc.
tags has an entry for each atom. It’s 0 for “subdermal” lattice atoms (below the surface). It’s 1 for “dermal” lattice atoms (near the surface). It’s 2 for adsorbate atoms.
- gif_maker_parallelized.py contains the comment #try and guess which atoms are adsorbates since the tags aren't correct after running in vasp. Is this true for any datasets or entries in particular; should tags be ignored?
fixed has an entry for each atom. It’s 0 for movable atoms and 1 for stationary ones. Subdermal lattice atoms are fixed, dermal ones and adsorbate ones are dynamic.
sid is a chemical identifier, “Compound ID”.
config is usually 0 or 1.
- is config meaningful/relevant?
y_relaxed seems to be adsorption (relaxed) energy.

Extras

The precise code for reading a single element from an lmdb is as follows

import pickle
import lmdb
from torch_geometric.data import Data
env = lmdb.open("path/to/data.lmdb",subdir=False,readonly=True,lock=False,readahead=True,meminit=False,max_readers=1)
data=pickle.loads(env.begin().get("0".encode("ascii")))
print(data)
env.close()

Of course different data accesses warrant different options for opening. The ocp library has utilities for IO in lmdb_dataset.py.

Maybe an FAQ for formats could be made. Whatever parts of this question are useful should be extracted/retained, I’m fine with whatever reformatting/deletions are appropriate.

Topic		Replies	Views
Lmdb转换：什么格式可以转换成lmdb	1	481	July 18, 2023
Converting LMDB to ASE object	5	382	August 31, 2023
Welcome to the discussion board!	6	886	October 5, 2022
Opening LMDB files	11	3175	July 12, 2023
Accessing the .cif structure files from the dataset	1	885	January 18, 2021

Reading LMDB files: field interpretations

Reading files

Field Questions

Extras

Related topics