Reading files
Each LMDB file contains key value pairs. The keys are ascii encoded integers beginning with 0. So the access of an element boils down to
import lmdb
env=lmdb.open("path/to/data.lmdb",some options)
env.begin()
entry0=env.get(b"0")
...
env.close()
The returned entry0
is a pickled python object. So the following turns it into a torch Data
import pickle
from torch_geometric.data import Data
data=pickle.loads(entry0)
before closing env
. The object may be queried for raw data, for instance data["pos"]
is a tensor of atom coordinates (if data
came from the is2re dataset).
Field Questions
A datum from the is2re dataset has the fields
pos, cell, atomic_numbers, natoms, tags, fixed, sid, config, y_relaxed
Is there documentation for each field? Here’s what I come up with:
pos
is a list of atom coordinates.cell
is a 3x3 matrix, saycell=[e1,e2,e3]
. The vectorse1
ande2
seem to be the lattice periods;pos
defines one unit cell, and it may be translated bye1
ore2
independently to generate all other cells along the face of the lattice.- What is the significance
e3
, orcell[2]
? It looks something like the cross product ofe1
ande2
, but not exactly.
- What is the significance
atomic_numbers
contains the atomic number of each atom (hydrogen 1…).natoms
islength(pos)
andlenth(atomic_numbers)
etc.tags
has an entry for each atom. It’s0
for “subdermal” lattice atoms (below the surface). It’s1
for “dermal” lattice atoms (near the surface). It’s2
for adsorbate atoms.- gif_maker_parallelized.py contains the comment
#try and guess which atoms are adsorbates since the tags aren't correct after running in vasp
. Is this true for any datasets or entries in particular; shouldtags
be ignored?
- gif_maker_parallelized.py contains the comment
fixed
has an entry for each atom. It’s0
for movable atoms and1
for stationary ones. Subdermal lattice atoms are fixed, dermal ones and adsorbate ones are dynamic.sid
is a chemical identifier, “Compound ID”.config
is usually 0 or 1.- is
config
meaningful/relevant?
- is
y_relaxed
seems to be adsorption (relaxed) energy.
Extras
The precise code for reading a single element from an lmdb is as follows
import pickle
import lmdb
from torch_geometric.data import Data
env = lmdb.open("path/to/data.lmdb",subdir=False,readonly=True,lock=False,readahead=True,meminit=False,max_readers=1)
data=pickle.loads(env.begin().get("0".encode("ascii")))
print(data)
env.close()
Of course different data accesses warrant different options for open
ing. The ocp
library has utilities for IO in lmdb_dataset.py.
Maybe an FAQ for formats could be made. Whatever parts of this question are useful should be extracted/retained, I’m fine with whatever reformatting/deletions are appropriate.