Reading LMDB files: field interpretations

Reading files

Each LMDB file contains key value pairs. The keys are ascii encoded integers beginning with 0. So the access of an element boils down to

import lmdb
env=lmdb.open("path/to/data.lmdb",some options)
env.begin()
entry0=env.get(b"0")
...
env.close()

The returned entry0 is a pickled python object. So the following turns it into a torch Data

import pickle
from torch_geometric.data import Data
data=pickle.loads(entry0)

before closing env. The object may be queried for raw data, for instance data["pos"] is a tensor of atom coordinates (if data came from the is2re dataset).

Field Questions

A datum from the is2re dataset has the fields

pos, cell, atomic_numbers, natoms, tags, fixed, sid, config, y_relaxed

Is there documentation for each field? Here’s what I come up with:

  • pos is a list of atom coordinates.
  • cell is a 3x3 matrix, say cell=[e1,e2,e3]. The vectors e1 and e2 seem to be the lattice periods; pos defines one unit cell, and it may be translated by e1 or e2 independently to generate all other cells along the face of the lattice.
    • What is the significance e3, or cell[2]? It looks something like the cross product of e1 and e2, but not exactly.
  • atomic_numbers contains the atomic number of each atom (hydrogen 1…).
  • natoms is length(pos) and lenth(atomic_numbers) etc.
  • tags has an entry for each atom. It’s 0 for “subdermal” lattice atoms (below the surface). It’s 1 for “dermal” lattice atoms (near the surface). It’s 2 for adsorbate atoms.
    • gif_maker_parallelized.py contains the comment #try and guess which atoms are adsorbates since the tags aren't correct after running in vasp. Is this true for any datasets or entries in particular; should tags be ignored?
  • fixed has an entry for each atom. It’s 0 for movable atoms and 1 for stationary ones. Subdermal lattice atoms are fixed, dermal ones and adsorbate ones are dynamic.
  • sid is a chemical identifier, “Compound ID”.
  • config is usually 0 or 1.
    • is config meaningful/relevant?
  • y_relaxed seems to be adsorption (relaxed) energy.

Extras

The precise code for reading a single element from an lmdb is as follows

import pickle
import lmdb
from torch_geometric.data import Data
env = lmdb.open("path/to/data.lmdb",subdir=False,readonly=True,lock=False,readahead=True,meminit=False,max_readers=1)
data=pickle.loads(env.begin().get("0".encode("ascii")))
print(data)
env.close()

Of course different data accesses warrant different options for opening. The ocp library has utilities for IO in lmdb_dataset.py.

Maybe an FAQ for formats could be made. Whatever parts of this question are useful should be extracted/retained, I’m fine with whatever reformatting/deletions are appropriate.

1 Like