Using our own data from Vasp with the OCP codebase

Hello,

We are looking to test some of the models in the OCP repo with some of our own data, which was generated using Vasp. How do we get the data we have in the right format to be used by the models? I’ve tried looking at the dataset creation tutorial and the data preprocessing tutorial, but they all seem to assume that your data was generated and saved by ASE. I’ve looked at importing Vasp runs into ASE, but it only seems to import the final, post-relaxation positions/energies, and the format here also requires the pre-relaxation positions. Do I need to write my own script to turn Vasp’s outputs into a format parseable by ASE that has all the information we need and then apply the pipelines shown in those links? Or is there an easier way?

Hi,

Glad to hear you’re interested in using your own data with the OCP repo. All our data was generated with VASP and then converted via ASE. Assuming you have successfully run VASP, you should see an OUTCAR file that corresponds to the results. The following code will allow you to read in your VASP data for a particular system in a format ready to be used in the dataset creation tutorial:

import ase.io
data = ase.io.read("OUTCAR", ":")

Let me know if this works for you. Alternatively, if you share some code snippets from your end I can better help debug what could be different in our workflows.

I see. I had thought that the OUTCAR file only contained the positions for the atoms post-relaxation, and not their initial positions, which we need for the IS2RE/IS2RS tasks.

If you ran a relaxation via VASP your OUTCAR includes all states - initial position, intermediates, and final position. You can index out initial + final positions accordingly:

data = ase.io.read("OUTCAR", ":")
initial_position = data[0]
relaxed_position = data[-1]

Note - the presence of a full OUTCAR doesn’t necessarily mean a relaxed state was reached. It could have reached the max number of steps and terminated accordingly. We filtered these systems by checking if the max absolute force on the relaxed state was less than 0.05 eV/A.

Ah, okay, I must have misunderstood something then. Thanks very much!

On this basis, I would like to ask you a question. How to put the processed data into the OC22LmdbDataset class, I loaded this class. But it doesn’t seem to be able to be used to feed into a model for training.