Load root directories directly
In implementing STEMDataModule
, we enabled custom datasets to override prepare_data()
which lets you customize all of the data preprocessing. The result is put into a "root" directory (the --root
argument) which holds a directory of .pth or .npy files, along with an index.csv giving locations and other parameters for those files, and hparams and microscope params saved as JSON files.
In cases where we have large datasets that have been processed, it may be nice to simply load these preprocessed root directories instead of specifying all of the datamodule parameters and holding the unprocessed data. In this case, it'd be nice to have another datamodule named preprocessed
or something that takes only the default parameters (including --root
). This would let us just pass these directories around, i.e. to OLCF, and would simplify our command line arguments considerably.
Related: #4 (closed)
Plan
All that's needed is to implement this inside of STEMDataModule
. We'll need to move the --root
argument and setup()
method to there, so that PTODataModule
basically only implements prepare_data()
. Then we'll add the preprocessed
datamodule name to data/__init__.py
and test that we can load it properly.