disMultiABM issueshttps://code.ornl.gov/groups/disMultiABM/-/issues2019-02-07T14:40:56Zhttps://code.ornl.gov/disMultiABM/stemdl/-/issues/4New Tensorflow I/O pipeline with LMDB2019-02-07T14:40:56ZLaanait, NouamaneNew Tensorflow I/O pipeline with LMDBNeed new Tensorflow I/O pipeline.
Current pipeline `stemdl/inputs.py/datasetTfrecord` uses tfrecords to read images/labels and stagingArea for asynchronous get/put.
Large input sizes (i.e. images) expose intrinsic limitation of tfrec...Need new Tensorflow I/O pipeline.
Current pipeline `stemdl/inputs.py/datasetTfrecord` uses tfrecords to read images/labels and stagingArea for asynchronous get/put.
Large input sizes (i.e. images) expose intrinsic limitation of tfrecords. Best bandwidth achieved on Summit is __0.5 GB/sec__ for input sizes [1024,512,512] (CHW) float32. see `lrn001/nl/dl/tf_io` for all relevant scripts to benchmark I/O.
Current I/O bandwidths lead to (very) poor single gpu performance.
Per Sean, he experienced this with NERSC team, and moved away from tfrecords to (hdf5/numpy). LMDB should have much better reading performance than h5py/numpy (in pytorch @jqyin achieved I/O bandwiths of __2.5 GB/sec__).
To do:
1. Sublcass `stemdl/inputs.py/DatasetTFRecords` and override pure tfrecords methods (`self.decode_image_label`) and modify `self.minibatch` to use __lmdb__ file (for I/O with lmdb/torch implementation see `stemdl/io_utils_torch.py/ABFDataSet`).
2. Benchmark with single python process.
3. Implement and Benchmark version with python `multiprocessing`.
Done means:
New TF I/O pipeline with bandwidths >= 1 GB/s.ACM GB Prize PrepYin, JunqiStarchenko, VitaliiYin, Junqi2019-02-08https://code.ornl.gov/disMultiABM/stemdl/-/issues/3Implement, train, and benchmark deeplabv32019-12-14T19:22:14ZLaanait, NouamaneImplement, train, and benchmark deeplabv3Sean got some pretty good task accuracy and hardware performance out of deeplab, surpassing FCDenseNet in both.
Tasks:
* [ ] Implement deeplab.
* [ ] Train deeplab.
* [ ] benchmark deeplab. This is related to #1 .Sean got some pretty good task accuracy and hardware performance out of deeplab, surpassing FCDenseNet in both.
Tasks:
* [ ] Implement deeplab.
* [ ] Train deeplab.
* [ ] benchmark deeplab. This is related to #1 .ACM GB Prize PrepLaanait, NouamaneLaanait, Nouamane2019-02-05https://code.ornl.gov/disMultiABM/stemdl/-/issues/1FCDenseNet Benchmarks2019-12-14T19:23:34ZLaanait, NouamaneFCDenseNet BenchmarksA Reconstruction of EM data using FCDenseNet looks promising. As such FCDenseNet is a top candidate model to use in a GB run and/or SC'19 paper submission.
Carry out performance (single gpu for flops) and scaling (multiple nodes for comm...A Reconstruction of EM data using FCDenseNet looks promising. As such FCDenseNet is a top candidate model to use in a GB run and/or SC'19 paper submission.
Carry out performance (single gpu for flops) and scaling (multiple nodes for communication) studies of FCDenseNet.
Input sizes from simulation will vary in size, some relevant sizes are [_x_,256,256], _x_=16x16, 32x32, etc...
Output size from simulation is [256,256].
Necessary code (to build FCDenseNet) is in stemdl/network
Benchmarks (might) require following code mods:
* [x] 1. Modify stemdl/inputs/DatasetTFRecords to generate batch of inputs+outputs on the fly.
* [x] 2. Create dummy TFRecords (for relevant inputs+outputs) to assess impact of I/O.
Benchmarks can be quantified using:
* [ ] 1. gpu timeline traces.
* [ ] 2. analytical flops.
* [ ] 3. Model's data processing throughput as a function of ranks.
__Important Notes__:
1. Most of the necessary code is already implemented (in stemdl).
2. Coordinate with Sean T. (in particular, Sean has a binary that forces direct convolutions --> 2x performance).ACM GB Prize PrepYin, JunqiYin, Junqi2019-01-28