disMultiABM issueshttps://code.ornl.gov/groups/disMultiABM/-/issues2019-12-15T00:18:26Zhttps://code.ornl.gov/disMultiABM/stemdl/-/issues/7gradient based SNN training in the complex plan2019-12-15T00:18:26ZLaanait, Nouamanegradient based SNN training in the complex planimplement the diffeqs of SNN and see if one can avoid singularities by going to the complex plane and picking up principal values?implement the diffeqs of SNN and see if one can avoid singularities by going to the complex plane and picking up principal values?Laanait, NouamaneLaanait, Nouamanehttps://code.ornl.gov/disMultiABM/stemdl/-/issues/6recurrent inverter machine2019-12-15T00:14:33ZLaanait, Nouamanerecurrent inverter machinereverse mode AD on forward model <--> forward mode AD on recurrent machine.
Essentially, have a recurrent machine learn the dynamics of the forward model running on reverse mode AD and must be something to do forward mode as always.reverse mode AD on forward model <--> forward mode AD on recurrent machine.
Essentially, have a recurrent machine learn the dynamics of the forward model running on reverse mode AD and must be something to do forward mode as always.Laanait, NouamaneLaanait, Nouamanehttps://code.ornl.gov/disMultiABM/stemdl/-/issues/5recurrent ynet2019-12-14T19:24:48ZLaanait, Nouamanerecurrent ynethttps://code.ornl.gov/disMultiABM/stemdl/-/issues/4New Tensorflow I/O pipeline with LMDB2019-02-07T14:40:56ZLaanait, NouamaneNew Tensorflow I/O pipeline with LMDBNeed new Tensorflow I/O pipeline.
Current pipeline `stemdl/inputs.py/datasetTfrecord` uses tfrecords to read images/labels and stagingArea for asynchronous get/put.
Large input sizes (i.e. images) expose intrinsic limitation of tfrec...Need new Tensorflow I/O pipeline.
Current pipeline `stemdl/inputs.py/datasetTfrecord` uses tfrecords to read images/labels and stagingArea for asynchronous get/put.
Large input sizes (i.e. images) expose intrinsic limitation of tfrecords. Best bandwidth achieved on Summit is __0.5 GB/sec__ for input sizes [1024,512,512] (CHW) float32. see `lrn001/nl/dl/tf_io` for all relevant scripts to benchmark I/O.
Current I/O bandwidths lead to (very) poor single gpu performance.
Per Sean, he experienced this with NERSC team, and moved away from tfrecords to (hdf5/numpy). LMDB should have much better reading performance than h5py/numpy (in pytorch @jqyin achieved I/O bandwiths of __2.5 GB/sec__).
To do:
1. Sublcass `stemdl/inputs.py/DatasetTFRecords` and override pure tfrecords methods (`self.decode_image_label`) and modify `self.minibatch` to use __lmdb__ file (for I/O with lmdb/torch implementation see `stemdl/io_utils_torch.py/ABFDataSet`).
2. Benchmark with single python process.
3. Implement and Benchmark version with python `multiprocessing`.
Done means:
New TF I/O pipeline with bandwidths >= 1 GB/s.ACM GB Prize PrepYin, JunqiStarchenko, VitaliiYin, Junqi2019-02-08https://code.ornl.gov/disMultiABM/stemdl/-/issues/3Implement, train, and benchmark deeplabv32019-12-14T19:22:14ZLaanait, NouamaneImplement, train, and benchmark deeplabv3Sean got some pretty good task accuracy and hardware performance out of deeplab, surpassing FCDenseNet in both.
Tasks:
* [ ] Implement deeplab.
* [ ] Train deeplab.
* [ ] benchmark deeplab. This is related to #1 .Sean got some pretty good task accuracy and hardware performance out of deeplab, surpassing FCDenseNet in both.
Tasks:
* [ ] Implement deeplab.
* [ ] Train deeplab.
* [ ] benchmark deeplab. This is related to #1 .ACM GB Prize PrepLaanait, NouamaneLaanait, Nouamane2019-02-05https://code.ornl.gov/disMultiABM/stemdl/-/issues/1FCDenseNet Benchmarks2019-12-14T19:23:34ZLaanait, NouamaneFCDenseNet BenchmarksA Reconstruction of EM data using FCDenseNet looks promising. As such FCDenseNet is a top candidate model to use in a GB run and/or SC'19 paper submission.
Carry out performance (single gpu for flops) and scaling (multiple nodes for comm...A Reconstruction of EM data using FCDenseNet looks promising. As such FCDenseNet is a top candidate model to use in a GB run and/or SC'19 paper submission.
Carry out performance (single gpu for flops) and scaling (multiple nodes for communication) studies of FCDenseNet.
Input sizes from simulation will vary in size, some relevant sizes are [_x_,256,256], _x_=16x16, 32x32, etc...
Output size from simulation is [256,256].
Necessary code (to build FCDenseNet) is in stemdl/network
Benchmarks (might) require following code mods:
* [x] 1. Modify stemdl/inputs/DatasetTFRecords to generate batch of inputs+outputs on the fly.
* [x] 2. Create dummy TFRecords (for relevant inputs+outputs) to assess impact of I/O.
Benchmarks can be quantified using:
* [ ] 1. gpu timeline traces.
* [ ] 2. analytical flops.
* [ ] 3. Model's data processing throughput as a function of ranks.
__Important Notes__:
1. Most of the necessary code is already implemented (in stemdl).
2. Coordinate with Sean T. (in particular, Sean has a binary that forces direct convolutions --> 2x performance).ACM GB Prize PrepYin, JunqiYin, Junqi2019-01-28