FCDenseNet Benchmarks

A Reconstruction of EM data using FCDenseNet looks promising. As such FCDenseNet is a top candidate model to use in a GB run and/or SC'19 paper submission. Carry out performance (single gpu for flops) and scaling (multiple nodes for communication) studies of FCDenseNet.
Input sizes from simulation will vary in size, some relevant sizes are [x,256,256], x=16x16, 32x32, etc...
Output size from simulation is [256,256].
Necessary code (to build FCDenseNet) is in stemdl/network Benchmarks (might) require following code mods:

1. Modify stemdl/inputs/DatasetTFRecords to generate batch of inputs+outputs on the fly.
2. Create dummy TFRecords (for relevant inputs+outputs) to assess impact of I/O.
Benchmarks can be quantified using:
1. gpu timeline traces.
2. analytical flops.
3. Model's data processing throughput as a function of ranks.

Important Notes:

Most of the necessary code is already implemented (in stemdl).
Coordinate with Sean T. (in particular, Sean has a binary that forces direct convolutions --> 2x performance).

Edited Jan 22, 2019 by Laanait, Nouamane