update descriptions

7fe7768b · Yin, Junqi · 6c4bf7a0 · 7fe7768b · 7fe7768b · 7fe7768b
Commit 7fe7768b authored 4 years ago by Yin, Junqi
--- a/README.md
+++ b/README.md
@@ -13,11 +13,12 @@
 ### [2. PyTorch Distributed Example](#Section2)
 * [4 Communication Methods Setup](#4-setup)
 * [Performance Comparisons](#comm-compare) 
+* [BERT on Summit](#bert-summit)

 ### [3. TensorFlow Distributed Example](#Section3)
 * [Multi-worker Mirrored Strategy](#tf-dist)
 * [Add Horovod Support](#add-hvd)
-* [Running on Summit](#run-summit)
+* [ResNet on Summit](#run-summit)
 * [Performance: Training Speed vs Convergence](#perf)

 ### [4. Scaling considerations](#Section4)
@@ -165,7 +166,23 @@ else:

 The job script (`examples/pytorch/job.lsf`) and testing logs (`examples/pytorch/logs`) for 4 distriubtion modes are also avaiable. Based on the performance plot, we recommend to use Horovod with NCCL backend as the communication method. 

-![](examples/pytorch/pytorch_comm_batch32.png "Comparisons of communication methods")
+![](examples/pytorch/synthetic/pytorch_comm_batch32.png "Comparisons of communication methods")
+
+### <a name="bert-summit"></a>2.3 BERT on Summit 
+
+This example is modified from Nvidia's [BERT example](https://github.com/NVIDIA/DeepLearningExamples/tree/master/PyTorch/LanguageModeling/BERT), which demonstrates the usage of Apex data parallel on Summit for NLP workloads. The key modifications are environment variables setup for each rank to establish the communicator,  
+```bash
+nodes=($(cat ${LSB_DJOB_HOSTFILE} | sort | uniq | grep -v login | grep -v batch))
+head=${nodes[0]}
+
+export RANK=$OMPI_COMM_WORLD_RANK
+export LOCAL_RANK=$OMPI_COMM_WORLD_LOCAL_RANK
+export WORLD_SIZE=$OMPI_COMM_WORLD_SIZE
+export MASTER_ADDR=$head
+export MASTER_PORT=29500 # default from torch launcher
+```
+The performance of pre-training BERT on Wikipedia corpus is shown in the following plot (more [details](./examples/pytorch/BERT/README.md)), 
+![](examples/pytorch/BERT/bert-summit.png "BERT performance on Summit")

 ## <a name="Section3"></a>3. TensorFlow Distributed Example

@@ -203,7 +220,7 @@ offical/resnet/imagenet_main.py
 offical/resnet/resnet_run_loop.py
 offical/utils/misc/distribution_utils.py
 ```
-### <a name="run-summit"></a>3.3 Running on Summit 
+### <a name="run-summit"></a>3.3 ResNet on Summit 
 For the built-in `MultiWorkerMirroredStrategy`, the currently tuning knob is the choice of communication layer, gRPC or NCCL, and NCCL should be used on Summit for better performance. 

 For Horovod, there are several paramters that can be tuned, and following are what we found works well for ResNet on Summit

--- a/examples/pytorch/BERT/README.md
+++ b/examples/pytorch/BERT/README.md
+# BERT benchmark on Wikipedia corpus 
+
+This example is a modified version of Nvidia's [BERT example](https://github.com/NVIDIA/DeepLearningExamples/tree/master/PyTorch/LanguageModeling/BERT)
+
+## Requirement 
+
+Download and pre-process Wikipedia corpus following steps in original [quick start guide](./README_nv.md#quick-start-guide). Then set the `INPUT_DATA` in [submit_pretraining.lsf](./submit_pretraining.lsf) to the data path. 
+
+## How to run 
+
+Simply submit the [job script](./submit_pretraining.lsf) from the example directory. 
+
+The key modifications are for launching Apex data parallel on Summit 
+```bash
+nodes=($(cat ${LSB_DJOB_HOSTFILE} | sort | uniq | grep -v login | grep -v batch))
+head=${nodes[0]}
+
+export RANK=$OMPI_COMM_WORLD_RANK
+export LOCAL_RANK=$OMPI_COMM_WORLD_LOCAL_RANK
+export WORLD_SIZE=$OMPI_COMM_WORLD_SIZE
+export MASTER_ADDR=$head
+export MASTER_PORT=29500 # default from torch launcher
+
+echo "Setting env_var RANK=${RANK}"
+echo "Setting env_var LOCAL_RANK=${LOCAL_RANK}"
+echo "Setting env_var WORLD_SIZE=${WORLD_SIZE}"
+echo "Setting env_var MASTER_ADDR=${MASTER_ADDR}"
+echo "Setting env_var MASTER_PORT=${MASTER_PORT}"
+```
+Which is sourced by each rank running the [task script](./scripts/run_pretraining_summit_32node_phase1.sh)  
--- a/examples/pytorch/BERT/bert-summit.png
+++ b/examples/pytorch/BERT/bert-summit.png
--- a/examples/pytorch/synthetic/README.md
+++ b/examples/pytorch/synthetic/README.md
 # PyTorch synthetic benchmark with `NCCL` and `MPI` backends and `DDL` and `Horovod` plugins 

-This example is a modified version of Horovod's [PyTorch examples](https://github.com/horovod/horovod/blob/master/examples/pytorch_imagenet_resnet50.py).
+This example is a modified version of Horovod's [PyTorch examples](https://github.com/horovod/horovod/blob/master/examples/pytorch_synthetic_benchmark.py).

 ## Requirement

-Horovod and PyTorch need to be installed in your environment.
-
-You need to have access to `/gpfs/alpine/world-shared` directory on Summit. (All valid Summit users should have access)
+Horovod (both NCCL and DDL backend) and PyTorch need to be installed in your environment.

 ## How to run

 1. Navigate yourself to this folder.


-2. Type bsub `bench.lsf` to submit the job.
+2. Type bsub `job.lsf` to submit the job.


 The following is the modification for general usage. (taken from [Horovod repository](https://github.com/horovod/horovod/blob/master/docs/pytorch.rst))