Unverified Commit 129295de authored by Tao Lin's avatar Tao Lin Committed by GitHub

Merge pull request #1 from IamTao/master

distributed deep learning code, supported by MPI.
parents ff0d7903 9b6a484f
# CHOCO-SGD
The code repository for the main experiments in the paper [Decentralized Stochastic Optimization and Gossip Algorithms with Compressed Communication](https://arxiv.org/abs/1902.00340) and [Decentralized Deep Learning with Arbitrary Communication Compression](https://arxiv.org/abs/1907.09356).
Code for the main experiments of the paper [Decentralized Stochastic Optimization and Gossip Algorithms with Compressed Communication](https://arxiv.org/abs/1902.00340).
Please refer to the folders `convex_code` and `dl_code` for more details.
### Datasets and Setup
First you need to download datasets from LIBSVM library and convert them into pickle format. For that from
```
cd data
wget -t inf https://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/binary/epsilon_normalized.bz2
wget -t inf https://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/binary/rcv1_test.binary.bz2
cd ../code
python pickle_datasets.py
```
If you get memory error, you can leave rcv1 dataset in the sparse format, but this would slow down training time.
### Reproduce the results
# Reference
If you use the code, please cite the following papers:
For running experiments with the `epsilon` dataset
```
python experiment_epsilon_final.py final
@inproceedings{ksj2019choco,
author = {Anastasia Koloskova and Sebastian U. Stich and Martin Jaggi},
title = {Decentralized Stochastic Optimization and Gossip Algorithms with Compressed Communication},
booktitle = {ICML 2019 - Proceedings of the 36th International Conference on Machine Learning},
url = {http://proceedings.mlr.press/v97/koloskova19a.html},
series = {Proceedings of Machine Learning Research},
publisher = {PMLR},
volume = {97},
pages = {3479--3487},
year = {2019}
}
```
# Reference
If you use this code, please cite the following [paper](http://proceedings.mlr.press/v97/koloskova19a.html):
@inproceedings{ksj2019choco,
author = {Anastasia Koloskova and Sebastian U. Stich and Martin Jaggi},
title = {Decentralized Stochastic Optimization and Gossip Algorithms with Compressed Communication},
booktitle = {ICML 2019 - Proceedings of the 36th International Conference on Machine Learning},
url = {http://proceedings.mlr.press/v97/koloskova19a.html},
series = {Proceedings of Machine Learning Research},
publisher = {PMLR},
volume = {97},
pages = {3479--3487},
year = {2019}
}
and
```
@article{koloskova2019decentralized,
title={Decentralized Deep Learning with Arbitrary Communication Compression},
author={Koloskova, Anastasia and Lin, Tao and Stich, Sebastian U and Jaggi, Martin},
journal={arXiv preprint arXiv:1907.09356},
year={2019}
}
```
\ No newline at end of file
# CHOCO-SGD
Code for the main experiments of the paper [Decentralized Stochastic Optimization and Gossip Algorithms with Compressed Communication](https://arxiv.org/abs/1902.00340).
### Datasets and Setup
First you need to download datasets from LIBSVM library and convert them into pickle format. For that from
```
cd data
wget -t inf https://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/binary/epsilon_normalized.bz2
wget -t inf https://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/binary/rcv1_test.binary.bz2
cd ../code
python pickle_datasets.py
```
If you get memory error, you can leave rcv1 dataset in the sparse format, but this would slow down training time.
### Reproduce the results
For running experiments with the `epsilon` dataset
```
python experiment_epsilon_final.py final
```
# Reference
If you use this code, please cite the following [paper](http://proceedings.mlr.press/v97/koloskova19a.html):
@inproceedings{ksj2019choco,
author = {Anastasia Koloskova and Sebastian U. Stich and Martin Jaggi},
title = {Decentralized Stochastic Optimization and Gossip Algorithms with Compressed Communication},
booktitle = {ICML 2019 - Proceedings of the 36th International Conference on Machine Learning},
url = {http://proceedings.mlr.press/v97/koloskova19a.html},
series = {Proceedings of Machine Learning Research},
publisher = {PMLR},
volume = {97},
pages = {3479--3487},
year = {2019}
}
*.vscode
Makefile
\ No newline at end of file
# CHOCO-SGD
Deep Learning code for the main experiments of the paper [Decentralized Deep Learning with Arbitrary Communication Compression](https://arxiv.org/abs/1907.09356).
## Getting started
Our experiments heavily rely on `Docker` and `Kubernetes`. For the detailed experimental environment setup, please refer to dockerfile under the `environments` folder.
### Use case of distributed training (centralized/decentralized)
Some simple explanation of the arguments used in the code.
* Arguments related to *distributed training*:
* The `n_mpi_process` and `n_sub_process` indicates the number of nodes and the number of GPUs for each node. The data-parallel wrapper is adapted and applied locally for each node.
* Note that the exact mini-batch size for each MPI process is specified by `batch_size`, while the mini-batch size used for each GPU is `batch_size/n_sub_process`.
* The `world` describes the GPU topology of the distributed training, in terms of all GPUs used for the distributed training.
* The `hostfile` from `mpi` specifies the physical location of the MPI processes.
* We provide two use cases here:
* `n_mpi_process=2`, `n_sub_process=1` and `world=0,0` indicates that two MPI processes are running on 2 GPUs with the same GPU id. It could be either 1 GPU at the same node or two GPUs at different nodes, where the exact configuration is determined by `hostfile`.
* `n_mpi_process=2`, `n_sub_process=2` and `world=0,1,0,1` indicates that two MPI processes are running on 4 GPUs and each MPI process uses GPU id 0 and id 1 (on 2 nodes).
* Arguments related to *communication compression*:
* The `graph_topology`
* The `optimizer` will decide the type of distributed training, e.g., centralized SGD, decentralized SGD
* The `comm_op` specifies the communication compressor we can use, e.g., `sign+norm`, `random-k`, `top-k`.
* The `choco_consenus_stepsize` determines the `consenus_stepsize` for `parallel_choco`.
* Arguments related to *learning*:
* The `lr_schedule_scheme` and `lr_change_epochs` indicates that it is a stepwise learning rate schedule, with decay factor `10` for epoch `150` and `225`.
* The `lr_scaleup`, `lr_warmup` and `lr_warmup_epochs` will decide if we want to scale up the learning rate, or warm up the learning rate.
The script below trains `ResNet-20` with `CIFAR-10`, as an example of decentralized training algorithm `parallel_choco` with `sign+norm` communication compressor.
```bash
$HOME/conda/envs/pytorch-py3.6/bin/python run.py \
--arch resnet20 --optimizer parallel_choco \
--avg_model True --experiment demo \
--data cifar10 --pin_memory True \
--batch_size 128 --base_batch_size ${base_batch_size[j]} --num_workers 2 --eval_freq 1 \
--num_epochs 300 --partition_data random --reshuffle_per_epoch True --stop_criteria epoch \
--n_mpi_process 16 --n_sub_process 1 --world 0,0,0,0,0,0,0,0 --on_cuda True --use_ipc False --comm_device cuda \
--lr 0.1 --lr_scaleup True --lr_scaleup_factor graph --lr_warmup True --lr_warmup_epochs 5 \
--lr_schedule_scheme custom_multistep --lr_change_epochs 150,225 --lr_decay 10 \
--weight_decay 1e-4 --use_nesterov True --momentum_factor 0.9 \
--comm_op sign --choco_consenus_stepsize 0.5 --compress_ratio 0.9 --quantize_level 16 --is_biased True \
--hostfile iccluster/hostfile --graph_topology ring --track_time True --display_tracked_time True \
--python_path $HOME/conda/envs/pytorch-py3.6/bin/python --mpi_path $HOME/.openmpi/ --evaluate_avg True
```
\ No newline at end of file
# -*- coding: utf-8 -*-
import os
import argparse
import pcode.utils.op_files as op_files
from pcode.tools.show_results import load_raw_info_from_experiments
"""parse and define arguments for different tasks."""
def get_args():
# feed them to the parser.
parser = argparse.ArgumentParser(description='Extract results.')
# add arguments.
parser.add_argument('--in_dir', type=str)
parser.add_argument('--out_name', type=str, default='summary.pickle')
# parse aˇˇrgs.
args = parser.parse_args()
# an argument safety check.
check_args(args)
return args
def check_args(args):
assert args.in_dir is not None
# define out path.
args.out_path = os.path.join(args.in_dir, args.out_name)
"""write the results to path."""
def main(args):
# save the parsed results to path.
op_files.write_pickle(
load_raw_info_from_experiments(args.in_dir),
args.out_path)
if __name__ == '__main__':
args = get_args()
main(args)
# the following two lines give a two-line status, with the current window highlighted
hardstatus alwayslastline
hardstatus string '%{= kG}[%{G}%H%? %1`%?%{g}][%= %{= kw}%-w%{+b yk} %n*%t%?(%u)%? %{-}%+w %=%{g}][%{B}%m/%d %{W}%C%A%{g}]'
# huge scrollback buffer
defscrollback 5000
# no welcome message
startup_message off
# 256 colors
attrcolor b ".I"
termcapinfo xterm 'Co#256:AB=\E[48;5;%dm:AF=\E[38;5;%dm'
defbce on
# mouse tracking allows to switch region focus by clicking
mousetrack on
# default windows
screen -t Shell1 1 bash
screen -t Shell2 2 bash
screen -t Python 3 python
screen -t Media 4 bash
select 0
bind c screen 1 # window numbering starts at 1 not 0
bind 0 select 10
# get rid of silly xoff stuff
bind s split
# layouts
layout autosave on
layout new one
select 1
layout new two
select 1
split
resize -v +8
focus down
select 4
focus up
layout new three
select 1
split
resize -v +7
focus down
select 3
split -v
resize -h +10
focus right
select 4
focus up
layout attach one
layout select one
# navigating regions with Ctrl-arrows
bindkey "^[[1;5D" focus left
bindkey "^[[1;5C" focus right
bindkey "^[[1;5A" focus up
bindkey "^[[1;5B" focus down
# switch windows with F3 (prev) and F4 (next)
bindkey "^[OR" prev
bindkey "^[OS" next
# switch layouts with Ctrl+F3 (prev layout) and Ctrl+F4 (next)
bindkey "^[O1;5R" layout prev
bindkey "^[O1;5S" layout next
# F2 puts Screen into resize mode. Resize regions using hjkl keys.
bindkey "^[OQ" eval "command -c rsz" # enter resize mode
# use hjkl keys to resize regions
bind -c rsz h eval "resize -h -5" "command -c rsz"
bind -c rsz j eval "resize -v -5" "command -c rsz"
bind -c rsz k eval "resize -v +5" "command -c rsz"
bind -c rsz l eval "resize -h +5" "command -c rsz"
# quickly switch between regions using tab and arrows
bind -c rsz \t eval "focus" "command -c rsz" # Tab
bind -c rsz -k kl eval "focus left" "command -c rsz" # Left
bind -c rsz -k kr eval "focus right" "command -c rsz" # Right
bind -c rsz -k ku eval "focus up" "command -c rsz" # Up
bind -c rsz -k kd eval "focus down" "command -c rsz" # Down
# 0 is too far from ` ;)
set -g base-index 1
# Automatically set window title
set-window-option -g automatic-rename on
set-option -g set-titles on
set-option -g mouse on
#set -g default-terminal screen-256color
set -g status-keys vi
set -g history-limit 10000
setw -g mode-keys vi
setw -g monitor-activity on
bind-key v split-window -h
bind-key s split-window -v
bind-key J resize-pane -D 5
bind-key K resize-pane -U 5
bind-key H resize-pane -L 5
bind-key L resize-pane -R 5
bind-key M-j resize-pane -D
bind-key M-k resize-pane -U
bind-key M-h resize-pane -L
bind-key M-l resize-pane -R
# Vim style pane selection
bind h select-pane -L
bind j select-pane -D
bind k select-pane -U
bind l select-pane -R
# Use Alt-vim keys without prefix key to switch panes
bind -n M-h select-pane -L
bind -n M-j select-pane -D
bind -n M-k select-pane -U
bind -n M-l select-pane -R
# Use Alt-arrow keys without prefix key to switch panes
bind -n M-Left select-pane -L
bind -n M-Right select-pane -R
bind -n M-Up select-pane -U
bind -n M-Down select-pane -D
# Shift arrow to switch windows
bind -n S-Left previous-window
bind -n S-Right next-window
# No delay for escape key press
set -sg escape-time 0
# Reload tmux config
bind r source-file ~/.tmux.conf
# THEME
set -g status-bg black
set -g status-fg white
set -g window-status-current-bg white
set -g window-status-current-fg black
set -g window-status-current-attr bold
set -g status-interval 60
set -g status-left-length 30
set -g status-left '#[fg=green](#S) #(whoami)'
set -g status-right '#[fg=yellow]#(cut -d " " -f 1-3 /proc/loadavg)#[default] #[fg=white]%H:%M#[default]'
FROM nvidia/cuda:10.0-cudnn7-devel-ubuntu16.04
MAINTAINER Tao Lin <itamtao@gmail.com>
# install some necessary tools.
RUN echo "deb http://developer.download.nvidia.com/compute/machine-learning/repos/ubuntu1604/x86_64 /" > /etc/apt/sources.list.d/nvidia-ml.list
RUN apt-get update \
&& apt-get install -y --no-install-recommends \
build-essential \
ca-certificates \
pkg-config \
software-properties-common
RUN apt-get install -y \
inkscape \
jed \
libsm6 \
libxext-dev \
libxrender1 \
lmodern \
libcurl3-dev \
libfreetype6-dev \
libzmq3-dev \
libcupti-dev \
pkg-config \
libav-tools \
libjpeg-dev \
libpng-dev \
zlib1g-dev \
locales
RUN apt-get install -y \
sudo \
rsync \
cmake \
g++ \
swig \
vim \
git \
curl \
wget \
unzip \
zsh \
git \
screen \
tmux \
openssh-server
RUN apt-get update && \
apt-get install -y pciutils net-tools iputils-ping && \
apt-get install -y htop
RUN add-apt-repository ppa:openjdk-r/ppa \
&& apt-get update \
&& apt-get install -y \
openjdk-7-jdk \
openjdk-7-jre-headless
# install good vim.
RUN curl http://j.mp/spf13-vim3 -L -o - | sh
# configure environments.
RUN echo "en_US.UTF-8 UTF-8" > /etc/locale.gen && locale-gen
# configure user.
ENV SHELL=/bin/bash \
NB_USER=user \
NB_UID=1000 \
NB_GROUP=1000 \
NB_GID=1000
ENV HOME=/home/$NB_USER
ADD base/fix-permissions /usr/local/bin/fix-permissions
RUN chmod +x /usr/local/bin/fix-permissions
ADD base/entrypoint.sh /usr/local/bin/entrypoint.sh
RUN chmod +x /usr/local/bin/entrypoint.sh
RUN groupadd $NB_GROUP -g $NB_GID
RUN useradd -m -s /bin/bash -N -u $NB_UID -g $NB_GID $NB_USER && \
echo "${NB_USER}:${NB_USER}" | chpasswd && \
usermod -aG sudo,adm,root ${NB_USER} && \
fix-permissions $HOME
RUN echo 'user ALL=(ALL) NOPASSWD: ALL' | sudo EDITOR='tee -a' visudo
# Default ssh config file that skips (yes/no) question when first login to the host
RUN mkdir /var/run/sshd
RUN sed -i "s/#PasswordAuthentication.*/PasswordAuthentication no/g" /etc/ssh/sshd_config \
&& sed -i "s/#PermitRootLogin.*/PermitRootLogin yes/g" /etc/ssh/sshd_config \
&& sed -ri 's/UsePAM yes/#UsePAM yes/g' /etc/ssh/sshd_config \
&& sed -i "s/#AuthorizedKeysFile/AuthorizedKeysFile/g" /etc/ssh/sshd_config
RUN /usr/bin/ssh-keygen -A
ENV SSHDIR $HOME/.ssh
RUN mkdir -p $SSHDIR \
&& chmod go-w $HOME/ \
&& chmod 700 $SSHDIR \
&& touch $SSHDIR/authorized_keys \
&& chmod 600 $SSHDIR/authorized_keys \
&& chown -R ${NB_USER}:${NB_GROUP} ${SSHDIR} \
&& chown -R ${NB_USER}:${NB_GROUP} /etc/ssh/*
###### switch to user and compile test example.
USER ${NB_USER}
RUN ssh-keygen -b 2048 -t rsa -f $SSHDIR/id_rsa -q -N ""
RUN cat ${SSHDIR}/*.pub >> ${SSHDIR}/authorized_keys
RUN echo "StrictHostKeyChecking no" > ${SSHDIR}/config
# configure screen and tmux
ADD base/.tmux.conf $HOME/
ADD base/.screenrc $HOME/
# expose port for ssh and start ssh service.
EXPOSE 22
# expose port for notebook.
EXPOSE 8888
# expose port for tensorboard.
EXPOSE 6666
sudo service ssh start
exec "$@"
#!/bin/bash
# set permissions on a directory
# after any installation, if a directory needs to be (human) user-writable,
# run this script on it.
# It will make everything in the directory owned by the group $NB_GID
# and writable by that group.
# Deployments that want to set a specific user id can preserve permissions
# by adding the `--group-add users` line to `docker run`.
# uses find to avoid touching files that already have the right permissions,
# which would cause massive image explosion
# right permissions are:
# group=$NB_GID
# AND permissions include group rwX (directory-execute)
# AND directories have setuid,setgid bits set
set -e
for d in $@; do
find "$d" \
! \( \
-group $NB_GID \
-a -perm -g+rwX \
\) \
-exec chgrp $NB_GID {} \; \
-exec chmod g+rwX {} \;
# setuid,setgid *on directories only*
find "$d" \
\( \
-type d \
-a ! -perm -6000 \
\) \
-exec chmod +6000 {} \;
done
version: '2'
services:
base:
build:
context: .
dockerfile: base/Dockerfile
image: user/base
pytorch-mpi:
build:
context: .
dockerfile: pytorch-mpi/Dockerfile
image: user/pytorch-mpi
depends_on:
- base
FROM user/base
USER $NB_USER
WORKDIR $HOME
# install openMPI
RUN mkdir $HOME/.openmpi/
RUN wget https://www.open-mpi.org/software/ompi/v3.0/downloads/openmpi-3.0.0.tar.gz
RUN gunzip -c openmpi-3.0.0.tar.gz | tar xf - \
&& cd openmpi-3.0.0 \
&& ./configure --prefix=$HOME/.openmpi/ --with-cuda \
&& make all install
ENV PATH $HOME/.openmpi/bin:$PATH
ENV LD_LIBRARY_PATH $HOME/.openmpi/lib:$LD_LIBRARY_PATH
# install conda
ENV PYTHON_VERSION=3.6
RUN curl -o ~/miniconda.sh -O https://repo.continuum.io/miniconda/Miniconda3-latest-Linux-x86_64.sh && \
sh miniconda.sh -b -p $HOME/conda && \
rm ~/miniconda.sh
RUN $HOME/conda/bin/conda update -n base conda
RUN $HOME/conda/bin/conda create -y --name pytorch-py$PYTHON_VERSION python=$PYTHON_VERSION numpy pyyaml scipy ipython mkl mkl-include
RUN $HOME/conda/bin/conda install --name pytorch-py$PYTHON_VERSION -c soumith magma-cuda100
RUN $HOME/conda/bin/conda install --name pytorch-py$PYTHON_VERSION scikit-learn
RUN $HOME/conda/envs/pytorch-py3.6/bin/pip install pytelegraf pymongo influxdb kubernetes jinja2
ENV PATH $HOME/conda/envs/pytorch-py$PYTHON_VERSION/bin:$PATH
# install pytorch, torchvision, torchtext.
RUN git clone --recursive https://github.com/pytorch/pytorch
RUN cd pytorch && \
git submodule update --init && \
TORCH_CUDA_ARCH_LIST="3.5 3.7 5.2 6.0 6.1 7.0+PTX" TORCH_NVCC_FLAGS="-Xfatbin -compress-all" \
CMAKE_PREFIX_PATH="$(dirname $(which $HOME/conda/bin/conda))/../" \
pip install -v .
RUN git clone https://github.com/pytorch/vision.git && cd vision && pip install -v .
RUN $HOME/conda/envs/pytorch-py3.6/bin/pip install --upgrade git+https://github.com/pytorch/text
RUN $HOME/conda/envs/pytorch-py3.6/bin/pip install spacy
RUN $HOME/conda/envs/pytorch-py3.6/bin/python -m spacy download en
# install bit2byte.
RUN git clone https://github.com/tvogels/signSGD-with-Majority-Vote.git && \
cd signSGD-with-Majority-Vote/main/bit2byte-extension/ && \
$HOME/conda/envs/pytorch-py3.6/bin/python setup.py develop --user
# install other python related softwares.
RUN $HOME/conda/bin/conda install --name pytorch-py$PYTHON_VERSION -y opencv protobuf
RUN $HOME/conda/bin/conda install --name pytorch-py$PYTHON_VERSION -y networkx
RUN $HOME/conda/envs/pytorch-py3.6/bin/pip install lmdb tensorboard_logger pyarrow msgpack msgpack_numpy mpi4py
RUN $HOME/conda/bin/conda install --name pytorch-py$PYTHON_VERSION -c conda-forge python-blosc
\ No newline at end of file
$HOME/conda/envs/pytorch-py3.6/bin/python run.py \