Merge branch 'main' into t5_pipeline_parallelism (1c3a1e6a) · Commits · candle / Megatron-LM

README.md

+1 −1

Original line number	Diff line number	Diff line
		@@ -127,7 +127,7 @@ Further command line arguments are described in the source file [`preprocess_dat
		## BERT Pretraining


		The `examples/pretrain_bert.sh` script runs single GPU 345M parameter BERT pretraining. Debugging is the primary use for single GPU training, as the code base and command line arguments are optimized for highly distributed training. Most of the arguments are fairly self-explanatory. By default, the learning rate decays linearly over the training iterations starting at `--lr` to a minimum set by `--min-lr` over `--lr-decay-iters` iterations. The fraction of training iterations used for warmup is set by `--lr-warmup-fraction`. While this is single GPU training, the batch size specified by `--micro-batch-size` is a single forward-backward path batch-size and the code will perform gradient accumulation steps until it reaches `global-batch-size` whcih is the batch size per iteration. The data is partitioned into a 949:50:1 ratio for training/validation/test sets (default is 969:30:1). This partitioning happens on the fly, but is consistent across runs with the same random seed (1234 by default, or specified manually with `--seed`). We use `train-iters` as the training iterations requested. Alternatively, one can provide `--train-samples` which is total number of samples to train on. If this option is present, then instead of providing `--lr-decay-iters`, one will need to provide `--lr-decay-samples`.
		The `examples/pretrain_bert.sh` script runs single GPU 345M parameter BERT pretraining. Debugging is the primary use for single GPU training, as the code base and command line arguments are optimized for highly distributed training. Most of the arguments are fairly self-explanatory. By default, the learning rate decays linearly over the training iterations starting at `--lr` to a minimum set by `--min-lr` over `--lr-decay-iters` iterations. The fraction of training iterations used for warmup is set by `--lr-warmup-fraction`. While this is single GPU training, the batch size specified by `--micro-batch-size` is a single forward-backward path batch-size and the code will perform gradient accumulation steps until it reaches `global-batch-size` which is the batch size per iteration. The data is partitioned into a 949:50:1 ratio for training/validation/test sets (default is 969:30:1). This partitioning happens on the fly, but is consistent across runs with the same random seed (1234 by default, or specified manually with `--seed`). We use `train-iters` as the training iterations requested. Alternatively, one can provide `--train-samples` which is total number of samples to train on. If this option is present, then instead of providing `--lr-decay-iters`, one will need to provide `--lr-decay-samples`.

		The logging, checkpoint-saving, and evaluation intervals are specified. Checkpointing the activations facilitates the training of larger models and/or batches. Note that the `--data-path` now includes the additional `_text_sentence` suffix added in preprocessing, but does not include the file extensions.

examples/pretrain_bert_distributed_with_mp.sh

+1 −0

Original line number	Diff line number	Diff line
		@@ -23,6 +23,7 @@ python -m torch.distributed.launch $DISTRIBUTED_ARGS \
		--num-attention-heads 16 \
		--micro-batch-size 2 \
		--global-batch-size 16 \
		--seq-length 512 \
		--max-position-embeddings 512 \
		--train-iters 1000000 \
		--save $CHECKPOINT_PATH \

examples/sc21/CONFIG.sh

0 → 100755

+57 −0

Original line number	Diff line number	Diff line
		#!/bin/bash


		# SLURM options.
		export SLURM_PARTITION=<slurm partition, used to feed -p option in slurm>
		export SLURM_ACCOUNT=<slurm account, used to feed -A option in slurm>


		# Source code.
		export MEGATRON_CODE_DIR=<megatron source code directory>


		# This variable is used to mount the relevant part of the filesystem
		# inside the docker container. Note that the `MEGATRON_CODE_DIR` and the
		# launch directory already get mounted; this variable should be used to
		# mount the directories that contain the data and tokenizer files.
		export DOCKER_MOUNT_DIR=<megatron dataset and bpe tokenizer vocab path>


		# Data and tokenizer files.
		MEGATRON_DATA=<path to megatron processed data>
		BPE_VOCAB_FILE=<path to bpe vocab file>
		BPE_MERGE_FILE=<path to bpe merges file>


		# Megatron input parameters.
		# `MEGATRON_EXTRA_PARAMS` can be used to provide any extra parameters
		# that are not listed here.
		export MEGATRON_PARAMS=" ${MEGATRON_EXTRA_PARAMS} \
		--tensor-model-parallel-size ${TP} \
		--pipeline-model-parallel-size ${PP} \
		--micro-batch-size ${MBS} \
		--global-batch-size ${GBS} \
		--num-layers ${NLS} \
		--hidden-size ${HS} \
		--num-attention-heads ${NAH} \
		--DDP-impl ${DDP} \
		--data-path ${MEGATRON_DATA} \
		--vocab-file ${BPE_VOCAB_FILE} \
		--merge-file ${BPE_MERGE_FILE} \
		--log-interval 5 \
		--seq-length 2048 \
		--max-position-embeddings 2048 \
		--train-iters 500 \
		--lr-decay-iters 320 \
		--lr 0.0001 \
		--min-lr 0.00001 \
		--lr-decay-style cosine \
		--lr-warmup-fraction 0.01 \
		--split 969,30,1 \
		--eval-iters 100 \
		--eval-interval 1000 \
		--clip-grad 1.0 \
		--fp16 \
		--loss-scale 8192 "

examples/sc21/README.md

0 → 100644

+45 −0

Original line number	Diff line number	Diff line
		# Reproducing Figures in SC21 Paper


		This directory contains some of the scripts that were used to produce the
		results in the [Megatron paper](https://arxiv.org/pdf/2104.04473.pdf) that is
		to appear at [SuperComputing 2021](https://sc21.supercomputing.org/). These
		scripts use [Slurm](https://slurm.schedmd.com/documentation.html) with the
		[pyxis plugin](https://github.com/NVIDIA/pyxis), but can be modified for other
		schedulers as well.


		## Setup

		All the cluster-dependent variables are in [`CONFIG.sh`](./CONFIG.sh). Please
		update the unspecified values (in angle brackets `<...>`) before launching any
		scripts.



		## Scripts

		Below is a list of scripts that can be used to reproduce various figures in our
		[paper](https://arxiv.org/pdf/2104.04473.pdf):

		* [run_table_1.sh](./run_table_1.sh): Table 1 showing weak-scaling throughput
		for GPT models ranging from 1 billion to 1 trillion parameters.
		* [run_figure_11.sh](./run_figure_11.sh): Figure 11 showing the weak-scaling
		performance of pipeline parallelism.
		* [run_figure_12.sh](./run_figure_12.sh): Figure 12 showing the effect of
		the interleaved schedule on a 175B GPT model.
		* [run_figure_13.sh](./run_figure_13.sh): Figure 13 showing the effect of
		different degrees of pipeline and tensor model parallelism on a model with
		162.2 billion parameters.
		* [run_figure_14.sh](./run_figure_14.sh): Figure 14 showing the effect of
		different degrees of data and pipeline model parallelism on a model with
		5.9 billion parameters.
		* [run_figure_15.sh](./run_figure_15.sh): Figure 15 showing the effect of
		different degrees of data and tensor model parallelism on a model with
		5.9 billion parameters.
		* [run_figure_16.sh](./run_figure_16.sh): Figure 16 showing the effect of
		microbatch size.
		* [run_figure_17.sh](./run_figure_17.sh): Figure 17 showing the effect of
		activation recomputation.
		* [run_figure_18.sh](./run_figure_18.sh): Figure 18 showing the effect of
		the scatter-gather communication optimization.

examples/sc21/SBATCH.sh

0 → 100755

+13 −0

Original line number	Diff line number	Diff line
		#!/bin/bash


		sbatch -p ${SLURM_PARTITION} \
		-A ${SLURM_ACCOUNT} \
		--job-name=${JOB_NAME} \
		--nodes=${NNODES} \
		--export=MEGATRON_CODE_DIR,MEGATRON_PARAMS,DOCKER_MOUNT_DIR SRUN.sh

		exit 0