Merge branch 'NVIDIA:main' into main (1064dc48) · Commits · candle / Megatron-LM

README.md

+8 −7

Original line number	Diff line number	Diff line
		@@ -13,7 +13,9 @@ Below are some of the projects where we have directly used Megatron:

		Megatron is also used in [NeMo Megatron](https://developer.nvidia.com/nvidia-nemo#nemo-megatron), a framework to help enterprises overcome the challenges of building and training sophisticated natural language processing models with billions and trillions of parameters.

		Our codebase is capable of efficiently training very large (hundreds of billions of parameters) language models with both model and data parallelism. To demonstrate how the code scales with multiple GPUs and model sizes, we consider GPT models from 1 billion all the way to 1 trillion parameters. All models use a vocabulary size of 51,200 and a sequence length of 2048. We vary hidden size, number of attention heads, and number of layers to arrive at a specifc model size. As the model size increases, we also modestly increase the batch size. We leverage [NVIDIA's Selene supercomputer](https://www.top500.org/system/179842/) to perform scaling studies and use up to 3072 [A100](https://www.nvidia.com/en-us/data-center/a100/) GPUs for the largest model. The table below shows the model configurations along with the achieved FLOPs (both per GPU and aggregate over all GPUs). Note that these results are from benchmark runs and these models were not trained to convergence; however, the FLOPs are measured for end-to-end training, i.e., includes all operations including data loading, optimization, and even logging.
		Our codebase is capable of efficiently training very large (hundreds of billions of parameters) language models with both model and data parallelism. To demonstrate how the code scales with multiple GPUs and model sizes, we consider GPT models from 1 billion all the way to 1 trillion parameters. All models use a vocabulary size of 51,200 and a sequence length of 2048. We vary hidden size, number of attention heads, and number of layers to arrive at a specifc model size. As the model size increases, we also modestly increase the batch size. We leverage [NVIDIA's Selene supercomputer](https://www.top500.org/system/179842/) to perform scaling studies and use up to 3072 [A100](https://www.nvidia.com/en-us/data-center/a100/) GPUs for the largest model. Each cluster node has 8 NVIDIA 80GB A100 GPUs. The table below shows the model configurations along with the achieved FLOPs (both per GPU and aggregate over all GPUs). Note that these results are from benchmark runs and these models were not trained to convergence; however, the FLOPs are measured for end-to-end training, i.e., includes all operations including data loading, optimization, and even logging.

		Additionally, the model parallel size column reports a combined tensor and pipeline parallelism degrees. For numbers larger than 8, typically tensor parallel of size 8 was used. So, for example, the 145B model reports the total model parallel size of 64, which means that this setup used TP=8 and PP=8.

		![Cases](images/cases_april2021.png)

		@@ -29,7 +31,6 @@ All the cases from 1 billion to 1 trillion parameters achieve more than 43% half
		* [Data Preprocessing](#data-preprocessing)
		* [BERT Pretraining](#bert-pretraining)
		* [GPT Pretraining](#gpt-pretraining)
		* [GPT Pretraining](#gpt-pretraining)
		* [T5 Pretraining](#t5-pretraining)
		* [Distributed Pretraining](#distributed-pretraining)
		* [GPT-3 Example](#gpt-3-example)
		@@ -206,7 +207,7 @@ Further command line arguments are described in the source file [`arguments.py`]

		## T5 Pretraining

		Very similar to BERT and GPT, the `examples/pretrain_t5.sh` script runs single GPU "base" (~220M parameter) T5 pretraining. The primary difference from BERT and GPT is the addition of the following arguments to accomodate the T5 architecture:
		Very similar to BERT and GPT, the `examples/pretrain_t5.sh` script runs single GPU "base" (~220M parameter) T5 pretraining. The primary difference from BERT and GPT is the addition of the following arguments to accommodate the T5 architecture:

		* `--kv-channels` sets the inner dimension of the "key" and "value" matrices of all attention mechanisms in the model. For BERT and GPT this defaults to the hidden size divided by the number of attention heads, but can be configured for T5.

		@@ -262,7 +263,7 @@ Second, we developed a simple and efficient two-dimensional model-parallel appro

		<!-- The number of microbatches in a per-pipeline minibatch is controlled by the `--num-microbatches-in-minibatch` argument. With `WORLD_SIZE` GPUs, `TENSOR_MP_SIZE` tensor-model-parallel size, `PIPELINE_MP_SIZE` pipeline-model-parallel-size, `WORLD_SIZE`/(`TENSOR_MP_SIZE` * `PIPELINE_MP_SIZE`) GPUs will be used for data parallelism. The default values for `--tensor-model-parallel-size` and `--pipeline-model-parallel-size` is 1, which will not implement either form of model parallelism. -->

		We have examples of how to use these two different forms of model parallelism the example scripts ending in `distributed_with_mp.sh`, note that pipeline parallelism is not currently supported in the T5 model:
		We have examples of how to use these two different forms of model parallelism the example scripts ending in `distributed_with_mp.sh`:

		Other than these minor changes, the distributed training is identical to the training on a single GPU.

		@@ -399,7 +400,7 @@ python tools/create_doc_index.py \

		We provide several command line arguments, detailed in the scripts listed below, to handle various zero-shot and fine-tuned downstream tasks. However, you can also finetune your model from a pretrained checkpoint on other corpora as desired. To do so, simply add the `--finetune` flag and adjust the input files and training parameters within the original training script. The iteration count will be reset to zero, and the optimizer and internal state will be reinitialized. If the fine-tuning is interrupted for any reason, be sure to remove the `--finetune` flag before continuing, otherwise the training will start again from the beginning.

		Because evaluation requires substantially less memory than training, it may be advantageous to merge a model trained in parallel for use on a single GPU in downstream tasks. The following script accomplishes this. Currently only tensor model parallelism is supported on input and pipeline model parallelsim on the output. This example reads in a model with 2-way tensor model parallelism and writes out a model with 2-way pipeline model parallelism.
		Because evaluation requires substantially less memory than training, it may be advantageous to merge a model trained in parallel for use on a single GPU in downstream tasks. The following script accomplishes this. Currently only tensor model parallelism is supported on input and pipeline model parallelism on the output. This example reads in a model with 2-way tensor model parallelism and writes out a model with 2-way pipeline model parallelism.

		<pre>
		TENSOR_MODEL_PARALLEL_SIZE=2
		@@ -484,7 +485,7 @@ python tasks/main.py \


		### LAMBADA Cloze Accuracy
		To compute LAMBADA cloze accuracy (the accuracy of predicting the last token given the preceeding tokens) we utilize a detokenized, processed version of the [LAMBADA dataset](https://github.com/cybertronai/bflm/blob/master/lambada_test.jsonl).
		To compute LAMBADA cloze accuracy (the accuracy of predicting the last token given the preceding tokens) we utilize a detokenized, processed version of the [LAMBADA dataset](https://github.com/cybertronai/bflm/blob/master/lambada_test.jsonl).

		We use the following command to run LAMBADA evaluation on a 345M parameter model. Note that the `--strict-lambada` flag should be used to require whole word matching. Make that `lambada` is part of the file path.

examples/msdp/prep_resp_gen.sh

+1 −1

Original line number	Diff line number	Diff line
		@@ -14,5 +14,5 @@ PROCESSED_FILE=<PATH_OF_INPUT_FILE_FOR_RESPONSE_GENERATION> \
		python ${DIR}/tasks/msdp/preprocessing.py \
		--func prepare_input \
		--test_file ${TEST_FILE} \
		--knowledge_gen_file ${KNOWLEDGE_FILE} \
		--knwl_gen_file ${KNOWLEDGE_FILE} \
		--processed_file ${PROCESSED_FILE}

megatron/arguments.py

+14 −2

Original line number	Diff line number	Diff line
		@@ -526,6 +526,9 @@ def _add_initialization_args(parser):
		group.add_argument('--seed', type=int, default=1234,
		help='Random seed used for python, numpy, '
		'pytorch, and cuda.')
		group.add_argument('--data-parallel-random-init', action='store_true',
		help='Enable random initialization of params '
		'across data parallel ranks')
		group.add_argument('--init-method-std', type=float, default=0.02,
		help='Standard deviation of the zero mean normal '
		'distribution used for weight initialization.')
		@@ -840,11 +843,20 @@ def _add_vit_args(parser):

		group.add_argument('--num-classes', type=int, default=1000,
		help='num of classes in vision classificaiton task')
		group.add_argument('--img-dim', type=int, default=224,
		help='Image size for vision classification task')
		group.add_argument('--img-h', type=int, default=224,
		help='Image height for vision classification task')
		group.add_argument('--img-w', type=int, default=224,
		help='Image height for vision classification task')
		group.add_argument('--num-channels', type=int, default=3,
		help='Number of channels in input image data')
		group.add_argument('--patch-dim', type=int, default=16,
		help='patch dimension used in vit')
		group.add_argument('--classes-fraction', type=float, default=1.0,
		help='training with fraction of classes.')
		group.add_argument('--data-per-class-fraction', type=float, default=1.0,
		help='training with fraction of data per class.')
		group.add_argument('--no-data-sharding', action='store_false',
		help='Disable data sharding.',
		dest='data_sharding')

		return parser

megatron/checkpointing.py

+58 −16

Original line number	Diff line number	Diff line
		@@ -65,6 +65,8 @@ def check_checkpoint_args(checkpoint_args):
		_compare('make_vocab_size_divisible_by')
		_compare('padded_vocab_size')
		_compare('tokenizer_type')
		if args.data_parallel_random_init:
		_compare('data_parallel_random_init')
		if get_checkpoint_version() < 3.0:
		_compare('tensor_model_parallel_size',
		old_arg_name='model_parallel_size')
		@@ -72,7 +74,6 @@ def check_checkpoint_args(checkpoint_args):
		_compare('tensor_model_parallel_size')
		_compare('pipeline_model_parallel_size')


		def ensure_directory_exists(filename):
		"""Build filename's path if it does not already exists."""
		dirname = os.path.dirname(filename)
		@@ -140,6 +141,32 @@ def read_metadata(tracker_filename):
		return max_iter, release


		def get_rng_state():
		""" collect rng state across data parallel ranks """
		args = get_args()
		rng_state = {
		'random_rng_state': random.getstate(),
		'np_rng_state': np.random.get_state(),
		'torch_rng_state': torch.get_rng_state(),
		'cuda_rng_state': torch.cuda.get_rng_state(),
		'rng_tracker_states': mpu.get_cuda_rng_tracker().get_states()}

		rng_state_list = None
		if torch.distributed.is_initialized() and \
		mpu.get_data_parallel_world_size() > 1 and \
		args.data_parallel_random_init:
		rng_state_list = \
		[None for i in range(mpu.get_data_parallel_world_size())]
		torch.distributed.all_gather_object(
		rng_state_list,
		rng_state,
		group=mpu.get_data_parallel_group())
		else:
		rng_state_list = [rng_state]

		return rng_state_list


		def save_checkpoint(iteration, model, optimizer, lr_scheduler):
		"""Save a model checkpoint."""
		args = get_args()
		@@ -150,6 +177,9 @@ def save_checkpoint(iteration, model, optimizer, lr_scheduler):
		print_rank_0('saving checkpoint at iteration {:7d} to {}'.format(
		iteration, args.save))

		# collect rng state across data parallel ranks
		rng_state = get_rng_state()

		if not torch.distributed.is_initialized() or mpu.get_data_parallel_rank() == 0:

		# Arguments, iteration, and model.
		@@ -173,12 +203,7 @@ def save_checkpoint(iteration, model, optimizer, lr_scheduler):

		# RNG states.
		if not args.no_save_rng:
		state_dict['random_rng_state'] = random.getstate()
		state_dict['np_rng_state'] = np.random.get_state()
		state_dict['torch_rng_state'] = torch.get_rng_state()
		state_dict['cuda_rng_state'] = torch.cuda.get_rng_state()
		state_dict['rng_tracker_states'] \
		= mpu.get_cuda_rng_tracker().get_states()
		state_dict["rng_state"] = rng_state

		# Save.
		checkpoint_name = get_checkpoint_name(args.save, iteration)
		@@ -381,6 +406,23 @@ def load_checkpoint(model, optimizer, lr_scheduler, load_arg='load', strict=True
		# rng states.
		if not release and not args.finetune and not args.no_load_rng:
		try:
		if 'rng_state' in state_dict:
		# access rng_state for data parallel rank
		if args.data_parallel_random_init:

		rng_state = state_dict['rng_state'][mpu.get_data_parallel_rank()]
		else:
		rng_state = state_dict['rng_state'][0]
		random.setstate(rng_state['random_rng_state'])
		np.random.set_state(rng_state['np_rng_state'])
		torch.set_rng_state(rng_state['torch_rng_state'])
		torch.cuda.set_rng_state(rng_state['cuda_rng_state'])
		# Check for empty states array
		if not rng_state['rng_tracker_states']:
		raise KeyError
		mpu.get_cuda_rng_tracker().set_states(
		rng_state['rng_tracker_states'])
		else: # backward compatability
		random.setstate(state_dict['random_rng_state'])
		np.random.set_state(state_dict['np_rng_state'])
		torch.set_rng_state(state_dict['torch_rng_state'])

megatron/data/data_samplers.py

+55 −13

Original line number	Diff line number	Diff line
		@@ -16,8 +16,10 @@
		"""Dataloaders."""


		import torch
		import random
		import torch
		import numpy as np
		from torch.utils.data import Dataset
		from megatron import get_args
		from megatron import mpu

		@@ -39,11 +41,13 @@ def build_pretraining_data_loader(dataset, consumed_samples):
		data_parallel_size=mpu.get_data_parallel_world_size())
		elif args.dataloader_type == 'cyclic':
		batch_sampler = MegatronPretrainingRandomSampler(
		dataset,
		total_samples=len(dataset),
		consumed_samples=consumed_samples,
		micro_batch_size=args.micro_batch_size,
		data_parallel_rank=mpu.get_data_parallel_rank(),
		data_parallel_size=mpu.get_data_parallel_world_size())
		data_parallel_size=mpu.get_data_parallel_world_size(),
		data_sharding=args.data_sharding)
		else:
		raise Exception('{} dataloader type is not supported.'.format(
		args.dataloader_type))
		@@ -103,16 +107,40 @@ class MegatronPretrainingSampler:
		yield batch[start_idx:end_idx]


		class RandomSeedDataset(Dataset):

		def __init__(self, dataset):
		args = get_args()
		self.base_seed = args.seed
		self.curr_seed = args.seed
		self.dataset = dataset

		def __len__(self):
		return len(self.dataset)

		def set_epoch(self, epoch):
		self.curr_seed = self.base_seed + epoch

		def __getitem__(self, idx):
		seed = idx + self.curr_seed
		torch.manual_seed(seed)
		random.seed(seed)
		np.random.seed(seed)
		return self.dataset[idx]


		class MegatronPretrainingRandomSampler:

		def __init__(self, total_samples, consumed_samples, micro_batch_size,
		data_parallel_rank, data_parallel_size):
		def __init__(self, dataset, total_samples, consumed_samples, micro_batch_size,
		data_parallel_rank, data_parallel_size, data_sharding):
		# Keep a copy of input params for later use.
		self.dataset = dataset
		self.total_samples = total_samples
		self.consumed_samples = consumed_samples
		self.micro_batch_size = micro_batch_size
		self.data_parallel_rank = data_parallel_rank
		self.data_parallel_size = data_parallel_size
		self.data_sharding = data_sharding
		self.micro_batch_times_data_parallel_size = \
		self.micro_batch_size * data_parallel_size
		self.last_batch_size = \
		@@ -136,7 +164,11 @@ class MegatronPretrainingRandomSampler:
		current_epoch_samples = self.consumed_samples % active_total_samples
		assert current_epoch_samples % self.micro_batch_times_data_parallel_size == 0

		if isinstance(self.dataset, RandomSeedDataset):
		self.dataset.set_epoch(self.epoch)

		# data sharding and random sampling
		if self.data_sharding:
		bucket_size = (self.total_samples // self.micro_batch_times_data_parallel_size) \
		* self.micro_batch_size
		bucket_offset = current_epoch_samples // self.data_parallel_size
		@@ -146,6 +178,16 @@ class MegatronPretrainingRandomSampler:
		g.manual_seed(self.epoch)
		random_idx = torch.randperm(bucket_size, generator=g).tolist()
		idx_range = [start_idx + x for x in random_idx[bucket_offset:]]
		else:
		full_bucket_size = (self.total_samples // self.micro_batch_size) \
		* self.micro_batch_size
		full_bucket_offset = current_epoch_samples
		g = torch.Generator()
		g.manual_seed(self.epoch)
		idx_range_total = \
		torch.randperm(full_bucket_size, generator=g).tolist()
		idx_range_active = idx_range_total[full_bucket_offset:]
		idx_range = idx_range_active[self.data_parallel_rank::self.data_parallel_size]

		batch = []
		# Last batch if not complete will be dropped.