Merge branch 'ict-merge' into 'master' (05620ee4) · Commits · candle / Megatron-LM

README.md

+56 −0

Original line number	Diff line number	Diff line
		@@ -16,6 +16,7 @@ For BERT training, we swapped the position of the layer normalization and the re
		- [BERT Pretraining](#bert-pretraining)
		- [GPT-2 Pretraining](#gpt-2-pretraining)
		- [Distributed BERT or GPT-2 Pretraining](#distributed-bert-or-gpt-2-pretraining)
		- [REALM Pipeline](#realm)
		- [Evaluation and Tasks](#evaluation-and-tasks)
		- [GPT-2 Text Generation](#gpt-2-text-generation)
		- [GPT-2 Evaluation](#gpt-2-evaluation)
		@@ -263,6 +264,61 @@ python -m torch.distributed.launch $DISTRIBUTED_ARGS ./pretrain_gpt2.py \

		</pre>

		<a id="realm"></a>
		## REALM Pipeline
		The following sections (will) reflect the three stages of training a REALM system. For now it's just the ICT code.
		Loosely, they are pretraining the retriever modules, then jointly training the language model and the retriever, and then finetuning a question answering head on the language model with fixed retriever.

		### Inverse Cloze Task (ICT) Pretraining
		1. Have a corpus in loose JSON format with the intention of creating a collection of fixed-size blocks of text as the fundamental units of data. For a corpus like Wikipedia, this will mean multiple sentences per block but also multiple blocks per document.
		Run `tools/preprocess_data.py` to construct one or more indexed datasets with the `--split-sentences` argument to make sentences the basic unit. For the original REALM system, we construct two datasets, one with the title of every document, and another with the body.
		Refer to the following script
		<pre>
		python preprocess_data.py \
		--input /path/to/corpus.json \
		--json-keys text title \
		--split-sentences \
		--tokenizer-type BertWordPieceLowerCase \
		--vocab-file /path/to/vocab.txt \
		--output-prefix corpus_indexed \
		--workers 5 # works well for 10 CPU cores. Scale up accordingly.
		</pre>

		2. Use a custom samples mapping function in place of `megatron/data/realm_dataset_utils.get_block_samples_mapping` if required. To do this, you will need to implement a new function in C++ inside of `megatron/data/helpers.cpp`. The samples mapping data structure is used to select the data that will constitute every training sample in advance of the training loop.
		The samples mapping is responsible for holding all of the required metadata needed to construct the sample from one or more indexed datasets. In REALM, the samples mapping contains the start and end sentence indices, as well as the document index (to find the correct title for a body) and a unique ID for every block.
		3. Pretrain a BERT language model using `pretrain_bert.py`, with the sequence length equal to the block size in token ids. This model should be trained on the same indexed dataset that is used to supply the blocks for the information retrieval task.
		In REALM, this is an uncased bert base model trained with the standard hyperparameters.
		4. Use `pretrain_ict.py` to train an `ICTBertModel` which uses two BERT-based encoders to encode queries and blocks to perform retrieval with.
		The script below trains the ICT model from REALM. It refrences a pretrained BERT model (step 3) in the `--bert-load` argument. The batch size used in the paper is 4096, so this would need to be run with data parallel world size 32.
		<pre>
		python pretrain_ict.py \
		--num-layers 12 \
		--num-attention-heads 12 \
		--hidden-size 768 \
		--batch-size 128 \
		--seq-length 256 \
		--max-position-embeddings 256 \
		--ict-head-size 128 \
		--train-iters 100000 \
		--checkpoint-activations \
		--bert-load /path/to/pretrained_bert \
		--load checkpoints \
		--save checkpoints \
		--data-path /path/to/indexed_dataset \
		--titles-data-path /path/to/titles_indexed_dataset \
		--vocab-file /path/to/vocab.txt \
		--lr 0.0001 \
		--num-workers 2 \
		--lr-decay-style linear \
		--weight-decay 1e-2 \
		--clip-grad 1.0 \
		--warmup .01 \
		--save-interval 3000 \
		--query-in-block-prob 0.1 \
		--fp16

		</pre>

		<a id="evaluation-and-tasks"></a>
		# Evaluation and Tasks

megatron/init.py

+2 −2

Original line number	Diff line number	Diff line
		@@ -12,6 +12,7 @@
		# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
		# See the License for the specific language governing permissions and
		# limitations under the License.
		import torch

		from .package_info import (
		__description__,
		@@ -30,7 +31,6 @@ from .global_vars import get_tensorboard_writer
		from .global_vars import get_adlr_autoresume
		from .global_vars import get_timers

		import torch

		def print_rank_0(message):
		"""If distributed is initialized print only on rank 0."""

megatron/arguments.py

+30 −0

Original line number	Diff line number	Diff line
		@@ -37,6 +37,7 @@ def parse_args(extra_args_provider=None, defaults={},
		parser = _add_validation_args(parser)
		parser = _add_data_args(parser)
		parser = _add_autoresume_args(parser)
		parser = _add_realm_args(parser)

		# Custom arguments.
		if extra_args_provider is not None:
		@@ -390,3 +391,32 @@ def _add_autoresume_args(parser):
		'termination signal')

		return parser


		def _add_realm_args(parser):
		group = parser.add_argument_group(title='realm')

		# network size
		group.add_argument('--ict-head-size', type=int, default=None,
		help='Size of block embeddings to be used in ICT and REALM (paper default: 128)')

		# checkpointing
		group.add_argument('--ict-load', type=str, default=None,
		help='Directory containing an ICTBertModel checkpoint')
		group.add_argument('--bert-load', type=str, default=None,
		help='Directory containing an BertModel checkpoint (needed to start ICT and REALM)')

		# data
		group.add_argument('--titles-data-path', type=str, default=None,
		help='Path to titles dataset used for ICT')
		group.add_argument('--query-in-block-prob', type=float, default=0.1,
		help='Probability of keeping query in block for ICT dataset')
		group.add_argument('--ict-one-sent', action='store_true',
		help='Whether to use one sentence documents in ICT')

		# training
		group.add_argument('--report-topk-accuracies', nargs='+', default=[],
		help="Which top-k accuracies to report (e.g. '1 5 20')")

		return parser

megatron/checkpointing.py

+4 −3

Original line number	Diff line number	Diff line
		@@ -128,14 +128,15 @@ def save_checkpoint(iteration, model, optimizer, lr_scheduler):
		torch.distributed.barrier()


		def load_checkpoint(model, optimizer, lr_scheduler):
		def load_checkpoint(model, optimizer, lr_scheduler, load_arg='load'):
		"""Load a model checkpoint and return the iteration."""
		args = get_args()
		load_dir = getattr(args, load_arg)

		if isinstance(model, torchDDP):
		model = model.module
		# Read the tracker file and set the iteration.
		tracker_filename = get_checkpoint_tracker_filename(args.load)
		tracker_filename = get_checkpoint_tracker_filename(load_dir)

		# If no tracker file, return iretation zero.
		if not os.path.isfile(tracker_filename):
		@@ -164,7 +165,7 @@ def load_checkpoint(model, optimizer, lr_scheduler):
		tracker_filename)

		# Checkpoint.
		checkpoint_name = get_checkpoint_name(args.load, iteration, release)
		checkpoint_name = get_checkpoint_name(load_dir, iteration, release)
		if mpu.get_data_parallel_rank() == 0:
		print('global rank {} is loading checkpoint {}'.format(
		torch.distributed.get_rank(), checkpoint_name))

megatron/data/bert_dataset.py

+72 −128

Original line number	Diff line number	Diff line
		@@ -22,81 +22,14 @@ import numpy as np
		import torch
		from torch.utils.data import Dataset

		from megatron import get_tokenizer
		from megatron import mpu
		from megatron.data.dataset_utils import build_training_sample
		from megatron.data.indexed_dataset import make_dataset as make_indexed_dataset
		from megatron import get_tokenizer, get_args
		from megatron import print_rank_0


		def build_train_valid_test_datasets(data_prefix, data_impl, splits_string,
		train_valid_test_num_samples,
		max_seq_length, masked_lm_prob,
		short_seq_prob, seed, skip_warmup):

		# Indexed dataset.
		indexed_dataset = get_indexed_dataset_(data_prefix,
		data_impl,
		skip_warmup)

		# Get start and end indices of train/valid/train into doc-idx
		# Note that doc-idx is desinged to be num-docs + 1 so we can
		# easily iterate over it.
		total_num_of_documents = indexed_dataset.doc_idx.shape[0] - 1
		splits = get_train_valid_test_split_(splits_string, total_num_of_documents)

		# Print stats about the splits.
		print_rank_0(' > dataset split:')

		def print_split_stats(name, index):
		print_rank_0(' {}:'.format(name))
		print_rank_0(' document indices in [{}, {}) total of {} '
		'documents'.format(splits[index], splits[index + 1],
		splits[index + 1] - splits[index]))
		start_index = indexed_dataset.doc_idx[splits[index]]
		end_index = indexed_dataset.doc_idx[splits[index + 1]]
		print_rank_0(' sentence indices in [{}, {}) total of {} '
		'sentences'.format(start_index, end_index,
		end_index - start_index))
		print_split_stats('train', 0)
		print_split_stats('validation', 1)
		print_split_stats('test', 2)

		def build_dataset(index, name):
		dataset = None
		if splits[index + 1] > splits[index]:
		# Get the pointer to the original doc-idx so we can set it later.
		doc_idx_ptr = indexed_dataset.get_doc_idx()
		# Slice the doc-idx
		start_index = splits[index]
		# Add +1 so we can index into the dataset to get the upper bound.
		end_index = splits[index + 1] + 1
		# New doc_idx view.
		indexed_dataset.set_doc_idx(doc_idx_ptr[start_index:end_index])
		# Build the dataset accordingly.
		dataset = BertDataset(
		name=name,
		indexed_dataset=indexed_dataset,
		data_prefix=data_prefix,
		num_epochs=None,
		max_num_samples=train_valid_test_num_samples[index],
		masked_lm_prob=masked_lm_prob,
		max_seq_length=max_seq_length,
		short_seq_prob=short_seq_prob,
		seed=seed)
		# Set the original pointer so dataset remains the main dataset.
		indexed_dataset.set_doc_idx(doc_idx_ptr)
		# Checks.
		assert indexed_dataset.doc_idx[0] == 0
		assert indexed_dataset.doc_idx.shape[0] == \
		(total_num_of_documents + 1)
		return dataset

		train_dataset = build_dataset(0, 'train')
		valid_dataset = build_dataset(1, 'valid')
		test_dataset = build_dataset(2, 'test')

		return (train_dataset, valid_dataset, test_dataset)
		from megatron import mpu
		from megatron.data.dataset_utils import get_a_and_b_segments
		from megatron.data.dataset_utils import truncate_segments
		from megatron.data.dataset_utils import create_tokens_and_tokentypes
		from megatron.data.dataset_utils import pad_and_convert_to_numpy
		from megatron.data.dataset_utils import create_masked_lm_predictions


		class BertDataset(Dataset):
		@@ -137,11 +70,8 @@ class BertDataset(Dataset):
		return self.samples_mapping.shape[0]

		def __getitem__(self, idx):

		start_index, end_index, seq_length = self.samples_mapping[idx]
		sample = []
		for index in range(start_index, end_index):
		sample.append(self.indexed_dataset[index])
		start_idx, end_idx, seq_length = self.samples_mapping[idx]
		sample = [self.indexed_dataset[i] for i in range(start_idx, end_idx)]
		# Note that this rng state should be numpy and not python since
		# python randint is inclusive whereas the numpy one is exclusive.
		np_rng = np.random.RandomState(seed=(self.seed + idx))
		@@ -154,55 +84,6 @@ class BertDataset(Dataset):
		self.masked_lm_prob, np_rng)


		def get_indexed_dataset_(data_prefix, data_impl, skip_warmup):

		print_rank_0(' > building dataset index ...')

		start_time = time.time()
		indexed_dataset = make_indexed_dataset(data_prefix,
		data_impl,
		skip_warmup)
		assert indexed_dataset.sizes.shape[0] == indexed_dataset.doc_idx[-1]
		print_rank_0(' > finished creating indexed dataset in {:4f} '
		'seconds'.format(time.time() - start_time))

		print_rank_0(' > indexed dataset stats:')
		print_rank_0(' number of documents: {}'.format(
		indexed_dataset.doc_idx.shape[0] - 1))
		print_rank_0(' number of sentences: {}'.format(
		indexed_dataset.sizes.shape[0]))

		return indexed_dataset


		def get_train_valid_test_split_(splits_string, size):
		""" Get dataset splits from comma or '/' separated string list."""

		splits = []
		if splits_string.find(',') != -1:
		splits = [float(s) for s in splits_string.split(',')]
		elif splits_string.find('/') != -1:
		splits = [float(s) for s in splits_string.split('/')]
		else:
		splits = [float(splits_string)]
		while len(splits) < 3:
		splits.append(0.)
		splits = splits[:3]
		splits_sum = sum(splits)
		assert splits_sum > 0.0
		splits = [split / splits_sum for split in splits]
		splits_index = [0]
		for index, split in enumerate(splits):
		splits_index.append(splits_index[index] +
		int(round(split * float(size))))
		diff = splits_index[-1] - size
		for index in range(1, len(splits_index)):
		splits_index[index] -= diff
		assert len(splits_index) == 4
		assert splits_index[-1] == size
		return splits_index


		def get_samples_mapping_(indexed_dataset,
		data_prefix,
		num_epochs,
		@@ -286,3 +167,66 @@ def get_samples_mapping_(indexed_dataset,
		samples_mapping.shape[0]))

		return samples_mapping


		def build_training_sample(sample,
		target_seq_length, max_seq_length,
		vocab_id_list, vocab_id_to_token_dict,
		cls_id, sep_id, mask_id, pad_id,
		masked_lm_prob, np_rng):
		"""Biuld training sample.

		Arguments:
		sample: A list of sentences in which each sentence is a list token ids.
		target_seq_length: Desired sequence length.
		max_seq_length: Maximum length of the sequence. All values are padded to
		this length.
		vocab_id_list: List of vocabulary ids. Used to pick a random id.
		vocab_id_to_token_dict: A dictionary from vocab ids to text tokens.
		cls_id: Start of example id.
		sep_id: Separator id.
		mask_id: Mask token id.
		pad_id: Padding token id.
		masked_lm_prob: Probability to mask tokens.
		np_rng: Random number genenrator. Note that this rng state should be
		numpy and not python since python randint is inclusive for
		the opper bound whereas the numpy one is exclusive.
		"""

		# We assume that we have at least two sentences in the sample
		assert len(sample) > 1
		assert target_seq_length <= max_seq_length

		# Divide sample into two segments (A and B).
		tokens_a, tokens_b, is_next_random = get_a_and_b_segments(sample, np_rng)

		# Truncate to `target_sequence_length`.
		max_num_tokens = target_seq_length
		truncated = truncate_segments(tokens_a, tokens_b, len(tokens_a),
		len(tokens_b), max_num_tokens, np_rng)

		# Build tokens and toketypes.
		tokens, tokentypes = create_tokens_and_tokentypes(tokens_a, tokens_b,
		cls_id, sep_id)

		# Masking.
		max_predictions_per_seq = masked_lm_prob * max_num_tokens
		(tokens, masked_positions, masked_labels, _) = create_masked_lm_predictions(
		tokens, vocab_id_list, vocab_id_to_token_dict, masked_lm_prob,
		cls_id, sep_id, mask_id, max_predictions_per_seq, np_rng)

		# Padding.
		tokens_np, tokentypes_np, labels_np, padding_mask_np, loss_mask_np \
		= pad_and_convert_to_numpy(tokens, tokentypes, masked_positions,
		masked_labels, pad_id, max_seq_length)

		train_sample = {
		'text': tokens_np,
		'types': tokentypes_np,
		'labels': labels_np,
		'is_random': int(is_next_random),
		'loss_mask': loss_mask_np,
		'padding_mask': padding_mask_np,
		'truncated': int(truncated)}
		return train_sample