Unverified Commit 4e2b7b26 authored by David E. Weekly's avatar David E. Weekly Committed by GitHub
Browse files

Fix typos

Fixed some small typos in the README
parent 03d28809
Loading
Loading
Loading
Loading
+2 −2
Original line number Diff line number Diff line
[Megatron](https://arxiv.org/pdf/1909.08053.pdf) is a large, powerful transformer developed by the Applied Deep Learning Research team at NVIDIA. This repository is for ongoing research on training large transformer language models at scale. We developed efficient, model-parallel, and multinode training of [GPT-2](https://d4mucfpksywv.cloudfront.net/better-language-models/language_models_are_unsupervised_multitask_learners.pdf) and [BERT](https://arxiv.org/pdf/1810.04805.pdf) using mixed precision.

Our codebase is capable of efficiently training a 72-layer, 8.3 billion parameter GPT-2 language model with 8-way model and 64-way data parallelism across 512 GPUs. We sustain 15.1 PetaFLOPs across the entire application with 76% scaling efficiency when compared to a strong single GPU baseline that sustains 39 TeraFLOPs, which is 30% of peak theoritical FLOPs. Using our GPT-2 model we achieve SOTA results on the WikiText-103 (10.8 compared to SOTA perplexity of 15.8) and LAMBADA (66.5% compared to SOTA accuracy of 63.2%) datasets. 
Our codebase is capable of efficiently training a 72-layer, 8.3 billion parameter GPT-2 language model with 8-way model and 64-way data parallelism across 512 GPUs. We sustain 15.1 PetaFLOPs across the entire application with 76% scaling efficiency when compared to a strong single GPU baseline that sustains 39 TeraFLOPs, which is 30% of peak theoretical FLOPs. Using our GPT-2 model we achieve SOTA results on the WikiText-103 (10.8 compared to SOTA perplexity of 15.8) and LAMBADA (66.5% compared to SOTA accuracy of 63.2%) datasets. 

For BERT training, we swapped the position of the layer normalization and the residual connection in the model architecture (similar to GPT-2 architucture), which allowed the models to continue to improve as they were scaled up. Our BERT models with 3.9 billion parameters reaches a loss of 1.16, SQuAD 2.0 F1-score of 91.7, and RACE accuracy of 90.9%.
For BERT training, we swapped the position of the layer normalization and the residual connection in the model architecture (similar to GPT-2 architucture), which allowed the models to continue to improve as they were scaled up. Our BERT models with 3.9 billion parameters reach a loss of 1.16, SQuAD 2.0 F1-score of 91.7, and RACE accuracy of 90.9%.

<a id="contents"></a>
# Contents