Commit 90e0a0dd authored by Jared Casper's avatar Jared Casper
Browse files

Merge branch 'github-pr' into 'main'

Pull in some GitHub PRs

See merge request ADLR/megatron-lm!282
parents 82b69e86 4a35d50a
Loading
Loading
Loading
Loading
+7 −1
Original line number Diff line number Diff line
@@ -103,6 +103,11 @@ python tools/preprocess_data.py \

The output will be two files named, in this case, `my-bert_text_sentence.bin` and `my-bert_text_sentence.idx`. The `--data-path` specified in later BERT training is the full path and new filename, but without the file extension.

For T5 use the same preprocessing as BERT, perhaps renaming it to:
<pre>
       --output-prefix my-t5 \
</pre>

Some minor modifications are required for GPT data preprocessing, namely, the addition of a merge table, an end-of-document token, removal of sentence splitting, and a change to the tokenizer type:
<pre>
python tools/preprocess_data.py \
@@ -237,13 +242,14 @@ T5_ARGS="--num-layers 24 \
         --micro-batch-size 16 \
         --global-batch-size 2048 \
         --vocab-file $VOCAB_FILE \
         --vocab-extra-ids 100 \
         --split 949,50,1 \
         --fp16"

OUTPUT_ARGS=&#60;same as those in <a href="#bert-pretraining">BERT pretraining</a> above&#62;

python pretrain_t5.py \
       $BERT_ARGS \
       $T5_ARGS \
       $OUTPUT_ARGS \
       --save $CHECKPOINT_PATH \
       --load $CHECKPOINT_PATH \
+0 −1
Original line number Diff line number Diff line
@@ -25,7 +25,6 @@ python -m torch.distributed.launch $DISTRIBUTED_ARGS \
       --decoder-seq-length 128 \
       --micro-batch-size 16 \
       --global-batch-size 128 \
       --seq-length 512 \
       --max-position-embeddings 512 \
       --train-iters 1000000 \
       --lr-decay-iters 1000000 \