Commit a449d312 authored by Mostofa Patwary's avatar Mostofa Patwary
Browse files

added readme

parent 59372322
Loading
Loading
Loading
Loading
+9 −0
Original line number Diff line number Diff line
@@ -44,3 +44,12 @@ python remove_group_duplicates.py <file containing simialr documents> <cleaned d
shuf <cleaned deduped data file> -o train_data.json
```

# Deduplicating ngrams

To deduplicate the downstream tasks from the training dataset, we run the following command.

```
python filter_ngrams.py <down stream task dataset> <training dataset to deduplicate> <output training dataset>
```

We use 13-grams for the deduplication. When we find a 13-gram match in a training document, we split the document into two pieces and remove the 13-gram along with 200 characters from the both side of the 13-gram. We also remove any splitted document with less than 200 characters or if a document got splitted more than 10 times.
+9 −0
Original line number Diff line number Diff line
@@ -13,6 +13,12 @@
# See the License for the specific language governing permissions and
# limitations under the License.

"""
Deduplicate downstream tasks from training dataset. 13-grams have been used.
All split documents with less than 200 characters got filtered. Any document
with more than 10 splits got filtered as well.
"""

from functools import partial
import json
import multiprocessing
@@ -23,6 +29,7 @@ import sys
import time

def get_words(text):
    # get all the lowercase words from text
    words, positions = [], []
    for match in re.finditer(r'\w+', text.lower()):
        words.append(match.group(0))
@@ -31,6 +38,8 @@ def get_words(text):

def free_ngram(line, ngrams, ngram_size, filter_text_len, 
    splits_count, split_window_each_size):
    # remove all the ngrams

    try:
        myjson = json.loads(line)
        text_buf = [myjson['text']]