Data Augmentation

Back-translation is translating a source language to a target language and back to a source language and mixing both original source sentence and back-translated sentence to train a model

Our technique has 6 basic steps:

  1. We train the model on our labeled data and produce a backtest on our test data as a base model.
  2. Our data augmentation involves translating our dataset to two target languages and back to English to generate variability.
  3. We then segment the augmented data to include news articles whose count of words increased by greater than 25%.
  4. We then add the augmented data to our traning set to produce an augmented data set for training.
  5. We train the model on our augmented training data and produce a backtest on our test data.
  6. We compare the backtest of the pure training data vs the augmented training dataset.

In our implementation, we generate an augmented sample of 14983 articles and add it to our pure training dataset of 60844 news articles.

Results

  1. Pure data: Validation accuracy: 54.89/ Test accuracy: 54.52
  2. Augmented data: Validation accuracy: 55.90/ Test accuracy: 54.25