**Proof-of-Concept 2: Reuters dataset 2 2007-2016; 150,802 articles
Given that this data set was collected by web scraping not a clean dataset, it presented problems when fed into the models and was thus held out as a second test dataset. The best performing model ,Sentimetre Model 2, was used to predict on this dataset and the charts are provided below. This dataset was not used in training, testing or validation of any of the models.
We use a long-short equally-weighted portfolio backtest for all our models. Other papers have tended to use the top 10 long predictions and top 10 short predictions to build a portfolio but we prefer to include all predictions in our portfolio. We assume that we are able to buy at market open and liquidate at market close.
Our backtest does not incorporate:
Use of options/derivatives
Optimal sizing of trades: all positions are the same size
*Model accuracy on the validation dataset:
NTLK VADER Sentiment Analyzer - N/A Linear Classifier - 51% Sentimetre Model 2 - 56%
*Prediction accuracy on the test dataset:
NTLK VADER Sentiment Analyzer - 50.6% Linear Classifier - 52.9% Sentimetre Model 2 - 51.6%