**Proof-of-Concept 2: Reuters dataset 2 2007-2016; 150,802 articles
Given that this data set was collected by web scraping not a clean dataset, it presented problems when fed into the models and was thus held out as a second test dataset. The best performing model ,Sentimetre Model 2, was used to predict on this dataset and the charts are provided below. This dataset was not used in training, testing or validation of any of the models.
We assume that we are able to buy at market open and liquidate at market close.
Our backtest does not incorporate:
Use of options/derivatives
Optimal sizing of trades: all positions are the same size
**Top 5 position selection
For each day, we select the top 5 long positions and top 5 short position based on features and discard the remaining positions.
*Model accuracy on the validation dataset:
NTLK VADER Sentiment Analyzer - N/A Linear Classifier - 51% Sentimetre Model 2 - 56%