Random Forest is a powerful machine learning algorithm, it can be used as a regressor or as a classifier. It’s a meta estimator, meaning it’s using a specified number of decision trees to fit and predict.
We’re going to use the package Scikit-Learn in Python, it’s a very useful library which contains a lot of machine learning algorithms and related tools.
Data preparation
To see how Random Forest can be applied, we’re going to try to predict the S&P 500 futures (E-Mini), you can get the data for free on Quandl. Here is what it looks like:
Date | Open | High | Low | Last | Change | Settle | Volume | Previous Day Open Interest |
---|---|---|---|---|---|---|---|---|
2016-12-30 | 2246.25 | 2252.75 | 2228.0 | 2233.5 | 8.75 | 2236.25 | 1252004.0 | 2752438.0 |
2016-12-29 | 2245.5 | 2250.0 | 2239.5 | 2246.25 | 0.25 | 2245.0 | 883279.0 | 2758174.0 |
2016-12-28 | 2261.25 | 2267.5 | 2243.5 | 2244.75 | 15.75 | 2245.25 | 976944.0 | 2744092.0 |
The column Change needs to be removed since there’s missing data and this information can be retrieved directly by substracting D close and D-1 close.
Since it’s a classifier, we need to create classes for each line: 1 if the future went up today, -1 if it went down or stayed the same.
import numpy as np import pandas as pd def computeClassification(actual): if(actual > 0): return 1 else: return -1 data = pd.DataFrame.from_csv(path='EMini.csv', sep=',') # Compute the daily returns data['Return'] = (data['Settle']/data ['Settle'].shift(-1)-1)*100 # Delete the last line which contains NaN data = data.drop(data.tail(1).index) # Compute the last column (Y) -1 = down, 1 = up data.iloc[:,len(data.columns)-1] = data.iloc[:,len(data.columns)-1].apply(computeClassification)
Now that we have a complete dataset with a predictable value, the last colum “Return” which is either -1 or 1, let’s create the train and test dataset.
testData = data[-(len(data)/2):] # 2nd half trainData = data[:-(len(data)/2)] # 1st half # X is the list of features (Open, High, Low, Settle) data_X_train = trainData.iloc[:,0:len(trainData.columns)-1] # Y is the value to be predicted data_Y_train = trainData.iloc[:,len(trainData.columns)-1] # Same thing for the test dataset data_X_test = testData.iloc[:,0:len(testData.columns)-1] data_Y_test = testData.iloc[:,len(testData.columns)-1]
Using the algorithm
Once we have everything ready we can start fitting the Random Forest classifier against our train dataset:
from sklearn import ensemble # I picked 100 randomly, we'll see in another post how to find the optimal value for the number of estimators clf = ensemble.RandomForestClassifier(n_estimators = 100, n_jobs = -1) clf.fit(data_X_train, data_Y_train) predictions = clf.predict(data_X_test)
predictions is an array containing the predicted values (-1 or 1) for the features in data_X_test.
You can see the prediction accuracy using the method accuracy_score which compares the predicted values versus the expected ones.
from sklearn.metrics import accuracy_score print "Score: "+str(accuracy_score(data_Y_test, y_predictions))
What’s next ?
Now for example you can create a trading strategy that goes long the future if the predicted value is 1, and goes short if it’s -1. This can be easily backtested using a backtest engine such as Zipline in Python.
Based on your backtest result you could add or remove features, maybe the volatility or the 5-day moving average can improve the prediction accuracy ?