When applying Machine Learning tools to market prediction, the internet is saturated with academic papers and lacking in practical code examples. In this post, it’s my goal to translate one such paper from text to code. Mark Dunne’s Undergraduate Thesis, “Stock Market Prediction“, approaches market forecasting from a sweeping set of angles. He considers fundamental and technical analysis, equity analyst opinions, price time-series, and interrelated market indices.
I’ve recreated a segment of his paper below. It uses major indices from around the world to predict price movements in the Dow Jones. Given that markets in Shanghai open at 01:30 UTC, in Germany at 07:00 UTC, and New York at 14:30 UTC, the movement of earlier markets can be used to estimate the movements of later markets.
Dunne was forthright about a significant flaw in this analysis. Markets in England and Germany close at 16:30 UTC and 19:00 UTC respectively. The close prices of these markets occur after US markets have opened, exposing the strategy to lookahead bias. The predictive power of his analysis is eroded by this bias, as shown when he runs the strategy through Quantopian. But some non-biased indicators – such as the Shanghai Composite – still show predictive power. At the end of this post, I use a mixture of open and close prices as a way to work around this bias.
As Dunne notes throughout his paper, the Efficient Market Hypothesis reinforces that market predictions are no simple task. Though few results of his analysis are trade-able, Dunne’s methods are a useful lesson in practical application.
As a disclaimer, I’m new to the field of Machine Learning. This, in part, is why I’ve chosen to translate existing work instead of perform my own research. The purpose of this post is to explain methodology, and since the results are in line with Dunne’s, I can speak to the accuracy of this methodology. If my technical explanations below are lacking, I apologize.
Gathering and Processing Data
Dunne’s analysis is performed with data from Quandl, a library housing various global financial datasets. Dunne’s goal was to predict movement in the Dow Jones. For indices, he used Germany’s DAX, England’s FTSE, Japan’s Nikkei, and China’s SSE. He added in DAX Futures, Oil Prices, USD/EUR exchanges rates, and USD/AUD exchange rates:
import numpy as np import quandl djia = quandl.get('YAHOO/INDEX_DJI', transformation="rdiff") dax_f = quandl.get('CHRIS/EUREX_FDAX1', transformation="rdiff") dax = quandl.get('YAHOO/INDEX_GDAXI', transformation="rdiff") nk = quandl.get('YAHOO/INDEX_N225', transformation="rdiff") sse = quandl.get('YAHOO/L_SSE', transformation="rdiff") eur = quandl.get('FED/RXI_US_N_B_EU', transformation="rdiff") aud = quandl.get('FED/RXI_US_N_B_AL', transformation="rdiff") oil = quandl.get('CHRIS/ICE_B2', transformation="rdiff")
The rdiff “transformation” conveniently converts the prices into daily percent changes.
I was unable to find free FTSE data from the year 2000 on Quandl. As an alternative, I went to Yahoo Finance:
import pandas as pd import pandas.io.data start = '2000-1-1' end = '2013-1-1' ftse = pd.io.data.get_data_yahoo('^FTSE', start, end) ftse.Volume = ftse.Volume.astype(float) ftse = ftse.pct_change()
Combining the close prices from these features into a single data frame:
df = pd.DataFrame([djia['Adjusted Close'], dax_f.Settle, dax['Adjusted Close'], nk['Adjusted Close'], sse['Adjusted Close'], ftse['Adj Close'], eur.Value, aud.Value, oil.Settle], index = ['DJIA','DAX FUT','DAX', 'NK', 'SSE','FTSE','EUR','AUD','OIL']) df = df.transpose() df = df.loc['20000104':'20130101'] df = df.dropna()
The algorithm will attempt to use these features to classify Dow Jones price movement as ‘Up’ or ‘Down’. Dunne’s analysis includes a neutral option for intraday changes of 0. Since no intraday Dow movement during this period is exactly 0, this becomes a binary classification. One might be able to define a “neutral” range using that day’s bid-ask spread, but this will be left for more meticulous minds.
def trend(val): if (val > 0): return 1 elif (val <= 0): return 0 df['Move'] = df['DJIA'].apply(trend) df = df.dropna()
If we were using today’s values to predict tomorrow’s price, it would be necessary to shift this Move column backwards. That way, today’s feature set would be in line with the dependent variable: tomorrow’s price movement. However, since we are using values from earlier in the day to predict values later in the day, no such shifting is needed.
Using the k-Nearest Neighbors algorithm, Dunne tests the efficacy of each of his features independently. Below is an implementation of such a test:
import sklearn from sklearn import metrics from sklearn import neighbors from sklearn import cross_validation labels =  scores =  Y = df['Move'] for c in df.columns: if (c != 'Move' and c!= 'DJIA'): X = pd.DataFrame([df[c]]).transpose() X_train = X[:int(len(X)*.8)] Y_train = Y[:int(len(X)*.8)] X_test = X[int(len(X)*.8):] Y_test = Y[int(len(X)*.8):] knn = neighbors.KNeighborsClassifier(n_neighbors=25) knn.fit(X_train, Y_train) predicted = knn.predict(X_test) print(c) print(metrics.f1_score(Y_test, predicted)) print(knn.score(X_test, Y_test)) labels.append(c) scores.append(knn.score(X_test, Y_test))
Plotting the results in matplotlib:
import matplotlib.pyplot as plt; plt.rcdefaults() import numpy as np import matplotlib.pyplot as plt y_pos = np.arange(len(labels)) plt.bar(y_pos, scores, align='center', alpha=0.5) plt.xticks(y_pos, labels) plt.ylabel('Scores') plt.show()
The corresponding bar chart is similar to Dunne’s (note the different order of the features):
I assume here that Dunne used the raw accuracy score for his bar chart, as opposed to the F1 Score. I’ve included both in the above code segment, as I believe both are important metrics.
Dunne runs the feature set through three models: Gaussian Naive Bayes, Logistic Regression, and k-Nearest Neighbors. All of these are easy to implement in scikit learn, with relatively few lines of code. Dunne references cross validating his results. Though there are methods of cross validating time series data, I found similar results by simply using an 80%/20% split.
Starting first with Logistic Regression:
from sklearn import linear_model X2 = df.copy() Y = df['Move'] del X2['Move'] del X2['DJIA'] X_train = X2[:int(len(X2)*.8)] Y_train = Y[:int(len(X2)*.8)] X_test = X2[int(len(X2)*.8):] Y_test = Y[int(len(X2)*.8):] logit = linear_model.LogisticRegression(C=1) logit.fit(X_train, Y_train) predicted = logit.predict(X_test) print(metrics.f1_score(Y_test, predicted)) print(logit.score(X_test,Y_test))
Logistic Regression is a variant on regression algorithms, designed to tackle binary classification problems. The main parameter to be tuned is the “C”-value. C is the regularization term. It is a method of controlling the complexity of your model. C here is inverse of the strength of regularization. I’ve left the C at its default value of 1.
Next is the k-Nearest Neighbors algorithm, also used above.
from sklearn import neighbors X2 = df.copy() Y = df['Move'] del X2['Move'] del X2['DJIA'] X_train = X2[:int(len(X2)*.8)] Y_train = Y[:int(len(X2)*.8)] X_test = X2[int(len(X2)*.8):] Y_test = Y[int(len(X2)*.8):] knn = neighbors.KNeighborsClassifier(n_neighbors=25) knn.fit(X_train, Y_train) predicted = knn.predict(X_test) print(metrics.f1_score(Y_test, predicted)) print(knn.score(X_test,Y_test))
k-Nearest Neighbors involves constructing a multi-dimensional plot of your data. The algorithm evaluates the distance between new data points and existing ones. The k nearest points effectively “vote” on how this new point should be classified. The main parameter to be tuned here is k, called “n_neighbors”.
Lastly, Gaussian Naive Bayes.
from sklearn.naive_bayes import GaussianNB X2 = df.copy() Y = df['Move'] del X2['Move'] del X2['DJIA'] X_train = X2[:int(len(X2)*.8)] Y_train = Y[:int(len(X2)*.8)] X_test = X2[int(len(X2)*.8):] Y_test = Y[int(len(X2)*.8):] clf = GaussianNB() clf.fit(X_train, Y_train) predicted = clf.predict(X_test) print(metrics.f1_score(Y_test, predicted)) print(clf.score(X_test,Y_test))
Gaussian Naive Bayes hinges on Bayes Theorem of conditional probabilities. It fits our feature variables to a normal distribution. Based on our training data, it computes a set of conditional probabilities between the independent and dependent variables. It uses these probabilities to classify new data points by their maximum likelihood.
Running these algorithms yields the following accuracy scores:
Logistic Regression: 0.712
K-Nearest Neighbors: 0.722
Gaussian Naive Bayes: 0.718
Let’s compare this with Dunne’s results:
The results for KNN and GaussianNB are extremely close. However, the accuracy results for the Logistic Regression algorithm above are far higher. I can only hypothesize as to why this is. It’s possible that Dunne did not use scikit learn for his analysis, or chose a significantly different set of parameters. Where I chose to drop “NaN” fields from the data, he may have filled the data with zeros. He may have performed additional pre-processing on his data, such as min-max scaling. The exact dates marking the beginning and end of his data may have been different.
Using ‘Open’ as an Alternative
As mentioned above, this current dataset is exposed to look-ahead bias. The close of the DAX, DAX futures, and FTSE occur after the open of the Dow Jones.
To correct this, we can take the percent change in “Open” prices (rather than “Close” prices) for these three fields. As I’m unsure of when the values for Oil, USD/AUD, and USD/EUR are recorded, I will drop these fields all together.
df = pd.DataFrame([djia['Adjusted Close'], dax_f.Open, dax['Open'], nk['Adjusted Close'], sse['Adjusted Close'], ftse['Open']], index = ['DJIA','DAX FUT','DAX', 'NK', 'SSE','FTSE']) df = df.transpose() df = df.loc['20000104':'20130101'] df = df.dropna()
Passing this set through the evaluation used above yields the follow chart:
Running these features through the above algorithms yield the following scores:
Logistic Regression: 0.535
K-Nearest Neighbors: 0.579
Gaussian Naive Bayes: 0.586
Since only 52.05% of our Dow data is marked “Up”, all three algorithms are an improvement on pure guesswork. This is likely due to the strength of the Shanghai Stock Exchange (SSE) feature. Though these results are nowhere near as insightful as the earlier dataset, they do help to reinforce a “proof of concept”.
I don’t believe that these results alone are trade-able. Since the relatinships between these variables are intuitive, it’s likely that after-hours movement makes them unprofitable. But further researchers may have luck using this as a single indicator among many.
Last Edited Dec. 24th 2016, 1:44 pm