In my previous post, I rushed through a lot of technical details on how I implemented the strategy. For that I apologize! I am here to make up by providing more on how I approached it and hopefully make my analysis more understandable.

In this post, I want to re-visit energy pairs (XLE vs OIL) trading but with the traditional spread construction approach through regression analysis. My data comes from QuantQuote, all adjusted for dividends and splits. To read in the data, I used the following code:

from matplotlib.pylab import *
import pandas as pd
import numpy as np
import datetime as dt
xle = pd.read_csv('/Users/mg326/xle.csv', header=None,parse_dates = [[0,1]])
xle.columns = ['Timestamp', 'open', 'high', 'low', 'close', 'volume', 'Split Factor', 'Earnings', 'Dividends']
xle['Timestamp'] = xle['Timestamp'].apply(lambda x: dt.datetime.strptime(x, '%Y%m%d %H%M'))
xle = xle.set_index('Timestamp')
xle = xle[["open", "high", "low", "close"]]

For minute data, there are approximately 391 rows of data per day. Taking in to account OHLC, there are a total of 391 * 4 = 1564 data observations per day. Heres a image displaying May 9th, 2013:

If you look in to the data, you may see a price for 9:45AM but the next data point comes in at 9:50AM. This means that there was a 5 minute gap where no shares were traded. To fix and align this, the following function will align the two data sets.

def align_data(data_leg1,data_leg2,symbols):
combined_df = pd.concat([data_leg1,data_leg2],axis=1)
combined_df = combined_df.fillna(method='pad')
data_panel = pd.Panel({symbols[0]: combined_df.ix[:,0:4], symbols[1]:combined_df.ix[:,4:9]}) #dict of dataframes
return(data_panel)

To construct the spread, we will run a rolling regression on the prices to extract the hedge ratio. This is then piped in to the following equation:

Given two series of prices, the following helper function will return a dictionary of the model and the spread. Following is the spread displayed.

def construct_spread(priceY,priceX,ols_lookback):
data = pd.DataFrame({'x':priceX,'y':priceY})
model = {}
model['model_ols'] = pd.ols(y=data.y, x=data.x, window=ols_lookback,intercept=False)
model['spread'] = data.y - (model['model_ols'].beta*data.x)
return model

To normalize it, simply subtract a rolling mean and divide that by the rolling standard deviation. Image of normalized spread follows.

zscore = lambda x: (x[-1] - x.mean()) / x.std(ddof=1)
sprd['zs'] = pd.rolling_apply(sprd, zs_window, zscore) # zscore

Without changing our parameters, a +-2 std will be our trigger point. At this threshold, there is a total of 16 trades. Here is the performance if we took all the trades for the day, frictionless:

Pretty ugly in my opinion but its only a day. Lets display all the daily equity performance distributed for the whole year of 2013.

The flat line initially is for the 60 bar lookback window for each day, unrealistic but it does give a rough picture on the returns. The average final portfolio gain is 0.06 with std of 0.13. The performance is pretty stellar when you look at 2013 as a whole. Comparing this to the other spread construction in my last post, its seems to reduce the variance of returns when incorporating a longer lookback period.

Coming up in the next instalment I want to investigate whether incorporating Garch models for volatility forecasting will help improve the performance of spread trading.

Thanks for reading,

Mike