mean reversion

Beyond Pairs

Writing rather prolifically this past week. Last week was the end of my midterms (for now!) and continuation of job search and preparing for my final 5 weeks of university.(!) I hope my readers are finding my posts interesting and enlightening.

As mentioned in an earlier post on statistical arbitrage, the interesting aspect of it comes when we consider multi leg portfolios. To construct a multi-leg portfolio, the traditional way to do it would be to employ a multivariate linear regression (factor model). The intuition behind this is that we are trying to estimate a fair value for an asset using various predictors or independent variables. For example, we know that the S&P 500 is composed of stocks from various sectors. Therefore, an intuitive way is to derive a fair value for S&P 500 using the 9 different sector Spdrs by the following equation:

CodeCogsEqn (3)

The residual return that is left over, “alpha”, is considered to be neutral (uncorrelated) against the industry sectors. With this framework, we now can essentially make ourselves neutral to any factors we want. For example, we have access to a wide variety of ETFs that mimic underlying asset class movements. If we want to be neutral to interest rates, credit risk, and volatility, we can employ ETFs: TLT, HYG, and VXX respectively. Below is a chart demonstrating this, showing the estimated fair value of SPY relative to the actual ETF:

Screen Shot 2014-03-02 at 11.33.55 AM

Below is the spread that can be traded via long short on each leg:

Screen Shot 2014-03-02 at 11.38.21 AM

The concept of being able to control the factors we are exposed to is very appealing as it allows us to potential shy away from turbulent events that transpire from specific assets. Not only that, these uncorrelated return streams when combined in to a portfolio allows significant risk reduction. As Dalio said, the ability to combine 15 uncorrelated return streams allows us to effectively reduce 80% of risk. (Chart below) Interestingly, from my understanding of what Bridgewater does, I am pretty confident they are employing spread trading too, but purely from a fundamental way. For example, how does a set of asset classes react to the movements of economic indicators? From there they construct synthetic spreads to trade off of these relationships.

Screen Shot 2014-03-02 at 12.06.52 PM

Below is the code that generated the data for this post:

spread.analysis<-function(data, y.symbol, x.symbol, lookback=250){
    y = data$prices[,y.symbol]
    x = data$prices[,x.symbol]
    fv = NA * data$prices[,1]
    colnames(fv) = c('FairValue')

    for( i in (lookback+1):nrow(data$prices) ){
        hist.y = y[(i-lookback):i, y.symbol]
        hist.x = x[(i-lookback):i, x.symbol]
        lm.r = lm(hist.y ~ hist.x)
        lm.holder[[i]] = lm.r
        fv[i,] = lm.r$coefficients[1] + sum(lm.r$coefficients[-1] * x[i,])
    mat = merge( x,y,fv )
    return( list( mat = mat, fv = fv, reg.model = lm.holder ) )

Also here are some links I’ve found to be very informative.

The paper on high frequency statistical arbitrage is rather a relevant one as it relates to my previous blog posts on energy related pairs trading. Essentially, the author goes on to construct a meta algorithm for ranking pairs to trade. This meta-algorithm is composed of correlation coefficient, minimum square distance of normalized price series, and a co-integration test value. I don’t have intraday (paper used 15 min bars) equities data nor do I have the infrastructure to test it but the idea resonates with me from my research in top N momentum systems. A lot of ways to improve.

Energy Stat Arb Part 2

In my previous post, I rushed through a lot of technical details on how I implemented the strategy. For that I apologize! I am here to make up by providing more on how I approached it and hopefully make my analysis more understandable.

In this post, I want to re-visit energy pairs (XLE vs OIL) trading but with the traditional spread construction approach through regression analysis. My data comes from QuantQuote, all adjusted for dividends and splits. To read in the data, I used the following code:

from matplotlib.pylab import *
import pandas as pd
import numpy as np
import datetime as dt
xle = pd.read_csv('/Users/mg326/xle.csv', header=None,parse_dates = [[0,1]])
xle.columns = ['Timestamp', 'open', 'high', 'low', 'close', 'volume', 'Split Factor', 'Earnings', 'Dividends']
xle['Timestamp'] = xle['Timestamp'].apply(lambda x: dt.datetime.strptime(x, '%Y%m%d %H%M'))
xle = xle.set_index('Timestamp')
xle = xle[["open", "high", "low", "close"]]

For minute data, there are approximately 391 rows of data per day. Taking in to account OHLC, there are a total of 391 * 4 = 1564 data observations per day. Heres a image displaying May 9th, 2013:


If you look in to the data, you may see a price for 9:45AM but the next data point comes in at 9:50AM. This means that there was a 5 minute gap where no shares were traded. To fix and align this, the following function will align the two data sets.

def align_data(data_leg1,data_leg2,symbols):
    combined_df = pd.concat([data_leg1,data_leg2],axis=1)
    combined_df = combined_df.fillna(method='pad')
    data_panel = pd.Panel({symbols[0]: combined_df.ix[:,0:4], symbols[1]:combined_df.ix[:,4:9]}) #dict of dataframes

To construct the spread, we will run a rolling regression on the prices to extract the hedge ratio. This is then piped in to the following equation:


Given two series of prices, the following helper function will return a dictionary of the model and the spread. Following is the spread displayed.

def construct_spread(priceY,priceX,ols_lookback):
    data = pd.DataFrame({'x':priceX,'y':priceY})
    model = {}
    model['model_ols'] = pd.ols(y=data.y, x=data.x, window=ols_lookback,intercept=False)
    model['spread'] = data.y - (model['model_ols'].beta*data.x)
    return model


To normalize it, simply subtract a rolling mean and divide that by the rolling standard deviation. Image of normalized spread follows.

zscore = lambda x: (x[-1] - x.mean()) / x.std(ddof=1)
sprd['zs'] = pd.rolling_apply(sprd, zs_window, zscore)  # zscore


Without changing our parameters, a +-2 std will be our trigger point. At this threshold, there is a total of 16 trades. Here is the performance if we took all the trades for the day, frictionless:


Pretty ugly in my opinion but its only a day. Lets display all the daily equity performance distributed for the whole year of 2013.


The flat line initially is for the 60 bar lookback window for each day, unrealistic but it does give a rough picture on the returns. The average final portfolio gain is 0.06 with std of 0.13. The performance is pretty stellar when you look at 2013 as a whole. Comparing this to the other spread construction in my last post, its seems to reduce the variance of returns when incorporating a longer lookback period.

2013Coming up in the next instalment I want to investigate whether incorporating Garch models for volatility forecasting will help improve the performance of spread trading.

Thanks for reading,


Energy Stat Arb

Back to my roots. Haven’t tested outright entry exit trading systems for a while now since the Mechanica and Tblox days but I aim to post more about these in the future.

I’ve been looking and reading about market neutral strategies lately to expand my knowledge. Long only strategies are great but sometimes constant outright directional exposure may leave your portfolio unprotected to the downside when all assets are moving in the same direction. A good reminder would be the May of last year when gold took a nose dive.

Below are some tests I conducted on trading related energy pairs. Note that I haven’t done any elaborate testing for whether the spread is mean reverting,etc. I just went with my instincts. No transaction costs. Spread construction based on stochastic differential, 10 day lookback, +-2/0 std normalized z score entry/exit, and delay 1 bar execution.

Crude Oil and Natural Gas Futures (Daily) (Daily don’t seem to work that well no more):


OIL and UNG ETF (1 Min Bar)


XLE and OIL ETF (1 Min Bar)


Pair trading is the simplest form of statistical arbitrage but what gets interesting is when you start dealing with a basket of assets. For example, XLE tracks both Crude Oil and Natural Gas companies, therefore a potential 3 legged trade would be to trade XLE against both OIL and UNG. Another well-known trade would be to derive value for the SPY against TLT (rates), HYG (corp spreads), and VXX (Vol).

The intuition behind relative value strategies is to derive a fair value of an asset “relative” to another. In basic pair trading, we are using one leg to derive the value of another, or vice versa. Any deviations are considered opportunities for arbitrage. In the case for multi legged portfolio, a set of assets are combined in some way (optimization, factor analysis, PCA) to measure the value. See (Avellaneda) for details.

While the equity lines above look nice, please remember that they don’t account for transaction costs and are modelled purely on adjusted last trade price. A more realistic simulation would be to test the sensitivity of entry and order fills given level 1 bid-ask spreads. For that, a more structured backtesting framework should be employed.

(Special thanks to QF for tremendous insight)

Thanks for reading,