Machine Learning

Machine Learning Code

I thought I’d share some code I’ve written for a Machine Learning class I am taking at my school. The project required me to write matlab code for various ML algorithms including, KNN, Linear Classifiers (MSE, Perceptron batch), and Adaboost. Our main objective was to train an classifier for detecting spam/ham emails.



Random Subspace Optimization: Max Sharpe

I was reading David’s post on the idea of Random Subspace Optimization and thought I’d provide some code to contribute to the discussion. I’ve always loved ensemble methods since combining multiple streams of estimates makes more robust estimation outcomes.

In this post, I will show how RSO overlay performs using max sharpe framework. To make things more comparable, I will employ the same assets as David for the backtest. One additional universe I would like to incorporate is the current day S&P 100 (survivorship bias).

Random subspace method is a generalization of the random forest algorithm. Instead of generating random decision trees, the method can employ any desired classifiers. Applied to portfolio management, given N different asset classes and return streams, we will randomly select `k` assets `s` times. Given `s` different random asset combinations, we can perform a user defined sizing algorithm for each of them. The last step is to combined them though averaging to get the final weights. In R, the problem can be easily formulated via `lapply` or `for` loops as the base iterative procedure. For random integers, the function `sample` will be employed. Note my RSO function employs functions inside Systematic Investors Toolbox.

 size.fn =$weight.function)
 if(k > ia$n) stop("K is greater than number of assets.")
 space = seq(1:ia$n)
 index.samples =t(replicate(s,sample(space,size=k)))
 weight.holder = matrix(NA,nrow = s , ncol = ia$n)
 colnames(weight.holder) = ia$symbol.names

 hist = coredata(ia$hist.returns)
 constraints = new.constraints(k, lb = 0, ub = 1)
 constraints = add.constraints(diag(k), type='&;=', b=0, constraints)
 constraints = add.constraints(diag(k), type='<=', b=1, constraints)

 #SUM x.i = 1
 constraints = add.constraints(rep(1, k), 1, type = '=', constraints)

 for(i in 1:s){
 ia.temp = create.historical.ia(hist[,index.samples[i,]],252)
 weight.holder[i,index.samples[i,]] = size.fn(ia.temp,constraints)
 final.weight = colMeans(weight.holder,na.rm=T)


The above function will take in a `ia` object, short for input assumption. It calculates all the necessary statistics for most sizing algorithms. Also, I’ve opted to focus on long only.

The following are the results for 8 asset class. All backtest hereafter will keep `s` equal to 100 while varying `k` from 2 to N-1, where N equals the total number of assets. The base comparison will be that of simple max sharpe and equal weight portfolio.


The following is for 9 sector asset classes.


Last but not least is the performance for current day S&P 100 stocks.


The RSO method seems to improve all the universes that I’ve thrown at it. For a pure stock universe, it is able to reduce volatility by more than 300-700 basis points depending on your selection of k. In a series of tests across different universes, I have found that the biggest improvements from RSO comes from applying it to a universe of instruments that belong to the same asset class. Also, I’ve found that for a highly similar universe (stocks), a lower `k` is better than a higher `k`. One explanation: since the max sharpe portfolio of X identical assets is equal to that of an equal weight portfolio, we can postulate that when the asset universe is highly similar or approaching equivalence, resampling with a lower `k` Y times where Y approaches infinity, we are in a sense approaching the limit of a equally weighted portfolio. This is in line with the idea behind curse of dimensionality: for better estimates,  the data required grows exponentially when the number of assets increase.  In this case, with limited data, a simple equal weight portfolio will do better which conforms to a better performance for lower `k`.

For a well specified universe of assets, RSO with a higher `k` yields better results than lower `k`. This is most likely caused by the fact that simple random sampling of such universe with a small `k` will yield samples that contain highly mis-specified universe. This problem is magnified when the number of diversifying assets like bonds are significantly out-numbered by other assets like equities as the probability of sampling an asset with diversification benefits are far lower than sampling an asset without such benefits. Another word, with a lower `k`, one will most likely end up with a portfolio that contain a lot of risky assets relative to lower risk assets.

Possible future direction would be to figure out some ways of having to specify the `k` and `s` in a RSO. For example, randomly selecting `k` OR selecting a `k` such that it targets a certain risk/return OR maximize an user defined performance metric.

Thanks for reading,


Machine Learning

I’m going to do something a bit different today. With this post, I would like to start on documenting some of my conquests in Machine Learning (ML). I personally find this subject fascinating and couldn’t wait 2 more years before I can take the data mining course offered at my school.

Warning, I am not trained at all in data mining and all of the following ideas are really bits and pieces of knowledge I’ve accumulated from the web and books I’ve been reading for the past month or so.

In ML, there are usually two categories with which users can categorize what they are trying to accomplish. The first one is a regression problem, and it is about taking a set of training data and predicting continuous ouputs which are usually numbers. For example, given your height, I would like to predict your weight. The next group is called “classification” problems. These are problems where by you take training data (inputs) and spit out which categories they belong to. Although the output can also take numeric values, they are usually dummy variables that are representing different factors. So continuing with our early example, given your height and weight, a classification problem will try and predict whether you are a male or female.

There are a lot of tools available to the data scientist. A list of different algorithms can be obtained from the following wikipedia page. (ML Algos)

Today, I would like to share a really simple ML algorithm called K-Nearest Neighbor Algorithm, short for KNN. This is a classification algorithm that takes in a vector of train data set and predicts what category the input belongs to. The process is achieve by the following steps

1. Plot training data on a 2D plane

2. Plot the value that you are trying to predict on the plane

3. Find the nearest K point(s) and initialize a vote

4. The category belonging to the new point is the majority of the points surrounding it

The K in “KNN” is a user defined integer parameter which specifies how many closest points should the algorithm take into consideration when determining the category for an input variable.

For my little experiment, I will be using Weight and Height Data to predict whether someone is a female or male. I’ve separated the data in to training (in-sample) and testing (out of sample). The below graph shows all the data points graphed and categorized.

To gauge performance, I will use mis-classification percent (error rate). In the next graph, I graphed out of sample error rate as a function of K.

As you can see there is a obvious downward trend in increasing K. This is can be attributed to the fact that with more surrounding information, the model can increase accuracy in predicting. In the future, I hope to post more on this and other relating methods.

I have found tremendous commonalities between ML and trading system development. Although it would be very naive to use ML outright to predict stock or asset prices, I brainstormed a few ways of using ML to build trading systems and improve existing testing methods. I hope to blog about this in the future. Stay tuned.

Code for The Above Exercise (drop me a email for dataset=


result<-knn(train,test,cl=train.raw[,1],k=3) #knn algo

#optimization for different values of K
for(i in 1:10)
 result=knn(train, test, cl=train.raw[,1], k=i)
qplot(summary[,1],summary[,2],geom='line',xlab="K",ylab="Mis-Classification (%)")