I’m going to do something a bit different today. With this post, I would like to start on documenting some of my conquests in Machine Learning (ML). I personally find this subject fascinating and couldn’t wait 2 more years before I can take the data mining course offered at my school.

Warning, I am not trained at all in data mining and all of the following ideas are really bits and pieces of knowledge I’ve accumulated from the web and books I’ve been reading for the past month or so.

In ML, there are usually two categories with which users can categorize what they are trying to accomplish. The first one is a regression problem, and it is about taking a set of training data and predicting continuous ouputs which are usually numbers. For example, given your height, I would like to predict your weight. The next group is called “classification” problems. These are problems where by you take training data (inputs) and spit out which categories they belong to. Although the output can also take numeric values, they are usually dummy variables that are representing different factors. So continuing with our early example, given your height and weight, a classification problem will try and predict whether you are a male or female.

There are a lot of tools available to the data scientist. A list of different algorithms can be obtained from the following wikipedia page. (*ML Algos*)

Today, I would like to share a really simple ML algorithm called K-Nearest Neighbor Algorithm, short for KNN. This is a classification algorithm that takes in a vector of train data set and predicts what category the input belongs to. The process is achieve by the following steps

1. Plot training data on a 2D plane

2. Plot the value that you are trying to predict on the plane

3. Find the nearest K point(s) and initialize a vote

4. The category belonging to the new point is the majority of the points surrounding it

The K in “KNN” is a user defined integer parameter which specifies how many closest points should the algorithm take into consideration when determining the category for an input variable.

For my little experiment, I will be using Weight and Height Data to predict whether someone is a female or male. I’ve separated the data in to training (in-sample) and testing (out of sample). The below graph shows all the data points graphed and categorized.

To gauge performance, I will use mis-classification percent (error rate). In the next graph, I graphed out of sample error rate as a function of K.

As you can see there is a obvious downward trend in increasing K. This is can be attributed to the fact that with more surrounding information, the model can increase accuracy in predicting. In the future, I hope to post more on this and other relating methods.

—

I have found tremendous commonalities between ML and trading system development. Although it would be very naive to use ML outright to predict stock or asset prices, I brainstormed a few ways of using ML to build trading systems and improve existing testing methods. I hope to blog about this in the future. Stay tuned.

Code for The Above Exercise (drop me a email for dataset= michaelguan326@gmail.com)

require(class)
require(ggplot2)
train.raw<-read.csv("hw_training.csv")
test.raw<-read.csv("hw_test.csv")
train<-train.raw[,2:3]
test<-test.raw[,2:3]
result<-knn(train,test,cl=train.raw[,1],k=3) #knn algo
k=c(1:10)
p=rep(0,10)
sum=cbind(k,p)
colnames(summary)=c("k","Mis_Class")
#optimization for different values of K
for(i in 1:10)
{
result=knn(train, test, cl=train.raw[,1], k=i)
summary[i,2]=(nrow(test)-sum(diag(table(result,test.raw[,1]))))/nrow(test)
}
#plot
ggplot(train.raw,aes(x=Height,y=Weight,color=Gender))+geom_point()
qplot(summary[,1],summary[,2],geom='line',xlab="K",ylab="Mis-Classification (%)")