Thursday, April 30, 2015

Relation between 2 distributions

Often we must have faced a situation where we wanted to know, how similar is a distribution compared to the other one. Okay, you got me, correlation is an answer, but what if I want to know how similar and far away are the distributions from each other

In my earlier blog post, I was adopting a crude manner. It worked because I needed a rough figure, but if you want statistically backed numbers, there are a lot of methodologies out there

http://en.wikipedia.org/wiki/Bhattacharyya_distance
http://en.wikipedia.org/wiki/Mahalanobis_distance
http://en.wikipedia.org/wiki/Kullback%E2%80%93Leibler_divergence

We will now compare the results for each of the above methods for various series

  • series1 = c(1,2,3,4,5)
    series2 = c(1,2,3,4,5)
Bhattacharya Distance : 0
Mahalanobis Distance : 0
Kullback Liebler Divergence : 0

  • series1 = c(1,2,3,4,5)
    series2 = c(1,2,3,4,10)
Bhattacharya Distance : 0.182661
Mahalanobis Distance : 0.03571429
Kullback Liebler Divergence : 0.444719


  • series1 = c(1,2,3,4,5)
    series2 = c(1,2,3,4,20)
Bhattacharya Distance : 0.7277557
Mahalanobis Distance : 0.25
Kullback Liebler Divergence : 1.201438
  • series1 = c(1,2,3,4,5)
    series2 = c(1,2,3,4,50)
Bhattacharya Distance : 2.305805
Mahalanobis Distance : 1.35
Kullback Liebler Divergence : 2.191514

From the above results we can see that Bhattacharya Distance and Kullback Liebler Divergence is a better measure of the divergence of two series. Their movement does not change very rapidly with one outlier


If we now get on to serious business, comparing stock prices. The stocks that I have chosen are
1) Apple : AAPL
2) Amazon : AMZN

I have considered 2 months of daily trading data


Bhattacharya Distance : 31.17678
Mahalanobis Distance : 31.10953
Kullback Liebler Divergence : 2416.31

Let us take another example with a lot of volatility


Bhattacharya Distance : 6.910062
Mahalanobis Distance : 6.415947
Kullback Liebler Divergence :42.18934

We see that almost all the values have decreased. This is because the two stock prices are closer to each other. WE need to find a way in which we can use the distance values and plug it into the correlation analysis

The code for the various distances is mentioned below

series1 = c(1,2,3,4,5)
series2 = c(1,2,3,4,50)
cov1 = cov(as.matrix(series1))
cov2 = cov(as.matrix(series2))
mean1 = mean(series1)
mean2 = mean(series2)
meanAverage = (mean1+mean2)/2
seriescov = (cov1+cov2) / 2
bhatt = (1/8) * t(as.matrix(mean1-mean2)) %*% solve(as.matrix(meanAverage)) %*% as.matrix(mean1-mean2) + 0.5*log(seriescov/sqrt(cov1*cov2))
mahal = (1/8)*t(as.matrix(mean1-mean2)) %*% solve(as.matrix(meanAverage)) %*% as.matrix(mean1-mean2)
kullback = 0.5 * ( ( matrix.trace(solve(as.matrix(cov2))%*%as.matrix(cov1)) ) + (t(mean2 -mean1) %*% solve(cov2) %*% as.matrix(mean2-mean1) ) - 1 + log(det(as.matrix(cov2))/det(as.matrix(cov1))))





No comments: