Often we must have faced a situation where we wanted to know, how similar is a distribution compared to the other one. Okay, you got me, correlation is an answer, but what if I want to know how similar and far away are the distributions from each other
In my earlier blog post, I was adopting a crude manner. It worked because I needed a rough figure, but if you want statistically backed numbers, there are a lot of methodologies out there
http://en.wikipedia.org/wiki/Bhattacharyya_distance
http://en.wikipedia.org/wiki/Mahalanobis_distance
http://en.wikipedia.org/wiki/Kullback%E2%80%93Leibler_divergence
We will now compare the results for each of the above methods for various series
Mahalanobis Distance : 0
Kullback Liebler Divergence : 0
Mahalanobis Distance : 0.03571429
Kullback Liebler Divergence : 0.444719
Mahalanobis Distance : 0.25
Kullback Liebler Divergence : 1.201438
Mahalanobis Distance : 1.35
Kullback Liebler Divergence : 2.191514
From the above results we can see that Bhattacharya Distance and Kullback Liebler Divergence is a better measure of the divergence of two series. Their movement does not change very rapidly with one outlier
If we now get on to serious business, comparing stock prices. The stocks that I have chosen are
1) Apple : AAPL
2) Amazon : AMZN
I have considered 2 months of daily trading data
Bhattacharya Distance : 31.17678
Mahalanobis Distance : 31.10953
Kullback Liebler Divergence : 2416.31
Let us take another example with a lot of volatility
Bhattacharya Distance : 6.910062
Mahalanobis Distance : 6.415947
Kullback Liebler Divergence :42.18934
We see that almost all the values have decreased. This is because the two stock prices are closer to each other. WE need to find a way in which we can use the distance values and plug it into the correlation analysis
The code for the various distances is mentioned below
series1 = c(1,2,3,4,5)
series2 = c(1,2,3,4,50)
cov1 = cov(as.matrix(series1))
cov2 = cov(as.matrix(series2))
mean1 = mean(series1)
mean2 = mean(series2)
meanAverage = (mean1+mean2)/2
seriescov = (cov1+cov2) / 2
bhatt = (1/8) * t(as.matrix(mean1-mean2)) %*% solve(as.matrix(meanAverage)) %*% as.matrix(mean1-mean2) + 0.5*log(seriescov/sqrt(cov1*cov2))
mahal = (1/8)*t(as.matrix(mean1-mean2)) %*% solve(as.matrix(meanAverage)) %*% as.matrix(mean1-mean2)
kullback = 0.5 * ( ( matrix.trace(solve(as.matrix(cov2))%*%as.matrix(cov1)) ) + (t(mean2 -mean1) %*% solve(cov2) %*% as.matrix(mean2-mean1) ) - 1 + log(det(as.matrix(cov2))/det(as.matrix(cov1))))
In my earlier blog post, I was adopting a crude manner. It worked because I needed a rough figure, but if you want statistically backed numbers, there are a lot of methodologies out there
http://en.wikipedia.org/wiki/Bhattacharyya_distance
http://en.wikipedia.org/wiki/Mahalanobis_distance
http://en.wikipedia.org/wiki/Kullback%E2%80%93Leibler_divergence
We will now compare the results for each of the above methods for various series
- series1 = c(1,2,3,4,5)
series2 = c(1,2,3,4,5)
Mahalanobis Distance : 0
Kullback Liebler Divergence : 0
- series1 = c(1,2,3,4,5)
series2 = c(1,2,3,4,10)
Mahalanobis Distance : 0.03571429
Kullback Liebler Divergence : 0.444719
- series1 = c(1,2,3,4,5)
series2 = c(1,2,3,4,20)
Mahalanobis Distance : 0.25
Kullback Liebler Divergence : 1.201438
- series1 = c(1,2,3,4,5)
series2 = c(1,2,3,4,50)
Mahalanobis Distance : 1.35
Kullback Liebler Divergence : 2.191514
From the above results we can see that Bhattacharya Distance and Kullback Liebler Divergence is a better measure of the divergence of two series. Their movement does not change very rapidly with one outlier
If we now get on to serious business, comparing stock prices. The stocks that I have chosen are
1) Apple : AAPL
2) Amazon : AMZN
I have considered 2 months of daily trading data
Bhattacharya Distance : 31.17678
Mahalanobis Distance : 31.10953
Kullback Liebler Divergence : 2416.31
Let us take another example with a lot of volatility
Bhattacharya Distance : 6.910062
Mahalanobis Distance : 6.415947
Kullback Liebler Divergence :42.18934
We see that almost all the values have decreased. This is because the two stock prices are closer to each other. WE need to find a way in which we can use the distance values and plug it into the correlation analysis
The code for the various distances is mentioned below
series1 = c(1,2,3,4,5)
series2 = c(1,2,3,4,50)
cov1 = cov(as.matrix(series1))
cov2 = cov(as.matrix(series2))
mean1 = mean(series1)
mean2 = mean(series2)
meanAverage = (mean1+mean2)/2
seriescov = (cov1+cov2) / 2
bhatt = (1/8) * t(as.matrix(mean1-mean2)) %*% solve(as.matrix(meanAverage)) %*% as.matrix(mean1-mean2) + 0.5*log(seriescov/sqrt(cov1*cov2))
mahal = (1/8)*t(as.matrix(mean1-mean2)) %*% solve(as.matrix(meanAverage)) %*% as.matrix(mean1-mean2)
kullback = 0.5 * ( ( matrix.trace(solve(as.matrix(cov2))%*%as.matrix(cov1)) ) + (t(mean2 -mean1) %*% solve(cov2) %*% as.matrix(mean2-mean1) ) - 1 + log(det(as.matrix(cov2))/det(as.matrix(cov1))))