Often we must have faced a situation where we wanted to know, how similar is a distribution compared to the other one. Okay, you got me, correlation is an answer, but what if I want to know how similar and far away are the distributions from each other

In my earlier blog post, I was adopting a crude manner. It worked because I needed a rough figure, but if you want statistically backed numbers, there are a lot of methodologies out there

http://en.wikipedia.org/wiki/Bhattacharyya_distance

http://en.wikipedia.org/wiki/Mahalanobis_distance

http://en.wikipedia.org/wiki/Kullback%E2%80%93Leibler_divergence

We will now compare the results for each of the above methods for various series

Mahalanobis Distance : 0

Kullback Liebler Divergence : 0

Mahalanobis Distance : 0.03571429

Kullback Liebler Divergence : 0.444719

Mahalanobis Distance : 0.25

Kullback Liebler Divergence : 1.201438

Mahalanobis Distance : 1.35

Kullback Liebler Divergence : 2.191514

From the above results we can see that Bhattacharya Distance and Kullback Liebler Divergence is a better measure of the divergence of two series. Their movement does not change very rapidly with one outlier

If we now get on to serious business, comparing stock prices. The stocks that I have chosen are

1) Apple : AAPL

2) Amazon : AMZN

I have considered 2 months of daily trading data

Bhattacharya Distance : 31.17678

Mahalanobis Distance : 31.10953

Kullback Liebler Divergence : 2416.31

Let us take another example with a lot of volatility

Bhattacharya Distance : 6.910062

Mahalanobis Distance : 6.415947

Kullback Liebler Divergence :42.18934

We see that almost all the values have decreased. This is because the two stock prices are closer to each other. WE need to find a way in which we can use the distance values and plug it into the correlation analysis

The code for the various distances is mentioned below

In my earlier blog post, I was adopting a crude manner. It worked because I needed a rough figure, but if you want statistically backed numbers, there are a lot of methodologies out there

http://en.wikipedia.org/wiki/Bhattacharyya_distance

http://en.wikipedia.org/wiki/Mahalanobis_distance

http://en.wikipedia.org/wiki/Kullback%E2%80%93Leibler_divergence

We will now compare the results for each of the above methods for various series

- series1 = c(1,2,3,4,5)

series2 = c(1,2,3,4,5)

Mahalanobis Distance : 0

Kullback Liebler Divergence : 0

- series1 = c(1,2,3,4,5)

series2 = c(1,2,3,4,10)

Mahalanobis Distance : 0.03571429

Kullback Liebler Divergence : 0.444719

- series1 = c(1,2,3,4,5)

series2 = c(1,2,3,4,20)

Mahalanobis Distance : 0.25

Kullback Liebler Divergence : 1.201438

- series1 = c(1,2,3,4,5)

series2 = c(1,2,3,4,50)

Mahalanobis Distance : 1.35

Kullback Liebler Divergence : 2.191514

From the above results we can see that Bhattacharya Distance and Kullback Liebler Divergence is a better measure of the divergence of two series. Their movement does not change very rapidly with one outlier

If we now get on to serious business, comparing stock prices. The stocks that I have chosen are

1) Apple : AAPL

2) Amazon : AMZN

I have considered 2 months of daily trading data

Bhattacharya Distance : 31.17678

Mahalanobis Distance : 31.10953

Kullback Liebler Divergence : 2416.31

Let us take another example with a lot of volatility

Bhattacharya Distance : 6.910062

Mahalanobis Distance : 6.415947

Kullback Liebler Divergence :42.18934

We see that almost all the values have decreased. This is because the two stock prices are closer to each other. WE need to find a way in which we can use the distance values and plug it into the correlation analysis

The code for the various distances is mentioned below

*series1 = c(1,2,3,4,5)*

series2 = c(1,2,3,4,50)

cov1 = cov(as.matrix(series1))

cov2 = cov(as.matrix(series2))

mean1 = mean(series1)

mean2 = mean(series2)

meanAverage = (mean1+mean2)/2

seriescov = (cov1+cov2) / 2

bhatt = (1/8) * t(as.matrix(mean1-mean2)) %*% solve(as.matrix(meanAverage)) %*% as.matrix(mean1-mean2) + 0.5*log(seriescov/sqrt(cov1*cov2))

mahal = (1/8)*t(as.matrix(mean1-mean2)) %*% solve(as.matrix(meanAverage)) %*% as.matrix(mean1-mean2)

kullback = 0.5 * ( ( matrix.trace(solve(as.matrix(cov2))%*%as.matrix(cov1)) ) + (t(mean2 -mean1) %*% solve(cov2) %*% as.matrix(mean2-mean1) ) - 1 + log(det(as.matrix(cov2))/det(as.matrix(cov1))))series2 = c(1,2,3,4,50)

cov1 = cov(as.matrix(series1))

cov2 = cov(as.matrix(series2))

mean1 = mean(series1)

mean2 = mean(series2)

meanAverage = (mean1+mean2)/2

seriescov = (cov1+cov2) / 2

bhatt = (1/8) * t(as.matrix(mean1-mean2)) %*% solve(as.matrix(meanAverage)) %*% as.matrix(mean1-mean2) + 0.5*log(seriescov/sqrt(cov1*cov2))

mahal = (1/8)*t(as.matrix(mean1-mean2)) %*% solve(as.matrix(meanAverage)) %*% as.matrix(mean1-mean2)

kullback = 0.5 * ( ( matrix.trace(solve(as.matrix(cov2))%*%as.matrix(cov1)) ) + (t(mean2 -mean1) %*% solve(cov2) %*% as.matrix(mean2-mean1) ) - 1 + log(det(as.matrix(cov2))/det(as.matrix(cov1))))