Thursday, April 30, 2015

Relation between 2 distributions

Often we must have faced a situation where we wanted to know, how similar is a distribution compared to the other one. Okay, you got me, correlation is an answer, but what if I want to know how similar and far away are the distributions from each other

In my earlier blog post, I was adopting a crude manner. It worked because I needed a rough figure, but if you want statistically backed numbers, there are a lot of methodologies out there

We will now compare the results for each of the above methods for various series

  • series1 = c(1,2,3,4,5)
    series2 = c(1,2,3,4,5)
Bhattacharya Distance : 0
Mahalanobis Distance : 0
Kullback Liebler Divergence : 0

  • series1 = c(1,2,3,4,5)
    series2 = c(1,2,3,4,10)
Bhattacharya Distance : 0.182661
Mahalanobis Distance : 0.03571429
Kullback Liebler Divergence : 0.444719

  • series1 = c(1,2,3,4,5)
    series2 = c(1,2,3,4,20)
Bhattacharya Distance : 0.7277557
Mahalanobis Distance : 0.25
Kullback Liebler Divergence : 1.201438
  • series1 = c(1,2,3,4,5)
    series2 = c(1,2,3,4,50)
Bhattacharya Distance : 2.305805
Mahalanobis Distance : 1.35
Kullback Liebler Divergence : 2.191514

From the above results we can see that Bhattacharya Distance and Kullback Liebler Divergence is a better measure of the divergence of two series. Their movement does not change very rapidly with one outlier

If we now get on to serious business, comparing stock prices. The stocks that I have chosen are
1) Apple : AAPL
2) Amazon : AMZN

I have considered 2 months of daily trading data

Bhattacharya Distance : 31.17678
Mahalanobis Distance : 31.10953
Kullback Liebler Divergence : 2416.31

Let us take another example with a lot of volatility

Bhattacharya Distance : 6.910062
Mahalanobis Distance : 6.415947
Kullback Liebler Divergence :42.18934

We see that almost all the values have decreased. This is because the two stock prices are closer to each other. WE need to find a way in which we can use the distance values and plug it into the correlation analysis

The code for the various distances is mentioned below

series1 = c(1,2,3,4,5)
series2 = c(1,2,3,4,50)
cov1 = cov(as.matrix(series1))
cov2 = cov(as.matrix(series2))
mean1 = mean(series1)
mean2 = mean(series2)
meanAverage = (mean1+mean2)/2
seriescov = (cov1+cov2) / 2
bhatt = (1/8) * t(as.matrix(mean1-mean2)) %*% solve(as.matrix(meanAverage)) %*% as.matrix(mean1-mean2) + 0.5*log(seriescov/sqrt(cov1*cov2))
mahal = (1/8)*t(as.matrix(mean1-mean2)) %*% solve(as.matrix(meanAverage)) %*% as.matrix(mean1-mean2)
kullback = 0.5 * ( ( matrix.trace(solve(as.matrix(cov2))%*%as.matrix(cov1)) ) + (t(mean2 -mean1) %*% solve(cov2) %*% as.matrix(mean2-mean1) ) - 1 + log(det(as.matrix(cov2))/det(as.matrix(cov1))))

Sunday, April 19, 2015

Coursera Course Distribution

Coursera has revolutionized online education. Like me it has given hope to millions of students worldwide who want to study different subjects. For fun,  I wanted to look at the distribution of languages of the various courses available

  • Distribution by Region ( excluding English Courses )
The file which contains the data is present here

I used R and ggplot to create a pie chart

languages = read.csv("Languages.csv",header=TRUE)
continents = sqldf("select Continent,sum(Number) Count from languages group by Continent")
ggplot(continents, aes(x="", y=Count, fill=Continent)) + geom_bar(width = 1, stat = "identity") + coord_polar(theta="y") + ylab("Number of Courses")

  • Distribution of Asian Courses
  •  Distribution of European Courses

R Tips - SQLDF, Lists

R is a wonderful language, and I hope more and more enthusiasts contribute to it. However like me, almost everybody is stuck with small irritating issues.

  • Joining/Filtering tables using SQLDF in R
SQLDF is a very powerful module. It brings the power of SQL to the already statistically powerful R

crestValues = subset(testdata,Direction=='Crest')  

# Simple Select Query with SQLDF in R
out=sqldf("select Date,Close,JulianDate from crestValues order by JulianDate asc")

#  Inner Join  with SQLDF in R

finaltables11 = sqldf("select out2.lowJDate,out2.highJDate from tables11 inner join out2 where tables11.s2==out2.s2)

# Left Outer Join with SQLDF in R
tempt=sqldf("select t1.Close,max(t2.Close),t1.JulianDate as lowJDate,t2.JulianDate as highJDate from out t1 left outer join out t2 where t2.JulianDate > t1.JulianDate group by t1.Close")

# Fetch First Rows in a SQLDF R Query
tables11 = sqldf("select sum(highJDate-lowJDate),Count,s2 from tables2 group by Count,s2 limit 1")
  •   Accessing Data Frames using strings
There might be occassions where we have made a lot of data frames and we would like to access them based on user input/flow logic. Surprisingly this was a difficult task

Suppose we have 5 data frames created  
tables5 = sqldf("select count(s5) as Count,s5,lowJDate,highJDate from out2 group by s5 having count(s5) = 2 order by count(s5) desc")
  tables4 = sqldf("select count(s4) as Count,s4,lowJDate,highJDate from out2 group by s4 having count(s4) = 2 order by count(s4) desc")
  tables3 = sqldf("select count(s3) as Count,s3,lowJDate,highJDate from out2 group by s3 having count(s3) = 2 order by count(s3) desc")
  tables2 = sqldf("select count(s2) as Count,s2,lowJDate,highJDate from out2 group by s2 having count(s2) = 2 order by count(s2) desc")
  tables1 = sqldf("select count(s1) as Count,s1,lowJDate,highJDate from out2 group by s1 having count(s1) = 2 order by count(s1) desc")


Now I want to access each of the data frames in a for loop. The way I did it was

mylist = list(tables1,tables2,tables3,tables4,tables5)
for(k in 5:1)
    #tables11 = sqldf("select sum(highJDate-lowJDate),Count,s1 from tables1 group by Count,s1 limit 1")


Thursday, April 16, 2015

ExoPlanet Analysis : Numerical to Categorical Data in R

After the recent discovery of a possible man made object by amateur outer space researchers, I thought that I should lay some hands on data from Outer space. I made use of the Exoplanet Orbit Database and the Exoplanet Data Explorer at

I used 2 variables Density and Gravity. We all know that more the density, more the gravity, so I just wanted to plot it. The problem was that both these columns were numerical values. This makes it difficult to plot. So i had to convert one of the variables to a category variable

I used the sapply function which proved to be very powerful

exoplanets$densityCategory = sapply(exoplanets$DENSITY,function(x) 
else if(x<2) {'less than 2'}
else if(x>=2) {'>2'})
We can always add more categories. So after this, it was a matter of plotting the data.

Sunday, April 5, 2015

Finding patterns in the Maxima/Minima Resistance/Support of Stock Prices

I have frequently seen people trying to find a pattern in the highs and lows of a stock. What they search for is a value that defines the upper limit and lower limit of the stock prices for the short term. This banding allows people to have an increased probability of profit making while taking position.This is usually denoted by a straight line drawn manually on the daily stock price chart. Like the one shown below

However this is easier done manually than automatically.

I plan to reproduce this with the help of an R code. I followed the following process

1) Mark all the "Crests" and "Troughs" on the graph
2) Find the "Crests" and "Troughs" with the maximum similar reflection points
3) These undoubtedly become your resistance/support plots
4) However, we need to find out when the resistance/support has changed ( this usually happens due to any micro/macro factor ). But this is the most difficult part
5) We need to find an algorithm for doing this
6) Also the resistance/support lines in many stocks might not be a zero slope line. They might have some gradient. It is difficult to identify them

The above is the plot of AAPL ( Apple Inc ) on NASDAQ. I have been able to Crests/Troughs but it seems the resistance line is going to have a slope. I need to factor this into my code too

After factoring in the slopes for finding the support plots/trends there was significant improvements in the plots. The code can be found here . However the input data required for this consists of daily stock prices. This can be found as a separate R script here
You need to replace the system file paths in the scripts

Some sample graphs from this code are

Some more refinement is required. Any comments or reviews are welcome