Sunday, March 22, 2015

Historical Finance News Feed

I scraped some financial news data for the year 2015. The file is present here. I am scraping data for the past 15 20 years for all subject lines ( Commodities, FX, Bonds etc )

Scraping the data is not enough. We would like to link the news to some effect on the stock prices so that we can use it for prediction/forecasting.

ANALYSIS 1
I took the stock prices and found the variation between the High and Opening Prices. Any day with a movement of more than 4 SD's can be marked as potential news days. We will then take the news from these days and mark them as having POSITIVE Sentiment on the Stock

Code
# Accessing the news feed content in R
library(XML)
library(RCurl)
# Find the big deviations and find if there are related news and vice versa

# We will read data from the master source file of RICS
Tickers = read.csv('C:\\Anant\\MyLearning\\Statistics\\SpreadAnalysis\\WorldTickerList.csv')

# For now we will take the example of GOOGLE in that list
Tickers = Tickers[Tickers$TICKER=='GOOG',]

# We will use the Ticker value and download data from Yahoo Finance
# You can also customise the date ranges
URL = paste(c('http://real-chart.finance.yahoo.com/table.csv?s=',as.character(Tickers$TICKER),'&a=00&b=01&c=2015&d=08&e=30&f=2015&g=d&ignore=.csv'),collapse="")
GOOG = read.csv(URL)

GOOG$OpenHighSpread = GOOG$High - GOOG$Open
GOOG$LowHighSpread = GOOG$High - GOOG$Low
GOOG$OpenHigh = (GOOG$OpenHighSpread - mean(GOOG$OpenHighSpread))/sd(GOOG$OpenHighSpread)
GOOG$LowHigh = (GOOG$LowHighSpread - mean(GOOG$LowHighSpread))/sd(GOOG$LowHighSpread)

MajorPoints = GOOG[GOOG$OpenHigh < -3 | GOOG$OpenHigh > 3,]

# We see that there were 4 dates when there was a lot of deviation in the Open High
# There must have been some news around these dates

#############################################################################################
# Source 1 : GOOGLE

#############################################################################################
# source 2 Reuters
for(dateValue in MajorPoints$Date)
{
  newsDateURL = paste(c('http://www.reuters.com/finance/stocks/companyNews?symbol=GOOG.O&date=',format(as.Date(dateValue,"%Y-%m-%d"),"%m%d%Y")),collapse="")
  #newsDateURL = paste(c(newsURL,'&startdate=',dateValue,'&enddate=',dateValue),collapse="")
  print(newsDateURL)
  doc = getURL(newsDateURL)
  doc = htmlParse(doc)
  news = xpathSApply(doc,'//div[@id = "companyNews"]/div/div/div/p')
}

#############################################################################################
# Source 3 Google
for(dateValue in MajorPoints$Date)
{
  newsDateURL = paste(c('http://finance.yahoo.com/q/h?s=',as.character(Tickers$TICKER),'&t',as.character(dateValue)),collapse="")
  #newsDateURL = paste(c(newsURL,'&startdate=',dateValue,'&enddate=',dateValue),collapse="")
  print(newsDateURL)
  doc = getURL(newsDateURL)
  doc = htmlParse(doc)
  news = xpathSApply(doc,'//div[@class = "mod yfi_quote_headline withsky"]/ul/li//a')
}


Next step is to do some Language Processing on this data

3 comments:

pumplove said...

hey Anant ..thanks for the post Sir .. i am planning to do some contextual analysis on the same historical news feeds that you are talking about. But here's the deal ..this scraping business isn't exactly legal right ? Please don't get me wrong ..its ok as long as you are doing it for a personal project but what if i want to integrate this into an app that i am building ? are you aware of open source feeds that i could use ? or even if i were to pay reuters have you any experience with subscribing for their services ? sorry if my query came of as rude or whatever but i am coming from the perspective of being cautious while doing it for a business idea that i have.

Regards,
vikram

Unknown said...

Sorry dear your link is not working to download previous 10 years NewsFeed for any stock

Not so Smoked said...

Hi Anant, What's the update on your attempt? Were you able to create one?