Monday, September 11, 2017

Bulk download stock data from Yahoo finance with R

stocks

StockScraper

So a slow weekend means working on some of my non-bioinformatics projects. This time it was writing an R script to scrape historical stock data from Yahoo finance using R. This comes after Yahoo broke everyone’s scripts (including one I had written in bash) by changing their API to require a cookie/crumb pair. I won’t go into detail about solving that problem (reg ex) or much about the script itself but feel free to have a look at it here.. The important thing is that it works.

I should note that another R package that I like very much, quantmod, includes the similar function getsymbols(). There’s a few things I dislike about getsymbols(), namely that when downloading multiple stocks it loads each one as a separate variable in the environment. This sucks if you’re downloading an entire exchange like NYSE. The other downside is not being able to limit the date range which again is useful when dealing with large numbers of stocks.

With that let’s take the StockScraper for a spin. First we need to source it since I’m too lazy to package it:

source("https://raw.githubusercontent.com/ScientistJake/StockScraper.R/master/StockScraper.R")

As you can see the script includes two functions. The primary function, stockhistoricals() is to download the data. The helper function get_stocklists() is a way to retrieve the stocklists for NYSE, AMEX and NASDAQ. It retrieves a lot of good stock metadata too which could be useful later when building correlation analysis.

Let’s bulk download NASDAQ as an example. Here I’m being explicit in the arguments but you can run them with their defaults listed at the top of the function. I usually run it with verbose=TRUE to monitor it but that would look like crap in this markdown!

NASDAQ <- stockhistoricals(stocklist="NASDAQ", start_date = "2016-09-11", end_date = "2017-09-11", verbose = FALSE)

So now we have a years worth of stock price historicals for the entire NASDAQ exchange. The data is stored as a list of dataframes named for the stock tickers. Check it out:

#list the first ten stocks
names(NASDAQ)[1:10]
##  [1] "PIH"  "TURN" "FLWS" "FCCY" "SRCE" "VNET" "TWOU" "JOBS" "CAFD" "EGHT"

We can retrieve individual stock data using standard R notation:

#check out GOOG
head(NASDAQ$GOOG)
##         Date   Open   High     Low  Close Adj.Close  Volume
## 1 2016-09-12 755.13 770.29 754.000 769.02    769.02 1311000
## 2 2016-09-13 764.48 766.22 755.800 759.69    759.69 1395000
## 3 2016-09-14 759.61 767.68 759.110 762.49    762.49 1087400
## 4 2016-09-15 762.89 773.80 759.960 771.76    771.76 1305100
## 5 2016-09-16 769.75 769.75 764.660 768.88    768.88 2049300
## 6 2016-09-19 772.42 774.00 764.441 765.70    765.70 1172800

If you’re comfortable with lists we can work directly with the list for simple analyses:

#Get the average adjusted close price for GOOG
mean(NASDAQ$GOOG$Adj.Close)
## [1] 852.4343

Or we can extract stocks and do fun things like plot them.

#extract GOOG
GOOG <- data.frame(NASDAQ$GOOG)
names(GOOG) <- c("Date","Open","High","Low","Close","Adj.Close","Volume")

#plot it out!
library(ggplot2)
ggplot(GOOG, aes(x = Date, y = Close)) +
geom_line() +
labs(title = "GOOG Price", y = "Closing Price", x = "")

In the future I might bundle stockscraper into an R package along with some of my favorite plotting and clustering wrappers. But for now I'll leave it at that.

Good luck and happy data-mining!

Session

sessionInfo()
## R version 3.3.0 (2016-05-03)
## Platform: x86_64-apple-darwin13.4.0 (64-bit)
## Running under: OS X 10.10 (Yosemite)
## 
## locale:
## [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
## 
## attached base packages:
## [1] stats     graphics  grDevices utils     datasets  methods   base     
## 
## other attached packages:
## [1] ggplot2_2.2.1  readr_1.1.1    httr_1.2.1     RCurl_1.95-4.8
## [5] bitops_1.0-6   XML_3.98-1.9  
## 
## loaded via a namespace (and not attached):
##  [1] Rcpp_0.12.11     knitr_1.16       magrittr_1.5     hms_0.3         
##  [5] munsell_0.4.3    colorspace_1.3-2 R6_2.2.2         rlang_0.1.2     
##  [9] plyr_1.8.4       stringr_1.2.0    tools_3.3.0      grid_3.3.0      
## [13] gtable_0.2.0     htmltools_0.3.6  lazyeval_0.2.0   yaml_2.1.14     
## [17] rprojroot_1.2    digest_0.6.12    tibble_1.3.3     curl_2.6        
## [21] evaluate_0.10    mime_0.5         rmarkdown_1.6    labeling_0.3    
## [25] stringi_1.1.5    scales_0.4.1     backports_1.1.0

Bulk download stock data from Yahoo finance with R

stocks StockScraper So a slow weekend means working on some of my non-bioinf...