This post is a fun exercise applying PCA (most in tidyverse framework) to stock data from S&P 500 index companies. With PCA on stock data from S&P 500, we can possibly do many interesting things, but we will focus on really simple things and try to see if PCA captures the general company trends from PCA results.
We will use tidyquant R package to get access to stock price data using R and use prcomp() to perform PCA using tidyverse framework with the help of broom package.
We will start with getting daily stock price data for about 500 companies in the S&P 500 stock index for the year 2022. Then we will explore the data a little and compute daily stock returns for each company.
Then we will go ahead and apply PCA on the daily return data for the year 2022 and try to make sense of the PCs.
First let us load the packages needed.
library(tidyquant) library(broom) library(tidyverse) theme_set(theme_bw(16))
Companies in S&P 500 index
S&P 500 index typically contains about 500 companies in its index. As of the end of 2022, there are 503 companies in S&P 500 index.
We can get the list of all the companies in S&P 500 index, using tidyquant’s tq_index() function. We specify the name of the index, here “SP500” as the argument to tidyquant’s tq_index(). It gives us detailed information regarding the companies in S&P 500 index, including stock ticker name, a companies weight in the index as measured by market cap, and the broad sector it belongs to.
# get S&P 500 index coopanies info tq_index("SP500") %>% head()
## Getting holdings for SP500 ## # A tibble: 10 × 8 ## symbol company identifier sedol weight sector shares_held local_currency ## <chr> <chr> <chr> <chr> <dbl> <chr> <dbl> <chr> ## 1 AAPL Apple Inc. 03783310 2046… 0.0587 Infor… 165455870 USD ## 2 MSFT Microsoft C… 59491810 2588… 0.0520 Infor… 82480670 USD ## 3 AMZN Amazon.com … 02313510 2000… 0.0232 Consu… 98203210 USD ## 4 BRK.B Berkshire H… 08467070 2073… 0.0177 Finan… 19935340 USD ## 5 GOOGL Alphabet In… 02079K30 BYVY… 0.0162 Commu… 66088936 USD ## 6 JNJ Johnson & J… 47816010 2475… 0.0147 Healt… 28960104 USD
Let us save the information on the S&P 500 companies in a variable for future use.
stock_list_tbl <- tq_index("SP500") %>% arrange(symbol)
Exploring the S&P 500 companies data a bit, let us look at the number of companies in each sector using a simple barplot.
stock_list_tbl %>% count(sector, sort=TRUE) %>% ggplot(aes(y=reorder(sector,n), x=n, fill=sector))+ geom_col()+ labs(y=NULL, subtitle= "2022 S&P 500 Companies Count by Sector")+ theme(legend.position = "none") ggsave("SP500_index_company_counts_by_sector.png")
Here is a look at the top 15 companies in the S&P 500 index by their weight.
stock_list_tbl %>% arrange(desc(weight))%>% head(15) %>% mutate(company=gsub(" Corporation", "", company), company=gsub(" Inc.","", company), company=gsub(" Limited","", company), company=gsub(" Company","", company), company=gsub(" Inc.","", company), company=gsub(" Incorporated","", company)) %>% ggplot(aes(y=reorder(company,weight), x=weight, fill=sector))+ geom_col()+ labs(y=NULL)+ scale_x_continuous(labels=scales::percent_format()) ggsave("SP500_index_company_weight_top15.png")
How to Get Stock price data for a company
We can get the stock price of any listed US company using tidyquant function tq_get(). In the example below, we specify ticker for Apple and also specify the start and end dates we need the stock data. Here we are getting Apple’s stock price data for the whole year 2022.
For each company we specify we get the stock price at opening, high, low, and adjusted price as shown below.
tq_get("AAPL", from = "2022-01-01", to="2023-01-01") %>% head() ## # A tibble: 6 × 8 ## symbol date open high low close volume adjusted ## <chr> <date> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> ## 1 AAPL 2022-01-03 178. 183. 178. 182. 104487900 181. ## 2 AAPL 2022-01-04 183. 183. 179. 180. 99310400 179. ## 3 AAPL 2022-01-05 180. 180. 175. 175. 94537600 174. ## 4 AAPL 2022-01-06 173. 175. 172. 172 96904000 171. ## 5 AAPL 2022-01-07 173. 174. 171. 172. 86709100 171. ## 6 AAPL 2022-01-10 169. 172. 168. 172. 106765600 171.
Let us write a small function to get a company’s stock price. Instead of getting all the data from tq_get() function, we keep just three columns from tq_get()’s output.
get_stock_price <- function(ticker = "AAPL"){ tq_get(ticker, from = "2022-01-01", to="2023-01-01") %>% select(symbol, date, adjusted) }
Here is another function to get all the stock tickers for the companies in S&P 500 index using tq_index() function in tidyquant R package.
get_stock_tickers <- function(stock_index= "SP500"){ tq_index(stock_index) %>% select(symbol,company,weight) %>% arrange(symbol) }
Leet us get all the stock tickers from S&P 500 index and clean up the ticker names a bit.
sp500_tickers <- get_stock_tickers() sp500_tickers <- sp500_tickers %>% mutate(symbol= gsub("\\.","-",symbol))
sp500_tickers %>% head() ## # A tibble: 6 × 3 ## symbol company weight ## <chr> <chr> <dbl> ## 1 A Agilent Technologies Inc. 0.00142 ## 2 AAL American Airlines Group Inc. 0.000280 ## 3 AAP Advance Auto Parts Inc. 0.000290 ## 4 AAPL Apple Inc. 0.0587 ## 5 ABBV AbbVie Inc. 0.00907 ## 6 ABC AmerisourceBergen Corporation 0.000829
How to Get Stock Price data for all the companies in S&P 500 index
With all company tickers, we are now ready to get the stock price data for all companies in S&P 500. Again, we will use tidyquant’s tq_get() function to get the stock data. Here, instead of single company’s ticker, we will use all the tickers from S&P 500 index.
We also specify the start and end date for the year 2022, since we want the data for the whole of 2022.
# get SP500 companies stock price data sp500_2022 <- tq_get(sp500_tickers %>% pull(symbol), from = "2022-01-01", to= "2023-01-01")
Since tq_get() is fetching the data through internet using Yahoo service, it might take some time to complete to get over 500 companies stock price for a year. When it is complete, the output data will look the same as before, but now for all the companies.
# S&P 500 companyies stock price data sp500_2022 %>% head() ## # A tibble: 6 × 8 ## symbol date open high low close volume adjusted ## <chr> <date> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> ## 1 A 2022-01-03 159 159. 154. 156. 1606300 155. ## 2 A 2022-01-04 155. 156. 150. 151. 2234000 150. ## 3 A 2022-01-05 151. 153. 149. 149. 2370500 148. ## 4 A 2022-01-06 149. 150. 146. 149. 2298300 148. ## 5 A 2022-01-07 149. 150. 145. 145. 2058600 144. ## 6 A 2022-01-10 143. 145. 141. 145. 2548100 144.
Wee can make sure we have downloaded all the companies’ stock price data.
sp500_2022 %>% dim() 126002 8
We have over 120k rows of stock price data. And by looking the tail of the data. Since we have sorted the company tickers, we are seeing data from a company that starts with Z.
sp500_2022 %>% tail() ## # A tibble: 6 × 8 ## symbol date open high low close volume adjusted ## <chr> <date> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> ## 1 ZTS 2022-12-22 144. 145. 142. 145. 1541800 145. ## 2 ZTS 2022-12-23 145. 146. 144. 146. 1017900 146. ## 3 ZTS 2022-12-27 146. 146. 144. 145. 957900 145. ## 4 ZTS 2022-12-28 145. 147. 144. 144. 1443900 144. ## 5 ZTS 2022-12-29 145. 149. 145. 148. 1298900 148. ## 6 ZTS 2022-12-30 147. 148. 145. 147. 1249500 147.
How to Compute Daily Returns for all the Companies in S&P 500 index
We will compute daily returns for each day and for each company using adjusted price for a day with. tidyquant’s tq_transmute() function.
We apply tq_transmute() for each company in S&P 500 using group_by() function. And tq_transmute() takes four arguments as shown below. We specify tq_transmute() to use “adjusted” stock price for computing daily returns and also rename the return column to “daily_return”.
sp500_daily_return <- sp500_2022 %>% group_by(symbol) %>% tq_transmute(select = adjusted, mutate_fun = periodReturn, period = "daily", col_rename = "daily_return")
And the resulting daily return data is a tidy tibble that looks like this.
sp500_daily_return %>% head() ## # A tibble: 6 × 3 ## # Groups: symbol [1] ## symbol date daily_return ## <chr> <date> <dbl> ## 1 A 2022-01-03 0 ## 2 A 2022-01-04 -0.0338 ## 3 A 2022-01-05 -0.0171 ## 4 A 2022-01-06 0.00350 ## 5 A 2022-01-07 -0.0266 ## 6 A 2022-01-10 0.0000690
How to Compute weekly/monthly/yearly Returns for all the Companies in S&P 500 index
If youa re interested in computing other returns like, weekly, monthly or yearly, we can use tidyquant’s tq_transmute() function in a similar way, but this time we change the “period” argument to “weekly” and rename the return column correctly.
The code below gets weeekly returns of companies in S&P 500 index.
sp500_weekly_return <- sp500_2022 %>% group_by(symbol) %>% tq_transmute(select = adjusted, mutate_fun = periodReturn, period = "weekly", col_rename = "weekly_return") sp500_weekly_return %>% head() ## # A tibble: 6 × 3 ## # Groups: symbol [1] ## symbol date weekly_return ## <chr> <date> <dbl> ## 1 A 2022-01-07 -0.0724 ## 2 A 2022-01-14 -0.00324 ## 3 A 2022-01-21 -0.0496 ## 4 A 2022-01-28 -0.00327 ## 5 A 2022-02-04 0.0296 ## 6 A 2022-02-11 -0.0278
Here we use the same strategy by using “period=monthly” and get monthly returns of companies in S&P 500 index.
sp500_monthly_return <- sp500_2022 %>% group_by(symbol) %>% tq_transmute(select = adjusted, mutate_fun = periodReturn, period = "monthly", col_rename = "monthly_return") sp500_monthly_return %>% head() ## # A tibble: 6 × 3 ## # Groups: symbol [1] ## symbol date monthly_return ## <chr> <date> <dbl> ## 1 A 2022-01-31 -0.110 ## 2 A 2022-02-28 -0.0643 ## 3 A 2022-03-31 0.0151 ## 4 A 2022-04-29 -0.0973 ## 5 A 2022-05-31 0.0695 ## 6 A 2022-06-30 -0.0689
And we can also get yearly returns of companies in S&P 500 index with the following code.
sp500_annual_return <- sp500_2022 %>% group_by(symbol) %>% tq_transmute(select = adjusted, mutate_fun = periodReturn, period = "yearly", col_rename = "annual_return") sp500_annual_return %>% head() ## # A tibble: 6 × 3 ## # Groups: symbol [6] ## symbol date annual_return ## <chr> <date> <dbl> ## 1 A 2022-12-30 -0.0374 ## 2 AAL 2022-12-30 -0.322 ## 3 AAP 2022-12-30 -0.357 ## 4 AAPL 2022-12-30 -0.282 ## 5 ABBV 2022-12-30 0.240 ## 6 ABC 2022-12-30 0.265
PCA on Daily Returns of S&P 500 companies
Let us go ahead to perform PCA on the daily returns of companies in S&P 500 index. First, let us reshape the daily return data tidy/long form to wide form, with companies on rows and dates on columns. We use pivot_wider to reshape. And we also convert the resulting wide tibble to a matrix.
sp500_dmat <- sp500_daily_return %>% pivot_wider(names_from="date", values_from = daily_return) %>% select(-`2022-01-03`) %>% column_to_rownames("symbol") %>% as.matrix()
And our S&P 500 daily return matrix looks like this.
sp500_mat[1:5,1:5] ## 2022-01-04 2022-01-05 2022-01-06 2022-01-07 2022-01-10 ## A -0.033806308 -0.017130581 0.003499327 -0.0266229247 6.899386e-05 ## AAL 0.014400000 -0.017875920 -0.005888651 0.0382337641 -2.541494e-02 ## AAP 0.001140383 -0.002531108 0.021991956 -0.0147320356 -1.663232e-02 ## AAPL -0.012691475 -0.026600016 -0.016693182 0.0009883834 1.160615e-04 ## ABBV -0.001919791 0.005252839 -0.004710368 -0.0025881284 1.119520e-02
Before proceeding, let us remove any missing data in the return matrix. Without dwelling on the reasons of missingness, we simply identify the rows with missing values and remove the rows from the SP500 return matrix.
na_ind <- which(apply(sp500_dmat,1,function(x){sum(is.na(x))})>0) na_ind ## CEG GEHC ## 89 199 sp500_mat <- sp500_dmat[-na_ind,]
dim(sp500_mat) ## [1] 501 250
After removing the two companies with missing values, we have 501 companies from S&P 500 index in our daily return data matrix.
In the PCA, we are interested in about the companies, rather than how the index performed over time. Therefore, transpose the matrix before applying scale() function to mean center and scale the data. After scaling we perform PCA using prcomp() function in R.
pca_fit <- sp500_mat %>% t() %>% scale() %>% prcomp()
We have save the PCA results in a variable called pca_fit. And this is the structure of PCA result object.
str(pca_fit ) ## List of 5 ## $ sdev : num [1:250] 14.87 5.54 4.68 3.76 3.01 ... ## $ rotation: num [1:501, 1:250] -0.0514 -0.0457 -0.0388 -0.0548 -0.027 ... ## ..- attr(*, "dimnames")=List of 2 ## .. ..$ : chr [1:501] "A" "AAL" "AAP" "AAPL" ... ## .. ..$ : chr [1:250] "PC1" "PC2" "PC3" "PC4" ... ## $ center : Named num [1:501] 3.02e-18 -2.21e-17 1.45e-17 -2.84e-17 6.91e-18 ... ## ..- attr(*, "names")= chr [1:501] "A" "AAL" "AAP" "AAPL" ... ## $ scale : Named num [1:501] 0.0223 0.0354 0.0235 0.0225 0.0141 ... ## ..- attr(*, "names")= chr [1:501] "A" "AAL" "AAP" "AAPL" ... ## $ x : num [1:250, 1:250] -9.27 14.59 -2.85 2.48 4.48 ... ## ..- attr(*, "dimnames")=List of 2 ## .. ..$ : chr [1:250] "2022-01-04" "2022-01-05" "2022-01-06" "2022-01-07" ... ## .. ..$ : chr [1:250] "PC1" "PC2" "PC3" "PC4" ... ## - attr(*, "class")= chr "prcomp"
Scree plot from Daily Returns of S&P 500 companies
We will use broom package to extract the results from PCA. With broom’s tidy() function with argument “pcs”, we get the percent variation explained by each PC.
library(broom) pca_fit %>% tidy("pcs") %>% head() ## # A tibble: 6 × 4 ## PC std.dev percent cumulative ## <dbl> <dbl> <dbl> <dbl> ## 1 1 14.9 0.441 0.441 ## 2 2 5.54 0.0612 0.502 ## 3 3 4.68 0.0438 0.546 ## 4 4 3.76 0.0282 0.574 ## 5 5 3.01 0.0180 0.592 ## 6 6 2.82 0.0159 0.608
Using the variance explained, we can make a scree plot to understand how many PCs explain the majority of the variation in S&P 500 stock data.
pca_fit %>% tidy("pcs") %>% filter(PC<50) %>% ggplot(aes(x=PC, y=percent))+ geom_col(fill="dodgerblue", alpha=0.7) + scale_y_continuous(labels=scales::label_percent(), breaks = scales::breaks_pretty(n=6))+ labs(y= "Variance explained", title="Scree plot: 2022 S&P 500 Stock Return data") ggsave("Scree_plot_percent_variance_explained_SP500_data.png")
Let us save the variance explained by each PC for later use in making PCA plots.
variance_exp <- pca_fit %>% tidy("pcs") %>% pull(percent)
variance_exp %>% head() [1] 0.44130 0.06117 0.04376 0.02817 0.01804 0.01586
PCA Plot : PC1 vs PC2 – Daily Returns of S&P 500 companies
To get to quick understanding the top two PCs from PCA on S&P 500 index data, we make a PCA plot with PC1 vs PC2.
First, let us combine the results from PCA with S&P 500 company info data.
pc_df <- pca_fit %>% tidy("loadings") %>% filter(PC %in% c(1:5)) %>% rename(symbol=column) %>% left_join(stock_list_tbl, by="symbol") %>% mutate(PC=paste0("PC",PC)) %>% pivot_wider(names_from = "PC", values_from = "value")
pc_df %>% head() ## # A tibble: 6 × 13 ## symbol company identifier sedol weight sector shares_held local_currency ## <chr> <chr> <chr> <chr> <dbl> <chr> <dbl> <chr> ## 1 A Agilent Tec… 00846U10 2520… 1.42e-3 Healt… 3289739 USD ## 2 AAL American Ai… 02376R10 BCV7… 2.80e-4 Indus… 7063898 USD ## 3 AAP Advance Aut… 00751Y10 2822… 2.90e-4 Consu… 667858 USD ## 4 AAPL Apple Inc. 03783310 2046… 5.87e-2 Infor… 165455870 USD ## 5 ABBV AbbVie Inc. 00287Y10 B92S… 9.07e-3 Healt… 19567584 USD ## 6 ABC Amerisource… 03073E10 2795… 8.29e-4 Healt… 1792022 USD ## # … with 5 more variables: PC1 <dbl>, PC2 <dbl>, PC3 <dbl>, PC4 <dbl>, ## # PC5 <dbl>
Now we make the PCA plot, a scatter plot between PC1 and PC2. We also color the points in the scatter plot by the sector it belongs in the S&P 500 index.
pc_df %>% ggplot(aes(x=PC1,y=PC2, color=sector))+ geom_point(size=3, alpha=0.7)+ #coord_fixed() + labs(title= "PCA: 2022 S&P 500 Stock Data", x=paste0("PC1: ", round(variance_exp[1]*100), "%"), y=paste0("PC1: ", round(variance_exp[2]*100), "%")) ggsave("PCA_plot_PC1_vs_PC2_SP500_daily_return.png")
By squinting at the PCA plot a bit we can see some patterns at the sector level, but not easily.
To see what PC1 has captured, we make a boxplot with PC1 on x-axis and S&P 500 index sectors on y-axis.
We have ordered the sectors by the median PC1 values.
pc_df %>% ggplot(aes(x=PC1,y=reorder(sector,PC1), color=sector))+ geom_boxplot(size=1)+ labs(title= "PCA: 2022 S&P 500 Stock Data", x=paste0("PC1: ", round(variance_exp[1]*100), "%"), y="Sector")+ #geom_jitter(width=0.05)+ theme(legend.position="none") ggsave("PCA_plot_PC1_vs_SP500_sectors.png")
We can clearly see the pattern that PC1 has captured. At the one end of the boxplot we have IT companies and at the other end we have energy companies. As we go from the sector IT, that has low PC1 values, to energy sector with high PC1 value, we can also see other sectors have a nice increasing pattern in their median PC1 values.
Going back to the PCA plot between PC1 and PC2, let us remake the plot this time highlighting only the energy and IT sectors.
pc_df %>% ggplot(aes(x=PC1,y=PC2))+ geom_point(color="grey", size=3, alpha=0.3)+ geom_point(data=pc_df %>% filter(sector=="Energy"), aes(PC1,PC2, color=sector), size=3, alpha=0.5)+ geom_point(data=pc_df %>% filter(sector=="Information Technology"), aes(PC1,PC2, color=sector), size=3, alpha=0.5)+ #coord_fixed() + labs(title= "PCA: 2022 S&P 500 Stock Data", x=paste0("PC1: ", round(variance_exp[1]*100), "%"), y=paste0("PC1: ", round(variance_exp[2]*100), "%")) #theme(legend.position = "none") ggsave("PCA_plot_PC1_vs_PC2_SP500_sectors.png")
Now, we can see both PC1 and PC2 seem to separate these sectors nicely. Definitely there is a huge variability within the sectors.
To summarize, in this post, we saw how to get stock price data using tidyquant and perform PCA on S&P 500 stock data using tidyverse framework. We performed a first pass PCA on the daily return data from S&P 500 index companies for the year 2022. And could find the top PCs do explain the general market behaviour on the returns nicely. As mentioned this is really a cursory fun analysis, it will be interesting to do more detailed financial analysis with this data using dimensionality reduction techniques like PCA.