PCA on S&P 500 Stock Return Data

This post is a fun exercise applying PCA (most in tidyverse framework) to stock data from S&P 500 index companies. With PCA on stock data from S&P 500, we can possibly do many interesting things, but we will focus on really simple things and try to see if PCA captures the general company trends from PCA results.

We will use tidyquant R package to get access to stock price data using R and use prcomp() to perform PCA using tidyverse framework with the help of broom package.

We will start with getting daily stock price data for about 500 companies in the S&P 500 stock index for the year 2022. Then we will explore the data a little and compute daily stock returns for each company.

Then we will go ahead and apply PCA on the daily return data for the year 2022 and try to make sense of the PCs.

First let us load the packages needed.

library(tidyquant)
library(broom)
library(tidyverse)
theme_set(theme_bw(16))

Companies in S&P 500 index

S&P 500 index typically contains about 500 companies in its index. As of the end of 2022, there are 503 companies in S&P 500 index.

We can get the list of all the companies in S&P 500 index, using tidyquant’s tq_index() function. We specify the name of the index, here “SP500” as the argument to tidyquant’s tq_index(). It gives us detailed information regarding the companies in S&P 500 index, including stock ticker name, a companies weight in the index as measured by market cap, and the broad sector it belongs to.

# get S&P 500 index coopanies info
tq_index("SP500") %>% head()

## Getting holdings for SP500
## # A tibble: 10 × 8
##    symbol company      identifier sedol weight sector shares_held local_currency
##    <chr>  <chr>        <chr>      <chr>  <dbl> <chr>        <dbl> <chr>         
##  1 AAPL   Apple Inc.   03783310   2046… 0.0587 Infor…   165455870 USD           
##  2 MSFT   Microsoft C… 59491810   2588… 0.0520 Infor…    82480670 USD           
##  3 AMZN   Amazon.com … 02313510   2000… 0.0232 Consu…    98203210 USD           
##  4 BRK.B  Berkshire H… 08467070   2073… 0.0177 Finan…    19935340 USD           
##  5 GOOGL  Alphabet In… 02079K30   BYVY… 0.0162 Commu…    66088936 USD           
##  6 JNJ    Johnson & J… 47816010   2475… 0.0147 Healt…    28960104 USD

Let us save the information on the S&P 500 companies in a variable for future use.

stock_list_tbl <- tq_index("SP500") %>%
  arrange(symbol)

Exploring the S&P 500 companies data a bit, let us look at the number of companies in each sector using a simple barplot.

stock_list_tbl %>% 
  count(sector, sort=TRUE) %>%
  ggplot(aes(y=reorder(sector,n),
             x=n, fill=sector))+
  geom_col()+
  labs(y=NULL, 
       subtitle= "2022 S&P 500 Companies Count by Sector")+
  theme(legend.position = "none") 
ggsave("SP500_index_company_counts_by_sector.png")

Counts of 2022 S&P 500 Index Companies by Sector

Here is a look at the top 15 companies in the S&P 500 index by their weight.

stock_list_tbl %>% 
  arrange(desc(weight))%>%
  head(15) %>%
  mutate(company=gsub(" Corporation", "", company),
         company=gsub(" Inc.","", company),
         company=gsub(" Limited","", company),
         company=gsub(" Company","", company),
         company=gsub(" Inc.","", company),
         company=gsub(" Incorporated","", company)) %>%
  ggplot(aes(y=reorder(company,weight),
             x=weight, fill=sector))+
  geom_col()+
  labs(y=NULL)+
  scale_x_continuous(labels=scales::percent_format())
ggsave("SP500_index_company_weight_top15.png")

Top 15 Companies in S&P 500 (by Market Cap)

How to Get Stock price data for a company

We can get the stock price of any listed US company using tidyquant function tq_get(). In the example below, we specify ticker for Apple and also specify the start and end dates we need the stock data. Here we are getting Apple’s stock price data for the whole year 2022.

For each company we specify we get the stock price at opening, high, low, and adjusted price as shown below.

tq_get("AAPL", from = "2022-01-01", to="2023-01-01") %>% 
  head()

## # A tibble: 6 × 8
##   symbol date        open  high   low close    volume adjusted
##   <chr>  <date>     <dbl> <dbl> <dbl> <dbl>     <dbl>    <dbl>
## 1 AAPL   2022-01-03  178.  183.  178.  182. 104487900     181.
## 2 AAPL   2022-01-04  183.  183.  179.  180.  99310400     179.
## 3 AAPL   2022-01-05  180.  180.  175.  175.  94537600     174.
## 4 AAPL   2022-01-06  173.  175.  172.  172   96904000     171.
## 5 AAPL   2022-01-07  173.  174.  171.  172.  86709100     171.
## 6 AAPL   2022-01-10  169.  172.  168.  172. 106765600     171.

Let us write a small function to get a company’s stock price. Instead of getting all the data from tq_get() function, we keep just three columns from tq_get()’s output.

get_stock_price <- function(ticker = "AAPL"){
  tq_get(ticker, from = "2022-01-01", to="2023-01-01") %>%
    select(symbol, date, adjusted)
}

Here is another function to get all the stock tickers for the companies in S&P 500 index using tq_index() function in tidyquant R package.

get_stock_tickers <- function(stock_index= "SP500"){
  tq_index(stock_index) %>%
    select(symbol,company,weight) %>%
    arrange(symbol) 
 }

Leet us get all the stock tickers from S&P 500 index and clean up the ticker names a bit.

sp500_tickers <- get_stock_tickers()
sp500_tickers <- sp500_tickers %>% 
  mutate(symbol= gsub("\\.","-",symbol))

sp500_tickers %>% head()

## # A tibble: 6 × 3
##   symbol company                         weight
##   <chr>  <chr>                            <dbl>
## 1 A      Agilent Technologies Inc.     0.00142 
## 2 AAL    American Airlines Group Inc.  0.000280
## 3 AAP    Advance Auto Parts Inc.       0.000290
## 4 AAPL   Apple Inc.                    0.0587  
## 5 ABBV   AbbVie Inc.                   0.00907 
## 6 ABC    AmerisourceBergen Corporation 0.000829

How to Get Stock Price data for all the companies in S&P 500 index

With all company tickers, we are now ready to get the stock price data for all companies in S&P 500. Again, we will use tidyquant’s tq_get() function to get the stock data. Here, instead of single company’s ticker, we will use all the tickers from S&P 500 index.

We also specify the start and end date for the year 2022, since we want the data for the whole of 2022.

# get SP500 companies stock price data
sp500_2022 <- tq_get(sp500_tickers %>% pull(symbol),
                    from = "2022-01-01",
                    to= "2023-01-01")

Since tq_get() is fetching the data through internet using Yahoo service, it might take some time to complete to get over 500 companies stock price for a year. When it is complete, the output data will look the same as before, but now for all the companies.

# S&P 500 companyies stock price data
sp500_2022 %>% head()

## # A tibble: 6 × 8
##   symbol date        open  high   low close  volume adjusted
##   <chr>  <date>     <dbl> <dbl> <dbl> <dbl>   <dbl>    <dbl>
## 1 A      2022-01-03  159   159.  154.  156. 1606300     155.
## 2 A      2022-01-04  155.  156.  150.  151. 2234000     150.
## 3 A      2022-01-05  151.  153.  149.  149. 2370500     148.
## 4 A      2022-01-06  149.  150.  146.  149. 2298300     148.
## 5 A      2022-01-07  149.  150.  145.  145. 2058600     144.
## 6 A      2022-01-10  143.  145.  141.  145. 2548100     144.

Wee can make sure we have downloaded all the companies’ stock price data.

sp500_2022  %>% dim()

 126002      8

We have over 120k rows of stock price data. And by looking the tail of the data. Since we have sorted the company tickers, we are seeing data from a company that starts with Z.

sp500_2022  %>% tail()

## # A tibble: 6 × 8
##   symbol date        open  high   low close  volume adjusted
##   <chr>  <date>     <dbl> <dbl> <dbl> <dbl>   <dbl>    <dbl>
## 1 ZTS    2022-12-22  144.  145.  142.  145. 1541800     145.
## 2 ZTS    2022-12-23  145.  146.  144.  146. 1017900     146.
## 3 ZTS    2022-12-27  146.  146.  144.  145.  957900     145.
## 4 ZTS    2022-12-28  145.  147.  144.  144. 1443900     144.
## 5 ZTS    2022-12-29  145.  149.  145.  148. 1298900     148.
## 6 ZTS    2022-12-30  147.  148.  145.  147. 1249500     147.

How to Compute Daily Returns for all the Companies in S&P 500 index

We will compute daily returns for each day and for each company using adjusted price for a day with. tidyquant’s tq_transmute() function.

We apply tq_transmute() for each company in S&P 500 using group_by() function. And tq_transmute() takes four arguments as shown below. We specify tq_transmute() to use “adjusted” stock price for computing daily returns and also rename the return column to “daily_return”.

sp500_daily_return <- sp500_2022  %>% 
  group_by(symbol) %>%
  tq_transmute(select     = adjusted,
                 mutate_fun = periodReturn, 
                 period     = "daily", 
                 col_rename = "daily_return")

And the resulting daily return data is a tidy tibble that looks like this.

sp500_daily_return %>% head()

## # A tibble: 6 × 3
## # Groups:   symbol [1]
##   symbol date       daily_return
##   <chr>  <date>            <dbl>
## 1 A      2022-01-03    0        
## 2 A      2022-01-04   -0.0338   
## 3 A      2022-01-05   -0.0171   
## 4 A      2022-01-06    0.00350  
## 5 A      2022-01-07   -0.0266   
## 6 A      2022-01-10    0.0000690

How to Compute weekly/monthly/yearly Returns for all the Companies in S&P 500 index

If youa re interested in computing other returns like, weekly, monthly or yearly, we can use tidyquant’s tq_transmute() function in a similar way, but this time we change the “period” argument to “weekly” and rename the return column correctly.

The code below gets weeekly returns of companies in S&P 500 index.

sp500_weekly_return <- sp500_2022  %>% 
  group_by(symbol) %>%
  tq_transmute(select     = adjusted,
                 mutate_fun = periodReturn, 
                 period     = "weekly", 
                 col_rename = "weekly_return")

sp500_weekly_return %>% head()

## # A tibble: 6 × 3
## # Groups:   symbol [1]
##   symbol date       weekly_return
##   <chr>  <date>             <dbl>
## 1 A      2022-01-07      -0.0724 
## 2 A      2022-01-14      -0.00324
## 3 A      2022-01-21      -0.0496 
## 4 A      2022-01-28      -0.00327
## 5 A      2022-02-04       0.0296 
## 6 A      2022-02-11      -0.0278

Here we use the same strategy by using “period=monthly” and get monthly returns of companies in S&P 500 index.

sp500_monthly_return <- sp500_2022  %>% 
  group_by(symbol) %>%
  tq_transmute(select     = adjusted,
                 mutate_fun = periodReturn, 
                 period     = "monthly", 
                 col_rename = "monthly_return")
sp500_monthly_return %>% head()
## # A tibble: 6 × 3
## # Groups:   symbol [1]
##   symbol date       monthly_return
##   <chr>  <date>              <dbl>
## 1 A      2022-01-31        -0.110 
## 2 A      2022-02-28        -0.0643
## 3 A      2022-03-31         0.0151
## 4 A      2022-04-29        -0.0973
## 5 A      2022-05-31         0.0695
## 6 A      2022-06-30        -0.0689

And we can also get yearly returns of companies in S&P 500 index with the following code.

sp500_annual_return <- sp500_2022  %>% 
  group_by(symbol) %>%
  tq_transmute(select     = adjusted,
                 mutate_fun = periodReturn, 
                 period     = "yearly", 
                 col_rename = "annual_return")
sp500_annual_return %>% head()
## # A tibble: 6 × 3
## # Groups:   symbol [6]
##   symbol date       annual_return
##   <chr>  <date>             <dbl>
## 1 A      2022-12-30       -0.0374
## 2 AAL    2022-12-30       -0.322 
## 3 AAP    2022-12-30       -0.357 
## 4 AAPL   2022-12-30       -0.282 
## 5 ABBV   2022-12-30        0.240 
## 6 ABC    2022-12-30        0.265

PCA on Daily Returns of S&P 500 companies

Let us go ahead to perform PCA on the daily returns of companies in S&P 500 index. First, let us reshape the daily return data tidy/long form to wide form, with companies on rows and dates on columns. We use pivot_wider to reshape. And we also convert the resulting wide tibble to a matrix.

sp500_dmat <-  sp500_daily_return %>%
  pivot_wider(names_from="date", values_from = daily_return) %>%
  select(-`2022-01-03`) %>%
  column_to_rownames("symbol") %>%
  as.matrix()

And our S&P 500 daily return matrix looks like this.

sp500_mat[1:5,1:5]

##        2022-01-04   2022-01-05   2022-01-06    2022-01-07    2022-01-10
## A    -0.033806308 -0.017130581  0.003499327 -0.0266229247  6.899386e-05
## AAL   0.014400000 -0.017875920 -0.005888651  0.0382337641 -2.541494e-02
## AAP   0.001140383 -0.002531108  0.021991956 -0.0147320356 -1.663232e-02
## AAPL -0.012691475 -0.026600016 -0.016693182  0.0009883834  1.160615e-04
## ABBV -0.001919791  0.005252839 -0.004710368 -0.0025881284  1.119520e-02

Before proceeding, let us remove any missing data in the return matrix. Without dwelling on the reasons of missingness, we simply identify the rows with missing values and remove the rows from the SP500 return matrix.


na_ind <- which(apply(sp500_dmat,1,function(x){sum(is.na(x))})>0)
na_ind

##  CEG GEHC 
##   89  199

sp500_mat <- sp500_dmat[-na_ind,]

dim(sp500_mat)
## [1] 501 250

After removing the two companies with missing values, we have 501 companies from S&P 500 index in our daily return data matrix.

In the PCA, we are interested in about the companies, rather than how the index performed over time. Therefore, transpose the matrix before applying scale() function to mean center and scale the data. After scaling we perform PCA using prcomp() function in R.

pca_fit <-  sp500_mat %>%
  t() %>%
  scale() %>%
  prcomp()

We have save the PCA results in a variable called pca_fit. And this is the structure of PCA result object.

str(pca_fit )
## List of 5
##  $ sdev    : num [1:250] 14.87 5.54 4.68 3.76 3.01 ...
##  $ rotation: num [1:501, 1:250] -0.0514 -0.0457 -0.0388 -0.0548 -0.027 ...
##   ..- attr(*, "dimnames")=List of 2
##   .. ..$ : chr [1:501] "A" "AAL" "AAP" "AAPL" ...
##   .. ..$ : chr [1:250] "PC1" "PC2" "PC3" "PC4" ...
##  $ center  : Named num [1:501] 3.02e-18 -2.21e-17 1.45e-17 -2.84e-17 6.91e-18 ...
##   ..- attr(*, "names")= chr [1:501] "A" "AAL" "AAP" "AAPL" ...
##  $ scale   : Named num [1:501] 0.0223 0.0354 0.0235 0.0225 0.0141 ...
##   ..- attr(*, "names")= chr [1:501] "A" "AAL" "AAP" "AAPL" ...
##  $ x       : num [1:250, 1:250] -9.27 14.59 -2.85 2.48 4.48 ...
##   ..- attr(*, "dimnames")=List of 2
##   .. ..$ : chr [1:250] "2022-01-04" "2022-01-05" "2022-01-06" "2022-01-07" ...
##   .. ..$ : chr [1:250] "PC1" "PC2" "PC3" "PC4" ...
##  - attr(*, "class")= chr "prcomp"

Scree plot from Daily Returns of S&P 500 companies

We will use broom package to extract the results from PCA. With broom’s tidy() function with argument “pcs”, we get the percent variation explained by each PC.

library(broom)
pca_fit %>%
  tidy("pcs")  %>%
  head()
## # A tibble: 6 × 4
##      PC std.dev percent cumulative
##   <dbl>   <dbl>   <dbl>      <dbl>
## 1     1   14.9   0.441       0.441
## 2     2    5.54  0.0612      0.502
## 3     3    4.68  0.0438      0.546
## 4     4    3.76  0.0282      0.574
## 5     5    3.01  0.0180      0.592
## 6     6    2.82  0.0159      0.608

Using the variance explained, we can make a scree plot to understand how many PCs explain the majority of the variation in S&P 500 stock data.

pca_fit %>%
  tidy("pcs") %>% 
  filter(PC<50) %>%
  ggplot(aes(x=PC, y=percent))+
  geom_col(fill="dodgerblue", alpha=0.7) +
  scale_y_continuous(labels=scales::label_percent(),
                     breaks = scales::breaks_pretty(n=6))+
  labs(y= "Variance explained",
       title="Scree plot: 2022 S&P 500 Stock Return data")
ggsave("Scree_plot_percent_variance_explained_SP500_data.png")

Scree plot: Percent Variance explained by PCs in SP500 data

Let us save the variance explained by each PC for later use in making PCA plots.

variance_exp <- pca_fit %>%  
  tidy("pcs") %>% 
  pull(percent)

variance_exp %>% head()
[1] 0.44130 0.06117 0.04376 0.02817 0.01804 0.01586

PCA Plot : PC1 vs PC2 – Daily Returns of S&P 500 companies

To get to quick understanding the top two PCs from PCA on S&P 500 index data, we make a PCA plot with PC1 vs PC2.

First, let us combine the results from PCA with S&P 500 company info data.

pc_df <- pca_fit %>% 
  tidy("loadings") %>% 
  filter(PC %in% c(1:5)) %>% 
  rename(symbol=column) %>%
  left_join(stock_list_tbl, by="symbol") %>%
  mutate(PC=paste0("PC",PC)) %>%  
  pivot_wider(names_from = "PC",
              values_from = "value")

pc_df %>% head()

## # A tibble: 6 × 13
##   symbol company      identifier sedol  weight sector shares_held local_currency
##   <chr>  <chr>        <chr>      <chr>   <dbl> <chr>        <dbl> <chr>         
## 1 A      Agilent Tec… 00846U10   2520… 1.42e-3 Healt…     3289739 USD           
## 2 AAL    American Ai… 02376R10   BCV7… 2.80e-4 Indus…     7063898 USD           
## 3 AAP    Advance Aut… 00751Y10   2822… 2.90e-4 Consu…      667858 USD           
## 4 AAPL   Apple Inc.   03783310   2046… 5.87e-2 Infor…   165455870 USD           
## 5 ABBV   AbbVie Inc.  00287Y10   B92S… 9.07e-3 Healt…    19567584 USD           
## 6 ABC    Amerisource… 03073E10   2795… 8.29e-4 Healt…     1792022 USD           
## # … with 5 more variables: PC1 <dbl>, PC2 <dbl>, PC3 <dbl>, PC4 <dbl>,
## #   PC5 <dbl>

Now we make the PCA plot, a scatter plot between PC1 and PC2. We also color the points in the scatter plot by the sector it belongs in the S&P 500 index.

pc_df %>% 
  ggplot(aes(x=PC1,y=PC2, color=sector))+
  geom_point(size=3, alpha=0.7)+
  #coord_fixed() +
  labs(title= "PCA: 2022 S&P 500 Stock Data",
       x=paste0("PC1: ", round(variance_exp[1]*100), "%"),
       y=paste0("PC1: ", round(variance_exp[2]*100), "%"))
ggsave("PCA_plot_PC1_vs_PC2_SP500_daily_return.png")

By squinting at the PCA plot a bit we can see some patterns at the sector level, but not easily.

To see what PC1 has captured, we make a boxplot with PC1 on x-axis and S&P 500 index sectors on y-axis.
We have ordered the sectors by the median PC1 values.

pc_df %>%
  ggplot(aes(x=PC1,y=reorder(sector,PC1),
             color=sector))+
  geom_boxplot(size=1)+
  labs(title= "PCA: 2022 S&P 500 Stock Data",
       x=paste0("PC1: ", round(variance_exp[1]*100), "%"),
       y="Sector")+
  #geom_jitter(width=0.05)+
  theme(legend.position="none")
ggsave("PCA_plot_PC1_vs_SP500_sectors.png")

We can clearly see the pattern that PC1 has captured. At the one end of the boxplot we have IT companies and at the other end we have energy companies. As we go from the sector IT, that has low PC1 values, to energy sector with high PC1 value, we can also see other sectors have a nice increasing pattern in their median PC1 values.

PCA on S&P 500 Stock data: PC1 vs SP500 Sectors

Going back to the PCA plot between PC1 and PC2, let us remake the plot this time highlighting only the energy and IT sectors.

pc_df %>%
  ggplot(aes(x=PC1,y=PC2))+
  geom_point(color="grey", size=3, alpha=0.3)+
  geom_point(data=pc_df %>% filter(sector=="Energy"),
             aes(PC1,PC2, color=sector), size=3, alpha=0.5)+
  geom_point(data=pc_df %>% filter(sector=="Information Technology"),
             aes(PC1,PC2, color=sector), size=3, alpha=0.5)+
  #coord_fixed() +
  labs(title= "PCA: 2022 S&P 500 Stock Data",
       x=paste0("PC1: ", round(variance_exp[1]*100), "%"),
       y=paste0("PC1: ", round(variance_exp[2]*100), "%"))
  #theme(legend.position = "none")
ggsave("PCA_plot_PC1_vs_PC2_SP500_sectors.png")

Now, we can see both PC1 and PC2 seem to separate these sectors nicely. Definitely there is a huge variability within the sectors.

PCA Plot S&P 500 Data : Highlight Select Sectors

To summarize, in this post, we saw how to get stock price data using tidyquant and perform PCA on S&P 500 stock data using tidyverse framework. We performed a first pass PCA on the daily return data from S&P 500 index companies for the year 2022. And could find the top PCs do explain the general market behaviour on the returns nicely. As mentioned this is really a cursory fun analysis, it will be interesting to do more detailed financial analysis with this data using dimensionality reduction techniques like PCA.