Acknowledgment: the materials below are partially based on the book “Introduction to Data Science Data Analysis and Prediction Algorithms with R” by Rafael A. Irizarry. This materials was initilated by Yichen Qin and modified by Tianhai Zu for teaching purpose.

1 The Story of Oakland A’s in 2002

The movie Moneyball (Wiki, YouTube) is about the story of how the Oakland Athletics baseball team’s general manager, Billy Beane, assembled a competitive team using analytics. The movie is based on Michael Lewis’s 2003 nonfiction book of the same name “Moneyball: The Art of Winning an Unfair Game” (Amazon).

In 2002, Oakland A’s GM, Billy Beane (played by Brad Pitt), and assistant GM, Peter Brand (played by Jonah Hill), faced with the franchise’s limited budget for players. Having lost a few key players to other teams, Billy and Peter built a team of undervalued talent by taking a sophisticated sabermetric approach to scouting and analyzing players. To get a sense of the budgets in 2002, A’s has $40 million payroll while Yankees has $125 million.

Now we pretend to be the A’s in 2002 and try to build a baseball team with a limited budget. Note that in 2002 the Yankees’ payroll of $125 million more than tripled the Oakland A’s $40 million:

#library(tidyverse)#install.packages("tidyverse")
library(gridExtra) #install.packages("gridExtra")
library(Lahman) #install.packages("Lahman")
library(dplyr) #install.packages("dplyr")
library(tidyr) #install.packages("tidyr")
library(tibble) #install.packages("tibble")
library(readr) #install.packages("readr")
library(purrr) #install.packages("purrr")
library(stringr) #install.packages("stringr")
library(forcats) #install.packages("forcats")
library(dslabs) #install.packages("dslabs")
library(ggplot2) #install.packages("ggplot2")
library(broom) #install.packages("broom")
#ds_theme_set()

2 Team Analysis

Let’s start with team level analysis. The data is in R package Lahman [https://cran.r-project.org/web/packages/Lahman/Lahman.pdf] Page 56-58. We begin by exploring the team level data. Yearly statistics and standings for teams

data(Teams)
str(Teams)
## 'data.frame':    2895 obs. of  48 variables:
##  $ yearID        : int  1871 1871 1871 1871 1871 1871 1871 1871 1871 1872 ...
##  $ lgID          : Factor w/ 7 levels "AA","AL","FL",..: 4 4 4 4 4 4 4 4 4 4 ...
##  $ teamID        : Factor w/ 149 levels "ALT","ANA","ARI",..: 24 31 39 56 90 97 111 136 142 8 ...
##  $ franchID      : Factor w/ 120 levels "ALT","ANA","ARI",..: 13 36 25 56 70 85 91 109 77 9 ...
##  $ divID         : chr  NA NA NA NA ...
##  $ Rank          : int  3 2 8 7 5 1 9 6 4 2 ...
##  $ G             : int  31 28 29 19 33 28 25 29 32 58 ...
##  $ Ghome         : int  NA NA NA NA NA NA NA NA NA NA ...
##  $ W             : int  20 19 10 7 16 21 4 13 15 35 ...
##  $ L             : int  10 9 19 12 17 7 21 15 15 19 ...
##  $ DivWin        : chr  NA NA NA NA ...
##  $ WCWin         : chr  NA NA NA NA ...
##  $ LgWin         : chr  "N" "N" "N" "N" ...
##  $ WSWin         : chr  NA NA NA NA ...
##  $ R             : int  401 302 249 137 302 376 231 351 310 617 ...
##  $ AB            : int  1372 1196 1186 746 1404 1281 1036 1248 1353 2571 ...
##  $ H             : int  426 323 328 178 403 410 274 384 375 753 ...
##  $ X2B           : int  70 52 35 19 43 66 44 51 54 106 ...
##  $ X3B           : int  37 21 40 8 21 27 25 34 26 31 ...
##  $ HR            : int  3 10 7 2 1 9 3 6 6 14 ...
##  $ BB            : num  60 60 26 33 33 46 38 49 48 29 ...
##  $ SO            : int  19 22 25 9 15 23 30 19 13 28 ...
##  $ SB            : num  73 69 18 16 46 56 53 62 48 53 ...
##  $ CS            : num  16 21 8 4 15 12 10 24 13 18 ...
##  $ HBP           : num  NA NA NA NA NA NA NA NA NA NA ...
##  $ SF            : int  NA NA NA NA NA NA NA NA NA NA ...
##  $ RA            : int  303 241 341 243 313 266 287 362 303 434 ...
##  $ ER            : int  109 77 116 97 121 137 108 153 137 166 ...
##  $ ERA           : num  3.55 2.76 4.11 5.17 3.72 4.95 4.3 5.51 4.37 2.9 ...
##  $ CG            : int  22 25 23 19 32 27 23 28 32 48 ...
##  $ SHO           : int  1 0 0 1 1 0 1 0 0 1 ...
##  $ SV            : int  3 1 0 0 0 0 0 0 0 1 ...
##  $ IPouts        : int  828 753 762 507 879 747 678 750 846 1548 ...
##  $ HA            : int  367 308 346 261 373 329 315 431 371 573 ...
##  $ HRA           : int  2 6 13 5 7 3 3 4 4 3 ...
##  $ BBA           : int  42 28 53 21 42 53 34 75 45 63 ...
##  $ SOA           : int  23 22 34 17 22 16 16 12 13 77 ...
##  $ E             : int  243 229 234 163 235 194 220 198 218 432 ...
##  $ DP            : int  24 16 15 8 14 13 14 22 20 22 ...
##  $ FP            : num  0.834 0.829 0.818 0.803 0.84 0.845 0.821 0.845 0.85 0.83 ...
##  $ name          : chr  "Boston Red Stockings" "Chicago White Stockings" "Cleveland Forest Citys" "Fort Wayne Kekiongas" ...
##  $ park          : chr  "South End Grounds I" "Union Base-Ball Grounds" "National Association Grounds" "Hamilton Field" ...
##  $ attendance    : int  NA NA NA NA NA NA NA NA NA NA ...
##  $ BPF           : int  103 104 96 101 90 102 97 101 94 106 ...
##  $ PPF           : int  98 102 100 107 88 98 99 100 98 102 ...
##  $ teamIDBR      : chr  "BOS" "CHI" "CLE" "KEK" ...
##  $ teamIDlahman45: chr  "BS1" "CH1" "CL1" "FW1" ...
##  $ teamIDretro   : chr  "BS1" "CH1" "CL1" "FW1" ...
tail(Teams)
##      yearID lgID teamID franchID divID Rank   G Ghome  W  L DivWin WCWin LgWin
## 2890   2018   NL    SFN      SFG     W    4 162    81 73 89      N     N     N
## 2891   2018   NL    SLN      STL     C    3 162    81 88 74      N     N     N
## 2892   2018   AL    TBA      TBD     E    3 162    81 90 72      N     N     N
## 2893   2018   AL    TEX      TEX     W    5 162    81 67 95      N     N     N
## 2894   2018   AL    TOR      TOR     E    4 162    81 73 89      N     N     N
## 2895   2018   NL    WAS      WSN     E    2 162    81 82 80      N     N     N
##      WSWin   R   AB    H X2B X3B  HR  BB   SO  SB CS HBP SF  RA  ER  ERA CG SHO
## 2890     N 603 5541 1324 255  30 133 448 1467  77 34  49 42 699 641 3.95  1  15
## 2891     N 759 5498 1369 248   9 205 525 1380  63 32  80 48 691 622 3.85  1   8
## 2892     N 716 5475 1415 274  43 150 540 1388 128 51 101 50 646 602 3.74  0  14
## 2893     N 737 5453 1308 266  24 194 555 1484  74 35  88 34 848 783 4.92  1   5
## 2894     N 709 5477 1336 320  16 217 499 1387  47 30  58 37 832 772 4.85  0   3
## 2895     N 771 5517 1402 284  25 191 631 1289 119 33  59 40 682 649 4.04  2   7
##      SV IPouts   HA HRA BBA  SOA   E  DP    FP                 name
## 2890 36   4384 1387 156 524 1269  97 160 0.984 San Francisco Giants
## 2891 43   4366 1354 144 593 1337 133 151 0.978  St. Louis Cardinals
## 2892 52   4345 1236 164 501 1421  85 136 0.986       Tampa Bay Rays
## 2893 42   4293 1516 222 491 1121 120 168 0.980        Texas Rangers
## 2894 39   4301 1476 208 551 1298 101 138 0.983    Toronto Blue Jays
## 2895 40   4338 1320 198 487 1417  64 115 0.989 Washington Nationals
##                               park attendance BPF PPF teamIDBR teamIDlahman45
## 2890                     AT&T Park    3156185  95  96      SFG            SFN
## 2891             Busch Stadium III    3403587  97  96      STL            SLN
## 2892               Tropicana Field    1154973  97  97      TBR            TBA
## 2893 Rangers Ballpark in Arlington    2107107 112 113      TEX            TEX
## 2894                 Rogers Centre    2325281  97  98      TOR            TOR
## 2895                Nationals Park    2529604 106 105      WSN            MON
##      teamIDretro
## 2890         SFN
## 2891         SLN
## 2892         TBA
## 2893         TEX
## 2894         TOR
## 2895         WAS

A data frame with 2895 observations on the following 48 variables.

Variable Description
yearID Year
lgID League; a factor with levels AA AL FL NL PL UA
teamID Team; a factor
franchID Franchise (links to TeamsFranchises table)
divID Team’s division; a factor with levels C E W
Rank Position in final standings
G Games played
Ghome Games played at home
W Wins
L Losses
DivWin Division Winner (Y or N)
WCWin Wild Card Winner (Y or N)
LgWin League Champion(Y or N)
WSWin World Series Winner (Y or N)
R Runs scored
AB At bats
H Hits by batters
X2B Doubles
X3B Triples
HR Homeruns by batters
BB Walks by batters
SO Strikeouts by batters
SB Stolen bases
CS Caught stealing
HBP Batters hit by pitch
SF Sacrifice flies
RA Opponents runs scored
ER Earned runs allowed
ERA Earned run average
CG Complete games
SHO Shutouts
SV Saves
IPouts Outs Pitched (innings pitched x 3)
HA Hits allowed
HRA Homeruns allowed
BBA Walks allowed
SOA Strikeouts by pitchers
E Errors
DP Double Plays
FP Fielding percentage
name Team’s full name
park Name of team’s home ballpark
attendance Home attendance total
BPF Three-year park factor for batters
PPF Three-year park factor for pitchers
teamIDBR Team ID used by Baseball Reference website
teamIDlahman45 Team ID used in Lahman database version 4.5
teamIDretro Team ID used by Retrosheet

2.1 Response Variable

Defining the response variable is one of the key questions. To assemble a winning team, there are many ways to define the response variable. For example, we could use number of wins or number of losses as the response variable and so on. However, for simplicity, we want to focus our analysis on the game level for offense only and ignore defense and pitching. Therefore, our response variable is the number of runs (R). A run represents a batter goes around the bases and arrives home safely and scores.

We use runs (R) as the response variable and plot against all other variables. Clearly, we see clusters. So we tentatively use blue dots for games after 1961 and red dots for games before 1961.

Teams_type=sapply(Teams, class)
col_index=which(Teams_type %in% c("integer","numeric"))
length(col_index)
## [1] 35
par(mfrow=c(7,5))
par(mar = c(4, 4, 0.1, 0.1)) #par(mar = c(bottom, left, top, right))
par(mgp = c(1.5,0.5,0))
for (i in 1:length(col_index))
{
  plot(Teams[,col_index[i]],Teams[,"R"],pch=46,xlab=names(Teams)[col_index[i]],ylab="R",col=ifelse(Teams$yearID<1961,"red","blue"))
}

Since we have observations dated back to 1871, when the league was operating in much different ways. We decide to focus on more recent games. Here is a visulization of number of games played by years. We can see that after 1961, the number of games is more or less fixed at 162.

Teams %>% ggplot(aes(yearID, G)) + geom_point(alpha = 0.5) + 
  geom_vline(xintercept = 1961) + 
  geom_hline(yintercept = 162, linetype="dotted",  color = "blue")

Currently, each observation represent one team per season, we convert the data into game level statistics by dividing a few key variables by the number of games played. The key variables are pre-selected for demonstration purpose, but in practice, it can be selected by model selection techniques together with analyst’s understanding about the game of baseball. Since we are pretending we are A’s in 2002, we exclude all the data after 2002 too. Clearly, there are much more analysis to be done to visualize and clean the data, but these are the basic analysis before any meaningful analysis. We visualize the data as follows.

Teams_1961to2001=filter(Teams, yearID %in% 1961:2001)
#The above is equivalent to: Teams_1961to2001=Teams[(Teams$yearID<=2001)&(Teams$yearID>=1961),]
Teams_perGame = mutate(Teams_1961to2001,
                       R_per_game = R / G,
                       SB_per_game = SB / G, 
                       BB_per_game = BB / G,
                       singles_per_game = (H - X2B - X3B - HR) / G, 
                       doubles_per_game = X2B / G, 
                       triples_per_game = X3B / G,
                       HR_per_game = HR / G, 
                       AB_per_game = AB / G, 
                       )
pairs(Teams_perGame[,c("R_per_game","singles_per_game","doubles_per_game","triples_per_game","HR_per_game","BB_per_game","SB_per_game","AB_per_game")],pch=46,cex=2)

The variables we created as explained as follows.

  • Single (H-2B-3B-HR): Batter hits the ball and gets to first base.
  • Double (2B): Batter hits the ball and gets to second base.
  • Triple (3B): Batter hits the ball and gets to third base.
  • Home Run (HR): Batter hits the ball and goes all the way home and scores a run.
  • Bases on balls (BB): also known as walk, the pitcher fails to throw the ball through a predefined area considered to be hittable (the strikezone), so the batter is permitted to go to first base.
  • Steal a base (SB)
  • Hit (H): Singles, doubles, triples, and home runs are all hits.
  • At bat (AB)= either get a hit or make an out; BB is excluded.

Some other variables that may also be useful are:

  • On base percentage (OBP) = (H+BB)/(AB+BB)
  • Plate appearence (PA) = BB + AB
  • Batting average = H/AB, usually between 20% to 38%.

For each plot above, we generate it individually as follows.

Teams_1961to2001=filter(Teams, yearID %in% 1961:2001)
Teams_1961to2001_added_columns=mutate(Teams_1961to2001, HR_per_game = HR / G, R_per_game = R / G)
plot(Teams_1961to2001_added_columns$HR_per_game, Teams_1961to2001_added_columns$R_per_game, pch=20)

The code above seems a little too much. Note that the code above can be succintly written in the following way.

Teams %>% filter(yearID %in% 1961:2001) %>%
    mutate(HR_per_game = HR / G, R_per_game = R / G) %>%
    ggplot(aes(HR_per_game, R_per_game)) + geom_point(alpha = 0.5)

2.2 Regression Analysis

There are many ways to decide covariates. For the moments, we will focus on five covaraites, BB, singles, doubles, triples, and HR. Obviously, SB was an candidate, but we will come back it later.

# regression with BB, singles, doubles, triples, HR
fit=lm(R_per_game ~ BB_per_game + singles_per_game + doubles_per_game + triples_per_game + HR_per_game, data = Teams_perGame)
summary(fit)
## 
## Call:
## lm(formula = R_per_game ~ BB_per_game + singles_per_game + doubles_per_game + 
##     triples_per_game + HR_per_game, data = Teams_perGame)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.50830 -0.09443 -0.00069  0.09345  0.47132 
## 
## Coefficients:
##                  Estimate Std. Error t value Pr(>|t|)    
## (Intercept)      -2.76919    0.08618  -32.13   <2e-16 ***
## BB_per_game       0.37121    0.01174   31.61   <2e-16 ***
## singles_per_game  0.51939    0.01272   40.83   <2e-16 ***
## doubles_per_game  0.77114    0.02261   34.11   <2e-16 ***
## triples_per_game  1.23997    0.07679   16.15   <2e-16 ***
## HR_per_game       1.44337    0.02435   59.28   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.1498 on 1020 degrees of freedom
## Multiple R-squared:  0.9355, Adjusted R-squared:  0.9352 
## F-statistic:  2961 on 5 and 1020 DF,  p-value: < 2.2e-16

All covariates are significant with nonzero slopes. The R squared is 0.93. To exam the model adequacy, we check the residuals.

par(mfrow=c(2,2))
plot(fit,pch=20)

The residuals all look OK . There seems to be no obviouis outliers. Normality, linearity seems, constant variance all seem to hold, which means our model assumption is valid, and we safely use the results from our regression output.

Next, we further exam the goodness of fit by plotting the fitted value against the actual response variable. Ideally, if the model fits the data very well, the plot will mostly be an straight line around 45 degree.

ggplot(aes(fit$fitted, R_per_game),data=Teams_perGame) + 
  geom_point() + geom_abline() + xlim(2.5, 6.5) + ylim(2.5, 6.5)

The results seems to be consistent with our expectation. Most of the dots are around 45 degree line. Next we take this model, and exam its out of sample performance, that is, how well it predicts a new observation. Note that we only used the data from 1961 through 2001 to build the linear regression. We can use this regression to predict the number of runs of each team in 2002 and compare with the actual observations. Here is the analysis and scatter plot of predict vs actual.

# predict number of runs for each team in 2002 and plot
Teams %>% 
  filter(yearID %in% 2002) %>% 
  mutate(BB_per_game      = BB/G, 
         singles_per_game = (H-X2B-X3B-HR)/G, 
         doubles_per_game = X2B/G, 
         triples_per_game =X3B/G, 
         HR_per_game      = HR/G,
         R_per_game       = R/G)  %>% 
  mutate(R_per_game_hat = predict(fit, newdata = .)) %>%
  ggplot(aes(R_per_game_hat, R_per_game, label = teamID)) + 
  geom_point() + geom_text(nudge_x=0.1, cex = 2) + 
  geom_abline() + xlim(2.5, 6.5) + ylim(2.5, 6.5)

Put these two figures side by side, we have

Therefore, our conclusion is that bases on balls, single, double, triple, and home run are strong predictors of number of runs. Therefore, we need to use these these variables to predict each individual player’s runs.

3 Player Analysis

For player analysis, we use the data set called Batting which stores yearly batting statistics of each players. This is a data frame with 105861 observations on the following 22 variables.

data(Batting)
str(Batting)
## 'data.frame':    105861 obs. of  22 variables:
##  $ playerID: chr  "abercda01" "addybo01" "allisar01" "allisdo01" ...
##  $ yearID  : int  1871 1871 1871 1871 1871 1871 1871 1871 1871 1871 ...
##  $ stint   : int  1 1 1 1 1 1 1 1 1 1 ...
##  $ teamID  : Factor w/ 149 levels "ALT","ANA","ARI",..: 136 111 39 142 111 56 111 24 56 24 ...
##  $ lgID    : Factor w/ 7 levels "AA","AL","FL",..: 4 4 4 4 4 4 4 4 4 4 ...
##  $ G       : int  1 25 29 27 25 12 1 31 1 18 ...
##  $ AB      : int  4 118 137 133 120 49 4 157 5 86 ...
##  $ R       : int  0 30 28 28 29 9 0 66 1 13 ...
##  $ H       : int  0 32 40 44 39 11 1 63 1 13 ...
##  $ X2B     : int  0 6 4 10 11 2 0 10 1 2 ...
##  $ X3B     : int  0 0 5 2 3 1 0 9 0 1 ...
##  $ HR      : int  0 0 0 2 0 0 0 0 0 0 ...
##  $ RBI     : int  0 13 19 27 16 5 2 34 1 11 ...
##  $ SB      : int  0 8 3 1 6 0 0 11 0 1 ...
##  $ CS      : int  0 1 1 1 2 1 0 6 0 0 ...
##  $ BB      : int  0 4 2 0 2 0 1 13 0 0 ...
##  $ SO      : int  0 0 5 2 1 1 0 1 0 0 ...
##  $ IBB     : int  NA NA NA NA NA NA NA NA NA NA ...
##  $ HBP     : int  NA NA NA NA NA NA NA NA NA NA ...
##  $ SH      : int  NA NA NA NA NA NA NA NA NA NA ...
##  $ SF      : int  NA NA NA NA NA NA NA NA NA NA ...
##  $ GIDP    : int  0 0 1 0 0 0 0 1 0 0 ...
tail(Batting)
##         playerID yearID stint teamID lgID   G  AB  R   H X2B X3B HR RBI SB CS
## 105856 zieglbr01   2018     2    ARI   NL  29   0  0   0   0   0  0   0  0  0
## 105857 zimmebr01   2018     1    CLE   AL  34 106 14  24   5   0  2   9  4  1
## 105858 zimmejo02   2018     1    DET   AL  25   2  0   0   0   0  0   0  0  0
## 105859 zimmery01   2018     1    WAS   NL  85 288 33  76  21   2 13  51  1  1
## 105860 zobribe01   2018     1    CHN   NL 139 455 67 139  28   3  9  58  3  4
## 105861 zuninmi01   2018     1    SEA   AL 113 373 37  75  18   0 20  44  0  0
##        BB  SO IBB HBP SH SF GIDP
## 105856  0   0   0   0  0  0    0
## 105857  7  44   0   1  0  0    1
## 105858  0   2   0   0  0  0    0
## 105859 30  55   1   3  0  2   10
## 105860 55  60   1   2  1  7    8
## 105861 24 150   0   6  0  2    7

Here is the description of the variables.

Variable Description
playerID Player ID code
yearID Year
stint player’s stint (order of appearances within a season)
teamID Team; a factor
lgID League; a factor with levels AA AL FL NL PL UA
G Games: number of games in which a player played
AB At Bats
R Runs
H Hits: times reached base because of a batted, fair ball without error by the defense
X2B Doubles: hits on which the batter reached second base safely
X3B Triples: hits on which the batter reached third base safely
HR Homeruns
RBI Runs Batted In
SB Stolen Bases
CS Caught Stealing
BB Base on Balls
SO Strikeouts
IBB Intentional walks
HBP Hit by pitch
SH Sacrifice hits
SF Sacrifice flies
GIDP Grounded into double plays
fit2 = Batting %>% filter(yearID %in% 1996:2001 ) %>% 
  mutate(HR_per_game=HR/G,
         R_per_game=R/G,
         BB_per_game = BB / G,
         singles_per_game = (H - X2B - X3B - HR) / G,
         doubles_per_game = X2B / G,
         triples_per_game = X3B / G) %>%
  lm(R_per_game ~
       BB_per_game + 
       singles_per_game + 
       doubles_per_game + 
       triples_per_game + 
       HR_per_game, data = .)
summary(fit2)
## 
## Call:
## lm(formula = R_per_game ~ BB_per_game + singles_per_game + doubles_per_game + 
##     triples_per_game + HR_per_game, data = .)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.65852 -0.02355 -0.00200  0.01452  0.99800 
## 
## Coefficients:
##                  Estimate Std. Error t value Pr(>|t|)    
## (Intercept)      0.001995   0.001201   1.661   0.0968 .  
## BB_per_game      0.276183   0.007519  36.733   <2e-16 ***
## singles_per_game 0.328264   0.005372  61.101   <2e-16 ***
## doubles_per_game 0.426012   0.015807  26.951   <2e-16 ***
## triples_per_game 0.992333   0.052120  19.039   <2e-16 ***
## HR_per_game      0.857241   0.017347  49.418   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.07668 on 7827 degrees of freedom
## Multiple R-squared:  0.8901, Adjusted R-squared:  0.8901 
## F-statistic: 1.268e+04 on 5 and 7827 DF,  p-value: < 2.2e-16
# average number of team plate appearances per game
pa_per_game <- Batting %>% filter(yearID == 2002) %>% 
  group_by(teamID) %>%
  summarize(pa_per_game = sum(AB+BB)/max(G)) %>% 
  pull(pa_per_game) %>% 
  mean
# compute per-plate-appearance rates for players available in 2002 using previous data
players <- Batting %>% filter(yearID %in% 1999:2001) %>% 
  group_by(playerID) %>%
  mutate(PA = BB + AB) %>%
  summarize(G = sum(PA)/pa_per_game,
    BB_per_game = sum(BB)/G,
    singles_per_game = sum(H-X2B-X3B-HR)/G,
    doubles_per_game = sum(X2B)/G, 
    triples_per_game = sum(X3B)/G, 
    HR_per_game = sum(HR)/G,
    AVG = sum(H)/sum(AB),
    PA = sum(PA)) %>%
  filter(PA >= 300) %>%
  select(-G) %>%
  mutate(R_hat = predict(fit, newdata = .))

# plot player-specific predicted runs
qplot(R_hat, data = players, geom = "histogram", binwidth = 0.5, color = I("black"))

3.1 Salary and Position

Finally, we include the salary and position information of the players and pick the players who are “undervalued”.

data(Salaries)
str(Salaries)
## 'data.frame':    26428 obs. of  5 variables:
##  $ yearID  : int  1985 1985 1985 1985 1985 1985 1985 1985 1985 1985 ...
##  $ teamID  : Factor w/ 35 levels "ANA","ARI","ATL",..: 3 3 3 3 3 3 3 3 3 3 ...
##  $ lgID    : Factor w/ 2 levels "AL","NL": 2 2 2 2 2 2 2 2 2 2 ...
##  $ playerID: chr  "barkele01" "bedrost01" "benedbr01" "campri01" ...
##  $ salary  : int  870000 550000 545000 633333 625000 800000 150000 483333 772000 250000 ...
tail(Salaries)
##       yearID teamID lgID  playerID   salary
## 26423   2016    WAS   NL scherma01 22142857
## 26424   2016    WAS   NL strasst01 10400000
## 26425   2016    WAS   NL taylomi02   524000
## 26426   2016    WAS   NL treinbl01   524900
## 26427   2016    WAS   NL werthja01 21733615
## 26428   2016    WAS   NL zimmery01 14000000
# add 2002 salary of each player
players <- Salaries %>% 
  filter(yearID == 2002) %>%
  select(playerID, salary) %>%
  right_join(players, by="playerID")

head(players)
##    playerID  salary BB_per_game singles_per_game doubles_per_game
## 1 abbotje01      NA    3.275949         6.213007         2.033348
## 2 abbotku01      NA    2.411612         5.868256         1.929290
## 3 abernbr01  215000    3.160596         6.906488         1.990005
## 4 abreubo01 6333333    6.027244         5.912439         2.391763
## 5 agbaybe01  600000    4.527856         6.309307         1.892792
## 6 alexama02      NA    2.261856         6.392200         1.475123
##   triples_per_game HR_per_game       AVG   PA    R_hat
## 1        0.1129638   0.5648188 0.2515924  343 4.197201
## 2        0.2411612   1.1254190 0.2522124  482 4.585157
## 3        0.1170591   0.5852956 0.2697368  331 4.515780
## 4        0.4783527   1.4541922 0.3128655 2025 7.075567
## 5        0.2226814   1.2989750 0.2841649 1044 5.799263
## 6        0.4917077   0.3933662 0.2398922  394 3.705517
# add defensive position
position_names <- c("G_p","G_c","G_1b","G_2b","G_3b","G_ss","G_lf","G_cf","G_rf")
tmp_tab <- Appearances %>% 
  filter(yearID == 2002) %>% 
  group_by(playerID) %>%
  summarize_at(position_names, sum) %>%
  ungroup()  
pos <- tmp_tab %>%
  select(position_names) %>%
  apply(., 1, which.max) 
players <- data_frame(playerID = tmp_tab$playerID, POS = position_names[pos]) %>%
  mutate(POS = str_to_upper(str_remove(POS, "G_"))) %>%
  filter(POS != "P") %>%
  right_join(players, by="playerID") %>%
  filter(!is.na(POS)  & !is.na(salary))
## Warning: `data_frame()` is deprecated, use `tibble()`.
## This warning is displayed once per session.
head(players)
## # A tibble: 6 x 11
##   playerID POS   salary BB_per_game singles_per_game doubles_per_game
##   <chr>    <chr>  <int>       <dbl>            <dbl>            <dbl>
## 1 abernbr~ 2B    2.15e5        3.16             6.91             1.99
## 2 abreubo~ RF    6.33e6        6.03             5.91             2.39
## 3 agbaybe~ LF    6.00e5        4.53             6.31             1.89
## 4 alfoned~ 3B    6.20e6        4.81             6.31             2.15
## 5 alicelu~ 2B    8.00e5        3.55             7.16             1.65
## 6 alomaro~ 2B    7.94e6        4.73             7.20             2.22
## # ... with 5 more variables: triples_per_game <dbl>, HR_per_game <dbl>,
## #   AVG <dbl>, PA <int>, R_hat <dbl>
str(players)
## Classes 'tbl_df', 'tbl' and 'data.frame':    342 obs. of  11 variables:
##  $ playerID        : chr  "abernbr01" "abreubo01" "agbaybe01" "alfoned01" ...
##  $ POS             : chr  "2B" "RF" "LF" "3B" ...
##  $ salary          : int  215000 6333333 600000 6200000 800000 7939664 2500000 6000000 200000 5000000 ...
##  $ BB_per_game     : num  3.16 6.03 4.53 4.81 3.55 ...
##  $ singles_per_game: num  6.91 5.91 6.31 6.31 7.16 ...
##  $ doubles_per_game: num  1.99 2.39 1.89 2.15 1.65 ...
##  $ triples_per_game: num  0.1171 0.4784 0.2227 0.0625 0.3871 ...
##  $ HR_per_game     : num  0.585 1.454 1.299 1.437 0.419 ...
##  $ AVG             : num  0.27 0.313 0.284 0.293 0.273 ...
##  $ PA              : int  331 2025 1044 1860 1201 1991 745 1076 1748 2024 ...
##  $ R_hat           : num  4.52 7.08 5.8 6.1 4.62 ...
# add players' first and last names
players <- Master %>%
  select(playerID, nameFirst, nameLast, debut) %>%
  mutate(debut = as.Date(debut)) %>%
  right_join(players, by="playerID")

head(players)
##    playerID nameFirst  nameLast      debut POS  salary BB_per_game
## 1 abernbr01     Brent Abernathy 2001-06-25  2B  215000    3.160596
## 2 abreubo01     Bobby     Abreu 1996-09-01  RF 6333333    6.027244
## 3 agbaybe01     Benny  Agbayani 1998-06-17  LF  600000    4.527856
## 4 alfoned01   Edgardo   Alfonzo 1995-04-26  3B 6200000    4.812074
## 5 alicelu01      Luis    Alicea 1988-04-23  2B  800000    3.548811
## 6 alomaro01   Roberto    Alomar 1988-04-22  2B 7939664    4.728989
##   singles_per_game doubles_per_game triples_per_game HR_per_game       AVG   PA
## 1         6.906488         1.990005       0.11705912   0.5852956 0.2697368  331
## 2         5.912439         2.391763       0.47835270   1.4541922 0.3128655 2025
## 3         6.309307         1.892792       0.22268143   1.2989750 0.2841649 1044
## 4         6.311941         2.145643       0.06249447   1.4373727 0.2934316 1860
## 5         7.162147         1.645358       0.38714307   0.4194050 0.2731439 1201
## 6         7.200518         2.218538       0.33083459   1.2260341 0.3226545 1991
##      R_hat
## 1 4.515780
## 2 7.075567
## 3 5.799263
## 4 6.102253
## 5 4.622360
## 6 6.616837
str(players)
## 'data.frame':    342 obs. of  14 variables:
##  $ playerID        : chr  "abernbr01" "abreubo01" "agbaybe01" "alfoned01" ...
##  $ nameFirst       : chr  "Brent" "Bobby" "Benny" "Edgardo" ...
##  $ nameLast        : chr  "Abernathy" "Abreu" "Agbayani" "Alfonzo" ...
##  $ debut           : Date, format: "2001-06-25" "1996-09-01" ...
##  $ POS             : chr  "2B" "RF" "LF" "3B" ...
##  $ salary          : int  215000 6333333 600000 6200000 800000 7939664 2500000 6000000 200000 5000000 ...
##  $ BB_per_game     : num  3.16 6.03 4.53 4.81 3.55 ...
##  $ singles_per_game: num  6.91 5.91 6.31 6.31 7.16 ...
##  $ doubles_per_game: num  1.99 2.39 1.89 2.15 1.65 ...
##  $ triples_per_game: num  0.1171 0.4784 0.2227 0.0625 0.3871 ...
##  $ HR_per_game     : num  0.585 1.454 1.299 1.437 0.419 ...
##  $ AVG             : num  0.27 0.313 0.284 0.293 0.273 ...
##  $ PA              : int  331 2025 1044 1860 1201 1991 745 1076 1748 2024 ...
##  $ R_hat           : num  4.52 7.08 5.8 6.1 4.62 ...
# top 10 players
players %>% select(nameFirst, nameLast, POS, salary, R_hat) %>% 
  arrange(desc(R_hat)) %>% 
  top_n(10) 
## Selecting by R_hat
##    nameFirst    nameLast POS   salary    R_hat
## 1      Barry       Bonds  LF 15000000 9.052454
## 2       Todd      Helton  1B  5000000 8.228108
## 3      Manny     Ramirez  LF 15462727 8.199474
## 4      Sammy        Sosa  RF 15000000 8.185319
## 5      Larry      Walker  RF 12666667 8.146729
## 6      Jason      Giambi  1B 10428571 7.993795
## 7    Chipper       Jones  LF 11333333 7.640781
## 8      Brian       Giles  LF  8063003 7.572229
## 9     Albert      Pujols  LF   600000 7.536447
## 10     Nomar Garciaparra  SS  9000000 7.505063
# players with a higher metric have higher salaries
plot1 <- players %>% ggplot(aes(salary, R_hat, color = POS)) + 
  geom_point()

plot2 <- players %>% ggplot(aes(salary, R_hat, color = POS)) + 
  geom_point() +
  scale_x_log10()

grid.arrange(plot1, plot2, ncol=2)

# remake plot without players that debuted after 1998
library(lubridate)
## 
## Attaching package: 'lubridate'
## The following object is masked from 'package:base':
## 
##     date
players %>% filter(year(debut) < 1998) %>%
 ggplot(aes(salary, R_hat, color = POS, label = nameLast)) + 
  geom_point() +
  geom_text(nudge_x=0.1, cex = 2) + 
  scale_x_log10()