Acknowledgment: the materials below are partially based on the book “Introduction to Data Science Data Analysis and Prediction Algorithms with R” by Rafael A. Irizarry. This materials was initilated by Yichen Qin and modified by Tianhai Zu for teaching purpose.
The movie Moneyball (Wiki, YouTube) is about the story of how the Oakland Athletics baseball team’s general manager, Billy Beane, assembled a competitive team using analytics. The movie is based on Michael Lewis’s 2003 nonfiction book of the same name “Moneyball: The Art of Winning an Unfair Game” (Amazon).
In 2002, Oakland A’s GM, Billy Beane (played by Brad Pitt), and assistant GM, Peter Brand (played by Jonah Hill), faced with the franchise’s limited budget for players. Having lost a few key players to other teams, Billy and Peter built a team of undervalued talent by taking a sophisticated sabermetric approach to scouting and analyzing players. To get a sense of the budgets in 2002, A’s has $40 million payroll while Yankees has $125 million.
Now we pretend to be the A’s in 2002 and try to build a baseball team with a limited budget. Note that in 2002 the Yankees’ payroll of $125 million more than tripled the Oakland A’s $40 million:
#library(tidyverse)#install.packages("tidyverse")
library(gridExtra) #install.packages("gridExtra")
library(Lahman) #install.packages("Lahman")
library(dplyr) #install.packages("dplyr")
library(tidyr) #install.packages("tidyr")
library(tibble) #install.packages("tibble")
library(readr) #install.packages("readr")
library(purrr) #install.packages("purrr")
library(stringr) #install.packages("stringr")
library(forcats) #install.packages("forcats")
library(dslabs) #install.packages("dslabs")
library(ggplot2) #install.packages("ggplot2")
library(broom) #install.packages("broom")
#ds_theme_set()
Let’s start with team level analysis. The data is in R package Lahman
[https://cran.r-project.org/web/packages/Lahman/Lahman.pdf] Page 56-58. We begin by exploring the team level data. Yearly statistics and standings for teams
data(Teams)
str(Teams)
## 'data.frame': 2895 obs. of 48 variables:
## $ yearID : int 1871 1871 1871 1871 1871 1871 1871 1871 1871 1872 ...
## $ lgID : Factor w/ 7 levels "AA","AL","FL",..: 4 4 4 4 4 4 4 4 4 4 ...
## $ teamID : Factor w/ 149 levels "ALT","ANA","ARI",..: 24 31 39 56 90 97 111 136 142 8 ...
## $ franchID : Factor w/ 120 levels "ALT","ANA","ARI",..: 13 36 25 56 70 85 91 109 77 9 ...
## $ divID : chr NA NA NA NA ...
## $ Rank : int 3 2 8 7 5 1 9 6 4 2 ...
## $ G : int 31 28 29 19 33 28 25 29 32 58 ...
## $ Ghome : int NA NA NA NA NA NA NA NA NA NA ...
## $ W : int 20 19 10 7 16 21 4 13 15 35 ...
## $ L : int 10 9 19 12 17 7 21 15 15 19 ...
## $ DivWin : chr NA NA NA NA ...
## $ WCWin : chr NA NA NA NA ...
## $ LgWin : chr "N" "N" "N" "N" ...
## $ WSWin : chr NA NA NA NA ...
## $ R : int 401 302 249 137 302 376 231 351 310 617 ...
## $ AB : int 1372 1196 1186 746 1404 1281 1036 1248 1353 2571 ...
## $ H : int 426 323 328 178 403 410 274 384 375 753 ...
## $ X2B : int 70 52 35 19 43 66 44 51 54 106 ...
## $ X3B : int 37 21 40 8 21 27 25 34 26 31 ...
## $ HR : int 3 10 7 2 1 9 3 6 6 14 ...
## $ BB : num 60 60 26 33 33 46 38 49 48 29 ...
## $ SO : int 19 22 25 9 15 23 30 19 13 28 ...
## $ SB : num 73 69 18 16 46 56 53 62 48 53 ...
## $ CS : num 16 21 8 4 15 12 10 24 13 18 ...
## $ HBP : num NA NA NA NA NA NA NA NA NA NA ...
## $ SF : int NA NA NA NA NA NA NA NA NA NA ...
## $ RA : int 303 241 341 243 313 266 287 362 303 434 ...
## $ ER : int 109 77 116 97 121 137 108 153 137 166 ...
## $ ERA : num 3.55 2.76 4.11 5.17 3.72 4.95 4.3 5.51 4.37 2.9 ...
## $ CG : int 22 25 23 19 32 27 23 28 32 48 ...
## $ SHO : int 1 0 0 1 1 0 1 0 0 1 ...
## $ SV : int 3 1 0 0 0 0 0 0 0 1 ...
## $ IPouts : int 828 753 762 507 879 747 678 750 846 1548 ...
## $ HA : int 367 308 346 261 373 329 315 431 371 573 ...
## $ HRA : int 2 6 13 5 7 3 3 4 4 3 ...
## $ BBA : int 42 28 53 21 42 53 34 75 45 63 ...
## $ SOA : int 23 22 34 17 22 16 16 12 13 77 ...
## $ E : int 243 229 234 163 235 194 220 198 218 432 ...
## $ DP : int 24 16 15 8 14 13 14 22 20 22 ...
## $ FP : num 0.834 0.829 0.818 0.803 0.84 0.845 0.821 0.845 0.85 0.83 ...
## $ name : chr "Boston Red Stockings" "Chicago White Stockings" "Cleveland Forest Citys" "Fort Wayne Kekiongas" ...
## $ park : chr "South End Grounds I" "Union Base-Ball Grounds" "National Association Grounds" "Hamilton Field" ...
## $ attendance : int NA NA NA NA NA NA NA NA NA NA ...
## $ BPF : int 103 104 96 101 90 102 97 101 94 106 ...
## $ PPF : int 98 102 100 107 88 98 99 100 98 102 ...
## $ teamIDBR : chr "BOS" "CHI" "CLE" "KEK" ...
## $ teamIDlahman45: chr "BS1" "CH1" "CL1" "FW1" ...
## $ teamIDretro : chr "BS1" "CH1" "CL1" "FW1" ...
tail(Teams)
## yearID lgID teamID franchID divID Rank G Ghome W L DivWin WCWin LgWin
## 2890 2018 NL SFN SFG W 4 162 81 73 89 N N N
## 2891 2018 NL SLN STL C 3 162 81 88 74 N N N
## 2892 2018 AL TBA TBD E 3 162 81 90 72 N N N
## 2893 2018 AL TEX TEX W 5 162 81 67 95 N N N
## 2894 2018 AL TOR TOR E 4 162 81 73 89 N N N
## 2895 2018 NL WAS WSN E 2 162 81 82 80 N N N
## WSWin R AB H X2B X3B HR BB SO SB CS HBP SF RA ER ERA CG SHO
## 2890 N 603 5541 1324 255 30 133 448 1467 77 34 49 42 699 641 3.95 1 15
## 2891 N 759 5498 1369 248 9 205 525 1380 63 32 80 48 691 622 3.85 1 8
## 2892 N 716 5475 1415 274 43 150 540 1388 128 51 101 50 646 602 3.74 0 14
## 2893 N 737 5453 1308 266 24 194 555 1484 74 35 88 34 848 783 4.92 1 5
## 2894 N 709 5477 1336 320 16 217 499 1387 47 30 58 37 832 772 4.85 0 3
## 2895 N 771 5517 1402 284 25 191 631 1289 119 33 59 40 682 649 4.04 2 7
## SV IPouts HA HRA BBA SOA E DP FP name
## 2890 36 4384 1387 156 524 1269 97 160 0.984 San Francisco Giants
## 2891 43 4366 1354 144 593 1337 133 151 0.978 St. Louis Cardinals
## 2892 52 4345 1236 164 501 1421 85 136 0.986 Tampa Bay Rays
## 2893 42 4293 1516 222 491 1121 120 168 0.980 Texas Rangers
## 2894 39 4301 1476 208 551 1298 101 138 0.983 Toronto Blue Jays
## 2895 40 4338 1320 198 487 1417 64 115 0.989 Washington Nationals
## park attendance BPF PPF teamIDBR teamIDlahman45
## 2890 AT&T Park 3156185 95 96 SFG SFN
## 2891 Busch Stadium III 3403587 97 96 STL SLN
## 2892 Tropicana Field 1154973 97 97 TBR TBA
## 2893 Rangers Ballpark in Arlington 2107107 112 113 TEX TEX
## 2894 Rogers Centre 2325281 97 98 TOR TOR
## 2895 Nationals Park 2529604 106 105 WSN MON
## teamIDretro
## 2890 SFN
## 2891 SLN
## 2892 TBA
## 2893 TEX
## 2894 TOR
## 2895 WAS
A data frame with 2895 observations on the following 48 variables.
Variable | Description |
---|---|
yearID |
Year |
lgID |
League; a factor with levels AA AL FL NL PL UA |
teamID |
Team; a factor |
franchID |
Franchise (links to TeamsFranchises table) |
divID |
Team’s division; a factor with levels C E W |
Rank |
Position in final standings |
G |
Games played |
Ghome |
Games played at home |
W |
Wins |
L |
Losses |
DivWin |
Division Winner (Y or N) |
WCWin |
Wild Card Winner (Y or N) |
LgWin |
League Champion(Y or N) |
WSWin |
World Series Winner (Y or N) |
R |
Runs scored |
AB |
At bats |
H |
Hits by batters |
X2B |
Doubles |
X3B |
Triples |
HR |
Homeruns by batters |
BB |
Walks by batters |
SO |
Strikeouts by batters |
SB |
Stolen bases |
CS |
Caught stealing |
HBP |
Batters hit by pitch |
SF |
Sacrifice flies |
RA |
Opponents runs scored |
ER |
Earned runs allowed |
ERA |
Earned run average |
CG |
Complete games |
SHO |
Shutouts |
SV |
Saves |
IPouts |
Outs Pitched (innings pitched x 3) |
HA |
Hits allowed |
HRA |
Homeruns allowed |
BBA |
Walks allowed |
SOA |
Strikeouts by pitchers |
E |
Errors |
DP |
Double Plays |
FP |
Fielding percentage |
name |
Team’s full name |
park |
Name of team’s home ballpark |
attendance |
Home attendance total |
BPF |
Three-year park factor for batters |
PPF |
Three-year park factor for pitchers |
teamIDBR |
Team ID used by Baseball Reference website |
teamIDlahman45 |
Team ID used in Lahman database version 4.5 |
teamIDretro |
Team ID used by Retrosheet |
Defining the response variable is one of the key questions. To assemble a winning team, there are many ways to define the response variable. For example, we could use number of wins or number of losses as the response variable and so on. However, for simplicity, we want to focus our analysis on the game level for offense only and ignore defense and pitching. Therefore, our response variable is the number of runs (R). A run represents a batter goes around the bases and arrives home safely and scores.
We use runs (R) as the response variable and plot against all other variables. Clearly, we see clusters. So we tentatively use blue dots for games after 1961 and red dots for games before 1961.
Teams_type=sapply(Teams, class)
col_index=which(Teams_type %in% c("integer","numeric"))
length(col_index)
## [1] 35
par(mfrow=c(7,5))
par(mar = c(4, 4, 0.1, 0.1)) #par(mar = c(bottom, left, top, right))
par(mgp = c(1.5,0.5,0))
for (i in 1:length(col_index))
{
plot(Teams[,col_index[i]],Teams[,"R"],pch=46,xlab=names(Teams)[col_index[i]],ylab="R",col=ifelse(Teams$yearID<1961,"red","blue"))
}
Since we have observations dated back to 1871, when the league was operating in much different ways. We decide to focus on more recent games. Here is a visulization of number of games played by years. We can see that after 1961, the number of games is more or less fixed at 162.
Teams %>% ggplot(aes(yearID, G)) + geom_point(alpha = 0.5) +
geom_vline(xintercept = 1961) +
geom_hline(yintercept = 162, linetype="dotted", color = "blue")
Currently, each observation represent one team per season, we convert the data into game level statistics by dividing a few key variables by the number of games played. The key variables are pre-selected for demonstration purpose, but in practice, it can be selected by model selection techniques together with analyst’s understanding about the game of baseball. Since we are pretending we are A’s in 2002, we exclude all the data after 2002 too. Clearly, there are much more analysis to be done to visualize and clean the data, but these are the basic analysis before any meaningful analysis. We visualize the data as follows.
Teams_1961to2001=filter(Teams, yearID %in% 1961:2001)
#The above is equivalent to: Teams_1961to2001=Teams[(Teams$yearID<=2001)&(Teams$yearID>=1961),]
Teams_perGame = mutate(Teams_1961to2001,
R_per_game = R / G,
SB_per_game = SB / G,
BB_per_game = BB / G,
singles_per_game = (H - X2B - X3B - HR) / G,
doubles_per_game = X2B / G,
triples_per_game = X3B / G,
HR_per_game = HR / G,
AB_per_game = AB / G,
)
pairs(Teams_perGame[,c("R_per_game","singles_per_game","doubles_per_game","triples_per_game","HR_per_game","BB_per_game","SB_per_game","AB_per_game")],pch=46,cex=2)
The variables we created as explained as follows.
Some other variables that may also be useful are:
For each plot above, we generate it individually as follows.
Teams_1961to2001=filter(Teams, yearID %in% 1961:2001)
Teams_1961to2001_added_columns=mutate(Teams_1961to2001, HR_per_game = HR / G, R_per_game = R / G)
plot(Teams_1961to2001_added_columns$HR_per_game, Teams_1961to2001_added_columns$R_per_game, pch=20)
The code above seems a little too much. Note that the code above can be succintly written in the following way.
Teams %>% filter(yearID %in% 1961:2001) %>%
mutate(HR_per_game = HR / G, R_per_game = R / G) %>%
ggplot(aes(HR_per_game, R_per_game)) + geom_point(alpha = 0.5)
There are many ways to decide covariates. For the moments, we will focus on five covaraites, BB, singles, doubles, triples, and HR. Obviously, SB was an candidate, but we will come back it later.
# regression with BB, singles, doubles, triples, HR
fit=lm(R_per_game ~ BB_per_game + singles_per_game + doubles_per_game + triples_per_game + HR_per_game, data = Teams_perGame)
summary(fit)
##
## Call:
## lm(formula = R_per_game ~ BB_per_game + singles_per_game + doubles_per_game +
## triples_per_game + HR_per_game, data = Teams_perGame)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.50830 -0.09443 -0.00069 0.09345 0.47132
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -2.76919 0.08618 -32.13 <2e-16 ***
## BB_per_game 0.37121 0.01174 31.61 <2e-16 ***
## singles_per_game 0.51939 0.01272 40.83 <2e-16 ***
## doubles_per_game 0.77114 0.02261 34.11 <2e-16 ***
## triples_per_game 1.23997 0.07679 16.15 <2e-16 ***
## HR_per_game 1.44337 0.02435 59.28 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.1498 on 1020 degrees of freedom
## Multiple R-squared: 0.9355, Adjusted R-squared: 0.9352
## F-statistic: 2961 on 5 and 1020 DF, p-value: < 2.2e-16
All covariates are significant with nonzero slopes. The R squared is 0.93. To exam the model adequacy, we check the residuals.
par(mfrow=c(2,2))
plot(fit,pch=20)
The residuals all look OK . There seems to be no obviouis outliers. Normality, linearity seems, constant variance all seem to hold, which means our model assumption is valid, and we safely use the results from our regression output.
Next, we further exam the goodness of fit by plotting the fitted value against the actual response variable. Ideally, if the model fits the data very well, the plot will mostly be an straight line around 45 degree.
ggplot(aes(fit$fitted, R_per_game),data=Teams_perGame) +
geom_point() + geom_abline() + xlim(2.5, 6.5) + ylim(2.5, 6.5)
The results seems to be consistent with our expectation. Most of the dots are around 45 degree line. Next we take this model, and exam its out of sample performance, that is, how well it predicts a new observation. Note that we only used the data from 1961 through 2001 to build the linear regression. We can use this regression to predict the number of runs of each team in 2002 and compare with the actual observations. Here is the analysis and scatter plot of predict vs actual.
# predict number of runs for each team in 2002 and plot
Teams %>%
filter(yearID %in% 2002) %>%
mutate(BB_per_game = BB/G,
singles_per_game = (H-X2B-X3B-HR)/G,
doubles_per_game = X2B/G,
triples_per_game =X3B/G,
HR_per_game = HR/G,
R_per_game = R/G) %>%
mutate(R_per_game_hat = predict(fit, newdata = .)) %>%
ggplot(aes(R_per_game_hat, R_per_game, label = teamID)) +
geom_point() + geom_text(nudge_x=0.1, cex = 2) +
geom_abline() + xlim(2.5, 6.5) + ylim(2.5, 6.5)
Put these two figures side by side, we have
Therefore, our conclusion is that bases on balls, single, double, triple, and home run are strong predictors of number of runs. Therefore, we need to use these these variables to predict each individual player’s runs.
For player analysis, we use the data set called Batting
which stores yearly batting statistics of each players. This is a data frame with 105861 observations on the following 22 variables.
data(Batting)
str(Batting)
## 'data.frame': 105861 obs. of 22 variables:
## $ playerID: chr "abercda01" "addybo01" "allisar01" "allisdo01" ...
## $ yearID : int 1871 1871 1871 1871 1871 1871 1871 1871 1871 1871 ...
## $ stint : int 1 1 1 1 1 1 1 1 1 1 ...
## $ teamID : Factor w/ 149 levels "ALT","ANA","ARI",..: 136 111 39 142 111 56 111 24 56 24 ...
## $ lgID : Factor w/ 7 levels "AA","AL","FL",..: 4 4 4 4 4 4 4 4 4 4 ...
## $ G : int 1 25 29 27 25 12 1 31 1 18 ...
## $ AB : int 4 118 137 133 120 49 4 157 5 86 ...
## $ R : int 0 30 28 28 29 9 0 66 1 13 ...
## $ H : int 0 32 40 44 39 11 1 63 1 13 ...
## $ X2B : int 0 6 4 10 11 2 0 10 1 2 ...
## $ X3B : int 0 0 5 2 3 1 0 9 0 1 ...
## $ HR : int 0 0 0 2 0 0 0 0 0 0 ...
## $ RBI : int 0 13 19 27 16 5 2 34 1 11 ...
## $ SB : int 0 8 3 1 6 0 0 11 0 1 ...
## $ CS : int 0 1 1 1 2 1 0 6 0 0 ...
## $ BB : int 0 4 2 0 2 0 1 13 0 0 ...
## $ SO : int 0 0 5 2 1 1 0 1 0 0 ...
## $ IBB : int NA NA NA NA NA NA NA NA NA NA ...
## $ HBP : int NA NA NA NA NA NA NA NA NA NA ...
## $ SH : int NA NA NA NA NA NA NA NA NA NA ...
## $ SF : int NA NA NA NA NA NA NA NA NA NA ...
## $ GIDP : int 0 0 1 0 0 0 0 1 0 0 ...
tail(Batting)
## playerID yearID stint teamID lgID G AB R H X2B X3B HR RBI SB CS
## 105856 zieglbr01 2018 2 ARI NL 29 0 0 0 0 0 0 0 0 0
## 105857 zimmebr01 2018 1 CLE AL 34 106 14 24 5 0 2 9 4 1
## 105858 zimmejo02 2018 1 DET AL 25 2 0 0 0 0 0 0 0 0
## 105859 zimmery01 2018 1 WAS NL 85 288 33 76 21 2 13 51 1 1
## 105860 zobribe01 2018 1 CHN NL 139 455 67 139 28 3 9 58 3 4
## 105861 zuninmi01 2018 1 SEA AL 113 373 37 75 18 0 20 44 0 0
## BB SO IBB HBP SH SF GIDP
## 105856 0 0 0 0 0 0 0
## 105857 7 44 0 1 0 0 1
## 105858 0 2 0 0 0 0 0
## 105859 30 55 1 3 0 2 10
## 105860 55 60 1 2 1 7 8
## 105861 24 150 0 6 0 2 7
Here is the description of the variables.
Variable | Description |
---|---|
playerID | Player ID code |
yearID | Year |
stint | player’s stint (order of appearances within a season) |
teamID | Team; a factor |
lgID | League; a factor with levels AA AL FL NL PL UA |
G | Games: number of games in which a player played |
AB | At Bats |
R | Runs |
H | Hits: times reached base because of a batted, fair ball without error by the defense |
X2B | Doubles: hits on which the batter reached second base safely |
X3B | Triples: hits on which the batter reached third base safely |
HR | Homeruns |
RBI | Runs Batted In |
SB | Stolen Bases |
CS | Caught Stealing |
BB | Base on Balls |
SO | Strikeouts |
IBB | Intentional walks |
HBP | Hit by pitch |
SH | Sacrifice hits |
SF | Sacrifice flies |
GIDP | Grounded into double plays |
fit2 = Batting %>% filter(yearID %in% 1996:2001 ) %>%
mutate(HR_per_game=HR/G,
R_per_game=R/G,
BB_per_game = BB / G,
singles_per_game = (H - X2B - X3B - HR) / G,
doubles_per_game = X2B / G,
triples_per_game = X3B / G) %>%
lm(R_per_game ~
BB_per_game +
singles_per_game +
doubles_per_game +
triples_per_game +
HR_per_game, data = .)
summary(fit2)
##
## Call:
## lm(formula = R_per_game ~ BB_per_game + singles_per_game + doubles_per_game +
## triples_per_game + HR_per_game, data = .)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.65852 -0.02355 -0.00200 0.01452 0.99800
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.001995 0.001201 1.661 0.0968 .
## BB_per_game 0.276183 0.007519 36.733 <2e-16 ***
## singles_per_game 0.328264 0.005372 61.101 <2e-16 ***
## doubles_per_game 0.426012 0.015807 26.951 <2e-16 ***
## triples_per_game 0.992333 0.052120 19.039 <2e-16 ***
## HR_per_game 0.857241 0.017347 49.418 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.07668 on 7827 degrees of freedom
## Multiple R-squared: 0.8901, Adjusted R-squared: 0.8901
## F-statistic: 1.268e+04 on 5 and 7827 DF, p-value: < 2.2e-16
# average number of team plate appearances per game
pa_per_game <- Batting %>% filter(yearID == 2002) %>%
group_by(teamID) %>%
summarize(pa_per_game = sum(AB+BB)/max(G)) %>%
pull(pa_per_game) %>%
mean
# compute per-plate-appearance rates for players available in 2002 using previous data
players <- Batting %>% filter(yearID %in% 1999:2001) %>%
group_by(playerID) %>%
mutate(PA = BB + AB) %>%
summarize(G = sum(PA)/pa_per_game,
BB_per_game = sum(BB)/G,
singles_per_game = sum(H-X2B-X3B-HR)/G,
doubles_per_game = sum(X2B)/G,
triples_per_game = sum(X3B)/G,
HR_per_game = sum(HR)/G,
AVG = sum(H)/sum(AB),
PA = sum(PA)) %>%
filter(PA >= 300) %>%
select(-G) %>%
mutate(R_hat = predict(fit, newdata = .))
# plot player-specific predicted runs
qplot(R_hat, data = players, geom = "histogram", binwidth = 0.5, color = I("black"))
Finally, we include the salary and position information of the players and pick the players who are “undervalued”.
data(Salaries)
str(Salaries)
## 'data.frame': 26428 obs. of 5 variables:
## $ yearID : int 1985 1985 1985 1985 1985 1985 1985 1985 1985 1985 ...
## $ teamID : Factor w/ 35 levels "ANA","ARI","ATL",..: 3 3 3 3 3 3 3 3 3 3 ...
## $ lgID : Factor w/ 2 levels "AL","NL": 2 2 2 2 2 2 2 2 2 2 ...
## $ playerID: chr "barkele01" "bedrost01" "benedbr01" "campri01" ...
## $ salary : int 870000 550000 545000 633333 625000 800000 150000 483333 772000 250000 ...
tail(Salaries)
## yearID teamID lgID playerID salary
## 26423 2016 WAS NL scherma01 22142857
## 26424 2016 WAS NL strasst01 10400000
## 26425 2016 WAS NL taylomi02 524000
## 26426 2016 WAS NL treinbl01 524900
## 26427 2016 WAS NL werthja01 21733615
## 26428 2016 WAS NL zimmery01 14000000
# add 2002 salary of each player
players <- Salaries %>%
filter(yearID == 2002) %>%
select(playerID, salary) %>%
right_join(players, by="playerID")
head(players)
## playerID salary BB_per_game singles_per_game doubles_per_game
## 1 abbotje01 NA 3.275949 6.213007 2.033348
## 2 abbotku01 NA 2.411612 5.868256 1.929290
## 3 abernbr01 215000 3.160596 6.906488 1.990005
## 4 abreubo01 6333333 6.027244 5.912439 2.391763
## 5 agbaybe01 600000 4.527856 6.309307 1.892792
## 6 alexama02 NA 2.261856 6.392200 1.475123
## triples_per_game HR_per_game AVG PA R_hat
## 1 0.1129638 0.5648188 0.2515924 343 4.197201
## 2 0.2411612 1.1254190 0.2522124 482 4.585157
## 3 0.1170591 0.5852956 0.2697368 331 4.515780
## 4 0.4783527 1.4541922 0.3128655 2025 7.075567
## 5 0.2226814 1.2989750 0.2841649 1044 5.799263
## 6 0.4917077 0.3933662 0.2398922 394 3.705517
# add defensive position
position_names <- c("G_p","G_c","G_1b","G_2b","G_3b","G_ss","G_lf","G_cf","G_rf")
tmp_tab <- Appearances %>%
filter(yearID == 2002) %>%
group_by(playerID) %>%
summarize_at(position_names, sum) %>%
ungroup()
pos <- tmp_tab %>%
select(position_names) %>%
apply(., 1, which.max)
players <- data_frame(playerID = tmp_tab$playerID, POS = position_names[pos]) %>%
mutate(POS = str_to_upper(str_remove(POS, "G_"))) %>%
filter(POS != "P") %>%
right_join(players, by="playerID") %>%
filter(!is.na(POS) & !is.na(salary))
## Warning: `data_frame()` is deprecated, use `tibble()`.
## This warning is displayed once per session.
head(players)
## # A tibble: 6 x 11
## playerID POS salary BB_per_game singles_per_game doubles_per_game
## <chr> <chr> <int> <dbl> <dbl> <dbl>
## 1 abernbr~ 2B 2.15e5 3.16 6.91 1.99
## 2 abreubo~ RF 6.33e6 6.03 5.91 2.39
## 3 agbaybe~ LF 6.00e5 4.53 6.31 1.89
## 4 alfoned~ 3B 6.20e6 4.81 6.31 2.15
## 5 alicelu~ 2B 8.00e5 3.55 7.16 1.65
## 6 alomaro~ 2B 7.94e6 4.73 7.20 2.22
## # ... with 5 more variables: triples_per_game <dbl>, HR_per_game <dbl>,
## # AVG <dbl>, PA <int>, R_hat <dbl>
str(players)
## Classes 'tbl_df', 'tbl' and 'data.frame': 342 obs. of 11 variables:
## $ playerID : chr "abernbr01" "abreubo01" "agbaybe01" "alfoned01" ...
## $ POS : chr "2B" "RF" "LF" "3B" ...
## $ salary : int 215000 6333333 600000 6200000 800000 7939664 2500000 6000000 200000 5000000 ...
## $ BB_per_game : num 3.16 6.03 4.53 4.81 3.55 ...
## $ singles_per_game: num 6.91 5.91 6.31 6.31 7.16 ...
## $ doubles_per_game: num 1.99 2.39 1.89 2.15 1.65 ...
## $ triples_per_game: num 0.1171 0.4784 0.2227 0.0625 0.3871 ...
## $ HR_per_game : num 0.585 1.454 1.299 1.437 0.419 ...
## $ AVG : num 0.27 0.313 0.284 0.293 0.273 ...
## $ PA : int 331 2025 1044 1860 1201 1991 745 1076 1748 2024 ...
## $ R_hat : num 4.52 7.08 5.8 6.1 4.62 ...
# add players' first and last names
players <- Master %>%
select(playerID, nameFirst, nameLast, debut) %>%
mutate(debut = as.Date(debut)) %>%
right_join(players, by="playerID")
head(players)
## playerID nameFirst nameLast debut POS salary BB_per_game
## 1 abernbr01 Brent Abernathy 2001-06-25 2B 215000 3.160596
## 2 abreubo01 Bobby Abreu 1996-09-01 RF 6333333 6.027244
## 3 agbaybe01 Benny Agbayani 1998-06-17 LF 600000 4.527856
## 4 alfoned01 Edgardo Alfonzo 1995-04-26 3B 6200000 4.812074
## 5 alicelu01 Luis Alicea 1988-04-23 2B 800000 3.548811
## 6 alomaro01 Roberto Alomar 1988-04-22 2B 7939664 4.728989
## singles_per_game doubles_per_game triples_per_game HR_per_game AVG PA
## 1 6.906488 1.990005 0.11705912 0.5852956 0.2697368 331
## 2 5.912439 2.391763 0.47835270 1.4541922 0.3128655 2025
## 3 6.309307 1.892792 0.22268143 1.2989750 0.2841649 1044
## 4 6.311941 2.145643 0.06249447 1.4373727 0.2934316 1860
## 5 7.162147 1.645358 0.38714307 0.4194050 0.2731439 1201
## 6 7.200518 2.218538 0.33083459 1.2260341 0.3226545 1991
## R_hat
## 1 4.515780
## 2 7.075567
## 3 5.799263
## 4 6.102253
## 5 4.622360
## 6 6.616837
str(players)
## 'data.frame': 342 obs. of 14 variables:
## $ playerID : chr "abernbr01" "abreubo01" "agbaybe01" "alfoned01" ...
## $ nameFirst : chr "Brent" "Bobby" "Benny" "Edgardo" ...
## $ nameLast : chr "Abernathy" "Abreu" "Agbayani" "Alfonzo" ...
## $ debut : Date, format: "2001-06-25" "1996-09-01" ...
## $ POS : chr "2B" "RF" "LF" "3B" ...
## $ salary : int 215000 6333333 600000 6200000 800000 7939664 2500000 6000000 200000 5000000 ...
## $ BB_per_game : num 3.16 6.03 4.53 4.81 3.55 ...
## $ singles_per_game: num 6.91 5.91 6.31 6.31 7.16 ...
## $ doubles_per_game: num 1.99 2.39 1.89 2.15 1.65 ...
## $ triples_per_game: num 0.1171 0.4784 0.2227 0.0625 0.3871 ...
## $ HR_per_game : num 0.585 1.454 1.299 1.437 0.419 ...
## $ AVG : num 0.27 0.313 0.284 0.293 0.273 ...
## $ PA : int 331 2025 1044 1860 1201 1991 745 1076 1748 2024 ...
## $ R_hat : num 4.52 7.08 5.8 6.1 4.62 ...
# top 10 players
players %>% select(nameFirst, nameLast, POS, salary, R_hat) %>%
arrange(desc(R_hat)) %>%
top_n(10)
## Selecting by R_hat
## nameFirst nameLast POS salary R_hat
## 1 Barry Bonds LF 15000000 9.052454
## 2 Todd Helton 1B 5000000 8.228108
## 3 Manny Ramirez LF 15462727 8.199474
## 4 Sammy Sosa RF 15000000 8.185319
## 5 Larry Walker RF 12666667 8.146729
## 6 Jason Giambi 1B 10428571 7.993795
## 7 Chipper Jones LF 11333333 7.640781
## 8 Brian Giles LF 8063003 7.572229
## 9 Albert Pujols LF 600000 7.536447
## 10 Nomar Garciaparra SS 9000000 7.505063
# players with a higher metric have higher salaries
plot1 <- players %>% ggplot(aes(salary, R_hat, color = POS)) +
geom_point()
plot2 <- players %>% ggplot(aes(salary, R_hat, color = POS)) +
geom_point() +
scale_x_log10()
grid.arrange(plot1, plot2, ncol=2)
# remake plot without players that debuted after 1998
library(lubridate)
##
## Attaching package: 'lubridate'
## The following object is masked from 'package:base':
##
## date
players %>% filter(year(debut) < 1998) %>%
ggplot(aes(salary, R_hat, color = POS, label = nameLast)) +
geom_point() +
geom_text(nudge_x=0.1, cex = 2) +
scale_x_log10()