# Using Regression to Predict Baseball Salaries

- Introduction
- Requirements
- About Regression
- The Data
- Regression Example: Model Team Wins
- Exploring Player Salaries
- Modeling Player Salaries
- Regularization
- a. Ridge
- b. LASSO
- c. ElasticNet
- d. Results

- A Simplified Model
- Summary
- Conclusions
- The Code

## 1. Introduction

In baseball and other sports, we often wonder what drives player compensation. Highly-sought free agents sign record-breaking contracts in what seems like every off-season. Surely, we think, salaries must be based on some rational measures of productivity on the field.

I used regression to answer the question: What drives player salaries? My working hypothesis is that player salaries are largely determined by on-field statistics, information that is available to all parties, including the player agents and general managers who negotiate contracts.

I will first show basic linear regression, and then dive into regularization, a technique which can be used to perform feature selection, reduce model complexity and prevent over-fitting. We will use these techniques to understand the relationship between salaries and performance on the field.

## 2. Requirements

The code I will show requires Python 3, scikit-learn, pandas, numpy and statsmodels. The Anaconda distribution is recommended and can be downloaded here.

## 3. About Regression

Multiple regression is the most basic modeling technique for understanding the relationship between two or more variables. I’m going to assume familiarity with regression, but if you would like a refresher, Analytics Vidhya has an excellent primer.

## 4. The Data

Data going back to the beginning of baseball in 1871 is available in the Lahman database. Initially, I created my own database by scraping data from MLB.com and USAToday.com, but later I found that data as recent as the 2015 season is available on Sean Lahman’s website. It appears that the database is updated at the end of each season, as the last update was March 2016.

## 5. Regression Example: Model Team Wins

First, let us construct a regression model on team wins. What are the key drivers of winning in baseball?

In an iPython notebook or Python shell, first load the data and subset the data frame to include columns which could be predictors:

### Subset

Next, we calculate WPCT (winning percentage):

### Normalize and Replace Missing Values

### Train, test

We will use construct the model and test it on a separate set of data:

*0.915994428193*

### Results

Not bad! An R^2 of 0.916 means we can explain almost 92% of the variance in wins with our model.

Note that we used split the data into train and test sets, then scored the resulting model on the test set. This is a standard technique to avoid “overfitting”. It gives us a more accurate idea of how well the model generalizes to new data. There are even more elaborate techniques which I will describe in the next section.

In scikit learn, it is a bit difficult to see inside of a regression model. This framework is designed for training and testing machine learning models where one might have hundreds of variables. It is excellent for tuning ML parameters, but for multiple regression, we would like to be able to inspect the coefficients to understand how each predictor variable is related to the response variable.

For those familiar with R, statsmodels uses R-style formulas and prints a summary of the results that looks very similar to the output of R’s summary(fit).

### Inspect Coefficients

OLS Regression Results ============================================================================== Dep. Variable: WIN_PCT R-squared: 0.901 Model: OLS Adj. R-squared: 0.900 Method: Least Squares F-statistic: 900.8 Date: Tue, 01 Nov 2016 Prob (F-statistic): 0.00 Time: 09:31:52 Log-Likelihood: 5853.1 No. Observations: 2805 AIC: -1.165e+04 Df Residuals: 2776 BIC: -1.148e+04 Df Model: 28 Covariance Type: nonrobust ==================================================================================== coef std err t P>|t| [95.0% Conf. Int.] ------------------------------------------------------------------------------------ Intercept -0.1666 0.071 -2.356 0.019 -0.305 -0.028 R 0.8042 0.024 33.538 0.000 0.757 0.851 AB -0.3332 0.064 -5.189 0.000 -0.459 -0.207 H -0.0148 0.031 -0.472 0.637 -0.076 0.047 SECOND_BASE_HITS -0.0101 0.009 -1.093 0.274 -0.028 0.008 THIRD_BASE_HITS -0.0215 0.008 -2.853 0.004 -0.036 -0.007 HR -0.0489 0.008 -6.023 0.000 -0.065 -0.033 BB -0.0424 0.010 -4.316 0.000 -0.062 -0.023 SO -0.0010 0.008 -0.127 0.899 -0.016 0.014 SB -0.0396 0.007 -5.540 0.000 -0.054 -0.026 CS 0.0016 0.005 0.315 0.753 -0.008 0.012 HBP 0.0066 0.006 1.028 0.304 -0.006 0.019 SF -0.0128 0.006 -2.006 0.045 -0.025 -0.000 RA -0.6026 0.043 -13.955 0.000 -0.687 -0.518 ER 0.3623 0.043 8.478 0.000 0.278 0.446 ERA -0.3191 0.026 -12.472 0.000 -0.369 -0.269 CG 0.0634 0.009 7.266 0.000 0.046 0.081 SHO 0.0387 0.006 6.479 0.000 0.027 0.050 SV 0.1094 0.007 15.279 0.000 0.095 0.123 IPouts 0.0387 0.056 0.685 0.493 -0.072 0.149 HA -0.0568 0.033 -1.703 0.089 -0.122 0.009 HRA 0.0008 0.009 0.088 0.930 -0.017 0.018 BBA -0.0086 0.010 -0.826 0.409 -0.029 0.012 SOA -0.0212 0.009 -2.314 0.021 -0.039 -0.003 E 0.0495 0.017 2.979 0.003 0.017 0.082 DP -0.0044 0.006 -0.691 0.489 -0.017 0.008 FP 0.8022 0.075 10.661 0.000 0.655 0.950 BPF 0.0060 0.001 11.752 0.000 0.005 0.007 PPF -0.0060 0.001 -11.723 0.000 -0.007 -0.005 ============================================================================== Omnibus: 604.327 Durbin-Watson: 1.994 Prob(Omnibus): 0.000 Jarque-Bera (JB): 17096.077 Skew: -0.339 Prob(JB): 0.00 Kurtosis: 15.075 Cond. No. 2.61e+04 ============================================================================== Warnings: [1] Standard Errors assume that the covariance matrix of the errors is correctly specified. [2] The condition number is large, 2.61e+04. This might indicate that there are strong multicollinearity or other numerical problems.

### Interpret results/coefficients

Wow, there is a lot of information here! I’ll just highlight a few things:

- Note the coefficients shown next to each variable. Not all of these will be used, as I will explain shortly.
- R-squared was slightly lower this time. Remember, we used a test/train split first, then we constructed a model for the entire data set using statsmodels. We did this just to be able to see the coefficients.
- P-values above 0.05 are high enough to be considered statistically insignificant. This means there is a good chance that the observed relationship between the predictor and response variables is the result of random variation.
- Confidence intervals are shown for the 95% confidence level. Variables whose confidence intervals include zero should be thrown out.
- Note the warnings about strong multicollinearity. Some of these variables are related. A good example is TB (Total Bases) and H (HITS), SECOND_BASE_HITS, THIRD_BASE_HITS and HR (Home Runs).

We will perform feature selection by eliminating those statistically insignificant variables as well as ones with obvious correlations. This will reduce model complexity and potential over-fitting.

Below I show the final model I came up with:

### Final Model

OLS Regression Results ============================================================================== Dep. Variable: WIN_PCT R-squared: 0.900 Model: OLS Adj. R-squared: 0.900 Method: Least Squares F-statistic: 1483. Date: Tue, 01 Nov 2016 Prob (F-statistic): 0.00 Time: 09:30:20 Log-Likelihood: 5847.3 No. Observations: 2805 AIC: -1.166e+04 Df Residuals: 2787 BIC: -1.155e+04 Df Model: 17 Covariance Type: nonrobust =================================================================================== coef std err t P>|t| [95.0% Conf. Int.] ----------------------------------------------------------------------------------- Intercept -0.1523 0.068 -2.232 0.026 -0.286 -0.019 R 0.7878 0.013 58.903 0.000 0.762 0.814 AB -0.3580 0.020 -17.628 0.000 -0.398 -0.318 THIRD_BASE_HITS -0.0200 0.007 -2.823 0.005 -0.034 -0.006 HR -0.0462 0.006 -7.376 0.000 -0.058 -0.034 BB -0.0384 0.007 -5.459 0.000 -0.052 -0.025 SB -0.0363 0.006 -6.096 0.000 -0.048 -0.025 SF -0.0115 0.004 -2.751 0.006 -0.020 -0.003 RA -0.6306 0.039 -16.323 0.000 -0.706 -0.555 ER 0.3613 0.039 9.185 0.000 0.284 0.438 ERA -0.3235 0.025 -13.184 0.000 -0.372 -0.275 CG 0.0710 0.008 9.227 0.000 0.056 0.086 SHO 0.0380 0.006 6.534 0.000 0.027 0.049 SV 0.1087 0.007 15.490 0.000 0.095 0.123 E 0.0589 0.015 4.003 0.000 0.030 0.088 FP 0.7922 0.073 10.888 0.000 0.650 0.935 BPF 0.0060 0.001 11.827 0.000 0.005 0.007 PPF -0.0060 0.001 -11.887 0.000 -0.007 -0.005 ============================================================================== Omnibus: 609.577 Durbin-Watson: 1.993 Prob(Omnibus): 0.000 Jarque-Bera (JB): 17556.500 Skew: -0.344 Prob(JB): 0.00 Kurtosis: 15.237 Cond. No. 2.51e+04 ============================================================================== Warnings: [1] Standard Errors assume that the covariance matrix of the errors is correctly specified. [2] The condition number is large, 2.51e+04. This might indicate that there are strong multicollinearity or other numerical problems.

The R^2 is 0.9, close to the accuracy of the original model, but we’ve simplified it somewhat.

We still see the warning about multi-collinearity among the remaining variables. We could simplify the model further by eliminating some variables, but this is good enough for our purpose, which is to understand regression and the factors that drive wins.

Look at the coefficients. We see a strong relationship between Runs (0.78), Runs Allowed (-0.63) and Winning Percentage. This should be fundamentally obvious to anyone who understands baseball, but it is a good illustration of how we can use regression and interpret the results.

Typically, statsmodels is used in the “statistics” world, while scikit learn is used in machine learning. They each have strengths and weaknesses. The method shown above is one method of feature selection, but we would still want to see how well the model generalizes to new data. For that, the cross_validation module in scikit-learn is well-suited.

## 6. Exploring Player Salaries

If runs are the primary driver of team wins, what factors are most important when it comes to player salaries? The universe of possibilities is vast when it comes to predicting compensation. There are different types of players, playing different positions – some highly specialized – and they each have different strengths and weaknesses. In addition, there are other variables like the team budget. There is a wide range of payrolls by team in Major League Baseball.

To develop an intuition about which features could be predictive, it is helpful to do some exploratory analysis. The full analysis is linked here, but I’ve included some key insights, below.

### Salary vs. Time

As we see from the plot below, average salaries have grown every year during the time period we analyzed:

### Salary Distribution

The distribution of our salaries is left-skewed with a long right tail. A small number of players recieve disproportionately large salaries:

When we plot salaries by position, we see a clear relationship between position played and salary:

First and second basemen have higher median salaries than other fielding positions.

MULTIPLE includes players that have played more than one position in a year. This appears to include all outfielders (left, right and center field). Many of these players also played pitcher or other fielding positions.

### Player Statistics vs. Salaries

There are several other variables that appear to have a strong correlation with salary: RBI’s (Runs Batted In), number of All-Star Appearances, Strikeouts and Fielding Games, among others:

In addition, this pairs plot clues us in to predictor variables that appear to be correlated with each other, such as RBI’s and Fielding games. We will want to remove correlated predictor variables from our model.

## 7. Modeling Player Salaries

As with team wins, we will use regression to model player salaries, but I will introduce some automated techniques for feature selection. These techniques come in handy for highly-dimensional data and for tuning more complex models.

We will go through the same kind of process for cleaning and normalizing the input variables that we executed above. Rather than embed the full code here, I will link to the iPython notebook and highlight the important modeling steps below.

### The Data

I wrote a script (generate_observations.py) that combines annual salaries with prior year playing statistics, labeled “{Statistic}.Year-1”. I only used the prior year, but I added a career statistic for several important variables. In other words, each row contains the current salary, data for the most recent year, and aggregate stats for the player’s entire career up to and including the previous year of play.

The full data pipeline is documented in the github repo.

In my iPython notebook, I show the inspect the first few rows as follows:

Player Id | Salary Year | Annual Salary | Salary Team | Batting_2B.Year-1 | Batting_3B.Year-1 | Batting_AB.Year-1 | Batting_AVG.Year-1 | Batting_BB.Year-1 | Batting_CS.Year-1 | ... | Pitching_SHO.Year-1 | Pitching_SO.Year-1 | Pitching_SV.Year-1 | Pitching_W.Year-1 | Pitching_WP.Year-1 | Player Id.1 | teamID 1.Year-1 | teamID 2.Year-1 | teamID 3.Year-1 | teamID 4.Year-1 | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|

0 | blanche01 | 2011 | 1000000 | ARI | 5.0 | 0.0 | 130.0 | 0.215385 | 11.0 | 0.0 | ... | NaN | NaN | NaN | NaN | NaN | blanche01 | NYN | NaN | NaN | NaN |

1 | bloomwi01 | 2011 | 900000 | ARI | 10.0 | 1.0 | 187.0 | 0.267380 | 9.0 | 5.0 | ... | NaN | NaN | NaN | NaN | NaN | bloomwi01 | CIN | KCA | NaN | NaN |

2 | blumge01 | 2011 | 1350000 | ARI | 10.0 | 1.0 | 202.0 | 0.267327 | 15.0 | 0.0 | ... | NaN | NaN | NaN | NaN | NaN | blumge01 | HOU | NaN | NaN | NaN |

3 | branyru01 | 2011 | 1000000 | ARI | 19.0 | 0.0 | 376.0 | 0.236702 | 46.0 | 0.0 | ... | NaN | NaN | NaN | NaN | NaN | branyru01 | SEA | CLE | NaN | NaN |

4 | demelsa01 | 2011 | 417000 | ARI | 0.0 | 0.0 | 0.0 | NaN | 0.0 | 0.0 | ... | 0.0 | 33.0 | 2.0 | 2.0 | 5.0 | demelsa01 | ARI | NaN | NaN | NaN |

Salary Year | Annual Salary | Batting_2B.Year-1 | Batting_3B.Year-1 | Batting_AB.Year-1 | Batting_AVG.Year-1 | Batting_BB.Year-1 | Batting_CS.Year-1 | Batting_Career_2B | Batting_Career_3B | ... | Pitching_IPouts.Year-1 | Pitching_L.Year-1 | Pitching_R.Year-1 | Pitching_SF.Year-1 | Pitching_SH.Year-1 | Pitching_SHO.Year-1 | Pitching_SO.Year-1 | Pitching_SV.Year-1 | Pitching_W.Year-1 | Pitching_WP.Year-1 | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|

count | 4043.000000 | 4.043000e+03 | 3942.000000 | 3942.000000 | 3942.000000 | 3073.000000 | 3942.000000 | 3942.000000 | 4043.000000 | 4043.000000 | ... | 1928.000000 | 1928.000000 | 1928.000000 | 1928.000000 | 1928.000000 | 1928.000000 | 1928.000000 | 1928.000000 | 1928.000000 | 1928.000000 |

mean | 2012.987880 | 3.783100e+06 | 9.335363 | 0.965246 | 180.968037 | 0.205732 | 16.521309 | 1.239472 | 58.042048 | 6.523126 | ... | 269.243776 | 4.832988 | 39.820021 | 2.428942 | 2.985996 | 0.150934 | 75.958506 | 2.991701 | 5.311203 | 3.258299 |

std | 1.417743 | 5.008263e+06 | 12.061593 | 1.888294 | 211.187452 | 0.120388 | 22.446631 | 2.409485 | 98.862958 | 13.221541 | ... | 196.661333 | 4.064620 | 31.519374 | 2.299598 | 3.024088 | 0.492327 | 55.558366 | 8.758179 | 4.788591 | 3.140324 |

min | 2011.000000 | 4.140000e+05 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | ... | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 |

25% | 2012.000000 | 5.085000e+05 | NaN | NaN | NaN | NaN | NaN | NaN | 0.000000 | 0.000000 | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |

50% | 2013.000000 | 1.400000e+06 | NaN | NaN | NaN | NaN | NaN | NaN | 6.000000 | 0.000000 | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |

75% | 2014.000000 | 5.000000e+06 | NaN | NaN | NaN | NaN | NaN | NaN | 77.000000 | 8.000000 | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |

max | 2015.000000 | 3.257100e+07 | 55.000000 | 16.000000 | 684.000000 | 1.000000 | 135.000000 | 23.000000 | 570.000000 | 120.000000 | ... | 753.000000 | 18.000000 | 128.000000 | 12.000000 | 19.000000 | 6.000000 | 277.000000 | 51.000000 | 24.000000 | 25.000000 |

### Cleaning

My initial code for cleaning the input was quite involved, due to the fact that data was coming from scraped websites, in different character sets, and presentation-level data was mixed in with the underlying playing data. I modified my data pipeline to use the Lahman database, obviating the need for this level of pre-processing.

As a result, most of my pre-processing work revolved around joining the different tables in the Lahman database to get a single table of observations (described in “The Data” section above).

### Feature Engineering

Not enough can be written about this topic, but feature engineering and feature selection are very important. Without the right data, we can’t build an accurate model.

The first step in feature engineering was the data transformation work described above. We calculated career stats and combined previous year and career statistics with annual salaries.

Normalizing input variables makes it easier to interpret regression coefficients. This means scaling the values so that they fall between 0 and 1. The formula for normalization is:

```
z = (x - min(x)) / (max(x) - min(x))
```

Before building the model, I normalized all the continous input variables.

For our Salary data, consider that salaries from year to year will grow over time. We don’t want to confuse the natural growth in salaries with variance in our input variables. To make apples-to-apples comparisons between salaries in two different years, I created an ‘Adjusted Salary’ variable, which is calculated as follows:

```
Adjusted Salary = Salary / Average Annual Player Salary
```

The Average Annual Player Salary is obtained by grouping salaries on year and calculating their mean.

Similarly, Team Payrolls (aggregate pay at the team level in any given year) grow over time. We perform a similar adjustment for team payrolls:

```
Adjusted Team Payroll = Team Payroll / Average Annual Team Payroll
```

Categorical variables have to be converted into ordinal or continuous values. A typical method is to create dummy variables, one for each value of the categorical variable. For player position, the following variables were created: 0.0, MULTIPLE (if a player played more than one position in a season), P (Pitcher), C (Catcher), SS (Short Stop), 1B (First Base), 2B (Second Base), and 3B (Third Base).

### Creating the Linear Regression Model

After cleaning the data and bringing it into the dataframe ‘df’, we fit a LinearRegression instance as we did with Team Wins, and observe the accuracy using a simple test-train split:

*0.677956364033*

### Results

R^2 = 0.68. This shows a significant degree of variance is explained by the features we included in the model, although it is not nearly as accurate as the Team Wins model we developed above.

### Use Cross-Validation

To get a better test of fit, we will perform 5-fold cross validation. A more detailed explation of CV is here, but essentially this method involves systematically performing train/test splits, fitting the model on the training set and evaluating it on the test set. We repeat this for several subsets of training/test data and take the average score:

*Accuracy: 0.65 (+/- 0.08)*

Not bad. This score is a measure of how well the model performs on new data.

Because we include potentially multiple years of data for each player, we have to be concerned about linear relationships among our observations. We used GroupKFold to ensure that no player appears in both test and train sets for any fold. This prevents over-fitting. While our observations within each fold are not linearly independent, the train and test splits remain independent.

### Model Complexity

As we discussed in the first part on Team Wins, too many features can make the model unstable. Relationships could be inferred from variations that are just noise, resulting in overfitting. We will try to reduce model complexity through feature selection.

Besides the approach we showed in part one, a well-known approach to reducing model complexity is Regularization. I encourage you to read the tutorial linked here for understanding and motivation for Regularization, if you need an introduction or refresher.

## 8. Regularization

There are two types of regularization for regression problems, and one which combines them:

- Ridge: Performs L2 regularization
- LASSO: Performs L1 regularization
- ElasticNet: Combines L1 and L2

All three of these add a penalty to the regression equation. In Ordinary Least Squares (the regression we performed above), the objective is to minimize the sum of the differences between the observed and predicted values.

L2 minimizes a combination of the sum of the squares of the errors, as in OLS, and the sum of the squares of the coefficients. L1 minimizes a combination of the sum of the squares of the errors, as in OLS, and the sum of absolute value of coefficients.

The equations for L1 and L2 include the parameter alpha:

```
L1 minimization objective = LS Obj + α * (sum of absolute value of coefficients)
L2 minimization objective = LS Obj + α * (sum of square of coefficients)
```

The effect of regularization is to shrink or eliminate coefficients. Note that Ridge regression will retain all variables, while LASSO will eliminate variables. ElasticNet will achieve results in-between these approaches, as it performs a combination of L1 and L2 regularization.

As you can see, the effect of L1 and L2 depend greatly on the value of alpha. If alpha is zero, then L1 and L2 will perform exactly as Ordinary Least Sequares.

We’ll use GridSearchCV to perform K-fold cross-validation and look for the optimal value of alpha, and in the case of ElasticNet the l1/l2 ratio.

(Note: GridSearchCV will output many warnings about non-convergence for the given values of alpha and l1/l2, but they can be ignored. For each type of regularization, I was able to find parameters for which the models did converge. Those warning messages are omitted from the output below).

### a. Ridge Regression

**The best parameters are {‘alpha’: 5} with a score of 0.62**

We look at the coefficients

{'0.0': -0.39057587143910538, '1B': 0.28963831227462217, '2B': 0.5567740265667418, '3B': -0.38383354872414599, 'Adjusted Team Payroll': 0.56405102067908308, 'Batting_Career_2B': 3.1522634262159572, 'Batting_Career_3B': 1.392063383072341, 'Batting_Career_AVG': -1.3374777290310322, 'Batting_Career_G': -6.0410423257153951, 'Batting_Career_H': 2.8558911690964304, 'Batting_Career_HR': 0.56139270540886621, 'Batting_Career_Num_Seasons': -0.19560965106335085, 'Batting_Career_OBP': 0.16138792259196344, 'Batting_Career_PSN': 1.3988969443216634, 'Batting_Career_R': -2.2951722609059919, 'Batting_Career_RBI': 3.0095553387131413, 'Batting_Career_SB': -1.1627478339088211, 'Batting_Career_SLG': 2.4792002682158638, 'Batting_Career_TB': -1.3751761008729613, 'C': -0.035421691264489852, 'Fielding_Career_A': -0.20547783542690334, 'Fielding_Career_E': 0.65918312303112425, 'Fielding_Career_FPCT': 0.17174074638196138, 'Fielding_Career_G': 2.5841653004703784, 'Fielding_Career_PO': 0.46558699831991984, 'Fielding_G': 0.0, 'Fielding_Num_Seasons': 0.0, 'MULTIPLE': -0.098250785476812943, 'Num_All_Star_Appearances': 2.2108124099949258, 'Num_Post_Season_Appearances': 0.18745542926727987, 'P': -0.088231982270278009, 'Pitching_Career_ER': -10.598771122237048, 'Pitching_Career_ERA': 0.28550993623861992, 'Pitching_Career_GS': -2.3114192891981071, 'Pitching_Career_IP': 5.1199564678367455, 'Pitching_Career_L': 2.6376961419350953, 'Pitching_Career_Num_Seasons': -0.015269958669084009, 'Pitching_Career_SO': 5.7323124475788303, 'Pitching_Career_W': 3.520676452015632, 'SS': 0.14990154035196948}

### b. LASSO

**The best parameters are {‘alpha’: 0.001} with a score of 0.62**

{'0.0': -0.28592675529735534, '1B': 0.26181856934547587, '2B': 0.35404526164078254, '3B': -0.12171969885924695, 'Adjusted Team Payroll': 0.54231569987963457, 'Batting_Career_2B': 1.3545359477186045, 'Batting_Career_3B': 0.53306092429719154, 'Batting_Career_AVG': -0.24680673375303361, 'Batting_Career_G': -0.0, 'Batting_Career_H': 0.0, 'Batting_Career_HR': 1.284782720712615, 'Batting_Career_Num_Seasons': -1.3358330510491565, 'Batting_Career_OBP': -0.0, 'Batting_Career_PSN': 0.32905664259505402, 'Batting_Career_R': -0.0, 'Batting_Career_RBI': 0.0085320537162230231, 'Batting_Career_SB': -0.0, 'Batting_Career_SLG': 0.98473350955245387, 'Batting_Career_TB': 0.52052229008278961, 'C': 0.018720803576232986, 'Fielding_Career_A': -0.0, 'Fielding_Career_E': 0.0033981909511692017, 'Fielding_Career_FPCT': 0.0, 'Fielding_Career_G': 1.1812941142468409, 'Fielding_Career_PO': 0.070324444252947163, 'Fielding_G': 0.0, 'Fielding_Num_Seasons': 0.0, 'MULTIPLE': -0.00011525840125545478, 'Num_All_Star_Appearances': 2.9686885540278687, 'Num_Post_Season_Appearances': 0.080096705748976094, 'P': -0.025495777914012779, 'Pitching_Career_ER': -0.71225349778559355, 'Pitching_Career_ERA': -0.0, 'Pitching_Career_GS': 0.0, 'Pitching_Career_IP': -0.0, 'Pitching_Career_L': -0.0, 'Pitching_Career_Num_Seasons': -0.0, 'Pitching_Career_SO': 5.2272363822703554, 'Pitching_Career_W': 0.0, 'SS': 0.12223098736421924}

**25 variables selected.**

Nice! Many of the 38 variables we originally used were eliminated.

### c. ElasticNet

The best parameters are {'l1_ratio': 0.6, 'alpha': 0.001} with a score of 0.62

{'0.0': -0.3047726423733812, '1B': 0.25748750699647138, '2B': 0.35889132154662995, '3B': -0.14988674024090118, 'Adjusted Team Payroll': 0.54625359519308125, 'Batting_Career_2B': 0.84243456942856298, 'Batting_Career_3B': 0.59913899613210242, 'Batting_Career_AVG': -0.35894026434519272, 'Batting_Career_G': -0.14729811028642528, 'Batting_Career_H': 0.0, 'Batting_Career_HR': 0.78105719697275222, 'Batting_Career_Num_Seasons': -1.1793233106665073, 'Batting_Career_OBP': 0.0, 'Batting_Career_PSN': 0.52551886237215906, 'Batting_Career_R': -0.0, 'Batting_Career_RBI': 0.61225141161728402, 'Batting_Career_SB': -0.17972207609872703, 'Batting_Career_SLG': 1.1473103292071138, 'Batting_Career_TB': 0.69807716951694032, 'C': 0.0, 'Fielding_Career_A': -0.0, 'Fielding_Career_E': 0.071400782449289851, 'Fielding_Career_FPCT': 0.0, 'Fielding_Career_G': 1.1490235733182153, 'Fielding_Career_PO': 0.17053361139204987, 'Fielding_G': 0.0, 'Fielding_Num_Seasons': 0.0, 'MULTIPLE': -0.015179815039546003, 'Num_All_Star_Appearances': 2.9585877533978362, 'Num_Post_Season_Appearances': 0.22848002839269588, 'P': -0.021741654556822559, 'Pitching_Career_ER': -0.38322604452848602, 'Pitching_Career_ERA': -0.011391062162622766, 'Pitching_Career_GS': 0.0, 'Pitching_Career_IP': 0.0, 'Pitching_Career_L': -0.0, 'Pitching_Career_Num_Seasons': -0.0, 'Pitching_Career_SO': 4.2715317739655374, 'Pitching_Career_W': 0.68767296740585693, 'SS': 0.11953849226009178}

**28 variables selected.**

### d. Results

Ridge regression did not eliminate any coefficients, as expected. Nor did the accuracy of the model change much.

LASSO did the best job at simplifying the model, as it selected the fewest variables. The accuracy was slightly reduced, but it is close. If we were concerned about how this model would perform on new data, this tradeoff might be attractive.

This is also a benefit in computationally expensive problems where there are a lot of data points – perhaps millions of variables.

ElasticNet performed in-between Ridge and LASSO.

## 9. A Simplified Model

My favorite approach to this problem was simply fitting an OLS model in statsmodels and eliminating the statistically insignificant variables.

This is what I came up with:

OLS Regression Results ============================================================================== Dep. Variable: Adjusted_Salary R-squared: 0.501 Model: OLS Adj. R-squared: 0.501 Method: Least Squares F-statistic: 3472. Date: Tue, 01 Nov 2016 Prob (F-statistic): 0.00 Time: 09:30:19 Log-Likelihood: -31318. No. Observations: 24188 AIC: 6.265e+04 Df Residuals: 24180 BIC: 6.272e+04 Df Model: 7 Covariance Type: nonrobust ============================================================================================ coef std err t P>|t| [95.0% Conf. Int.] -------------------------------------------------------------------------------------------- Intercept 0.2996 0.008 37.092 0.000 0.284 0.315 Batting_Career_TB 4.3374 0.064 67.495 0.000 4.211 4.463 Pitching_Career_IP -0.5935 0.228 -2.598 0.009 -1.041 -0.146 Pitching_Career_SO 8.2155 0.339 24.215 0.000 7.551 8.881 Num_All_Star_Appearances 2.6404 0.096 27.630 0.000 2.453 2.828 NO_POSITION -0.2963 0.030 -9.767 0.000 -0.356 -0.237 FIRST_BASE 0.3226 0.034 9.408 0.000 0.255 0.390 SECOND_BASE 0.0881 0.040 2.210 0.027 0.010 0.166 ============================================================================== Omnibus: 4686.175 Durbin-Watson: 1.946 Prob(Omnibus): 0.000 Jarque-Bera (JB): 50380.234 Skew: 0.620 Prob(JB): 0.00 Kurtosis: 9.961 Cond. No. 72.0 ============================================================================== Warnings: [1] Standard Errors assume that the covariance matrix of the errors is correctly specified.

I like the OLS summary from statsmodels because it’s very easy to interpret:

- Total Bases is the most important batting statistic. Players are paid for their ability to hit.
- Strikeouts is the most important metric for pitchers.
- Innings Pitched is negatively correlated with salaries because the rate of strike-outs is lower for a pitcher with more innings pitched.
- Players with more All-Star Appearances are paid proportionally more based on the number of games they apppeared in.
- NO_POSITION: This might require further investigation, but I suspect this refers to designated hitters. They would play no fielding position. On average, they earn less, hence the negative correlation.
- First and Second base players earn more on average.

## 10. Summary

We looked at multiple regression for modeling a couple of interesting variables in baseball: team wins and player salaries. I showed 3 regularization techniques for performing automated feature selection during cross-validation to score model accuracy. As a result, we have a model that is more robust, parsimonious and less prone to over-fitting.

I also showed how feature selection could be performed by eliminating statistically insignificant variables, with similar benefits as regularization. The statsmodels package makes this very easy by producing summary output that is similar to R’s.

## 11. Conclusions

The most important predictor of winning is runs.

The most important drivers of baseball compensation include a few commonly-used metrics of batting and pitching ability, the number of all-star appearances, and fielding position.

The salaries model is not as accurate as the team wins model. There is a great deal of variation in player salaries, not all of which can be explained by the data. Maybe there is more data that could be added to our regression model. It’s more likely, though, that much of what drives player compensation is random chance, errors in financial judgement and bias.

In Moneyball, Michael Lewis writes about the puzzling way that baseball scouts evaluate players. By using statistical analysis and smart management of the roster, the Oakland A’s fielded a winning team despite having one of the lowest payrolls in major league baseball. One of the book’s key insights is that hitting skill is overvalued. This is consistent with our salaries model, of which TB (Total Bases) is an important component.

When it comes to winning, the ability to get on base is what’s important – whether that is accomplished by walks or hits. By recruiting players with high OBP (on-base percentage) and OK hitting skills, salaries can be minimized while achieving good numbers of runs. Runs – not hits – is the critical driver for winning.

# 12. The Code

The code can be found here.