*Author: Nate Reed*

*Date: August 18, 2016*

The Lahman Baseball Database (available here: http://www.seanlahman.com/baseball-archive/statistics/) provides an extensive historical database of baseball statistics which I will use to answer the following interesting questions:

- Have batting averages increased or decreased over time?
- Is there a relationship between batting and pitching skills?
- Of those pitchers who bat, are pitchers good batters?

This report uses some informal techniques (eg. exploratory data analysis) and basic statistics to find answers to these questions.

In [86]:

```
import urllib.request as request
request.urlretrieve('http://seanlahman.com/files/database/baseballdatabank-master_2016-03-02.zip', "baseballdatabank-master_2016-03-02.zip")
from zipfile import ZipFile
zip = ZipFile('baseballdatabank-master_2016-03-02.zip')
zip.extractall()
```

In [87]:

```
import pandas as pd
batting_df = pd.read_csv("baseballdatabank-master\core\Batting.csv")
print("%d observations" % len(batting_df))
```

In [88]:

```
batting_df.head()
```

Out[88]:

To understand batting, we will look at one of the common metrics for batting ability, the batting average, which is calculated as Hits / "At Bat's". I will select the stint and the team, in addition to hits and "at bats":

In [89]:

```
batting_df = batting_df[['playerID', 'yearID', 'AB', 'H', 'stint', 'teamID']]
batting_df.head(5)
```

Out[89]:

In [90]:

```
batting_df['batting_average'] = batting_df['H'] / batting_df['AB']
batting_df.head(5)
```

Out[90]:

We will look at the distribution of data to see if we notice any interesting anything interesting. First, we look at the distribution of batting averages, which we plot below in the density plot.

In [91]:

```
batting_df = batting_df.dropna() # Drop missing values
batting_df['batting_average'].dropna().describe()
```

Out[91]:

In [92]:

```
import matplotlib.pyplot as plt
import seaborn
%matplotlib inline
# Plot the distribution
batting_df['batting_average'].plot.kde()
plt.title("Batting Avg. Distribution")
plt.xlabel("Avg.")
plt.show()
```

In [93]:

```
batting_df.groupby('yearID')['batting_average'].mean().plot()
plt.title("Batting Average Time Series")
plt.xlabel("Year")
plt.ylabel("Avg.")
plt.show()
```

Pitching skill is measured by a few different stats, but one of the most common is Earned Run Average, or ERA. The ERA is calculated as the number of Earned Runs divided by the number of innings pitched. The definition of "Earned Run" is rather long. In short, it is a run for which the pitcher is held accountable (See http://www.baseball-almanac.com/rule10.shtml#anchor11198 for the full definition).

The variable "IPouts" is the number of innings pitched times 3. We can get the number of innings pitched simply by dividing this number by 3.

For batting, we refer to the batting average, although there are other statistics we could consider. For brevity, I just use the batting average we calculated above

In [94]:

```
import pandas as pd
pitching_df = pd.read_csv("baseballdatabank-master\core\Pitching.csv")
```

In [95]:

```
pitching_df['innings_pitched'] = pitching_df['IPouts'] / 3
pitching_df['ERA'] = pitching_df['ER'] / pitching_df['innings_pitched'] * 9
pitching_df = pitching_df[['playerID', 'yearID', 'stint', 'teamID', 'IPouts', 'ER', 'ERA', 'innings_pitched']]
pitching_df.head()
```

Out[95]:

In [96]:

```
pitching_df = pitching_df.dropna() # Drop missing values
pitching_df.describe()
```

Out[96]:

In [97]:

```
import numpy as np
pitching_df[np.isfinite(pitching_df['ERA'])]['ERA'].plot.kde()
plt.title('Distribution of Earned Run Average')
plt.xlabel('ERA')
plt.show()
```

In the above plot, there is a wide range of values, but most observations are between 0 and 5. A "good" ERA is below 4. Between 4 and 5 is OK, but not great. Over 5 is considered unsustainable, as a pitcher with this ERA will likely be replaced.

For this analysis, I am interested in all skill levels. Like the batters, some of these pitchers have exceptionally good or bad metrics, mostly due to a small number of games pitched. For the scatterplot below, I've included those outliers. In addition, I will re-load the batters and include those batters with few at bats. Skills are likely correlated with the number of attempts, as players who perform poorly will be given fewer opportunities.

In [98]:

```
players_df = pd.merge(pitching_df, batting_df, on="playerID", how="inner")
players_df = players_df[players_df['AB'] > 0] # Require at least 1 at bat
players_df = players_df[players_df['innings_pitched'] > 0] # Eliminate no innings pitched
players_df.describe()
```

Out[98]:

In [99]:

```
plt.scatter(players_df['ERA'], players_df['batting_average'])
plt.title('Pitching vs. Batting Skills')
plt.xlabel('ERA')
plt.ylabel('Batting Average')
plt.show()
```

The "Designated Hitter" rule allows a team to use one non-fielding player as a batter, typically in place of the pitcher. This rule was introduced in 1973 in the American League. In the Lahmann database, Appearances.G_dh is the number of games as designated hitter.

The common understanding is that pitchers are not typically good at batting. This makes me curious: can we quantify how poorly pitchers perform in comparison to other players at batting?

To compare their respective batting skills, we will divide the batters into two groups -- those who have pitched, and those who haven't.

In [100]:

```
players_df = players_df[players_df['AB'] > 20]
batting_df = batting_df[batting_df['AB'] > 20]
```

In [101]:

```
# Population 1: Pitchers that also bat
pitchers_that_bat = players_df[players_df['innings_pitched'] > 0.0]
pitchers_that_bat_grouped_by_playerID = pitchers_that_bat.groupby('playerID', as_index=False)
batting_averages_for_pitchers = pitchers_that_bat_grouped_by_playerID.sum()['H'] / pitchers_that_bat_grouped_by_playerID.sum()['AB']
print("Batting pitchers: %d" % len(batting_averages_for_pitchers))
# Population 2: All other batters
non_pitching_batters = batting_df[~batting_df['playerID'].isin(pitchers_that_bat['playerID'])]
non_pitching_batters = non_pitching_batters[non_pitching_batters['AB'] > 20]
non_pitching_batters_grouped_by_playerID = non_pitching_batters.groupby('playerID', as_index=False)
batting_averages_for_non_pitchers = non_pitching_batters_grouped_by_playerID.sum()['H'] / non_pitching_batters_grouped_by_playerID.sum()['AB']
print("All other batters: %d" % len(batting_averages_for_non_pitchers))
```

We look at the mean batting average for pitchers that bat vs. all other batters:

In [102]:

```
batting_averages_for_pitchers.mean()
```

Out[102]:

In [103]:

```
batting_averages_for_non_pitchers.mean()
```

Out[103]:

It does appear, at first blush, that pitchers are worse at batting based on the means we calculated for both groups. We can use a t-test to test the null hypothesis that batting pitchers bat no better or worse than all other batters.

The t-test assumes a normal distribution. Recall that we observed outliers in the baseball averages due to players that infrequently batted and had either very low or very high batting averages. Similarly, some pitchers pitched few innings. We removed those outliers in section 3.1 in order to get a more accurate confidence interval for the t-test:

In [115]:

```
stderr = st.sem(batting_averages_for_pitchers)
interval1 = (batting_averages_for_pitchers.mean() - stderr * 1.96, batting_averages_for_pitchers.mean() + stderr * 1.96)
stderr = st.sem(batting_averages_for_non_pitchers)
interval2 = (batting_averages_for_non_pitchers.mean() - stderr * 1.96, batting_averages_for_non_pitchers.mean() + stderr * 1.96)
batting_averages_for_pitchers.plot.kde(label='Avg. for Pitchers')
ax = batting_averages_for_non_pitchers.plot.kde(label='Avg. for Non-Pitchers')
ax.vlines(x=batting_averages_for_pitchers.mean(), ymin=-1, ymax=15, color='red', label='95% CI (Pitchers)')
ax.vlines(x=interval1[0], ymin=-1, ymax=15, color='red')
ax.vlines(x=interval1[1], ymin=-1, ymax=15, color='red')
ax.vlines(x=batting_averages_for_non_pitchers.mean(), ymin=-1, ymax=15, color='purple', label='95% CI (Non-Pitchers)')
ax.vlines(x=interval2[0], ymin=-1, ymax=15, color='purple')
ax.vlines(x=interval2[1], ymin=-1, ymax=15, color='purple')
ax.set_ylim([-1,12])
ax.legend()
plt.title("Batting Average for Pitchers vs. Non-Pitchers")
plt.xlabel("Avg.")
plt.show()
```

In [105]:

```
import scipy.stats
scipy.stats.ttest_ind(batting_averages_for_pitchers, batting_averages_for_non_pitchers, equal_var=False)
```

Out[105]: