img

Predicting Football Draft with Data

As a student of data science and a fan of most things football, I jumped at the chance to test out some of my newly found data science chops on optimizing my fantasy football roster. I could rely on projections and recommendations from fantasy football pros, but where’s the fun in that? There are so many sources to get projections and rankings for players, so why not use multiple sources to improve my roster?

This article is intended to demonstrate the method I used to build select quarterbacks, tight ends, wide receivers, and running backs for my fantasy football team. I want to stress that this is by no means the best method or the only method, but it is a method.

The Data

The data used for this project covers the entirety of the 2019 NFL regular season. Our data set comes from Pro Football Reference and includes the following columns:

  • Player: the name of the player.
  • Tm: three-letter abbreviation for the player’s team.
  • FantPos: the player’s position on the field.
  • Games G: number of games played in the 2019 season.
  • Fantasy FantPt: Pro Football Reference’s official scoring method, total points scored for whole season.
  • Fantasy PPR: a slightly modified version of the official scoring method that includes additional scoring opportunities, total points scored for whole season.
  • Fantasy DKPt: Draft Kings official scoring method, total points scored for whole season.
  • Fantasy FDPt: Fanduel’s official scoring method, total points scored for whole season.
  • Player: the name of the player.
  • Tm: three-letter abbreviation for the player’s team.
  • FantPos: the player’s position on the field.
  • Games G: number of games played in the 2019 season.
  • Fantasy FantPt: Pro Football Reference’s official scoring method, total points scored for whole season.
  • Fantasy PPR: a slightly modified version of the official scoring method that includes additional scoring opportunities, total points scored for whole season.
  • Fantasy DKPt: Draft Kings official scoring method, total points scored for whole season.
  • Fantasy FDPt: Fanduel’s official scoring method, total points scored for whole season.

Figure 1: Data (Accesible in PDF Version) 

Data Cleaning

To properly work with our data, we must clean it and investigate it. In other words, the data must be formatted to account for missing values, outliers, and abnormal text.

For example, the player names should be normalized to remove the asterisks and plus signs. Although this is not necessary for this case, cleaning it now will allow us to make future enrichment much easier. We’ll apply the function below to our player name column to edit the text.

import re # Python Regular Expression module

def str_fixer(text: str):

     “””uses predetermined regex pattern to normalize text”””

result = re.sub(r”[*+]”, “”, text)

return result

df[“Player”] = df[“Player”].apply(str_fixer) # applying the function

Another issue encountered in this particular data set was the presence of missing values. Knowing that the data is recording fantasy football scores, the assumption was made that a player with missing scores likely did not earn any points. We can use the pandas method fillna() to replace missing values within specified columns with 0.

Figure 2: Player Information (Accessible in PDF Version)  After accounting for missing values for the score columns, the next item to consider is the player position. A handful of players in the data set had missing values for their on-field position. These players tended to reside at the bottom of the rankings in terms of fantasy points, so the decision was made to drop these players altogether (sorry Greg Dortch). df = df[df[“FantPos”].isna() == False] # removing all instances of position-less players

A Touch of Feature Manipulation

Feature manipulation, or feature engineering, is the process of manipulating your data to make it work better for you and machine learning models. This is the step where having domain knowledge is crucial to improving the performance of said models.

As we all know, NFL players can have ‘hot’ games. As great as this is for someone with that player, these games don’t come every Sunday. To try and mitigate this, we’ll change our score columns.

Using the games played column, we’ll create new columns that take the estimate of fantasy points scored per game. One way to do this is with a little help of a ‘for’ loop.

A ‘for’ loop lets you run code in a repetitive process. Rather than typing out each new column by hand, we can tell Python to do it. The code below takes each score column, divides it by the number of games played by a particular player, and creates a new column for the per-game value.

for col in df.columns[4:]: # iterating on names of the score columns

     df[f”{col} Per Game”] = df[col] / df[“Games G”]

Now we have new columns that provide us with an estimate of how many fantasy points each player scored per game they played. We’re ready to begin looking at some patterns with these new features.

Visualization

Let’s take a look and see just how our scoring services dish out their ratings. The first thing to notice here is the asymmetrical nature of the plot above. Notice how the four peaks are right around the 2–3 point range? This plot tells us that a lot of the players in our data set fall into this category. This makes sense since our high-scoring players are not the majority.

The main takeaway from this chart is that the four sources included in this analysis disagree on how to score points, or they have different methods of scoring.

Figure 3: Distribution of Points (Accessible in PDF Version) 

But what does this mean for the average fantasy football enthusiast? It means that it can make choosing the right players very confusing and costly depending on the source you’re referencing. But there is another way, we can use the power of data science! Rather than looking at only one of these sources or scouring the internet manually, why not group players based on a variety of scores?

Modeling

The goal here is to group players based on a variety of scores. Sounds like a classic clustering problem waiting to be solved.

Cluster analysis, or cluster modeling, is the method of gathering objects that are similar into separate groups.

For example, say we have a sample of NBA players and a sample of elementary school students. One thing we could do to differentiate the students from the NBA players is to record their heights. As we all know, NBA players are much taller than first graders (don’t quote me on this), making height an excellent metric to classify students vs. NBA players. However, in our example, we know that there are two groups, students and NBA players. We need to use an unsupervised method to determine how many groups or tiers exist within our NFL player data. To do this, we’ll use a K-means algorithm.

To get a better understanding of what the K-means algorithm does, I’ll refer to an article from Dr. Michael J. Garbade.

  • We define the number of clusters (commonly denoted as ‘k’), which refers to the number of centroids (a centroid represents the center of a given cluster or group).
  • Data points are assigned to the clusters by reducing the in-cluster sum of squares.
  • The process repeats until clusters have been minimized.

Figure 4: K-means (Accessible in PDF Version)  To tie this back to our student vs. NBA player example, we know that we have two groups here. We would assign the K-means algorithm with a value of two (AKA: two groups). Due to the significant size gap between 8-year-olds and 20-something-year-old professional athletes, our two centroids would fall somewhere around the average for each group. Just to be safe though, the algorithm measures the distance between each person and the centroid to adjust to the optimal location (in essence). 
 So then how do we do this? And also how many clusters should we create? In the real world, we don’t always know how many clusters may exist in our data.

One way we can determine the optimum number of groups is by using the ‘Elbow Method.’ The ‘Elbow Method’ compares the in-cluster sum of squares across different centroid configurations.

Basically, it helps us determine how many clusters we should create. We can use a ‘for’ loop again to our advantage by testing out a list of different cluster options and plotting them.

Figure 5: Elbow Method (Accessible in PDF Version) 

Notice how the y-axis says ‘Distortion’? Distortion is another way to say ‘distance from the centroid (or cluster average). As we see above, the plot looks very similar to an elbow. We can use this to determine how many centroids (k on the x-axis) we should include in our model. Five looks to be a pretty good fit since there is no real benefit in reducing distortion as we increase k. Now that we have our desired number of clusters and our model is fit, we can begin to see who I should pick up in the draft! To help out with this, we’ll add a column to our data that provides the group, or cluster, that each player belongs to. 

kmeans = KMeans(n_clusters=5, random_state=1) # using 5 clusters kmeans.fit(X) # fitting model to data tiers = kmeans.labels_ # obtaining group labelsdf[“tier”] = tiers # setting tier column as group labels.

Figure 6: Tier Group (Accessible in PDF Version) 

The boxplot above shows us the average value of fantasy points earned across all four sources. As you can see, tier 1 players (the ones we want) tend to score roughly between 16–30 points per game. With the tiers clearly outlined, we can begin to view players to draft. Let’s take a look at quarterbacks that fall under the tier one group.

df.query(‘FantPos == “QB” and tier == 1’)

Figure 7: Table (Accessible in PDF Version) 

Being a fan of football, I can tell you that players like Lamar Jackson, Russell Wilson, and Aaron Rodgers are very good at what they do, which tells me that our algorithm did what it’s supposed to do: assign players to tiered performance groups.

The results of the table show an interesting finding.

Notice item number seven in the table above? Jeff Driskel is a backup quarterback on the Detroit Lyons, who played three games while the starting quarterback, Matthew Stafford, was injured. Typically, quarterbacks such as

Driskel would be overlooked, but with our model, we see that he isn’t too shabby when he gets some minutes. This could be useful information later on in the season in case my primary QB gets injured, or all of the good ones are taken (given that the backup QB is playing that week).

Conclusion

Finding Jeff Driskel in our tier one group is why I wanted to conduct this analysis. Backup quarterbacks are commonly not taken, but now we know that if Stafford is injured (hopefully not), we would have a pretty solid mid-season pick in Driskel. He’s almost guaranteed to be available throughout the season, given his role in the NFL. Another benefit of this model is that it will make draft time less stressful. Since the K-means algorithm successfully classified high-performance players, I’ll be better equipped to build up my roster. I won’t have to panic if someone takes Lamar Jackson right away because players in his tier performed very similarly throughout the 2019 season, meaning that I’ll still get a great player.