Saturday, November 21, 2015

Learn Statistics and Data Mining with League of Legends Data - Principal Component Analysis

plot of chunk unnamed-chunk-2

Hi, and welcome to the first of the series of learning statistics and data mining with League of Legends. In this series, I will try to illustrate how League of Legends data can be analyzed using various statistical and/or data mining methodologies - while keeping mathematical details to a minimum. The data and the code (written in R) are both available, so you can try it yourself!

So let's begin. Suppose you are interested in the average number of champion kills and deaths per game for each champion in the game (NA ranked soloqueue). One starting point is typically to draw scatterplot like this:

plot of chunk unnamed-chunk-4

This plot is slightly crowded in the middle, but overall it's not too bad: we see supports like Janna with the lowest average kills and deaths per game; on the other hand, assassins like Akali, Talon, and Katarina tends to have high kills and high deaths per game.

However, we have only looked at only two variables here: champion kills and deaths. This is easy since two variables fit onto a 2-dimensional scatterplot with two axes - which conveniently fits onto your computer screen which is a 2-dimensional surface.

But the Riot API offers a lot of data with more than two variables - for the data of interest (Google Doc Link, Pastebin Link), we have 16 different variables. Obviously we don't have a 16-dimensional surface to make our scatterplot with; drawing a scatterplot for each variable pair also gets unreadable really quick. So what can we do at this point?


One way to do it is through a data mining technique called Principal Component Analysis (PCA), which I will be talking about today.

Here's what we want to do: we cannot make a 16 variables scatterplot since it will require 16 axes and a 16 dimensional plot, but we CAN make a 2 variables scatterplot. So, instead of plotting all 16 variables, we will “compress” our data into 2 variables - then the problem becomes easy.

How do we “compress” the data? In our case, our dataset has 16 variables, all of which contains a certain amount of variance (think variance as “information”). We apply a linear algebra technique called Singular Value Decomposition to construct 16 new variables in a smart way - so that the first two variables will contain most of the variance we care about (again, think variance as “information”). Then we can just plot the first two new variables with a 2-dimensional scatterplot and ignore the rest - with any luck, the first two new variables will capture most of the information you care about.

In case you had no idea what I just said, here's an analogy. You have 16 friends, all of them have (say) 100 dollars. But tracking all 16 people's money is too hard for you, so you redistribute everyone's money such that your best friend has the most money, your second best friend has the second most money, and so on; you can't just give all 1600 dollars to your best friend (which will cause the other 15 to starve to death), but you push your envelope as much as possible such that a majority of the wealth stays with your best and second best friend. Then you leave with your two best friends and ignore the rest.

While it is not impossible to do principal component analysis by hand (it all comes down to linear algebra), we will use a popular programming language / statistical analysis tool called R which will do the heavy lifting for us.

# First download the data from here: http://pastebin.com/3NXpPRp1 and save it as a csv file (e.g. "data_summary_cleansed.csv")

# Load the data into R
data <- read.table("data_summary_cleansed.csv", header = T, row.names = 1, sep = ",", as.is = T) 

# Perform PCA:
pca <- prcomp(data, scale = T)
# Note that we chose to normalize the data - since variables such as number of kills and gold earned are operating on completely different scales.

# Take the first two principal components for plotting:
x <- summary(pca)$x[,1:2]

# Draw the scatterplot:
plot(x, pch = 16)
text(x, labels=row.names(data), cex= 0.8, pos=1) # this adds the labels to the point so we know which champion it is.

plot of chunk unnamed-chunk-9

What is cool about this plot is that champions with similar roles are automatically grouped together; for example, we see that most support champions are on the lower left corner - while AP mid laners are on the lower right corner. Karma and Zyra, which share the playstyle of both support and AP mid lane, are somewhere in between. It's fairly easy to see that champions with similar playstyles are indeed close in this scatterplot (click here for the same plot in high resolution).

What is cooler is that our data does not contain any information about each champion's role in the game - we employed a data mining algorithm which digested a fairly complex set of data and illustrated the answer for us.

You may remember that by performing principal component analysis we concentrate the total variance of the dataset onto the first two principal components. We can illustrate the effect as follows:

plot(pca)

plot of chunk unnamed-chunk-10

As you can see, the first two principal components contain a lot of variance and thus a lot of “information”. The principal components further down are not as valuable because they don't contain as much information as the first two. Since the total amount of variance is 16 (we have 16 original variables) and the first two principal components have roughly \(6.2 + 3 = 9.2\) variance, we can hand-wavingly argue that our scatterplot using principal component analysis captured about 57.5% of the total “information” from our original dataset. We still lost a lot of information from the original dataset, but sometimes sacrifices are needed.

So where did all the original variables go? We can visualize this by using a biplot:

p <- biplot(pca, cex=c(0.5, 1))

plot of chunk unnamed-chunk-11

What we see here is that a support champion generally has high ward killed, ward placed, and assists (arrows to the left); mid laners tend to have high magic damage dealt to champions (arrow to the bottom); junglers and other non-jungle tanks tend to have an assortment of high damage taken and neutral monsters killed (arrows to the top); assassins and ADCs tend to have high gold, high champion kills, high minion kills, and high gold earned (arrows to the right). All of the information we've mined from the data conforms to our intuition about the game.

On a closing note, data mining techniques such as Principal Component Analysis reduces the the complexity of the data by what is called dimension reduction. When we are faced with too many variables (in our case, 16) that we cannot easily analyze, we reduce it to 2 so we can more easily comprehend the data. Do keep in mind there is a series of standard techniques on dimension reduction and this barely scratches the surface of how we can reduce the complexity of the data. In fact, the way which we compute KDA can be seen as a dimension reduction technique where we reduce 3 variables (kills, deaths, and assists) into 1 variable.

Do you want to learn more? Here's a reading list:

  1. If you know little on linear algebra or statistics and would like to learn Principal Component Analysis, try this tutorial which starts from the very basics.

  2. If you think your mathematical background is sufficiently strong, try this tutorial instead.

  3. If you would like to go beyond Principal Component Analysis (including Sparse PCA and ICA), one source I recommend is Hastie et al.'s book Sections 14.5 - 14.7.