Lucas Abrahao de Paiva - League of Legend’s Analysis

E-Sport Analysis - League of Legends

Hello!

Here I’m gonna try some things out with the available data from the league of legends official games. The data was extracted from https://oracleselixir.com/tools/downloads.

Some libraries need to be imported. I like to work with only few libraries, such as dplyr for data manipulations and ggplot for visualizations, other libraries could be use through out the code.

#Libraries:
library(dplyr)
library(ggplot2)

The data is loaded from a csv files downloaded from the website specified before.

#Loading Data:
data = read.csv("league_data.csv")

#Dimensions:
paste0("# Rows: ", dim(data)[1], "; # Columns: ", dim(data)[2])

[1] "# Rows: 135384; # Columns: 123"

Clearly we have a lot of data to work and a ton of variables. Since I have played the game for a few years, I will jump the part of understanding the variables and I’ll try to do some exploration with variables that I think should display some nice hypothesis for the creation of a possible model.

I’ll try to answer some question that I personally would like to know, using only the CBLOL’s Data (Brazil’s league) from the second split.

#Filtering CBLOL Data and only Teams Data:
cblol_teams_data = data |> 
  filter(league == "CBLOL" & position == "team" & split == "Split 2")

How many teams CBLOL has and how are they ranked?

#Teams' Ranks:
cblol_teams_data |> 
  filter(playoffs == 0) |> 
  group_by(teamname) |> 
  summarise("Games Played" = sum(game),
            "Win Rate" = sum(result)/sum(game)) |> arrange(desc(`Win Rate`)) |> 
  mutate_if(is.double, scales::percent, accuracy = 0.01) |>
  rename("Team" = teamname)

# A tibble: 10 × 3
   Team                 `Games Played` `Win Rate`
   <chr>                         <int> <chr>     
 1 FURIA                            18 83.33%    
 2 RED Canids                       18 72.22%    
 3 LOUD                             18 66.67%    
 4 paiN Gaming                      18 66.67%    
 5 KaBuM! e-Sports                  18 61.11%    
 6 Liberty                          18 44.44%    
 7 Miners                           18 44.44%    
 8 Flamengo Los Grandes             18 33.33%    
 9 INTZ                             18 22.22%    
10 Rensga eSports                   18 5.56%

So, we had a good regular season, 5 teams above 50% Win Rate, that is nice. But I think we can probably find some patterns between those first 5 teams and the bottom ones. Let’s go.

How the kills and deaths from these teams are displayed?

To visualize the kills and deaths variables from every team game we will use the boxplot.

#Plotting the Kills and Deaths variables:
## Kills:
cblol_teams_data |> 
  filter(playoffs == 0) |> 
  select(gameid, teamname, kills) |> 
  ggplot(aes(x = reorder(teamname, kills), y = kills)) + geom_boxplot(col = "black", fill = "gray") + 
  theme_minimal() +
  theme(axis.text.x = element_text(angle = 45, colour = "black")) + 
  labs(title = "Teams' Kills Distributions",
       x = "Team", y = "Kills") + 
  geom_hline(yintercept = median(filter(cblol_teams_data, playoffs == 0)$kills), linetype = 2, col = "red") #Plotting the general kills median

## Deaths:
cblol_teams_data |> 
  filter(playoffs == 0) |> 
  select(gameid, teamname, deaths) |> 
  ggplot(aes(x = reorder(teamname, deaths), y = deaths)) + geom_boxplot(col = "black", fill = "gray") + 
  theme_minimal() +
  theme(axis.text.x = element_text(angle = 45, colour = "black")) + 
  labs(title = "Teams' Deaths Distributions",
       x = "Team", y = "Deaths") + 
  geom_hline(yintercept = median(filter(cblol_teams_data, playoffs == 0)$deaths), linetype = 2, col = "red") #Plotting the general kills median

Analyzing these boxplots we can take some points noted, such as:

Looking at the kills boxplots, we find out that the teams that won more in regular season have higher kills distributions and also a smaller deviation, RED canids is the only team in the top 5 that has a higher deviation, but they probably have a more aggressive style. But a good thing is that all the top five teams have kills median above the general median, therefore kills and wins have some correlation;
We can withdraw the same correlation point said before about the kills and the wins for the deaths, but looking at the distributions we can see that the teams deaths have higher deviations in general, even in the top 5.

We can build a scatter plot using the average kills and deaths from each team as axis and look for some profiles and confirm the correlation theory.

cblol_teams_data |> 
  filter(playoffs == 0) |> 
  select(gameid, teamname, deaths, kills) |>
  group_by(teamname) |> summarise(KillsAvg = mean(kills), DeathsAvg = mean(deaths)) |> 
  ggplot(aes(x = KillsAvg, y = DeathsAvg, label = teamname)) + 
  geom_point() + 
  geom_text(vjust = -0.8) + 
  theme_classic() + 
  xlim(c(5, 20)) + 
  geom_hline(yintercept = mean(filter(cblol_teams_data, playoffs == 0)$deaths), linetype = 2, col = "red") + 
  geom_vline(xintercept = mean(filter(cblol_teams_data, playoffs == 0)$kills), linetype = 2, col = "red") + 
  labs(title = "Teams Kills Average by Deaths",
       x = "Kills Average", y = "Deaths Average")

Yeah, definitely we have correlation between the kills and deaths distributions with a better win rate in general, all the top 5 stay in the fourth quadrant, meaning that all of them kills above the average and dies below the average per game.

I thought we had a second split very competitive in general, but looking at the win rates and the kills and deaths variables, we have a very clear separation, we had 5 team (the top 5) very competitive within them and 5 others teams that were not quite good in general.

I’d like to look at some other variables that I think would be nice to have on a model.

What about the towers and the farm?

Well, in league of legends we need to destroy at least one of the 3 towers before reach the nexus towers, therefore we need at minimum destroy 5 towers to win a game (3 on each lane and 2 at the nexus), furthermore towers also give gold and open the map for new possibilities. Farming (killing creeps) is very important either, cause if you farm a lot, you’ll earn a ton of gold and therefore will have more items and possibly win the game.

#Scatter Plot Creeps per Minute and Towers:
cblol_teams_data |> 
  filter(playoffs == 0) |> 
  select(gameid, teamname, towers, cspm, result) |>
  mutate(result = factor(result, levels = c(0, 1), labels = c("Lose", "Win"))) |> 
  ggplot(aes(x = towers, y = cspm)) + 
  geom_point(aes(col = result)) + 
  scale_colour_viridis_d(end = 0.6) + 
  theme_classic() + 
  geom_smooth(method = "lm", se = F) + 
  labs(title = "Towers Destroyed versus CS Per Minute by Result",
       x = "Towers Destroyed by Team", y = "Creeps Farmed per Minute")

Looking at the previous plot, we can see that winning a game “depends” on a team farm and its capacity of destroying towers efficiently, it’s very rare to win a game destroying only 5 towers (the minimum). The nice thing about this plot is that I don’t need to see the teams’ statistics, cause I already know the more you win, more you will focus in destroying towers and farm a lot.

Just a comment, even though I could have used a boxplot for comparing the result with these variables, using the scatter plot enabled me to just use one visualization and also look at the correlation between the towers and cspm variables.

Dragons are important, are not?

Dragons bring with them buffs for the champions abilities but not always a specific dragon may worth a fight with the other team. Let’s see if we can add the dragons variables on a possible model.

#Dragons Killed by Game:
cblol_teams_data |> 
  filter(playoffs == 0) |> 
  select(gameid, teamname, dragons, result) |> 
  mutate(result = factor(result, levels = c(0, 1), labels = c("Lose", "Win"))) |> 
  ggplot(aes(x = result, y = dragons)) + 
  geom_boxplot() + theme_minimal() + 
  labs(title = "Dragons Distribution by Match Result")

So, both distributions are very symmetric and even though we can see that more dragons probably means more wins, we can wee either that 75% of the matches had at least one dragon and some matches that were lost also had 2 dragons in 50% of them.

Build a model

After doing some exploratory and explanatory analysis with the previous knowledge that I’ve already have know I’ll try to build a model for predicting the winning probability using only the variables seen before, we have sufficient evidences for believing that we should be able to build a nice model using only the kills, deaths, towers, cs per minute and dragons and modeling the result. Important observation here, although in my dataset I have data from all the leagues around the world, I’ll use only the Brazil’s data from split 1 and 2, because we have sufficient data to train and test a model and probably get a good result, let’s see.

#Libraries for fitting a model:
library(tidymodels)

I don’t feel the need to use a very sophisticated model to predict the winning probability, let’s try the most simple model possible: linear!

#Data Model:
models_data = cblol_teams_data |>
  select(result, kills, deaths, towers, dragons, cspm) |> 
  mutate(result = factor(result, levels = c(0, 1), labels = c("lose", "win")))

Dividing into training and testing sets.

models_data |> 
  group_by(result) |> summarise(Qty = n()) |>
  mutate("%" = Qty/sum(Qty))

# A tibble: 2 × 3
  result   Qty   `%`
  <fct>  <int> <dbl>
1 lose     121   0.5
2 win      121   0.5

As we can see, we have a very balanced class in the result variable.

#Data Splitting:
set.seed(23)
model_splits <- initial_split(models_data)
model_train <- training(model_splits)
model_test  <- testing(model_splits)

#Training Set Props:
model_train |> 
  group_by(result) |> summarise(Qty = n()) |>
  mutate("%" = Qty/sum(Qty))

# A tibble: 2 × 3
  result   Qty   `%`
  <fct>  <int> <dbl>
1 lose      91 0.503
2 win       90 0.497

#Testing Set Props:
model_test |> 
  group_by(result) |> summarise(Qty = n()) |>
  mutate("%" = Qty/sum(Qty))

# A tibble: 2 × 3
  result   Qty   `%`
  <fct>  <int> <dbl>
1 lose      30 0.492
2 win       31 0.508

#Logistic Model with simple engine:
glm_model = logistic_reg() |> 
  set_engine("glm")
glm_model

Logistic Regression Model Specification (classification)

Computational engine: glm

Now we use a workflow for training our models.

model_wf = workflow() |> 
  add_formula(result ~ .) #Using all the previous variables as features 

model_wf

══ Workflow ════════════════════════════════════════════════════════════════════
Preprocessor: Formula
Model: None

── Preprocessor ────────────────────────────────────────────────────────────────
result ~ .

#Logistic regression:
glm_fit = model_wf |> 
  add_model(glm_model) |> 
  fit(model_train)
tidy(glm_fit)

# A tibble: 6 × 5
  term        estimate std.error statistic p.value
  <chr>          <dbl>     <dbl>     <dbl>   <dbl>
1 (Intercept) -20.8       12.0     -1.73   0.0834 
2 kills         0.266      0.136    1.96   0.0499 
3 deaths       -0.148      0.105   -1.40   0.160  
4 towers        0.913      0.342    2.67   0.00758
5 dragons       0.0444     0.452    0.0982 0.922  
6 cspm          0.404      0.322    1.25   0.210

Evaluate our model

data_with_preds = model_test |> 
  bind_cols(pred = predict(glm_fit, new_data = model_test)) |> #Predicted class
  bind_cols(pred = predict(glm_fit, new_data = model_test, type = "prob")) #Predicted Probability

#Conf Matrix:
table(data_with_preds$result, data_with_preds$.pred_class)

      
       lose win
  lose   30   0
  win     0  31

Well, I’m not very comfortable with this model, it’s very suspicious..

Let’s use all the data avaiable, this will make our training and testing sets much larger, but first, we need to look if the leagues around the world have many different wins patterns considering the features used before.

Looking at the leagues

leagues_data = data |>
  filter(position == "team") |> 
  select(league, result, kills, deaths, towers, dragons, cspm) |> 
  mutate(result = factor(result, levels = c(0, 1), labels = c("lose", "win"))) |> na.omit()

dim(leagues_data)

[1] 19080     7

Much more data…

#Number of leagues:
leagues_data |> 
  select(league) |> unique() |> dim()

[1] 43  1

43 different leagues, it means that it is not good for us to compare each one, but we can create 2 categories: CBLOL and Others Leagues.

leagues_data = leagues_data |> 
  mutate(league_cat = if_else(league == "CBLOL", "CBLOL", "Others Leagues"))

leagues_data |> 
  select(league_cat) |> unique() |> dim()

[1] 2 1

Now we have a new category and can possibly compare the leagues. To save some time, we will compare all the variables at once using the median and mean as measures, this way we can compare distributions and deviations.

#Comparing Leagues:
leagues_data |> 
  group_by(league_cat) |> 
  summarise("Kills Median (Mean)" = paste0(round(median(kills), 1), " (", round(mean(kills), 1), ")"),
            "Deaths Median (Mean)" = paste0(round(median(deaths), 1), " (", round(mean(deaths), 1), ")"),
            "Towers Median (Mean)" = paste0(round(median(towers), 1), " (", round(mean(towers), 1), ")"),
            "Dragons Median (Mean)" = paste0(round(median(dragons), 1), " (", round(mean(dragons), 1), ")"))

# A tibble: 2 × 5
  league_cat     `Kills Median (Mean)` `Deaths Median (Mean)` Towers M…¹ Drago…²
  <chr>          <chr>                 <chr>                  <chr>      <chr>  
1 CBLOL          15 (14.2)             15 (14.3)              7 (6.3)    2 (2.3)
2 Others Leagues 14 (14.4)             14 (14.5)              7 (6.1)    2 (2.3)
# … with abbreviated variable names ¹`Towers Median (Mean)`,
#   ²`Dragons Median (Mean)`

Good news, we definitely can use all the data to fit a new model, we don’t have any evidence what so ever that the leagues has different patterns about winning.

New Model

#Data Splitting:
set.seed(23)
model2_splits <- initial_split(select(leagues_data, -league, -league_cat))
model2_train <- training(model2_splits)
model2_test  <- testing(model2_splits)

#Training Set Props:
model2_train |> 
  group_by(result) |> summarise(Qty = n()) |>
  mutate("%" = Qty/sum(Qty))

# A tibble: 2 × 3
  result   Qty   `%`
  <fct>  <int> <dbl>
1 lose    7119 0.497
2 win     7191 0.503

#Testing Set Props:
model2_test |> 
  group_by(result) |> summarise(Qty = n()) |>
  mutate("%" = Qty/sum(Qty))

# A tibble: 2 × 3
  result   Qty   `%`
  <fct>  <int> <dbl>
1 lose    2424 0.508
2 win     2346 0.492

Now we use the same workflow for training our models.

#Logistic regression:
glm_fit2 = model_wf |> 
  add_model(glm_model) |> 
  fit(model2_train)
tidy(glm_fit2)

# A tibble: 6 × 5
  term        estimate std.error statistic   p.value
  <chr>          <dbl>     <dbl>     <dbl>     <dbl>
1 (Intercept) -11.4       1.03     -11.0   2.37e- 28
2 kills         0.324     0.0149    21.8   1.20e-105
3 deaths       -0.367     0.0139   -26.4   4.09e-154
4 towers        1.09      0.0410    26.5   2.63e-155
5 dragons       0.0317    0.0546     0.579 5.62e-  1
6 cspm          0.140     0.0278     5.04  4.66e-  7

Evaluate our model

data_with_preds2 = model2_test |> 
  bind_cols(pred = predict(glm_fit2, new_data = model2_test)) |> #Predicted class
  bind_cols(pred = predict(glm_fit2, new_data = model2_test, type = "prob")) #Predicted Probability

#Conf Matrix:
table(data_with_preds2$result, data_with_preds2$.pred_class)

      
       lose  win
  lose 2370   54
  win    39 2307

#Accuracy:
data_with_preds2 |> 
  mutate(right = if_else(.pred_class == result, 1, 0)) |> 
  summarise("Accuracy" = scales::percent(sum(right)/n(), accuracy = 0.1))

  Accuracy
1    98.1%

Wow, we have a very very good model. Let’s take a look at the roc curve.

roc_curve(data_with_preds2, result, .pred_lose) %>%
  ggplot(aes(x = 1 - specificity, y = sensitivity)) +
  geom_path() +
  geom_abline(lty = 3) +
  coord_equal() +
  theme_bw()

Very good AUC, it means that we have a very good model.