#Libraries:
library(dplyr)
library(ggplot2)
E-Sport Analysis - League of Legends
Hello!
Here I’m gonna try some things out with the available data from the league of legends official games. The data was extracted from https://oracleselixir.com/tools/downloads.
Some libraries need to be imported. I like to work with only few libraries, such as dplyr for data manipulations and ggplot for visualizations, other libraries could be use through out the code.
The data is loaded from a csv files downloaded from the website specified before.
#Loading Data:
= read.csv("league_data.csv")
data
#Dimensions:
paste0("# Rows: ", dim(data)[1], "; # Columns: ", dim(data)[2])
[1] "# Rows: 135384; # Columns: 123"
Clearly we have a lot of data to work and a ton of variables. Since I have played the game for a few years, I will jump the part of understanding the variables and I’ll try to do some exploration with variables that I think should display some nice hypothesis for the creation of a possible model.
I’ll try to answer some question that I personally would like to know, using only the CBLOL’s Data (Brazil’s league) from the second split.
#Filtering CBLOL Data and only Teams Data:
= data |>
cblol_teams_data filter(league == "CBLOL" & position == "team" & split == "Split 2")
How many teams CBLOL has and how are they ranked?
#Teams' Ranks:
|>
cblol_teams_data filter(playoffs == 0) |>
group_by(teamname) |>
summarise("Games Played" = sum(game),
"Win Rate" = sum(result)/sum(game)) |> arrange(desc(`Win Rate`)) |>
mutate_if(is.double, scales::percent, accuracy = 0.01) |>
rename("Team" = teamname)
# A tibble: 10 × 3
Team `Games Played` `Win Rate`
<chr> <int> <chr>
1 FURIA 18 83.33%
2 RED Canids 18 72.22%
3 LOUD 18 66.67%
4 paiN Gaming 18 66.67%
5 KaBuM! e-Sports 18 61.11%
6 Liberty 18 44.44%
7 Miners 18 44.44%
8 Flamengo Los Grandes 18 33.33%
9 INTZ 18 22.22%
10 Rensga eSports 18 5.56%
So, we had a good regular season, 5 teams above 50% Win Rate, that is nice. But I think we can probably find some patterns between those first 5 teams and the bottom ones. Let’s go.
How the kills and deaths from these teams are displayed?
To visualize the kills and deaths variables from every team game we will use the boxplot.
#Plotting the Kills and Deaths variables:
## Kills:
|>
cblol_teams_data filter(playoffs == 0) |>
select(gameid, teamname, kills) |>
ggplot(aes(x = reorder(teamname, kills), y = kills)) + geom_boxplot(col = "black", fill = "gray") +
theme_minimal() +
theme(axis.text.x = element_text(angle = 45, colour = "black")) +
labs(title = "Teams' Kills Distributions",
x = "Team", y = "Kills") +
geom_hline(yintercept = median(filter(cblol_teams_data, playoffs == 0)$kills), linetype = 2, col = "red") #Plotting the general kills median
## Deaths:
|>
cblol_teams_data filter(playoffs == 0) |>
select(gameid, teamname, deaths) |>
ggplot(aes(x = reorder(teamname, deaths), y = deaths)) + geom_boxplot(col = "black", fill = "gray") +
theme_minimal() +
theme(axis.text.x = element_text(angle = 45, colour = "black")) +
labs(title = "Teams' Deaths Distributions",
x = "Team", y = "Deaths") +
geom_hline(yintercept = median(filter(cblol_teams_data, playoffs == 0)$deaths), linetype = 2, col = "red") #Plotting the general kills median
Analyzing these boxplots we can take some points noted, such as:
Looking at the kills boxplots, we find out that the teams that won more in regular season have higher kills distributions and also a smaller deviation, RED canids is the only team in the top 5 that has a higher deviation, but they probably have a more aggressive style. But a good thing is that all the top five teams have kills median above the general median, therefore kills and wins have some correlation;
We can withdraw the same correlation point said before about the kills and the wins for the deaths, but looking at the distributions we can see that the teams deaths have higher deviations in general, even in the top 5.
We can build a scatter plot using the average kills and deaths from each team as axis and look for some profiles and confirm the correlation theory.
|>
cblol_teams_data filter(playoffs == 0) |>
select(gameid, teamname, deaths, kills) |>
group_by(teamname) |> summarise(KillsAvg = mean(kills), DeathsAvg = mean(deaths)) |>
ggplot(aes(x = KillsAvg, y = DeathsAvg, label = teamname)) +
geom_point() +
geom_text(vjust = -0.8) +
theme_classic() +
xlim(c(5, 20)) +
geom_hline(yintercept = mean(filter(cblol_teams_data, playoffs == 0)$deaths), linetype = 2, col = "red") +
geom_vline(xintercept = mean(filter(cblol_teams_data, playoffs == 0)$kills), linetype = 2, col = "red") +
labs(title = "Teams Kills Average by Deaths",
x = "Kills Average", y = "Deaths Average")
Yeah, definitely we have correlation between the kills and deaths distributions with a better win rate in general, all the top 5 stay in the fourth quadrant, meaning that all of them kills above the average and dies below the average per game.
I thought we had a second split very competitive in general, but looking at the win rates and the kills and deaths variables, we have a very clear separation, we had 5 team (the top 5) very competitive within them and 5 others teams that were not quite good in general.
I’d like to look at some other variables that I think would be nice to have on a model.
What about the towers and the farm?
Well, in league of legends we need to destroy at least one of the 3 towers before reach the nexus towers, therefore we need at minimum destroy 5 towers to win a game (3 on each lane and 2 at the nexus), furthermore towers also give gold and open the map for new possibilities. Farming (killing creeps) is very important either, cause if you farm a lot, you’ll earn a ton of gold and therefore will have more items and possibly win the game.
#Scatter Plot Creeps per Minute and Towers:
|>
cblol_teams_data filter(playoffs == 0) |>
select(gameid, teamname, towers, cspm, result) |>
mutate(result = factor(result, levels = c(0, 1), labels = c("Lose", "Win"))) |>
ggplot(aes(x = towers, y = cspm)) +
geom_point(aes(col = result)) +
scale_colour_viridis_d(end = 0.6) +
theme_classic() +
geom_smooth(method = "lm", se = F) +
labs(title = "Towers Destroyed versus CS Per Minute by Result",
x = "Towers Destroyed by Team", y = "Creeps Farmed per Minute")
Looking at the previous plot, we can see that winning a game “depends” on a team farm and its capacity of destroying towers efficiently, it’s very rare to win a game destroying only 5 towers (the minimum). The nice thing about this plot is that I don’t need to see the teams’ statistics, cause I already know the more you win, more you will focus in destroying towers and farm a lot.
Just a comment, even though I could have used a boxplot for comparing the result with these variables, using the scatter plot enabled me to just use one visualization and also look at the correlation between the towers and cspm variables.
Dragons are important, are not?
Dragons bring with them buffs for the champions abilities but not always a specific dragon may worth a fight with the other team. Let’s see if we can add the dragons variables on a possible model.
#Dragons Killed by Game:
|>
cblol_teams_data filter(playoffs == 0) |>
select(gameid, teamname, dragons, result) |>
mutate(result = factor(result, levels = c(0, 1), labels = c("Lose", "Win"))) |>
ggplot(aes(x = result, y = dragons)) +
geom_boxplot() + theme_minimal() +
labs(title = "Dragons Distribution by Match Result")
So, both distributions are very symmetric and even though we can see that more dragons probably means more wins, we can wee either that 75% of the matches had at least one dragon and some matches that were lost also had 2 dragons in 50% of them.
Build a model
After doing some exploratory and explanatory analysis with the previous knowledge that I’ve already have know I’ll try to build a model for predicting the winning probability using only the variables seen before, we have sufficient evidences for believing that we should be able to build a nice model using only the kills, deaths, towers, cs per minute and dragons and modeling the result. Important observation here, although in my dataset I have data from all the leagues around the world, I’ll use only the Brazil’s data from split 1 and 2, because we have sufficient data to train and test a model and probably get a good result, let’s see.
#Libraries for fitting a model:
library(tidymodels)
I don’t feel the need to use a very sophisticated model to predict the winning probability, let’s try the most simple model possible: linear!
#Data Model:
= cblol_teams_data |>
models_data select(result, kills, deaths, towers, dragons, cspm) |>
mutate(result = factor(result, levels = c(0, 1), labels = c("lose", "win")))
Dividing into training and testing sets.
|>
models_data group_by(result) |> summarise(Qty = n()) |>
mutate("%" = Qty/sum(Qty))
# A tibble: 2 × 3
result Qty `%`
<fct> <int> <dbl>
1 lose 121 0.5
2 win 121 0.5
As we can see, we have a very balanced class in the result variable.
#Data Splitting:
set.seed(23)
<- initial_split(models_data)
model_splits <- training(model_splits)
model_train <- testing(model_splits)
model_test
#Training Set Props:
|>
model_train group_by(result) |> summarise(Qty = n()) |>
mutate("%" = Qty/sum(Qty))
# A tibble: 2 × 3
result Qty `%`
<fct> <int> <dbl>
1 lose 91 0.503
2 win 90 0.497
#Testing Set Props:
|>
model_test group_by(result) |> summarise(Qty = n()) |>
mutate("%" = Qty/sum(Qty))
# A tibble: 2 × 3
result Qty `%`
<fct> <int> <dbl>
1 lose 30 0.492
2 win 31 0.508
#Logistic Model with simple engine:
= logistic_reg() |>
glm_model set_engine("glm")
glm_model
Logistic Regression Model Specification (classification)
Computational engine: glm
Now we use a workflow for training our models.
= workflow() |>
model_wf add_formula(result ~ .) #Using all the previous variables as features
model_wf
══ Workflow ════════════════════════════════════════════════════════════════════
Preprocessor: Formula
Model: None
── Preprocessor ────────────────────────────────────────────────────────────────
result ~ .
#Logistic regression:
= model_wf |>
glm_fit add_model(glm_model) |>
fit(model_train)
tidy(glm_fit)
# A tibble: 6 × 5
term estimate std.error statistic p.value
<chr> <dbl> <dbl> <dbl> <dbl>
1 (Intercept) -20.8 12.0 -1.73 0.0834
2 kills 0.266 0.136 1.96 0.0499
3 deaths -0.148 0.105 -1.40 0.160
4 towers 0.913 0.342 2.67 0.00758
5 dragons 0.0444 0.452 0.0982 0.922
6 cspm 0.404 0.322 1.25 0.210
Evaluate our model
= model_test |>
data_with_preds bind_cols(pred = predict(glm_fit, new_data = model_test)) |> #Predicted class
bind_cols(pred = predict(glm_fit, new_data = model_test, type = "prob")) #Predicted Probability
#Conf Matrix:
table(data_with_preds$result, data_with_preds$.pred_class)
lose win
lose 30 0
win 0 31
Well, I’m not very comfortable with this model, it’s very suspicious..
Let’s use all the data avaiable, this will make our training and testing sets much larger, but first, we need to look if the leagues around the world have many different wins patterns considering the features used before.
Looking at the leagues
= data |>
leagues_data filter(position == "team") |>
select(league, result, kills, deaths, towers, dragons, cspm) |>
mutate(result = factor(result, levels = c(0, 1), labels = c("lose", "win"))) |> na.omit()
dim(leagues_data)
[1] 19080 7
Much more data…
#Number of leagues:
|>
leagues_data select(league) |> unique() |> dim()
[1] 43 1
43 different leagues, it means that it is not good for us to compare each one, but we can create 2 categories: CBLOL and Others Leagues.
= leagues_data |>
leagues_data mutate(league_cat = if_else(league == "CBLOL", "CBLOL", "Others Leagues"))
|>
leagues_data select(league_cat) |> unique() |> dim()
[1] 2 1
Now we have a new category and can possibly compare the leagues. To save some time, we will compare all the variables at once using the median and mean as measures, this way we can compare distributions and deviations.
#Comparing Leagues:
|>
leagues_data group_by(league_cat) |>
summarise("Kills Median (Mean)" = paste0(round(median(kills), 1), " (", round(mean(kills), 1), ")"),
"Deaths Median (Mean)" = paste0(round(median(deaths), 1), " (", round(mean(deaths), 1), ")"),
"Towers Median (Mean)" = paste0(round(median(towers), 1), " (", round(mean(towers), 1), ")"),
"Dragons Median (Mean)" = paste0(round(median(dragons), 1), " (", round(mean(dragons), 1), ")"))
# A tibble: 2 × 5
league_cat `Kills Median (Mean)` `Deaths Median (Mean)` Towers M…¹ Drago…²
<chr> <chr> <chr> <chr> <chr>
1 CBLOL 15 (14.2) 15 (14.3) 7 (6.3) 2 (2.3)
2 Others Leagues 14 (14.4) 14 (14.5) 7 (6.1) 2 (2.3)
# … with abbreviated variable names ¹`Towers Median (Mean)`,
# ²`Dragons Median (Mean)`
Good news, we definitely can use all the data to fit a new model, we don’t have any evidence what so ever that the leagues has different patterns about winning.
New Model
#Data Splitting:
set.seed(23)
<- initial_split(select(leagues_data, -league, -league_cat))
model2_splits <- training(model2_splits)
model2_train <- testing(model2_splits)
model2_test
#Training Set Props:
|>
model2_train group_by(result) |> summarise(Qty = n()) |>
mutate("%" = Qty/sum(Qty))
# A tibble: 2 × 3
result Qty `%`
<fct> <int> <dbl>
1 lose 7119 0.497
2 win 7191 0.503
#Testing Set Props:
|>
model2_test group_by(result) |> summarise(Qty = n()) |>
mutate("%" = Qty/sum(Qty))
# A tibble: 2 × 3
result Qty `%`
<fct> <int> <dbl>
1 lose 2424 0.508
2 win 2346 0.492
Now we use the same workflow for training our models.
#Logistic regression:
= model_wf |>
glm_fit2 add_model(glm_model) |>
fit(model2_train)
tidy(glm_fit2)
# A tibble: 6 × 5
term estimate std.error statistic p.value
<chr> <dbl> <dbl> <dbl> <dbl>
1 (Intercept) -11.4 1.03 -11.0 2.37e- 28
2 kills 0.324 0.0149 21.8 1.20e-105
3 deaths -0.367 0.0139 -26.4 4.09e-154
4 towers 1.09 0.0410 26.5 2.63e-155
5 dragons 0.0317 0.0546 0.579 5.62e- 1
6 cspm 0.140 0.0278 5.04 4.66e- 7
Evaluate our model
= model2_test |>
data_with_preds2 bind_cols(pred = predict(glm_fit2, new_data = model2_test)) |> #Predicted class
bind_cols(pred = predict(glm_fit2, new_data = model2_test, type = "prob")) #Predicted Probability
#Conf Matrix:
table(data_with_preds2$result, data_with_preds2$.pred_class)
lose win
lose 2370 54
win 39 2307
#Accuracy:
|>
data_with_preds2 mutate(right = if_else(.pred_class == result, 1, 0)) |>
summarise("Accuracy" = scales::percent(sum(right)/n(), accuracy = 0.1))
Accuracy
1 98.1%
Wow, we have a very very good model. Let’s take a look at the roc curve.
roc_curve(data_with_preds2, result, .pred_lose) %>%
ggplot(aes(x = 1 - specificity, y = sensitivity)) +
geom_path() +
geom_abline(lty = 3) +
coord_equal() +
theme_bw()
Very good AUC, it means that we have a very good model.