This post is inspired by the model and post written by Matt Dancho and being repricated on other data set to ultimately interprete the outputs of ML nodel in visual and most understandable way. ‘Machine learning is great, until you have to explain it’: Matt Dancho has said,
This small blog post is aiming to visual explain the output of trained and won machine learning model through the usage of modelstudio
library created by Matt Dancho and being hosted in CRAN.
modelStudio
is a new R package that makes it easy to interactively explain machine learning models using state-of-the-art techniques like Shapley Values, Break Down plots, and Partial Dependence (Matt Dancho, 2022).
In this blog, we will learn how to make my 4 most important Explainable AI plots:
modelStudio
, DALEX
, tidyverse
and tidymodels
.The data to be used looks like this
library(flextable)
ft <- flextable(head(data_tbl))
ft <- autofit(ft)
for(i in 1:1){
flextable_to_rmd(ft)
}
mean_radius | mean_texture | mean_perimeter | mean_area | mean_smoothness | diagnosis |
---|---|---|---|---|---|
17.99 | 10.38 | 122.80 | 1,001.0 | 0.11840 | 0 |
20.57 | 17.77 | 132.90 | 1,326.0 | 0.08474 | 0 |
19.69 | 21.25 | 130.00 | 1,203.0 | 0.10960 | 0 |
11.42 | 20.38 | 77.58 | 386.1 | 0.14250 | 0 |
20.29 | 14.34 | 135.10 | 1,297.0 | 0.10030 | 0 |
12.45 | 15.70 | 82.57 | 477.1 | 0.12780 | 0 |
We want to understand how breast cancer diagnosis status can be estimated based on the remaining 5 columns.
The best way to understand what affects cancer diagnosis decision is to build a predictive model (and then explain it). Let’s build an xgboost
model using the tidymodels
ecosystem. If you’ve never heard of Tidymodels
, it’s like Scikit Learn for R and CARET ecosystem.
Select Model Type: We use the boost_tree()
function to establish that we are making a Boosted Tree
Set the Mode: Using set_mode()
we select classification because we are predicting a class label of patient.
Set the Engine: Next we use set_engine()
to tell Tidymodels to use the xgboost
library.
Fit the Model: This performs a simple training of the model to fit each of the 5 predictors to the target diagnosis. Note that we did not perform cross-validation, hyperparameter tuning, or any advanced concepts as they are beyond the scope of this tutorial.
With above predictive model, we are ready to create an explainer. In basic terms, an explainer is a consistent and unified way to explain predictive models. The explainer can accept many different model types like:
Now, below is the code to create the explainer.
#---------Explainer
explainer<-DALEX::explain(
model = fit_xgboost,
data= data_tbl[,-6],
y=as.numeric(unlist(data_tbl[,6])),
label = "Extreme Gradient Boosting Machine (XGBoost)"
)
Preparation of a new explainer is initiated
-> model label : Extreme Gradient Boosting Machine (XGBoost)
-> data : 569 rows 5 cols
-> data : tibble converted into a data.frame
-> target variable : 569 values
-> predict function : yhat.model_fit will be used ( default )
-> predicted values : No value for predict function target column. ( default )
-> model_info : package parsnip , ver. 1.0.3 , task classification ( default )
-> predicted values : numerical, min = 0.00912416 , mean = 0.6253965 , max = 0.9911299
-> residual function : difference between y and yhat ( default )
-> residuals : numerical, min = 0.1481095 , mean = 1.00202 , max = 1.515258
A new explainer has been created!
modStudio<-modelStudio::modelStudio(explainer = explainer)
#modStudio
We intensively express our recognition to the developers of
modelStudio
who are Hubert Baniecki and Przemyslaw Biecek. This package is part of the Dr. Why ecosystem of R packages, which are a collection of tools for Visual Exploration, Explanation and Debugging of Predictive Models. Thank you for everything you do, we owe you much respect to simplify our work.