: Explaining AI Visualization(ModelStudio)

Overview

This small blog post is aiming to visual explain the output of trained and won machine learning model through the usage of modelstudio library created by Matt Dancho and being hosted in CRAN. modelStudio is a new R package that makes it easy to interactively explain machine learning models using state-of-the-art techniques like Shapley Values, Break Down plots, and Partial Dependence (Matt Dancho, 2022).

Steps and Workflow

In this blog, we will learn how to make my 4 most important Explainable AI plots:

Feature Importance
Break Down Plot
Shapley Values
Partial Dependence

Step1: Loading data and Libaries

Load the libraries and Data
- Load Libraries: Load modelStudio , DALEX, tidyverse and tidymodels.
- Import Data: We’re using the mpg dataset that comes with ggplot2.

library(modelStudio)
library(tidyverse)
library(DALEX)
library(tidymodels)

data<- rio::import("C:/Users/jmurera/Desktop/Blog/myblog/data/Breast_cancer_data.csv")

data_tbl<-data%>% mutate_if(is.integer, as.factor) %>% as.tibble()

The data to be used looks like this

library(flextable)
ft <- flextable(head(data_tbl))
ft <- autofit(ft)

for(i in 1:1){
  flextable_to_rmd(ft)
}

mean_radius	mean_texture	mean_perimeter	mean_area	mean_smoothness
17.99	10.38	122.80	1,001.0	0.11840
20.57	17.77	132.90	1,326.0	0.08474
19.69	21.25	130.00	1,203.0	0.10960
11.42	20.38	77.58	386.1	0.14250
20.29	14.34	135.10	1,297.0	0.10030
12.45	15.70	82.57	477.1	0.12780

We want to understand how breast cancer diagnosis status can be estimated based on the remaining 5 columns.

Step2: Make a Predictive Model

The best way to understand what affects cancer diagnosis decision is to build a predictive model (and then explain it). Let’s build an xgboost model using the tidymodels ecosystem. If you’ve never heard of Tidymodels, it’s like Scikit Learn for R and CARET ecosystem.

Select Model Type: We use the boost_tree() function to establish that we are making a Boosted Tree
Set the Mode: Using set_mode() we select classification because we are predicting a class label of patient.
Set the Engine: Next we use set_engine() to tell Tidymodels to use the xgboost library.
Fit the Model: This performs a simple training of the model to fit each of the 5 predictors to the target diagnosis. Note that we did not perform cross-validation, hyperparameter tuning, or any advanced concepts as they are beyond the scope of this tutorial.

fit_xgboost<- boost_tree(learn_rate = 0.3) %>% 
              set_mode("classification") %>% 
              set_engine("xgboost") %>% 
              fit(diagnosis~.,data=data_tbl)
#fit_xgboost

Step3: Make Explainer

With above predictive model, we are ready to create an explainer. In basic terms, an explainer is a consistent and unified way to explain predictive models. The explainer can accept many different model types like:

Tidymodels
mlr3
H2O
Python Scikit Learn Models And it returns the explanation results from the model in a consistent format for investigation.

Now, below is the code to create the explainer.

#---------Explainer
explainer<-DALEX::explain(
  model = fit_xgboost,
  data= data_tbl[,-6],
  y=as.numeric(unlist(data_tbl[,6])),
  label = "Extreme Gradient Boosting Machine (XGBoost)"
)

Preparation of a new explainer is initiated -> model label : Extreme Gradient Boosting Machine (XGBoost) -> data : 569 rows 5 cols -> data : tibble converted into a data.frame -> target variable : 569 values -> predict function : yhat.model_fit will be used ( default ) -> predicted values : No value for predict function target column. ( default ) -> model_info : package parsnip , ver. 1.0.3 , task classification ( default ) -> predicted values : numerical, min = 0.00912416 , mean = 0.6253965 , max = 0.9911299
-> residual function : difference between y and yhat ( default ) -> residuals : numerical, min = 0.1481095 , mean = 1.00202 , max = 1.515258
A new explainer has been created!

Step4: Run modelStudio

modStudio<-modelStudio::modelStudio(explainer = explainer)
#modStudio

Acknowledgement

We intensively express our recognition to the developers of modelStudio who are Hubert Baniecki and Przemyslaw Biecek. This package is part of the Dr. Why ecosystem of R packages, which are a collection of tools for Visual Exploration, Explanation and Debugging of Predictive Models. Thank you for everything you do, we owe you much respect to simplify our work.