Machine Learning Interpretability

Introduction

Machine learning algorithms are often said to be black-box models in that there is not a good idea of how the model is arriving at predictions.This has often hindered adopting machine learning models in certain industires where interpretation is key. Examples of such areas include financial institutions who are regulated and have to explain decisions such as rejecting loan application or detecting fraud.Also, A healthcare provider may want to identify what factors are driving each patient’s risk of some disease so they can directly address those risk factors with targeted health interventions . Model interprtability allows you to determine how much you can trust a prediction model as a whole,debug, inform human decision making, Inform feature engineering, data collection and provides insights that can be used to improve the model.

First we need to install the libraries for machine learning intrepatabilty and model building. The dataset fro this post can be found here here The data has features to predict heart disease.

Install Packages

!pip  install   matplotlib 
!pip install eli5
#pdpbox has some problems in their pypi version
!pip install git+https://github.com/SauceCat/PDPbox
!pip install eli5
!pip install pycebox
!pip install ml_insights
!pip install scikit-optimize
!pip install alepython

#or
#!conda install -c conda-forge shap
!pip install shap
!pip install plotly_express
!pip install seaborn
!pip install category_encoders
!conda install -c conda-forge category_encoders
!pip install --upgrade git+https://github.com/scikit-learn-contrib/categorical-encoding
! pip install lime
!pip install graphviz  
!pip install skll

# Load All Libraries
from __future__ import print_function
# future allows compatitbility between python two and three modules
print(__doc__)
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.base import TransformerMixin
from sklearn.metrics import classification_report, f1_score, accuracy_score, precision_score, confusion_matrix
from sklearn.metrics import roc_auc_score
from sklearn.preprocessing import Imputer
from sklearn.preprocessing import MinMaxScaler
from sklearn.preprocessing import StandardScaler
from sklearn.preprocessing import LabelEncoder
from sklearn.metrics import accuracy_score
from sklearn.model_selection import cross_validate
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from xgboost import XGBClassifier
from lime import lime_text
from sklearn.pipeline import make_pipeline
from sklearn.pipeline import Pipeline
import sklearn
import sklearn.datasets
import sklearn.ensemble
import sklearn
import numpy as np
import pandas as pd
import lightgbm as lgb
from sklearn.datasets import load_boston
import xgboost as xgb
import numpy as np
import statsmodels.api as sm
from sklearn.model_selection import train_test_split
from sklearn.model_selection import *
from sklearn.model_selection import GridSearchCV
from sklearn.metrics import mean_squared_error
import matplotlib.pyplot as plt
from sklearn.model_selection import KFold
from sklearn import preprocessing
import plotly_express as px
import datetime
from datetime import *
import seaborn as sns; sns.set()
import shap
import warnings
from eli5.sklearn import PermutationImportance
from sklearn.preprocessing import Imputer
from sklearn.model_selection import  cross_val_score
from sklearn.pipeline import Pipeline
from sklearn.metrics import make_scorer
from sklearn.ensemble import RandomForestRegressor
#from alepython  import *

from skll.metrics import spearman

from skopt import BayesSearchCV
from skopt.space import Real, Categorical, Integer


warnings.filterwarnings("ignore", category=FutureWarning)
#from __future__ import print_function

Automatically created module for IPython interactive environment

Loading Data

link to data This database contains 76 attributes, but all published experiments refer to using a subset of 14 of them. In particular, the Cleveland database is the only one that has been used by ML researchers to this date. The “goal” field refers to the presence of heart disease in the patient. It is integer valued from 0 (no presence) to 4.

Content

Attribute Information: Only 14 attributes used:

age: age in years
sex: sex (1 = male; 0 = female)
cp: chest pain type – Value 1: typical angina – Value 2: atypical angina – Value 3: non-anginal pain – Value 4: asymptomatic
trestbps: resting blood pressure (in mm Hg on admission to the hospital)
chol: serum cholestoral in mg/dl
fbs: (fasting blood sugar > 120 mg/dl) (1 = true; 0 = false)
restecg: resting electrocardiographic results – Value 0: normal – Value 1: having ST-T wave abnormality (T wave inversions and/or ST elevation or depression of > 0.05 mV) – Value 2: showing probable or definite left ventricular hypertrophy by Estes’ criteria
thalach: maximum heart rate achieved
exang: exercise induced angina (1 = yes; 0 = no)
oldpeak = ST depression induced by exercise relative to rest
slope: the slope of the peak exercise ST segment – Value 1: upsloping – Value 2: flat – Value 3: downsloping
ca: number of major vessels (0-3) colored by flourosopy
thal: 3 = normal; 6 = fixed defect; 7 = reversable defect
target: diagnosis of heart disease (angiographic disease status) – Value 0: < 50% diameter narrowing – Value 1: > 50% diameter narrowing (in any major vessel: attributes 59 through 68 are vessels)

import numpy as np
import pandas as pd
from google.colab import files
import io

uploaded = files.upload()

Saving heart.csv to heart.csv

for fn in uploaded.keys():
  print('User uploaded file "{name}" with length {length} bytes'.format(
      name=fn, length=len(uploaded[fn])))

User uploaded file "heart.csv" with length 11328 bytes

#uploaded

heart_data=pd.read_csv(io.StringIO(uploaded['heart.csv'].decode('utf-8')))
heart_data.head()

	age	sex	cp	trestbps	chol	fbs	restecg	thalach	exang	oldpeak	slope	thal	target
0	63	1	3	145	233	1	0	150	0	2.3	0	1	1
1	37	1	2	130	250	0	1	187	0	3.5	0	2	1
2	41	0	1	130	204	0	0	172	0	1.4	2	2	1
3	56	1	1	120	236	0	1	178	0	0.8	2	2	1
4	57	0	0	120	354	0	1	163	1	0.6	2	2	1

heart_data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 303 entries, 0 to 302
Data columns (total 14 columns):
age         303 non-null int64
sex         303 non-null int64
cp          303 non-null int64
trestbps    303 non-null int64
chol        303 non-null int64
fbs         303 non-null int64
restecg     303 non-null int64
thalach     303 non-null int64
exang       303 non-null int64
oldpeak     303 non-null float64
slope       303 non-null int64
ca          303 non-null int64
thal        303 non-null int64
target      303 non-null int64
dtypes: float64(1), int64(13)
memory usage: 33.2 KB

heart_data.describe()

	age	trestbps	chol	thalach	oldpeak	ca	target
count	303.000000	303.000000	303.000000	303.000000	303.000000	303.000000	303.000000
mean	54.366337	131.623762	246.264026	149.646865	1.039604	0.729373	0.544554
std	9.082101	17.538143	51.830751	22.905161	1.161075	1.022606	0.498835
min	29.000000	94.000000	126.000000	71.000000	0.000000	0.000000	0.000000
25%	47.500000	120.000000	211.000000	133.500000	0.000000	0.000000	0.000000
50%	55.000000	130.000000	240.000000	153.000000	0.800000	0.000000	1.000000
75%	61.000000	140.000000	274.500000	166.000000	1.600000	1.000000	1.000000
max	77.000000	200.000000	564.000000	202.000000	6.200000	4.000000	1.000000

Model Training

We will train a binary classification model using xgboost model to predict Income of a person using features in the dataset above.

d=pd.DataFrame(heart_data.target.value_counts())

d=d.reset_index()

d.columns=['Prediction','Frequency']

d['Prediction']=d['Prediction'].astype('category')
d['Prediction']=d['Prediction'].astype(str)
d
d['Heart_Diseae']=['Present','Absent']
#d.dtypes()
d.info()

#px.bar(d, x="Heart_Diseae", y="Frequency",orientation="v", color='Heart_Diseae')

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2 entries, 0 to 1
Data columns (total 3 columns):
Prediction      2 non-null object
Frequency       2 non-null int64
Heart_Diseae    2 non-null object
dtypes: int64(1), object(2)
memory usage: 128.0+ bytes

bg_color = (0.5, 0.5, 0.5)

sns.set(rc={"font.style":"normal",
            "axes.facecolor":bg_color,
            "axes.titlesize":30,
            "figure.facecolor":bg_color,
            "text.color":"black",
            "xtick.color":"black",
            "ytick.color":"black",
            "axes.labelcolor":"black",
            "axes.grid":False,
            'axes.labelsize':30,
            'figure.figsize':(10.0, 10.0),
            'xtick.labelsize':25,
            'ytick.labelsize':20})

#plt.rcParams.update(params)
flatui = ["#9b59b6", "#3498db", "#95a5a6", "#e74c3c", "#34495e", "#2ecc71"]
sns.countplot(heart_data.target,palette=flatui)
plt.xlabel('Heart_Disease ,Present(1), Absent(0)')
plt.title('frequency of Heart_Diseae  ')
plt.show()

png

X=heart_data.drop('target',axis=1)
Y=heart_data.target
scaler = StandardScaler()

#X = scaler.fit_transform(X)


train_X, val_X, train_y, val_y = train_test_split(X, Y,test_size=0.3, random_state=148)

Categorical Variable Encoding

Target encoding is one of the bayesian encoders for categorical variables. One-Hot does not work well for tree-based models and high cardinality features.

from category_encoders import *
import pandas as pd

enc = TargetEncoder(cols=['sex', 'cp','fbs','restecg','exang','slope','thal']).fit(train_X,train_y)

# transform the datasets
training_numeric_dataset = enc.transform(train_X,train_y)
testing_numeric_dataset = enc.transform(val_X)

print(training_numeric_dataset.thal.value_counts())

training_numeric_dataset.head()

773109    119
246753     77
357143     14
513955      2
Name: thal, dtype: int64

	age	sex	cp	trestbps	chol	fbs	restecg	thalach	exang	oldpeak	slope	ca	thal
114	55	0.454545	0.885714	130	262	0.574586	0.63964	155	0.708333	0.0	0.789474	0	0.773109
152	64	0.454545	0.647059	170	227	0.574586	0.46000	155	0.708333	0.6	0.361111	0	0.246753
100	42	0.454545	0.647059	148	244	0.574586	0.46000	178	0.708333	0.8	0.789474	2	0.773109
87	46	0.454545	0.885714	101	197	0.419355	0.63964	156	0.708333	0.0	0.789474	0	0.246753
72	29	0.454545	0.885714	130	204	0.574586	0.46000	202	0.708333	0.0	0.789474	0	0.773109

Converting Numerical features to Categorical features

The categorical features such as sex etc are represented as numeric. We can convert them to categorical courtesy of pandas function astype(‘category’)

cat_cols=['sex', 'cp','fbs','restecg','exang','slope','thal']

heart_data[cat_cols] = heart_data[cat_cols].apply(lambda x: x.astype('category'))

heart_data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 303 entries, 0 to 302
Data columns (total 14 columns):
age         303 non-null int64
sex         303 non-null category
cp          303 non-null category
trestbps    303 non-null int64
chol        303 non-null int64
fbs         303 non-null category
restecg     303 non-null category
thalach     303 non-null int64
exang       303 non-null category
oldpeak     303 non-null float64
slope       303 non-null category
ca          303 non-null int64
thal        303 non-null category
target      303 non-null int64
dtypes: category(7), float64(1), int64(6)
memory usage: 19.6 KB

cat_cols=['sex', 'cp','fbs','restecg','exang','slope','thal']

for col_name in cat_cols:
    #if(heart_data[col_name].dtype == 'object'):
        heart_data[col_name]= heart_data[col_name].astype('category')
        #heart_data[col_name] = heart_data[col_name].cat.codes
        
#heart_data[cat_cols] = heart_data[cat_cols].apply(lambda x: x.astype('category'))

heart_data.head(3)
heart_data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 303 entries, 0 to 302
Data columns (total 14 columns):
age         303 non-null int64
sex         303 non-null category
cp          303 non-null category
trestbps    303 non-null int64
chol        303 non-null int64
fbs         303 non-null category
restecg     303 non-null category
thalach     303 non-null int64
exang       303 non-null category
oldpeak     303 non-null float64
slope       303 non-null category
ca          303 non-null int64
thal        303 non-null category
target      303 non-null int64
dtypes: category(7), float64(1), int64(6)
memory usage: 19.6 KB

train_X.columns.tolist()

[‘age’, ‘sex’, ‘cp’, ‘trestbps’, ‘chol’, ‘fbs’, ‘restecg’, ‘thalach’, ‘exang’, ‘oldpeak’, ‘slope’, ‘ca’, ‘thal’]

train_data=lgb.Dataset(train_X ,label=train_y,feature_name=train_X.columns.tolist(), categorical_feature=cat_cols)


lgb_train = lgb.Dataset(train_X ,label=train_y
 ,feature_name =train_X.columns.tolist()
 , categorical_feature = cat_cols
)


#params = {
# 'task': 'train'
# , 'boosting_type': 'gbdt'
# , 'objective': 'regression'
#   ,'objective': 'binary'    
# , 'num_class': num_of_classes
# , 'metric': 'rmsle' 
# , 'min_data': 1
# , 'verbose': -1
#}
 
#lgbm = lgb.train(params, lgb_train, num_boost_round=50)



params = {'num_leaves':150, 
         'objective':'binary',
         'max_depth':7,
         'learning_rate':.05,
         'is_unbalance':True,
         'max_bin':200}
params['metric'] = ['auc', 'binary_logloss']


cv_results = lgb.cv(
        params,
        train_data,
        num_boost_round=100,
        nfold=3,
        #metrics='mae',
        early_stopping_rounds=10
        )

/usr/local/lib/python3.6/dist-packages/lightgbm/basic.py:1205: UserWarning:

Using categorical_feature in Dataset.

# Display results
print('Current parameters:\n', params)
print('\nBest num_boost_round:', len(cv_results['auc-mean']))
print('Best CV score:', max(cv_results['auc-mean']))

#cv_results.best_params_

Current parameters:
 {'num_leaves': 150, 'objective': 'binary', 'max_depth': 7, 'learning_rate': 0.05, 'is_unbalance': True, 'max_bin': 200, 'metric': ['auc', 'binary_logloss']}

Best num_boost_round: 26
Best CV score: 0.8793941273779984

#setting parameters for lightgbm

#train_data=lgb.Dataset(train_X ,label=train_y)

param = {'num_leaves':150, 
         'objective':'binary',
         'max_depth':7,
         'learning_rate':.05,
         'is_unbalance':True,
         'max_bin':200}
param['metric'] = ['auc', 'binary_logloss']


#training our model using light gbm
num_round=50
start=datetime.now()
lgbm=lgb.train(param,train_data,num_round,feature_name=list(train_X.columns), categorical_feature=cat_cols)
stop=datetime.now()


#lgb_estimator = lgb.LGBMClassifier(boosting_type='gbdt')
#Equivalent way of fitting light gbm

lgb_estimator = lgb.LGBMClassifier(boosting_type='gbdt',
                        # class_weight='balanced', #used only in multiclass training
                         #is_unbalance ='True'   #used in binary class training    
                         objective='binary',
                         n_jobs=-1, 
                         verbose=0)

lgb_model = lgb_estimator.fit(X=train_X, y=train_y)

#Execution time of the model
execution_time_lgbm = stop-start
execution_time_lgbm

datetime.timedelta(0, 0, 45042)

#predicting the target label on test set
ypred=lgbm.predict(val_X)
print(ypred[0:5])  # showing first 5 predictions

#predicting the target probability on test set
#yprob=lgbm.predict_proba(val_X)
#print(yprob[0:2])  # showing first 5 predictions

[0.84144207 0.48859939 0.93763193 0.86565699 0.10644411]

#predicting the target label on test set
ypred=lgb_model.predict(val_X)
print(ypred[0:5])  # showing first 5 predictions

#predicting the target probability on test set
yprob=lgb_model.predict_proba(val_X)
print(yprob[0:2])  # showing first 5 predictions

[1 1 1 1 0]
[[0.06968765 0.93031235]
 [0.38360334 0.61639666]]

predictions = lgb_model.predict(val_X)
print("Confusion Matrix:")

cm=confusion_matrix(val_y, predictions)

print(cm)

print()
print("Classification Report")
print(classification_report(val_y, predictions))

# ROC curve and Area-Under-Curve (AUC)
#calculating accuracy
accuracy_lgbm = accuracy_score(predictions,val_y)

print('accuracy score : {:0.3f}'.format( accuracy_lgbm))

roc_auc_lgbm = roc_auc_score(val_y,predictions)

print('roc score : {:0.3f}'.format( roc_auc_lgbm))

Confusion Matrix:
[[33 10]
 [ 7 41]]

Classification Report
              precision    recall  f1-score   support

           0       0.82      0.77      0.80        43
           1       0.80      0.85      0.83        48

   micro avg       0.81      0.81      0.81        91
   macro avg       0.81      0.81      0.81        91
weighted avg       0.81      0.81      0.81        91

accuracy score : 0.813
roc score : 0.811

import seaborn as sn
rcParams['figure.figsize'] = 8,6


df_cm = pd.DataFrame(cm, range(2),
                  range(2))
#plt.figure(figsize = (10,7))
sn.set(font_scale=1.4)#for label size
# Show confusion matrix in a separate window
#plt.matshow(cm)
sn.heatmap(df_cm, annot=True,annot_kws={"size": 16},cmap=plt.cm.Blues)# font size
plt.title('Confusion matrix')
#plt.colorbar()
plt.ylabel('True label')
plt.xlabel('Predicted label')
plt.show()

png

Plotting Model Inferences From lightgbm Library

rcParams['figure.figsize'] = 10,8
lgb.plot_importance(lgb_model,title='Variable Importance')
plt.show()

png

#plt.figure(figsize=(10,10))
lgb.plot_tree(lgb_model)
plt.show()

png

lgb.create_tree_digraph(lgb_model)

png

Permutation Importance

The eli5 package can be used to compute feature importances for any black-box estimator by measuring how score decreases when a feature is not available; the method is also known as “permutation importance” or “Mean Decrease Accuracy (MDA)”. The method picks a feature and randomly shuffles its values whilst keeping the other features fixed. This is done to break the dependence between the feature and target variable. The feature importance can be measured by calculating how much the score (accuracy, F1, R^2, etc. ) decreases when a feature is not available. The greater the decrease in the score the more important that feature is predicting the target variable. This process is repeated for all other features in the model to arrive at feature importance for each variable in the model. The features are arranged in decreasing order of importance. The most important features are the top values, and those towards the bottom matter least. The first number in each row shows how much model performance decreased with a random shuffling (in this case, using “accuracy” as the performance metric).

ELI5 is a Python package which helps to debug machine learning classifiers and explain their predictions. It supports popular ML libraries such as scikit-learn, xgboost, LightGBM and lightning.It can be used to compute feature importances for black box estimators using the permutation importance method.

from eli5.sklearn import PermutationImportance
from eli5.sklearn import *
import eli5
from eli5.permutation_importance import get_score_importances


perm = PermutationImportance(lgb_model, random_state=1).fit(train_X ,train_y)
eli5.show_weights(perm, feature_names=train_X.columns.tolist())

png

from eli5.sklearn import PermutationImportance


# set up the met-estimator to calculate permutation importance on our training
# data
perm_train = PermutationImportance(lgb_model, scoring='accuracy',
                                   n_iter=100, random_state=1)
# fit and see the permuation importances
perm_train.fit(train_X, train_y)
eli5.explain_weights_df(perm_train, feature_names=train_X.columns.tolist()).head()

	feature	weight	std
0	ca	0.113679	0.017014
1	thalach	0.059528	0.011779
2	cp	0.046226	0.010484
3	oldpeak	0.043868	0.009724
4	sex	0.042830	0.009774

# plot the distributions
from matplotlib import rcParams

# figure size in inches
rcParams['figure.figsize'] = 15,9

perm_train_df = pd.DataFrame(data=perm.results_,
                                      columns=X.columns)



(sns.boxplot(data=perm_train_df)
        .set(title='Permutation Importance Distributions (training data)',
             ylabel='Importance'));

plt.show()

png

# create our dataframe of feature importances
feat_imp_df = eli5.explain_weights_df(estimator =lgb_model, feature_names=train_X.columns.tolist())
feat_imp_df.head()

	feature	weight
0	cp	0.213396
1	ca	0.152011
2	thalach	0.120438
3	oldpeak	0.088585
4	age	0.087723

estimator =lgb_model

estimator.feature_importances_

array([113,  36,  38,  65,  90,   0,  31, 120,  23,  86,  24,  56,  24])

features=train_X.columns.tolist()

# simple exmaple of a patient with feature values below
example = np.array([70,	1,	3,	100,	200,	1,	0,	150,	0,	2.3,	0,	0,	1])
eli5.explain_prediction_df(lgb_model, example, feature_names=features)
lgb_model.predict(example.reshape(1,-1))

array([1])

Partial Dependence Plots

Partial dependence plots show the marginal effect between the predicted label/ target function from a machine learing model and a set of features. The limits size of the target feature set is usually one or two. The target features are usually chosen among the most important features. A partial dependence plot can capture linear, monotonous and complex relationships between the target variable and the features of interest.

Partial dependence works by marginalizing the machine learning model output over the distribution of the features whose partial dependence function should be plotted, so that the function shows the relationship between those features and the predicted outcome. By marginalizing over the other features in the model , we get a function that depends only on the specified features and its interactions.

from matplotlib import pyplot as plt
from pdpbox import pdp, get_dataset, info_plots
from sklearn.ensemble.partial_dependence import partial_dependence, plot_partial_dependence
# import machine learning algorithms
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.metrics import classification_report, confusion_matrix, roc_curve, auc

bg_color = (0.5, 0.5, 0.5)

sns.set(rc={"font.style":"normal",
            "axes.facecolor":bg_color,
            "axes.titlesize":20,
            "figure.facecolor":bg_color,
            "text.color":"black",
            "xtick.color":"black",
            "ytick.color":"black",
            "axes.labelcolor":"black",
            "axes.grid":False,
            'axes.labelsize':10,
            'figure.figsize':(10.0, 10.0),
            'xtick.labelsize':10,
            'ytick.labelsize':10})
sns.set_style("whitegrid")
my_model = GradientBoostingClassifier()
my_model.fit(X=train_X, y=train_y)
# Here we make the plot
my_plots = plot_partial_dependence(my_model ,       
                                   features=[0,1, 2,3,4,5], # column numbers of plots we want to show
                                   X=train_X,            # raw predictors data.
                                   feature_names=['age','cp', 'thalach','ca','chol','sex'], # labels on graphs
                                   grid_resolution=10) # number of values to plot on x axis
plt.show()

png

Individual Conditional Expectation (ICE)

from pycebox.ice import ice, ice_plot

# pcyebox likes the data to be in a DataFrame so let's create one with our imputed data
# we first need to impute the missing data
train_X_df = pd.DataFrame(train_X, columns=X.columns)

thalach_ice_df = ice(data=train_X_df, column='thalach', 
                   predict=lgbm.predict)

rcParams['figure.figsize'] = 8,6

ice_plot(thalach_ice_df , c='dimgray', linewidth=0.3)
plt.ylabel('Pred. AV %ile')
plt.xlabel('thalach');
plt.show()

png

#ice_plot(thalach_ice_df, linewidth=0.5,color_by='data_thalach', cmap=cmap2)
#wt_vals
#thalach_ice_df
#plt.show()

# figure size in inches
rcParams['figure.figsize'] = 10,8
# new colormap for ICE plot
cmap2 = plt.get_cmap('OrRd')
# set color_by to Wt, in order to color each curve by that player's weight
ice_plot(thalach_ice_df, linewidth=0.5, color_by='data_thalach', cmap=cmap2)
# ice_plot doesn't return a colorbar so we have to add one
# hack to add in colorbar taken from here:
# https://stackoverflow.com/questions/8342549/matplotlib-add-colorbar-to-a-sequence-of-line-plots/11558629#11558629
wt_vals = thalach_ice_df.columns.get_level_values('data_thalach').values
sm = plt.cm.ScalarMappable(cmap=cmap2, 
                           norm=plt.Normalize(vmin=wt_vals.min(), 
                                              vmax=wt_vals.max()))
# need to create fake array for the scalar mappable or else we get an error
sm._A = []
plt.colorbar(sm, label='thalach')
plt.ylabel('Pred. AV %ile')
plt.xlabel('thalach');
plt.show()

png

# figure size in inches
rcParams['figure.figsize'] = 10,8
ice_plot(thalach_ice_df, linewidth=.5, color_by='data_thalach', cmap=cmap2, plot_pdp=True, 
         pdp_kwargs={'c': 'k', 'linewidth': 5})
plt.colorbar(sm, label='thalach')
plt.ylabel('Pred. AV %ile')
plt.xlabel('thalach');
plt.show()

png

age_ice_df = ice(data=train_X_df, column='age', 
                   predict=lgbm.predict)

# figure size in inches
rcParams['figure.figsize'] = 10,8
ice_plot(age_ice_df, linewidth=.5, color_by='data_age', cmap=cmap2, plot_pdp=True, 
         pdp_kwargs={'c': 'k', 'linewidth': 5})
plt.colorbar(sm, label='age')
plt.ylabel('Pred. AV %ile')
plt.xlabel('age');
plt.show()

png

ca_ice_df = ice(data=train_X_df, column='ca'
                   ,predict=lgbm.predict)

# figure size in inches
rcParams['figure.figsize'] = 10,8
ice_plot(ca_ice_df, linewidth=.5, color_by='data_ca', cmap=cmap2, plot_pdp=True, 
         pdp_kwargs={'c': 'k', 'linewidth': 5})
plt.colorbar(sm, label='ca')
plt.ylabel('Pred. AV %ile')
plt.xlabel('ca');
plt.show()

png

Interpreting Partial Dependence Plots

The top left plot shows the partial dependence between our target varaible Heart disease present or absent, and the age variable in years.

ca: number of major vessels (0-3) colored by flourosopy, having more major vessels colored by flouroscopy reduces your risk of a heart disease.
cp: Having chest pain type 1 (typical angina) increases your average probability of having a heart disease. The increasing chance of heart disease remains constant for chest pain types 2,3 and 4.
thalach: maximum heart rate achieved above 120 increases your risk of heart disease.
Being a male slightly reduces your risk of a heart disease.
chol: serum cholestoral in mg/dl, cholesterol level around 240 reduces your risk of heart disease. Above this level your risk goes up

from matplotlib import pyplot as plt
from pdpbox import pdp, get_dataset, info_plots

# Create the data that we will plot
pdp_goals = pdp.pdp_isolate(model=lgb_model, dataset=train_X,
                            model_features=val_X.columns.tolist(), feature='age')

# plot it
pdp.pdp_plot(pdp_goals, 'age')
plt.show()

png

# Create the data that we will plot
pdp_goals = pdp.pdp_isolate(model=lgb_model, dataset=train_X,
                            model_features=val_X.columns.tolist(), feature='ca')

# plot it
pdp.pdp_plot(pdp_goals, 'ca')
plt.show()

png

# Create the data that we will plot
pdp_goals = pdp.pdp_isolate(model=lgb_model, dataset=train_X,
                            model_features=val_X.columns.tolist(), feature='thalach')

# plot it
pdp.pdp_plot(pdp_goals, 'thalach')
plt.show()

png

# Create the data that we will plot
pdp_goals = pdp.pdp_isolate(model=lgb_model, dataset=train_X,
                            model_features=val_X.columns.tolist(), feature='cp')

# plot it
pdp.pdp_plot(pdp_goals, 'cp')
plt.show()

png

# Create the data that we will plot
pdp_goals = pdp.pdp_isolate(model=lgb_model, dataset=train_X,
                            model_features=val_X.columns.tolist(), feature='sex')

# plot it
pdp.pdp_plot(pdp_goals, 'sex')
plt.show()

png

# Create the data that we will plot
sns.set_style("whitegrid")
pdp_goals = pdp.pdp_isolate(model=lgb_model, dataset=val_X,
                            model_features=val_X.columns.tolist(), feature='chol')

# plot it
pdp.pdp_plot(pdp_goals, 'chol')
plt.show()

png

# figure size in inches
rcParams['figure.figsize'] = 10,13
sns.set_style("whitegrid")
pdp_interaction = pdp.pdp_interact(model=lgb_model, dataset=val_X, model_features=val_X.columns.tolist(), features=['thalach', 'oldpeak'])
pdp.pdp_interact_plot(pdp_interact_out=pdp_interaction, feature_names=['thalach', 'oldpeak'], plot_type='contour')
plt.show()

png

from matplotlib import pyplot as plt
plt.figure(figsize=(15,10))
sns.set_style("whitegrid")
pdp_interaction = pdp.pdp_interact(model=lgb_model, dataset=val_X, model_features=val_X.columns.tolist(), features=['age', 'ca'])
pdp.pdp_interact_plot(pdp_interact_out=pdp_interaction, feature_names=['age', 'ca'], plot_type='contour')
plt.show()

<matplotlib.figure.Figure at 0x7fb4674430f0>

png

from matplotlib import pyplot as plt
plt.figure(figsize=(15,10))
sns.set_style("whitegrid")
pdp_interaction = pdp.pdp_interact(model=lgb_model, dataset=val_X, model_features=val_X.columns.tolist(), features=['age', 'sex'])
pdp.pdp_interact_plot(pdp_interact_out=pdp_interaction, feature_names=['age', 'sex'], plot_type='contour')
plt.show()

<matplotlib.figure.Figure at 0x7f4078e22400>

png

Combined effect of age and ca , number of major vessels (0-3) colored by flourosopy on heart disease prediction is shown above. Any age combined with low ca number increases the risk of heart disease. This is shown by the yellow colors just above the horizontal axis. Any age also with high ca number reduces the risk if a heart disease shown by blue and purple regions higher up the chart. The effect of ca number on heart disease prediction is stronger than the age variable.

SHapley Additive exPlanations (SHAP) Values

SHAP measures the impact of features taking into account the interaction with other features. Shapley values calculate the feature importance by comparing two predictions, one with the feature included and the other without it. The positive SHAP values affect the prediction/target variable positively whereas the negative SHAP values affect the target negatively. The effects of a feature on a single example of the data can also be studied with SHAP values. The SHAP method is used to calculate influences of variables on the particular observation. This technique was borrowed from game theory. SHAP is model agnostic, it works with a variety of supervised machine learning models form xgboost, lightgbm, deep learning models.

import xgboost
import shap

# load JS visualization code to notebook
shap.initjs()
#select a random row between 0 and 80
row_to_show = random.randint(0,val_X.shape[0])
#row_to_show = 10
data_for_prediction = val_X.iloc[row_to_show]  # use 1 row of data here. Could use multiple rows if desired
data_for_prediction_array = data_for_prediction.values.reshape(1, -1)


# explain the model's predictions using SHAP values
# (same syntax works for LightGBM, CatBoost, and scikit-learn models)
explainer = shap.TreeExplainer(lgb_model)
#shap_values = explainer.shap_values(train_X)

# visualize the first prediction's explanation (use matplotlib=True to avoid Javascript)
#shap.force_plot(explainer.expected_value, shap_values[0,:], train_X.iloc[0,:])
shap.initjs()
#shap.force_plot(explainer.expected_value, shap_values[row_to_show], data_for_prediction)
shap.force_plot(explainer.expected_value[0], shap_values[1], data_for_prediction)
#shap.force_plot(explainer.expected_value, shap_values[0,:], val_X.iloc[0,:])

Visualization omitted, Javascript library not loaded!
Have you run `initjs()` in this notebook? If this notebook was from another user you must also trust this notebook (File -> Trust notebook). If you are viewing this notebook on github the Javascript has been stripped for security. If you are using JupyterLab this error is because a JupyterLab extension has not yet been written.

The above explanation shows features each contributing to push the model prediction from the base value (the average model prediction over the training dataset we passed) to the model prediction. Features pushing the prediction higher are shown in red, those pushing the prediction lower are in blue .

Exlanations for the whole dataset can be obtained by combining many explanations , rotating them 90 degrees, and then stacking them horizontally

import random

lgb_model.predict_proba(data_for_prediction_array)

array([[0.84938913, 0.15061087]])

data_for_prediction.as_matrix()
data_for_prediction.values
explainer.expected_value
shap_values[row_to_show]

array([-0.24660923,  0.5942684 , -1.08236257, -0.30082934, -0.9082763 ,
        0.        ,  0.28450406, -1.01859325, -0.6946964 ,  1.19555418,
       -0.65066211,  0.75565923, -0.69125715])

import shap  # package used to calculate Shap values

# Create object that can calculate shap values
explainer = shap.TreeExplainer(lgb_model)

# Calculate Shap values
shap_values = explainer.shap_values(val_X)


# Create object that can calculate shap values
#explainer = shap.TreeExplainer(my_model)

# Calculate Shap values
#shap_values = explainer.shap_values(data_for_prediction)

If you look carefully at the code where we created the SHAP values, you’ll notice we reference Trees in shap.TreeExplainer(my_model). But the SHAP package has explainers for every type of model.

shap.DeepExplainer works with Deep Learning models. shap.KernelExplainer works with all models, though it is slower than other Explainers and it offers an approximation rather than exact Shap values. Here is an example using KernelExplainer to get similar results. The results aren’t identical because KernelExplainer gives an approximate result. But the results tell the same story.

# use Kernel SHAP to explain test set predictions
shap.initjs()
k_explainer = shap.KernelExplainer(my_model.predict_proba, train_X)
k_shap_values = k_explainer.shap_values(data_for_prediction)
shap.force_plot(k_explainer.expected_value[1], k_shap_values[1], data_for_prediction)

Using 212 background data samples could cause slower run times. Consider using shap.kmeans(data, K) to summarize the background as K weighted samples.

# visualize the training set predictions
# load JS visualization code to notebook
shap.initjs()
shap.force_plot(explainer.expected_value, shap_values, train_X)

Shapley Dependence Plots

# create a SHAP dependence plot to show the effect of a single feature across the whole dataset
sns.set_style("whitegrid")
shap.dependence_plot("ca", shap_values, train_X)

png

import shap  # package used to calculate Shap values

# Create object that can calculate shap values
explainer = shap.TreeExplainer(my_model)

# calculate shap values. This is what we will plot.
shap_values = explainer.shap_values(X)

# make plot.
shap.dependence_plot('Ball Possession %', shap_values[1], X, interaction_index="Goal Scored")

The most important features for the model contributing to the risk of heart disease can be shown by plotting the SHAP values of every feature for every sample. The plot below sorts features by the sum of SHAP value magnitudes over all samples, and uses SHAP values to show the distribution of the impacts each feature has on the model output. The color represents the feature value (red high, blue low)

High ca (number of major vessels (0-3) colored by flourosopy) lowers the predicted probabilty of having a heart disease.
Belonging to a low cp (chest pain type) category lowers the predicted probabilty of having a heart disease.
Being a male reduces your chance of a heart disease.

# summarize the effects of all the features
sns.set_style("whitegrid")
shap.summary_plot(shap_values, train_X)

png

Variable importance plot can be plotted by taking the mean absolute value of the SHAP values for each feature.

sns.set_style("whitegrid")
plt.title("Variable Importance Plot")
shap.summary_plot(shap_values, train_X, plot_type="bar")

png

Local Interpretable Model-Agnostic Explanations (LIME )

LIME is model-agnostic, meaning that it can be applied to any machine learning model. The parts of the interpretable input contributing to the prediction is determined by perturbing the input around its neighborhood and observe how the model’s predictions change. The perturbed data is weighed by their proximity to the original example, and learn an interpretable model on those and the associated predictions.

Create the explainer

Tabular explainers use a training set to compute statistics on each feature. For continouous features,statistics such as the mean, standard deviation and discretizing into quartiles are computed.Frequency is computed for each categorical feature. The computed statistics are used to scale the data, so that we can meaningfully compute distances when the attributes are not on the same scale. It is also used to sample perturbed instances - which we do by sampling from a Normal(0,1), multiplying by the std and adding back the mean.

from lime import lime_text
from lime import *
import lime
from sklearn.pipeline import make_pipeline

sns.set_style("whitegrid")
predict_fn_lgbm = lambda x: lgb_model.predict_proba(x).astype(float)

X_val=val_X.as_matrix()

#target_names= ['1','0']
target_names= [str(i) for  i in train_y.unique()]

explainer = lime.lime_tabular.LimeTabularExplainer(X_val, feature_names=train_X.columns.tolist(), class_names=target_names, discretize_continuous=True)

i = np.random.randint(0, val_X.shape[0])
exp = explainer.explain_instance(X_val[i], lgb_model.predict_proba, num_features=10, top_labels=1)

#We now explain a single instance:

exp.show_in_notebook(show_table=True, show_all=False)

Plotting Decision Tree

# a plot of the weights for each feature
exp.as_pyplot_figure();

png

from sklearn.datasets import load_iris
from sklearn import tree
iris = load_iris()
clf = tree.DecisionTreeClassifier()
clf = clf.fit(iris.data, iris.target)
#tree.plot_tree(clf.fit(iris.data, iris.target)) 


import graphviz 
dot_data = tree.export_graphviz(clf, out_file=None) 
graph = graphviz.Source(dot_data) 
graph.render("iris") 

'iris.pdf'

dot_data = tree.export_graphviz(clf, out_file=None, 
                      feature_names=iris.feature_names,  
                      class_names=iris.target_names,  
                      filled=True, rounded=True,  
                     special_characters=True)  
graph = graphviz.Source(dot_data)  
graph 

png

Explaining Predictions

For the particular example/row of the feature dataset picked randomly, the predicted target is not having a heart disease(0) with probability of 0.94. It can be inferred taht the features contributing to this prediction are the Feature Values shown in orange color above. These include ca, sex, cp, slope, thal, oldpeak and exang. The average probability contribution to the target class predicition from the individual features are shown in the midlle barplot above. The remaining features decreases prediction of the target class 0.

	age	sex	cp	trestbps	chol	fbs	restecg	thalach	exang	oldpeak	slope	thal	target
0	63	1	3	145	233	1	0	150	0	2.3	0	1	1
1	37	1	2	130	250	0	1	187	0	3.5	0	2	1
2	41	0	1	130	204	0	0	172	0	1.4	2	2	1
3	56	1	1	120	236	0	1	178	0	0.8	2	2	1
4	57	0	0	120	354	0	1	163	1	0.6	2	2	1

	age	sex	cp	trestbps	chol	fbs	restecg	thalach	exang	oldpeak	slope	thal	target
0	63	1	3	145	233	1	0	150	0	2.3	0	1	1
1	37	1	2	130	250	0	1	187	0	3.5	0	2	1
2	41	0	1	130	204	0	0	172	0	1.4	2	2	1
3	56	1	1	120	236	0	1	178	0	0.8	2	2	1
4	57	0	0	120	354	0	1	163	1	0.6	2	2	1

	age	sex	cp	trestbps	chol	fbs	restecg	thalach	exang	oldpeak	slope	thal	target
0	63	1	3	145	233	1	0	150	0	2.3	0	1	1
1	37	1	2	130	250	0	1	187	0	3.5	0	2	1
2	41	0	1	130	204	0	0	172	0	1.4	2	2	1
3	56	1	1	120	236	0	1	178	0	0.8	2	2	1
4	57	0	0	120	354	0	1	163	1	0.6	2	2	1