Statistical Methods For Data Drift Detection
Introduction
Data quality issues account for majority of machine learning model flaws. Monitoring and detecting changes in data is commonly used to identify causes of model performance degradation, model drift, decay or staleness. One of the most important components of machine learning operations is model performance monitoring after model deployment. Monitoring and detecting concept drift is commonly used to identify causes of model performance degradation. Concept drift entails changes in underlying distribution of data over time.
Why Model Monitoring?
Monitoring procedures will help to continuously track the health of the machine learning model against a set of key indicators and generate event-based alerts. When a machine learning model is deployed in production, it can start degrading in quality fast and without warning until it is too late. That is why model monitoring is a crucial step in the ML model life cycle and a critical piece of Model Ops. Data Skew refers to differences between two static data versions/sources such as training set and serving set.
Concept drift Concept drift means the statistical relationship between the target variable and the features used in the model for prediction has changed over time. If concept drift occurs, the nature of the statistical relationship between the target and features may no longer exist which causes the decision boundary to also change and cause the model to produce poor predictions. Machine learning under concept drift involves three components: drift detection( whether drift occurs),drift understanding(when,how, where it occurs) and drift adaption(reaction to the existence of drift)
Causes od Data Drift
-
Change in relation between features, or covariate shift.
-
Unexpected changes in the macro economic environment such COVID-19’s impact on the economy.
-
Natural drift in data such as seasonal changes with temperature.
-
Data quality issues such as a broken-down server that does not transmit or store data.
Examples of Data Drift
- A change in ground truth or input data distribution e.g. change in customer preferences such as in fashion styles, from a natural distater pandemic, product launch in a new market etc.
General Framework For Drift Detection
The framework for drift detection consist of four stages namely data
- retrieval
- data modeling
- test statistics calculation and
- hypothesis testing.
Data retrieval involves getting the data from the storage location ontoa processing platform. Data modeling involves selecting those features that will significantly impact the system when they drift. The dissimilarity measurements involve designing a metric such as distance or test-statistic to quantify the level of drift and a test-statistics for the hypothesis test.Choosing an appropriate and robust measure of dissimilarity is an open question. The concept drift detect significance test can be seen as a two sample test of difference in distribution between two samples.
Types of Concept Drfit
Concept drift can be distinguished as four types namely sudden, gradual,incremental and reccurring. Research on the first three explores how to minimize drop in model performance and achieve the fastest recovery rate during the concept transformation process. The fourth kind emphasizes how to find the best matched historical concepts with the shortest time
Concept Drift Detection Algorithms
Error rate-based drift detection
This consist of a class of algorithms that track changes in error rate by classifiers online. A significant change in error rate corresponds to the occurrence of drift. Example Implementation of this algorithm can be found in the following:
- Local Drift Detection (LLDD)
- Early Drift Detection Method (EDDM)
- Heoffding’s inequality based Drift Detection Method (HDDM)
- Fuzzy Windowing Drift Detection Method (FW-DDM)
-
Dynamic Extreme Learning Machine (DELM)
- ADaptive Windowing (ADWIN) is popular two time windowing drift detection algorithm. Unlike several counterparts, it does not require the window size to be specified beforehand but computes a optimal window size by examining all possible window cuts once the total size n of a sufficiently large window W is specified. The test statistic is the difference of the two sample means
$\theta_{ADWIN} = | \hat{\mu_{hist}}-\hat{\mu_{new}} | $ |
Data Distribution-based Drift Detection
Data distribution based drift algorithms algorithms use distance metrics to quantify the dissimilarity between historical and new data over time. A statistically significant distance between the two data distributions indicates the time is right for model upgrade. This addresses concept drift from its root source which is a shift in distribution of data. These algorithms require time windows for historical and new data to be pre-specified.
Similar distribution-based drift detection methods or algorithms are:
- Statistical Change Detection for multidimensional data (SCD).
- Competence Model-based drift detection (CM)
- a prototype-based classification model for evolving data streams called SyncStream.
- PCA-based change detection framework (PCA-CD)
- Equal Density Estimation (EDE)
- Least Squares Density
- Difference-based Change Detection Test (LSDD-CDT)
- Incremental version of LSDD-CDT (LSDD-INC) and Local Drift Degree-based Density Synchronized Drift Adaptation (LDD-DSDA).
Multiple Hypothesis Test Drift Detection
They combine ideas from both error-rate and data distribution drift and use multiple hypothesis test to detect concept drift. This method employs Principal Component Analysis from feature extraction,drift is then detected in each component and also among combinations of the feature space. Implementations include:
- Linear Four Rate drift detection (LFR)
- three-layer drift detection, based on Information Value and Jaccard similarity (IV-Jac)
- Ensemble of Detectors (e-Detector)
- Hierarchical Linear Four Rate (HLFR)
- Hierarchical Hypothesis Testing with Classification Uncertainty (HHT-CU).
the features extracted by Principal Component Analysis (PCA)
Strategies for Detecting Drift
Population Stability Index(PSI)
The Population Stability Index (PSI) is a measure of the degree to which the distribution of some variable in a population changes between two time periods, say T1 and T2. It is calculated by comparing a score variable divided into several buckets where the number of buckets is discretionary. The contribution of each bucket is computed as the arithmetic change (P2 - P1) in the percentage of the population assigned to the bucket, weighted by the log of the geometric change, i.e. the ratio (P2 / P1) of the final percentage to the initial percentage. Rule of thumb for making decisions about PSI available in statistical literature is listed below:
- If PSI < 0.1, there is no significant change in the data and no action is required.
- If 0.1 < PSI < 0.25, there is minor change in the data and some further monitoring is necessary.
- If PSI > 0.25, there is a major change in the data distribution and appropriate investigation should be initiated.
Feature Distribution Plots :
For categorical and continuous features:
- A feature distribution plot can be used to compare the empirical feature distribution of a feature at two or multiple time points (Train distribution vs server/Prediction data distribution).
- The kernel density estimate is used for continuous features and histograms are used for categorical features.
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
from pandas.tseries.holiday import USFederalHolidayCalendar as calendar
import os
import matplotlib.pyplot as plt
import seaborn as sns
import re
import gensim
import plotly
from plotly import graph_objs as go
import plotly.express as px
import plotly.figure_factory as ff
from collections import Counter
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import roc_auc_score
from sklearn.metrics import confusion_matrix
from tqdm import tqdm
import itertools
import collections
from xgboost import XGBClassifier
import xgboost as xgb
import warnings
from tqdm.notebook import tqdm
tqdm.pandas(desc="Completed") # add progressbar to pandas, use progress_apply instead apply
from ipywidgets import interact #interactive plots
#from IPython.display import clear_output
from datetime import datetime
#import pandas_profiling
#import sweetviz as sv
from sklearn.impute import SimpleImputer
from sklearn.ensemble import RandomForestRegressor
import missingno as msno
# activate R magic
import rpy2
import seaborn as sns
import matplotlib.pyplot as plt
import sklearn
import random
from sklearn.preprocessing import OrdinalEncoder
from sklearn.preprocessing import LabelEncoder
from sklearn.model_selection import train_test_split
#!pip install Boruta
#from boruta import BorutaPy
from sklearn.ensemble import RandomForestClassifier
from sklearn.ensemble import ExtraTreesClassifier
from sklearn.feature_selection import RFE
from sklearn.pipeline import Pipeline
from sklearn import metrics
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import GridSearchCV, RandomizedSearchCV
# !pip install imblearn
#from imblearn.over_sampling import SMOTE
from collections import Counter
#from imblearn.under_sampling import RandomUnderSampler
#from imblearn.over_sampling import RandomOverSampler
#from imblearn.pipeline import make_pipeline
#!pip install lightgbm
import lightgbm as lgb
import pickle
#!pip install bayesian-optimization
#from bayes_opt import BayesianOptimization
#!pip install hyperopt
from hyperopt import fmin, tpe, hp,Trials
import matplotlib.ticker as mtick
#!pip install scikit-plot
#import scikitplot as skplt
import gc
import glob
warnings.filterwarnings("ignore")
#%load_ext rpy2.ipython
%matplotlib inline
%autosave 5
pd.set_option('display.max_columns', None)
pd.set_option('display.max_rows', None)
column_names = ['age', 'workclass', 'fnlwgt', 'education',
'education-num', 'marital-status', 'occupation',
'relationship', 'race', 'sex', 'capital-gain',
'capital-loss', 'hours-per-week', 'native-country', 'label']
column_names =['age','sex','on thyroxine','query on thyroxine','on antithyroid medication','sick',
'pregnant','thyroid surgery','I131 treatment','query hypothyroid','query hyperthyroid','lithium',
'goitre','tumor','hypopituitary','psych','TSH measured','TSH','T3 measured','T3','TT4 measured',
'TT4','T4U measured','T4U','FTI measured','FTI','TBG measured','TBG','referral source','Target']
# Read in the training and evaluation datasets
#df_train = pd.read_csv('/content/drive/MyDrive/Data/adult.data.zip', skipinitialspace=True,header=None,names=column_names)
#df = pd.read_csv('/content/drive/MyDrive/Data/adult.test', skipinitialspace=True,header=None,names=column_names)
df = pd.read_csv('/Data/allbp.data', skipinitialspace=True,header=None,names=column_names)
# Split the dataset. Do not shuffle for this demo notebook.
train_df, test_df = train_test_split(df, test_size=0.5, shuffle=False)
#train_df, val_df = train_test_split(train_df, test_size=0.2, shuffle=False)
#print(f"size of training set : {train_df.shape}")
#print(f"size of validation set : {val_df.shape}")
#print(f"size of test set : {test_df.shape}")
print(f"size of train set : {train_df.shape}")
print(f"size of test set : {test_df.shape}")
print(f"size of full set : {df.shape}")
print(len(column_names))
test_df.head()
Autosaving every 5 seconds
size of train set : (1400, 30)
size of test set : (1400, 30)
size of full set : (2800, 30)
30
age | sex | on thyroxine | query on thyroxine | on antithyroid medication | sick | pregnant | thyroid surgery | I131 treatment | query hypothyroid | query hyperthyroid | lithium | goitre | tumor | hypopituitary | psych | TSH measured | TSH | T3 measured | T3 | TT4 measured | TT4 | T4U measured | T4U | FTI measured | FTI | TBG measured | TBG | referral source | Target | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
1400 | 62 | M | f | f | f | f | f | f | f | f | f | f | f | f | f | f | t | 1.4 | t | 3.9 | t | 97 | t | 0.84 | t | 115 | f | ? | other | negative.|68 |
1401 | 57 | F | t | f | f | f | f | f | f | f | f | f | f | f | f | f | t | 0.1 | t | 2.2 | t | 150 | t | 1.01 | t | 149 | f | ? | SVI | negative.|183 |
1402 | 65 | F | t | f | f | f | f | f | f | f | f | f | f | f | f | f | t | 2.5 | f | ? | t | 137 | t | 1.06 | t | 129 | f | ? | other | negative.|2930 |
1403 | 90 | M | f | f | f | f | f | f | f | f | f | f | f | f | f | f | t | 0.15 | t | 1.7 | t | 118 | t | 0.82 | t | 144 | f | ? | SVI | negative.|1830 |
1404 | 68 | M | f | f | f | f | f | f | f | f | f | f | f | f | f | f | t | 1.4 | t | 1.9 | t | 98 | t | 0.82 | t | 118 | f | ? | other | negative.|1467 |
#df.iloc[:,29].nunique()
df.apply(lambda x : x.nunique())
age 94
sex 3
on thyroxine 2
query on thyroxine 2
on antithyroid medication 2
sick 2
pregnant 2
thyroid surgery 2
I131 treatment 2
query hypothyroid 2
query hyperthyroid 2
lithium 2
goitre 2
tumor 2
hypopituitary 2
psych 2
TSH measured 2
TSH 264
T3 measured 2
T3 65
TT4 measured 2
TT4 218
T4U measured 2
T4U 139
FTI measured 2
FTI 210
TBG measured 1
TBG 1
referral source 5
Target 2800
dtype: int64
train_df.drop(['TBG measured','TBG'],inplace=True,axis=1)
test_df.drop(['TBG measured','TBG'],inplace=True,axis=1)
#print(df_train.dtypes)
#numeric_cols = df_train.select_dtypes(include=np.number).columns
#categorical_cols
numeric_cols = ['age','TSH','T3','TT4','T4U','FTI']
for col in numeric_cols:
train_df[col] = pd.to_numeric(train_df[col],errors= 'coerce')
test_df[col] = pd.to_numeric(test_df[col],errors= 'coerce')
categorical_cols = train_df.select_dtypes(exclude=np.number).columns
categorical_cols = ['sex', 'on thyroxine', 'query on thyroxine',
'on antithyroid medication', 'sick', 'pregnant', 'thyroid surgery',
'I131 treatment', 'query hypothyroid', 'query hyperthyroid', 'lithium',
'goitre', 'tumor', 'hypopituitary', 'psych', 'TSH measured',
'T3 measured', 'TT4 measured', 'T4U measured', 'FTI measured',
'referral source']
train_df[categorical_cols].head()
sex | on thyroxine | query on thyroxine | on antithyroid medication | sick | pregnant | thyroid surgery | I131 treatment | query hypothyroid | query hyperthyroid | lithium | goitre | tumor | hypopituitary | psych | TSH measured | T3 measured | TT4 measured | T4U measured | FTI measured | referral source | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | F | f | f | f | f | f | f | f | f | f | f | f | f | f | f | t | t | t | t | t | SVHC |
1 | F | f | f | f | f | f | f | f | f | f | f | f | f | f | f | t | t | t | f | f | other |
2 | M | f | f | f | f | f | f | f | f | f | f | f | f | f | f | t | f | t | t | t | other |
3 | F | t | f | f | f | f | f | f | f | f | f | f | f | f | f | t | t | t | f | f | other |
4 | F | f | f | f | f | f | f | f | f | f | f | f | f | f | f | t | t | t | t | t | SVI |
'''
1. This plots the distribution of a feature in a training data and test data.
The plot allows us to visually inspect if there is difference in the distribution
of the feature in the two datasets
2. Example of usage below:
Kernel Density Plot
https://seaborn.pydata.org/generated/seaborn.kdeplot.html
Example
DistributionPlot(data=data,categorical_cols= categorical_cols,continuous_columns= numeric_cols,
save_location=save_location,split_column='Split').feature_dist_plot()
`
- For continuous features, the kernel density plot of the feature is plotted whereas for categorical features the
histogram is plotted.
- split_column splits the input dataset into two parts, the training/reference and the test/deployment dataset
'''
import seaborn as sns
sns.set(rc = {'figure.figsize':(13,10)})
import matplotlib.ticker as mtick
save_location = '/drift/'
def count_plot(column):
d1 = pd.DataFrame(test_df[column].value_counts()/test_df[column].shape[0]).reset_index().rename(columns={'index':'Value'})
d1["Split"] = "Train"
d2 = pd.DataFrame(val_df[column].value_counts()/val_df[column].shape[0]).reset_index().rename(columns={'index':'Value'})
d2["Split"] = "Validation"
d3 = pd.DataFrame(test_df[column].value_counts()/test_df[column].shape[0]).reset_index().rename(columns={'index':'Value'})
d3["Split"] = "Test"
d = pd.concat([d1,d2,d3])
ax = sns.barplot(x="Value", y = column, data=d,hue="Split", palette="viridis")
plt.savefig(save_location+f'{column}.png')
plt.ylabel(f'proportion of {column}')
plt.xlabel(" Category Levels")
plt.xticks(rotation=45)
plt.yticks(plt.yticks()[0], ['{:,.0%}'.format(x) for x in plt.yticks()[0]])
class DistributionPlot:
# constructor of Main class
def __init__(self, data, categorical_cols, continuous_columns, save_location, split_column):
# Initialization of the Strings
self.data = data
self.categorical_cols = categorical_cols
self.continuous_columns = continuous_columns
self.save_location = save_location
self.split_column = split_column
def feature_dist_plot(self):
Train_Data = self.data.loc[data[self.split_column] == 'Train']
Val_Data = self.data.loc[data[self.split_column] == 'Validation']
Test_Data = self.data.loc[data[self.split_column] == 'Test']
for i in self.continuous_columns:
plt.figure()
fig = sns.kdeplot(x=Train_Data[i], label='Train', shade=True)
fig = sns.kdeplot(x= Val_Data[i], label='Validation', shade=True)
fig = sns.kdeplot(x=Test_Data[i], label='Test', shade=True)
fig.legend()
plt.xticks(rotation=45)
plt.savefig(self.save_location+f'{i}.png')
plt.show()
for i in self.categorical_cols:
plt.figure()
d1 = pd.DataFrame(Test_Data[i].value_counts()/Test_Data[i].shape[0]).reset_index().rename(columns={'index':'Value'})
d1["Split"] = "Test"
d2 = pd.DataFrame(Val_Data[i].value_counts()/Val_Data[i].shape[0]).reset_index().rename(columns={'index':'Value'})
d2["Split"] = "Validation"
d3 = pd.DataFrame(Train_Data[i].value_counts()/Train_Data[i].shape[0]).reset_index().rename(columns={'index':'Value'})
d3["Split"] = "Train"
d = pd.concat([d1,d2,d3])
ax = sns.barplot(x="Value", y = i, data=d,hue="Split", palette="viridis")
plt.savefig(save_location+f'{i}.png')
plt.ylabel(f'proportion of {i}')
plt.xlabel(" Category Levels")
plt.yticks(plt.yticks()[0], ['{:,.0%}'.format(x) for x in plt.yticks()[0]])
plt.xticks(rotation=45)
plt.show()
return fig
save_location = '/drift/'
train_df['Split']= "Train"
#val_df['Split'] = "Validation"
test_df['Split'] = "Test"
for name in numeric_cols:
train_df[name] = pd.to_numeric(train_df[name])
train_df[name] = train_df[name].astype(float)
#val_df[name] = pd.to_numeric(val_df[name])
#val_df[name] = val_df[name].astype(float)
test_df[name] = pd.to_numeric(test_df[name])
test_df[name] = test_df[name].astype(float)
data = pd.concat([train_df,test_df])
data[numeric_cols].dtypes
#catcol = cat_col2 + boolean_numeric
DistributionPlot(data=data,categorical_cols= categorical_cols,continuous_columns= numeric_cols,
save_location=save_location,split_column='Split').feature_dist_plot()
Statistical Test For Concept Drift Detection
Paired T- test
- Test whether the mean difference between pairs of measurements is zero or not.
-
The Paired Samples T-Test compares the means for two related (paired) units on a continuous outcome that is normally distributed.
- $ H_{0} $ : $ \mu_{1}-\mu_{2} =0 $ The difference in mean of two populations is equal to 0.
- $ H_{1} $ : $ \mu_{1} - \mu_{2} \neq 0 $ The difference in mean of two populations is different from 0.
Kolmogorov-Smirnov Test :
- The K-S test is a nonparametric test that compares the cumulative distributions of two random variables.
- This can be used to test for significant difference between a feature in two time periods.
-
This test is for continuous features.
- $ H_{0} $: The feature distribution in the two populations are equal versus
- $ H_{1} $: The feature distribution in the two populations are not equal.
from scipy import stats
import numpy as np
def kolmogorovsmirnovtest(train_df,test_df,columns,significance_level = 0.05):
Test_Statistic = []
P_Value = []
Feature = list()
for col in columns:
#val = stats.ks_2samp(np.array(train_df[col]), np.array(test_df[col]))
val = stats.ks_2samp(train_df[col], test_df[col],mode="exact")
Test_Statistic.append(val[0])
P_Value.append(val[1])
Feature.append(col)
output= pd.DataFrame()
output['Feature'] = Feature
output['Test_Statistic'] =Test_Statistic
output['P_Value'] = P_Value
output['Decision'] = np.where(output['P_Value'] < significance_level,'Reject H0 : significant Difference','Do Not Reject H0 : No significant Difference')
return output
numeric_cols = train_df.select_dtypes(include=np.number).columns
kolmogorovsmirnovtest(train_df= train_df,test_df= test_df,columns= numeric_cols,significance_level = 0.05)
Feature | Test_Statistic | P_Value | Decision | |
---|---|---|---|---|
0 | age | 0.032143 | 0.464819 | Do Not Reject H0 : No significant Difference |
1 | TSH | 0.032857 | 0.436584 | Do Not Reject H0 : No significant Difference |
2 | T3 | 0.035000 | 0.357931 | Do Not Reject H0 : No significant Difference |
3 | TT4 | 0.042143 | 0.166343 | Do Not Reject H0 : No significant Difference |
4 | T4U | 0.024286 | 0.803661 | Do Not Reject H0 : No significant Difference |
5 | FTI | 0.020714 | 0.924908 | Do Not Reject H0 : No significant Difference |
Anderson-Darling Test for k-samples
- It is a modification of the Kolmogorov-Smirnov (K-S) test.
- The populations from which two or more groups of data were drawn are identical.
- The distribution function of that population do not have to be specified.
-
Three versions of the k-sample Anderson-Darling test: one for continuous distributions and two for discrete distributions.
- $ H_{0} $ : k-samples are drawn from the same population.
- $ H_{1} $ : The populations from which two or more groups of data were drawn are not identical.
from scipy import stats
import numpy as np
#The critical values for significance levels 25%, 10%, 5%, 2.5%, 1%, 0.5%, 0.1%.
out = stats.anderson_ksamp([train_df['age'],test_df['age']])
#print(out[1][-2])
out
Anderson_ksampResult(statistic=-0.4825370980213063, critical_values=array([0.325, 1.226, 1.961, 2.718, 3.752, 4.592, 6.546]), significance_level=0.25)
Wilcoxon-Signed Rank Test
-
It tests whether the distribution of the differences of two paired samples is symmetric about zero or equivalently the null hypothesis that two related paired samples originate from the same distribution. It is a non-parametric version of the paired T-test (Difference in locationparameter) for continuous variables.
- $ H_{0} $ : The difference between the two populations follows a symmetric distribution around zero.
- $ H_{1} $ : The difference between the pairs does not follow a symmetric distribution around zero.
The null hypothesis can be rejected if the p-value is less than the significance level(0.05).
If the interest is in detecting a positive or negative shift in one population from the other, then the hypothesis can be specified as one-sided. The test procedure combines the observations from the two pools into one sample and then by keeping track of which sample each observation comes from, rank the lowest to highest from 1 to n1+n2, respectively.
from scipy import stats
import numpy as np
import scipy
def wilcoxontest(train_df,test_df,columns,significance_level = 0.05):
Test_Statistic = []
P_Value = []
Feature = list()
for col in columns:
#val = stats.ks_2samp(np.array(train_df[col]), np.array(test_df[col]))
val = scipy.stats.wilcoxon(train_df[col], test_df[col],mode="auto")
Test_Statistic.append(val[0])
P_Value.append(val[1])
Feature.append(col)
output= pd.DataFrame()
output['Feature'] = Feature
output['Test_Statistic'] =Test_Statistic
output['P_Value'] = P_Value
output['Decision'] = np.where(output['P_Value'] < significance_level,'Reject H0 : significant Difference','Do Not Reject H0 : No significant Difference')
return output
wilcoxontest(train_df= train_df,test_df= test_df,columns= numeric_cols,significance_level = 0.05)
Feature | Test_Statistic | P_Value | Decision | |
---|---|---|---|---|
0 | age | 465252.5 | 5.364083e-01 | Do Not Reject H0 : No significant Difference |
1 | TSH | 314479.0 | 2.334189e-29 | Reject H0 : significant Difference |
2 | T3 | 170417.5 | 4.000233e-90 | Reject H0 : significant Difference |
3 | TT4 | 352276.5 | 1.393954e-17 | Reject H0 : significant Difference |
4 | T4U | 294315.5 | 2.610237e-35 | Reject H0 : significant Difference |
5 | FTI | 299891.5 | 3.434791e-34 | Reject H0 : significant Difference |
scipy.stats.wilcoxon(train_df['age'], test_df['age'],mode="auto")
WilcoxonResult(statistic=465252.5, pvalue=0.5364083181880943)
Mann–Whitney U test
$ H_0 $: the distributions of both populations are identical
$ H_1 $: the distributions are not identical.
train_df.fillna(0)
from scipy.stats import mannwhitneyu
U1, p = mannwhitneyu(train_df['age'].fillna(0), test_df['age'].fillna(0), method="auto")
print(U1)
print(p)
mannwhitneyu(train_df['age'].fillna(0), test_df['age'].fillna(0), method="auto")
975127.5
0.8197947010766226
MannwhitneyuResult(statistic=975127.5, pvalue=0.8197947010766226)
from scipy import stats
import numpy as np
from scipy.stats import mannwhitneyu
def mannwhitneyu_test(train_df,test_df,columns,significance_level = 0.05):
Test_Statistic = []
P_Value = []
Feature = list()
for col in columns:
#val = stats.ks_2samp(np.array(train_df[col]), np.array(test_df[col]))
#val = stats.ks_2samp(train_df[col], test_df[col],mode="exact")
val= mannwhitneyu(train_df[col].fillna(0), test_df[col].fillna(0), method="auto")
Test_Statistic.append(val[0])
P_Value.append(val[1])
Feature.append(col)
output= pd.DataFrame()
output['Feature'] = Feature
output['Test_Statistic'] =Test_Statistic
output['P_Value'] = P_Value
output['Decision'] = np.where(output['P_Value'] < significance_level,'Reject H0 : significant Difference','Do Not Reject H0 : No significant Difference')
return output
numeric_cols = train_df.select_dtypes(include=np.number).columns
mannwhitneyu_test(train_df= train_df,test_df= test_df,columns= numeric_cols,significance_level = 0.05)
Feature | Test_Statistic | P_Value | Decision | |
---|---|---|---|---|
0 | age | 975127.5 | 0.819795 | Do Not Reject H0 : No significant Difference |
1 | TSH | 995669.5 | 0.463541 | Do Not Reject H0 : No significant Difference |
2 | T3 | 971038.5 | 0.673651 | Do Not Reject H0 : No significant Difference |
3 | TT4 | 1014083.5 | 0.110984 | Do Not Reject H0 : No significant Difference |
4 | T4U | 1002719.5 | 0.287800 | Do Not Reject H0 : No significant Difference |
5 | FTI | 1003243.5 | 0.276870 | Do Not Reject H0 : No significant Difference |
Chi-squared Test (Categorical Target )
- The Chi-Square test of independence is used to determine if there is a significant relationship between two nominal (categorical) variables.
- $ H_{0} $: There is no relationship between feature distribution in the two populations versus.
-
$ H_{1} $: There is no relationship between feature distribution in the two populations.
The frequency of each category for one nominal variable is compared across the categories of the second nominal variable. The data can be displayed in a contingency table where each row represents a category for one variable and each column represents a category for the other variable. For example, say a researcher wants to examine the relationship between gender (male vs. female) and empathy (high vs. low). The chi-square test of independence can be used to examine this relationship. The null hypothesis for this test is that there is no relationship between gender and empathy. The alternative hypothesis is that there is a relationship between gender and empathy (e.g. there are more high-empathy females than high-empathy males)
'''
- This function performs a statistical test of difference between a feature
in the training set and test set.
- For continouous features kolmogorov-smirnov test is performed
- For categorical features a chi-squared test is performed
- significance_level = 0.05
'''
from scipy.stats import chi2_contingency
significance_level = 0.05
def hypothesis_test(data,split_column,columns,significance_level):
Test_Statistic = []
P_Value = []
Feature = list()
Test_Type = list()
for col in columns:
if data[col].dtypes =='O':
stat, p_val, dof, expected = chi2_contingency(pd.crosstab(data[col],data[split_column]))
Test_Statistic.append(stat)
P_Value.append(p_val)
Feature.append(col)
Test_Type.append("Chi-Square -Test")
else:
val = stats.ks_2samp(data.loc[data[split_column]=="Train",col], data.loc[data[split_column]=="Test",col],mode="exact")
Test_Statistic.append(val[0])
P_Value.append(val[1])
Feature.append(col)
Test_Type.append("Kolmogorov-Smirnov-Test")
output= pd.DataFrame()
output['Feature'] = Feature
output['Test_Statistic'] =Test_Statistic
output['P_Value'] = P_Value
output['Decision'] = np.where(output['P_Value'] < significance_level,'Reject H0 : significant Difference','Do Not Reject H0 : No significant Difference')
output["Test Type"] = Test_Type
return output
hypothesis_test(data=data,split_column="Split",columns=data.drop('Split',axis=1).columns,significance_level=0.05)
Feature | Test_Statistic | P_Value | Decision | Test Type | |
---|---|---|---|---|---|
0 | age | 0.032143 | 0.464819 | Do Not Reject H0 : No significant Difference | Kolmogorov-Smirnov-Test |
1 | sex | 0.041015 | 0.979701 | Do Not Reject H0 : No significant Difference | Chi-Square -Test |
2 | on thyroxine | 0.992762 | 0.319068 | Do Not Reject H0 : No significant Difference | Chi-Square -Test |
3 | query on thyroxine | 3.068841 | 0.079806 | Do Not Reject H0 : No significant Difference | Chi-Square -Test |
4 | on antithyroid medication | 0.000000 | 1.000000 | Do Not Reject H0 : No significant Difference | Chi-Square -Test |
5 | sick | 2.734708 | 0.098189 | Do Not Reject H0 : No significant Difference | Chi-Square -Test |
6 | pregnant | 0.000000 | 1.000000 | Do Not Reject H0 : No significant Difference | Chi-Square -Test |
7 | thyroid surgery | 0.104013 | 0.747066 | Do Not Reject H0 : No significant Difference | Chi-Square -Test |
8 | I131 treatment | 0.021197 | 0.884244 | Do Not Reject H0 : No significant Difference | Chi-Square -Test |
9 | query hypothyroid | 2.110597 | 0.146282 | Do Not Reject H0 : No significant Difference | Chi-Square -Test |
10 | query hyperthyroid | 1.577218 | 0.209162 | Do Not Reject H0 : No significant Difference | Chi-Square -Test |
11 | lithium | 0.071788 | 0.788752 | Do Not Reject H0 : No significant Difference | Chi-Square -Test |
12 | goitre | 0.161441 | 0.687833 | Do Not Reject H0 : No significant Difference | Chi-Square -Test |
13 | tumor | 0.924860 | 0.336202 | Do Not Reject H0 : No significant Difference | Chi-Square -Test |
14 | hypopituitary | 0.000000 | 1.000000 | Do Not Reject H0 : No significant Difference | Chi-Square -Test |
15 | psych | 0.778264 | 0.377673 | Do Not Reject H0 : No significant Difference | Chi-Square -Test |
16 | TSH measured | 1.728095 | 0.188654 | Do Not Reject H0 : No significant Difference | Chi-Square -Test |
17 | TSH | 0.032857 | 0.436584 | Do Not Reject H0 : No significant Difference | Kolmogorov-Smirnov-Test |
18 | T3 measured | 0.008643 | 0.925927 | Do Not Reject H0 : No significant Difference | Chi-Square -Test |
19 | T3 | 0.035000 | 0.357931 | Do Not Reject H0 : No significant Difference | Kolmogorov-Smirnov-Test |
20 | TT4 measured | 0.285035 | 0.593420 | Do Not Reject H0 : No significant Difference | Chi-Square -Test |
21 | TT4 | 0.042143 | 0.166343 | Do Not Reject H0 : No significant Difference | Kolmogorov-Smirnov-Test |
22 | T4U measured | 1.506610 | 0.219657 | Do Not Reject H0 : No significant Difference | Chi-Square -Test |
23 | T4U | 0.024286 | 0.803661 | Do Not Reject H0 : No significant Difference | Kolmogorov-Smirnov-Test |
24 | FTI measured | 1.515613 | 0.218285 | Do Not Reject H0 : No significant Difference | Chi-Square -Test |
25 | FTI | 0.020714 | 0.924908 | Do Not Reject H0 : No significant Difference | Kolmogorov-Smirnov-Test |
26 | referral source | 4.227530 | 0.376088 | Do Not Reject H0 : No significant Difference | Chi-Square -Test |
27 | Target | 2800.000000 | 0.491115 | Do Not Reject H0 : No significant Difference | Chi-Square -Test |
Confidence Intervals :
For continuous features:
-
For normally distributed variables
- Normal Confidence Interval(mean and median)
- Example: Difference in Mean between feature in Training and prediction datasets.
- Alternative non-parametric methods include: - Bootstrap Confidence Interval : - Percentile method - Empirical Bootstrap - Normal Interval
Confidence Intervals
For categorical features with two levels:
- For normally distributed variables
- Normal Confidence Interval(proportion)
- Example: Difference in proportion between feature in Training and serving datasets.
- Normal Confidence Interval(proportion)
- Alternative non-parametric methods include: - Bootstrap Confidence Interval : - Percentile method - Empirical Bootstrap - Normal Interval
If the confidence interval contains 0, then there is 95% confidence that there is no difference in mean between the two variablles. The confidence interval of the variable in the training data can also be estimated. If the mean of the variable in the serving set falls outside this interval wecan flag that there is difference in means between the two variables.
The classical normal 95% confidence interval for the difference of the means assuming normal distrirution of the mean is:
import numpy as np
import pandas as pd
from scipy.stats import t
pd.set_option('display.max_columns', 30) # set so can see all columns of the DataFrame
import math
import numpy as np
from scipy.stats import norm
from scipy.stats import bootstrap
rng = np.random.default_rng()
m = train_df[numeric_cols].mean() - test_df[numeric_cols].mean()
me = 1.96* np.sqrt(train_df[numeric_cols].var()/train_df.shape[0] + test_df[numeric_cols].var()/test_df.shape[0])
res = pd.DataFrame({'Feature':numeric_cols,
'mean_ci_lower_limit': m- me,
'mean_ci_upper_limit': m+me,
'Train_mean':train_df[numeric_cols].mean(),'Test_mean':test_df[numeric_cols].mean()})
res
Feature | mean_ci_lower_limit | mean_ci_upper_limit | Train_mean | Test_mean | |
---|---|---|---|---|---|
age | age | -1.491609 | 1.540420 | 51.856429 | 51.832023 |
TSH | TSH | -1.810646 | 1.370124 | 4.562983 | 4.783244 |
T3 | T3 | -0.088162 | 0.034031 | 2.011452 | 2.038517 |
TT4 | TT4 | -1.482069 | 3.762244 | 109.640701 | 108.500613 |
T4U | T4U | -0.007827 | 0.020980 | 1.001173 | 0.994596 |
FTI | FTI | -2.270119 | 2.602522 | 110.870388 | 110.704187 |
Alternatively, the bootstrap confidence interval for the mean can be found as below by resampling with replacement. In bootstrap we create multiple resamples with replacement from the observed dataset. The commpute the effect size of interest(example difference in means) on each of these resamples. The bootstrap resamples of the effect size can then be used to determine the 95% Confidence Interval. The distribution of the mean of the resampled data approaches the normal distribution due to the central limit theorem even when the original distribution of the data is not normaly distributed. With the percentile method the suppose we have 1000 bootstrap samples and the sorted effect size from smallest to largest, the 25th and 975th effect size represents the 95% confidence interval.
- calculate difference in means on each of the 1000 bootstrap sample
- Sort the difference in mean in ascending order.
- Select the 2.5th percentile as the 25th value and the 97.5th percentile as the 975 value for the confidence interval.
%%time
num_iter = 10000
n_sample = train_df.shape[0]
boot_mean_train = np.zeros((num_iter, len(numeric_cols)))
boot_mean_test = np.zeros((num_iter, len(numeric_cols)))
#boot_var_train = np.zeros((num_iter, len(numeric_cols)))
#boot_var_test = np.zeros((num_iter, len(numeric_cols)))
boot_statistic = np.zeros((num_iter, len(numeric_cols)))
#boot_me = np.zeros((num_iter, len(numeric_cols)))
for i in range(num_iter):
boot_mean_train[i,:] = list(train_df[numeric_cols].sample(n=n_sample,replace=True).mean())
boot_mean_test[i,:] = list(test_df[numeric_cols].sample(n= n_sample,replace=True).mean())
boot_statistic[i,:] = boot_mean_train[i,:] - boot_mean_test[i,:]
#boot_var_train[i,:] = list(train_df[numeric_cols].sample(n= n_sample,replace=True).var())
#boot_var_test[i,:] = list(test_df[numeric_cols].sample(n= n_sample,replace=True).var())
# boot_me[i,:] = 1.96* np.sqrt(boot_var_train[i,:]/train_df[numeric_cols].sample(n= n_sample,replace=True).shape[0] +
# test_df[numeric_cols].var()/test_df[numeric_cols].sample(n= n_sample,replace=True).shape[0])
#normal method
#lower_interval = boot_statistic.mean(0)- 1.96*(boot_statistic.std(0)/np.sqrt(num_iter))
#upper_interval = boot_statistic.mean(0) + 1.96*(boot_statistic.std(0)/np.sqrt(num_iter))
# 95% CI using the percentile method
lower_limit = np.sort(boot_statistic, axis=0)[25,:]
upper_limit = np.sort(boot_statistic, axis=0)[975,:]
#equivalently
lower_limit = np.quantile(boot_statistic, q=0.025,axis=0)
upper_limit = np.quantile(boot_statistic, q=0.975,axis=0)
res = pd.DataFrame({'Feature':numeric_cols,'LowerLimit':lower_limit,'UpperLimit':upper_limit,
'Train_mean':train_df[numeric_cols].mean(),'Test_mean':test_df[numeric_cols].mean() })
res
CPU times: user 22.1 s, sys: 74.4 ms, total: 22.2 s
Wall time: 22.2 s
Feature | LowerLimit | UpperLimit | Train_mean | Test_mean | |
---|---|---|---|---|---|
age | age | -1.471455 | 1.559323 | 51.856429 | 51.832023 |
TSH | TSH | -1.905514 | 1.468802 | 4.562983 | 4.783244 |
T3 | T3 | -0.096149 | 0.040580 | 2.011452 | 2.038517 |
TT4 | TT4 | -1.583813 | 3.817105 | 109.640701 | 108.500613 |
T4U | T4U | -0.008266 | 0.021863 | 1.001173 | 0.994596 |
FTI | FTI | -2.493965 | 2.701709 | 110.870388 | 110.704187 |
The classical 95% confidence interval for the difference in proportion for categorical variables with two levels can be found as below:
two_level_categorical = ['on thyroxine','query on thyroxine','on antithyroid medication',
'sick','pregnant','thyroid surgery','I131 treatment','query hypothyroid',
'query hyperthyroid','lithium','goitre','tumor','hypopituitary','psych',
'TSH measured','T3 measured','TT4 measured','T4U measured']
n_sample = train_df.shape[0]
lower_limit = []
upper_limit = []
p_test = []
p_train = []
pdiff = []
for column in two_level_categorical:
dtest=pd.DataFrame(test_df[column].value_counts()/n_sample).reset_index().rename(columns={'index':'Value'})
dtrain=pd.DataFrame(train_df[column].value_counts()/n_sample).reset_index().rename(columns={'index':'Value'})
p1 = dtest.query('Value == "f"').iloc[0,1]
p2 = dtrain.query('Value == "f"').iloc[0,1]
margin_error = 1.96*np.sqrt((p1*(1-p1))/n_sample + (p2*(1-p2))/n_sample)
lower_limit.append((p1-p2) -margin_error)
upper_limit.append((p1-p2) + margin_error)
p_test.append(p1)
p_train.append(p2)
res = pd.DataFrame({'Feature':two_level_categorical,'LowerLimit':lower_limit,'UpperLimit':upper_limit,
'Train_proportion':p_train,'Test_proportion':p_test })
res
Feature | LowerLimit | UpperLimit | Train_proportion | Test_proportion | |
---|---|---|---|---|---|
0 | on thyroxine | -0.036739 | 0.011025 | 0.888571 | 0.875714 |
1 | query on thyroxine | -0.000214 | 0.017357 | 0.981429 | 0.990000 |
2 | on antithyroid medication | -0.008114 | 0.008114 | 0.987857 | 0.987857 |
3 | sick | -0.001527 | 0.027241 | 0.954286 | 0.967143 |
4 | pregnant | -0.008184 | 0.009613 | 0.985000 | 0.985714 |
5 | thyroid surgery | -0.010824 | 0.006539 | 0.987143 | 0.985000 |
6 | I131 treatment | -0.008187 | 0.011044 | 0.982143 | 0.983571 |
7 | query hypothyroid | -0.030910 | 0.003767 | 0.948571 | 0.935000 |
8 | query hyperthyroid | -0.029973 | 0.005688 | 0.944286 | 0.932143 |
9 | lithium | -0.003796 | 0.006654 | 0.994286 | 0.995714 |
10 | goitre | -0.004825 | 0.009111 | 0.990000 | 0.992143 |
11 | tumor | -0.005215 | 0.018072 | 0.971429 | 0.977857 |
12 | hypopituitary | -0.000685 | 0.002114 | 0.999286 | 1.000000 |
13 | psych | -0.008010 | 0.023724 | 0.947857 | 0.955714 |
14 | TSH measured | -0.006643 | 0.038071 | 0.093571 | 0.109286 |
15 | T3 measured | -0.027974 | 0.032260 | 0.207857 | 0.210000 |
16 | TT4 measured | -0.012640 | 0.024069 | 0.062857 | 0.068571 |
17 | T4U measured | -0.007805 | 0.037805 | 0.098571 | 0.113571 |
The bootstrap confidence interval for the proportion can also be found as shown below.It basicaly enatils obtaining the distribution of the diffeence i proportions by resampling with replacement then calcculating the 2.5th percentile as the lower confidence limit and 97.5th interval as the upper confidence limit.
two_level_categorical = ['on thyroxine','query on thyroxine','on antithyroid medication',
'sick','pregnant','thyroid surgery','I131 treatment','query hypothyroid',
'query hyperthyroid','lithium','goitre','tumor','hypopituitary','psych',
'TSH measured','T3 measured','TT4 measured','T4U measured']
n_sample = train_df.shape[0]
num_iter = 1000
ptest2 = np.zeros((len(two_level_categorical),1))
ptrain2 = np.zeros((len(two_level_categorical),1))
ptrain_statistic = np.zeros((len(two_level_categorical), num_iter))
ptest_statistic = np.zeros((len(two_level_categorical), num_iter))
pdiff_statistic = np.zeros((len(two_level_categorical), num_iter))
for j in range(num_iter):
for i,column in enumerate(two_level_categorical):
dtest=pd.DataFrame(test_df[column].sample(n=n_sample,replace=True).value_counts()/n_sample).reset_index().rename(columns={'index':'Value'})
dtrain=pd.DataFrame(train_df[column].sample(n=n_sample,replace=True).value_counts()/n_sample).reset_index().rename(columns={'index':'Value'})
p1 = dtest.query('Value == "f"').iloc[0,1]
p2 = dtrain.query('Value == "f"').iloc[0,1]
ptrain2[i] = p2
ptest2[i] = p1
ptrain_statistic[:,j] = ptrain2.flatten()
ptest_statistic[:,j] = ptest2.flatten()
pdiff_statistic[:,j] = ptrain2.flatten() - ptest2.flatten()
lower_limit = np.sort(pdiff_statistic, axis=1)[:,25]
upper_limit = np.sort(pdiff_statistic, axis=1)[:,975]
#equivalently
#lower_limit = np.quantile(pdiff_statistic, q=0.025,axis=1)
#upper_limit = np.quantile(boot_statistic, q=0.975,axis=1)
res = pd.DataFrame({'Feature':two_level_categorical,'LowerLimit':lower_limit,'UpperLimit':upper_limit,
'Train_proportion':p_train,'Test_proportion':p_test })
res
Feature | LowerLimit | UpperLimit | Train_proportion | Test_proportion | |
---|---|---|---|---|---|
0 | on thyroxine | -0.012143 | 0.036429 | 0.888571 | 0.875714 |
1 | query on thyroxine | -0.017143 | -0.000714 | 0.981429 | 0.990000 |
2 | on antithyroid medication | -0.008571 | 0.007857 | 0.987857 | 0.987857 |
3 | sick | -0.027143 | 0.000714 | 0.954286 | 0.967143 |
4 | pregnant | -0.009286 | 0.007857 | 0.985000 | 0.985714 |
5 | thyroid surgery | -0.006429 | 0.011429 | 0.987143 | 0.985000 |
6 | I131 treatment | -0.010714 | 0.008571 | 0.982143 | 0.983571 |
7 | query hypothyroid | -0.003571 | 0.030714 | 0.948571 | 0.935000 |
8 | query hyperthyroid | -0.005000 | 0.029286 | 0.944286 | 0.932143 |
9 | lithium | -0.006429 | 0.003571 | 0.994286 | 0.995714 |
10 | goitre | -0.009286 | 0.005000 | 0.990000 | 0.992143 |
11 | tumor | -0.019286 | 0.005714 | 0.971429 | 0.977857 |
12 | hypopituitary | -0.002143 | 0.000000 | 0.999286 | 1.000000 |
13 | psych | -0.022857 | 0.008571 | 0.947857 | 0.955714 |
14 | TSH measured | -0.037143 | 0.007143 | 0.093571 | 0.109286 |
15 | T3 measured | -0.032143 | 0.028571 | 0.207857 | 0.210000 |
16 | TT4 measured | -0.025000 | 0.012857 | 0.062857 | 0.068571 |
17 | T4U measured | -0.039286 | 0.008571 | 0.098571 | 0.113571 |
Concept Drift Understanding
The severity of concept drift can be measured by using a dissimilarity metric to measure the difference between a new and previous concepts. The greater the quantified difference between the two distributions, the more severe the drift. Identifying the time a concept drift occurs and its duration is key to concept drift understanding.A drift is detected if the accuracy of a learning system drops below a predefined threshold. For data drift, if there is a statistically significant difference between two data samples. A drift detection threshold may be set at p-value to 95% or $2\sigma$ or a p-value of 99% or $3\sigma$. Data distribution-based drift detection algorithms report a drift alarm when two data samples have a statistically significant difference. The two sample test statistics used here include the Generalized Wilcoxon-test statistic.Concept drift understanding involves the drift detection start point, change period and end point. The difference between a benchmark accuracy of a model and the new degraded accuracy can be used as an indirect measure of the severity of drift. The severity of data drift can be measured by magnitude of the distance between the features in the two time frame.Kullback–Leibler divergence, \({\displaystyle D_{\text{KL}}(P\parallel Q)}{\displaystyle D_{\text{KL}}(P\parallel Q)}\) (also called relative entropy and I-divergence[1]) is measure of drift with value between \([0,1]\). 0 means the two distributions are the same and 1 means a new concept has occurred. The greater its value from 0, the more severe the drift is.
Drift Adaptation
There are three main techniques for adapting existing machine learning models to drift occurrence namely
- simple retraining
- ensemble retraining
- model adjusting.
For recurring concept drifts, an old model may be combined with a new one handle drift when it occurs. Examples of ensemble methods used here include Adaptive Random Forest (ARF) which extends RF with a concept drift detection algorithm ADWIN to make a decision when a degerading model should be replaced, Dynamic Weighted Majority(DWM). Models can be trained to adaptively learn from the changing data so that the model will be update itself.An example algorithm include online decision tree algorithm, called Very Fast Decision Tree classifier (VFDT)