Imbalanced Machine Learning For Fraud Detection

cost-sensitive xgboost,RusBoost,Smote,EasyEnsemble,cost-sensitive logistic regression


Data with imbalanced target class occurs frequently in several domians such as credit card Fraud Detection ,insurance claim prediction, email spam detection, anomaly detection, outlier detection etc. Financial instituions loose millions of dollars every year to fraudulent financial transactions. It is important that these institutions are able to identify fraud to protect their customers and also reduce the financial losses that comes from fraudsters. The goal here is to predict fraudulent transactions to minimize loss to financial companies. For machine learning data with imbalanced target clases, the model evaluation metric is the AUC, the area under the ROC curve and the area under the precision-recall curve. The accuaracy metric is not useful in these situations since usually the proportion of the positive class in these situations is so low that even a naive classifier that predicts all transactions as fraudulent would result in a high accuracy. For example the dataset considered here, the proportion of negative examples is over 99% this a naive classifier can predict all transactions as legitimate and would be over 99% accuarate.

The following packages that is been installed here will be neccessary for some of the analysis later on this project.

Description of Data.

The datasets can be found on kaggle.The link to it is here. The datasets contains transactions made by credit cards in September 2013 by european cardholders. This dataset presents transactions that occurred in two days, where we have 492 frauds out of 284,807 transactions. The dataset is highly imbalanced, the positive class (frauds) account for 0.172% of all transactions.

It contains only numerical input variables which are the result of a PCA transformation. This was done to preserve the identity and privacy of the people whose transaction this data was gathered from. Features V1, V2, … V28 are the principal components obtained with PCA, the only features which have not been transformed with PCA are ‘Time’ and ‘Amount’. Feature ‘Time’ contains the seconds elapsed between each transaction and the first transaction in the dataset. The feature ‘Amount’ is the transaction Amount, this feature can be used for example-dependant cost-senstive learning. Feature ‘Class’ is the response variable and it takes value 1 in case of fraud and 0 otherwise.

file = tf.keras.utils
df = pd.read_csv('')
Time V1 V2 V3 V4 V5 V6 V7 V8 V9 V10 V11 V12 V13 V14 V15 V16 V17 V18 V19 V20 V21 V22 V23 V24 V25 V26 V27 V28 Amount Class
0 0.0 -1.359807 -0.072781 2.536347 1.378155 -0.338321 0.462388 0.239599 0.098698 0.363787 0.090794 -0.551600 -0.617801 -0.991390 -0.311169 1.468177 -0.470401 0.207971 0.025791 0.403993 0.251412 -0.018307 0.277838 -0.110474 0.066928 0.128539 -0.189115 0.133558 -0.021053 149.62 0
1 0.0 1.191857 0.266151 0.166480 0.448154 0.060018 -0.082361 -0.078803 0.085102 -0.255425 -0.166974 1.612727 1.065235 0.489095 -0.143772 0.635558 0.463917 -0.114805 -0.183361 -0.145783 -0.069083 -0.225775 -0.638672 0.101288 -0.339846 0.167170 0.125895 -0.008983 0.014724 2.69 0
2 1.0 -1.358354 -1.340163 1.773209 0.379780 -0.503198 1.800499 0.791461 0.247676 -1.514654 0.207643 0.624501 0.066084 0.717293 -0.165946 2.345865 -2.890083 1.109969 -0.121359 -2.261857 0.524980 0.247998 0.771679 0.909412 -0.689281 -0.327642 -0.139097 -0.055353 -0.059752 378.66 0
3 1.0 -0.966272 -0.185226 1.792993 -0.863291 -0.010309 1.247203 0.237609 0.377436 -1.387024 -0.054952 -0.226487 0.178228 0.507757 -0.287924 -0.631418 -1.059647 -0.684093 1.965775 -1.232622 -0.208038 -0.108300 0.005274 -0.190321 -1.175575 0.647376 -0.221929 0.062723 0.061458 123.50 0
4 2.0 -1.158233 0.877737 1.548718 0.403034 -0.407193 0.095921 0.592941 -0.270533 0.817739 0.753074 -0.822843 0.538196 1.345852 -1.119670 0.175121 -0.451449 -0.237033 -0.038195 0.803487 0.408542 -0.009431 0.798278 -0.137458 0.141267 -0.206010 0.502292 0.219422 0.215153 69.99 0
df[['Time', 'V1', 'V2', 'V3', 'V4', 'V5', 'V26', 'V27', 'V28', 'Amount', 'Class']].describe().transpose()
count mean std min 25% 50% 75% max
Time 284807.0 9.481386e+04 47488.145955 0.000000 54201.500000 84692.000000 139320.500000 172792.000000
V1 284807.0 3.919560e-15 1.958696 -56.407510 -0.920373 0.018109 1.315642 2.454930
V2 284807.0 5.688174e-16 1.651309 -72.715728 -0.598550 0.065486 0.803724 22.057729
V3 284807.0 -8.769071e-15 1.516255 -48.325589 -0.890365 0.179846 1.027196 9.382558
V4 284807.0 2.782312e-15 1.415869 -5.683171 -0.848640 -0.019847 0.743341 16.875344
V5 284807.0 -1.552563e-15 1.380247 -113.743307 -0.691597 -0.054336 0.611926 34.801666
V26 284807.0 1.699104e-15 0.482227 -2.604551 -0.326984 -0.052139 0.240952 3.517346
V27 284807.0 -3.660161e-16 0.403632 -22.565679 -0.070840 0.001342 0.091045 31.612198
V28 284807.0 -1.206049e-16 0.330083 -15.430084 -0.052960 0.011244 0.078280 33.847808
Amount 284807.0 8.834962e+01 250.120109 0.000000 5.600000 22.000000 77.165000 25691.160000
Class 284807.0 1.727486e-03 0.041527 0.000000 0.000000 0.000000 0.000000 1.000000

We can see the target class is highly imbalanced. The minority classis about 0.17% of the target exampes.

0    99.827251
1     0.172749
Name: Class, dtype: float64
neg, pos = df.Class.value_counts()
total = neg + pos
print('Examples:\n    Total: {}\n    Positive: {} ({:.2f}% of total)\n  '.format(
    total, pos, 100 * pos / total,100 * neg / total))

print('Total: {}\n    Negative: {} ({:.2f}% of total)\n  '.format(
    total, neg, 100 * neg / total))
    Total: 284807
    Positive: 492 (0.17% of total)
Total: 284807
    Negative: 284315 (99.83% of total)
#x = raw_df.drop(['Time'],axis=1)

# Use a utility from sklearn to split and shuffle our dataset.
train_df, test_df = train_test_split(df, test_size=0.2)
#train_df, val_df = train_test_split(train_df, test_size=0.2)

train_x =train_df.drop(['Time','Class'],axis=1)
test_x = test_df.drop(['Time','Class'],axis=1)
#val_x  =  val_df.drop(['Time'],axis=1)

train_y=  train_df.Class
test_y = test_df.Class
#val_y  = val_df.Class

print('Traing dataset size:{}'.format(train_x.shape))
print('Test dataset size:{}'.format(test_x.shape))
#print('Validation dataset size: {}'.format(val_df.shape))
Traing dataset size:(227845, 29)
Test dataset size:(56962, 29)

The first model considered here is the extreme gradient boosting algorithm. It is popular with modeling tabular data. The hyperparameters of the model would be set to default except the scale_pos_weight which would be tuned in the case of cost-sensitive xgboost to find the best weight that optimizes the model.The hyperparameter values is left to the default values to allow for a fair comparison among machine learning algorithms used in this analysis. The hyperparameter tuning is done by bayesian optimization using the scikit-optimize package.

# Setting a 5-fold stratified cross-validation (note: shuffle=True)
skf = StratifiedKFold(n_splits=5, shuffle=True, random_state=0)

clf = xgb.XGBClassifier(
        n_jobs = -1,
        objective = 'binary:logistic',

search_spaces = {
    #'learning_rate': Real(0.01, 1.0, 'log-uniform'),
    #             'min_child_weight': Integer(0, 10),
    #             'max_depth': Integer(1, 50),
    #            'max_delta_step': Integer(0, 20), # Maximum delta step we allow each leaf output
    #             'subsample': Real(0.01, 1.0, 'uniform'),
    #             'colsample_bytree': Real(0.01, 1.0, 'uniform'), # subsample ratio of columns by tree
    #             'colsample_bylevel': Real(0.01, 1.0, 'uniform'), # subsample ratio by level in trees
                 #'reg_lambda': Real(1e-9, 1000, 'log-uniform'), # L2 regularization
                 #'reg_alpha': Real(1e-9, 1.0, 'log-uniform'), # L1 regularization
     #            'gamma': Real(1e-9, 0.5, 'log-uniform'), # Minimum loss reduction for partition/pruning parameter
     #            'n_estimators': Integer(50, 100),
                 'scale_pos_weight': Real(1e-6, 2000, 'log-uniform')

bayessearch = BayesSearchCV(clf,
                    scoring='roc_auc', #f1
                    optimizer_kwargs={'base_estimator': 'GP'},

#xgbm_model =,  y=train_y)

from google.colab import drive
Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).
import os 

Build Pandas -Profiling Report

The exploratory analysis of the features in the dataset can be automated with the Pandas -ProfilingReport package. It generates exploratory plots of the features in a dataset that is passed to it.

#Inline report without saving object
#Save report to file¶
pfr = pandas_profiling.ProfileReport(df)

pfr.to_file("/content/drive/My Drive/profilingReport2.html")


Time V1 V2 V3 V4 V5 V6 V7 V8 V9 V10 V11 V12 V13 V14 V15 V16 V17 V18 V19 V20 V21 V22 V23 V24 V25 V26 V27 V28 Amount Class
0 0.0 -1.359807 -0.072781 2.536347 1.378155 -0.338321 0.462388 0.239599 0.098698 0.363787 0.090794 -0.551600 -0.617801 -0.991390 -0.311169 1.468177 -0.470401 0.207971 0.025791 0.403993 0.251412 -0.018307 0.277838 -0.110474 0.066928 0.128539 -0.189115 0.133558 -0.021053 149.62 0
1 0.0 1.191857 0.266151 0.166480 0.448154 0.060018 -0.082361 -0.078803 0.085102 -0.255425 -0.166974 1.612727 1.065235 0.489095 -0.143772 0.635558 0.463917 -0.114805 -0.183361 -0.145783 -0.069083 -0.225775 -0.638672 0.101288 -0.339846 0.167170 0.125895 -0.008983 0.014724 2.69 0
2 1.0 -1.358354 -1.340163 1.773209 0.379780 -0.503198 1.800499 0.791461 0.247676 -1.514654 0.207643 0.624501 0.066084 0.717293 -0.165946 2.345865 -2.890083 1.109969 -0.121359 -2.261857 0.524980 0.247998 0.771679 0.909412 -0.689281 -0.327642 -0.139097 -0.055353 -0.059752 378.66 0
3 1.0 -0.966272 -0.185226 1.792993 -0.863291 -0.010309 1.247203 0.237609 0.377436 -1.387024 -0.054952 -0.226487 0.178228 0.507757 -0.287924 -0.631418 -1.059647 -0.684093 1.965775 -1.232622 -0.208038 -0.108300 0.005274 -0.190321 -1.175575 0.647376 -0.221929 0.062723 0.061458 123.50 0
4 2.0 -1.158233 0.877737 1.548718 0.403034 -0.407193 0.095921 0.592941 -0.270533 0.817739 0.753074 -0.822843 0.538196 1.345852 -1.119670 0.175121 -0.451449 -0.237033 -0.038195 0.803487 0.408542 -0.009431 0.798278 -0.137458 0.141267 -0.206010 0.502292 0.219422 0.215153 69.99 0

The function below is used to calculate the various evaluation metrics including area under the ROC curve, the area under the precision-recall curve,-f1-score etc.

from sklearn.metrics import precision_score
from sklearn.metrics import recall_score
from sklearn.metrics import precision_recall_fscore_support
from numpy import trapz
from scipy.integrate import simps
from sklearn.metrics import f1_score

def Evaluate(labels, predictions, p=0.5):
    CM= confusion_matrix(labels, predictions > p)
    TN = CM[0][0]
    FN = CM[1][0]
    TP = CM[1][1]
    FP = CM[0][1]
    print('Legitimate Transactions Detected (True Negatives): {}'.format(TN))
    print('Fraudulent Transactions Missed (False Negatives):  {}'.format(FN))
    print('Fraudulent Transactions Detected (True Positives): {}'.format(TP))
    print('Legitimate Transactions Incorrectly Detected (False Positives):{}'.format(FP))
    print('Total Fraudulent Transactions: ', np.sum(CM[1]))
    auc = roc_auc_score(labels, predictions)
    prec=precision_score(labels, predictions>0.5)
    rec=recall_score(labels, predictions>0.5)
     # calculate F1 score
    f1 = f1_score(labels, predictions>p)
    print('auc :{}'.format(auc))
    print('precision :{}'.format(prec))
    print('recall :{}'.format(rec))
    print('f1 :{}'.format(f1))
    # Compute Precision-Recall and plot curve
    precision, recall, thresholds = precision_recall_curve(labels, predictions >0.5)
    #use the trapezoidal rule to calculate the area under the precion-recall curve
    area =  trapz(recall, precision)
    #area =  simps(recall, precision)
    print("Area Under Precision Recall  Curve(AP): %0.4f" % area)   #should be same as AP?    
from sklearn.metrics import auc

We will attempt to investigate the performance of several ML algorithms on imbalanced target class data classification. The xgboost algorithm will be used to model data that is undersampled, no sample at all, Synthetic Minority Oversampling Technique and also modified to perform a cost sensitive learning. The other ML algorithms which will be tested include Forest of Randomized Trees, RusBoost, EasyEnsemble and Bagging classifier. The cost-sensitive xgboost method will invlove experimental determining the optimal weight on the minority target class using bayesian optimization.

XGBoost No Weights

The first model considered here is an xgboost with default hyperparameter values with no sampling of data.

xgb_no_weights =,  y=train_y)
#xgb_no_weights_pred = xgb_no_weights.predict_proba(test_x)
xgb_no_weights_pred = xgb_no_weights.predict_proba(test_x)[:,1]
dump(xgb_no_weights, '/content/drive/My Drive/ImbalancedData/xgb_no_weights.joblib')
Evaluate(labels=test_y, predictions=xgb_no_weights_pred, p=0.5)
Legitimate Transactions Detected (True Negatives): 56854
Fraudulent Transactions Missed (False Negatives):  26
Fraudulent Transactions Detected (True Positives): 79
Legitimate Transactions Incorrectly Detected (False Positives):3
Total Fraudulent Transactions:  105
auc :0.9777749860343032
precision :0.9634146341463414
recall :0.7523809523809524
f1 :0.8449197860962566
Area Under Precision Recall  Curve(AP): 0.8563
xgb_no_weights  =  load( '/content/drive/My Drive/ImbalancedData/xgb_no_weights.joblib')
xgb_no_weights_pred = xgb_no_weights.predict_proba(test_x)[:,1]
#Evaluate(labels=test_y, predictions=xgb_no_weights_pred, p=0.5)

Convert these vectors from python to R vectors. This will alow to use the R library for evaluating ML models /MLmetrics to be used in finding area under the precision-recall curve.

 #%R -i  test_y 
#%%R -i xgb_no_weights_pred

#library( MLmetrics)
#d= data.frame(pred=xgb_no_weights_pred[,2],truth=as.factor(test_y))
#glue("Test Set : Area Under Precision-Recall Curve: {yardstick::pr_auc(d, truth, pred)}")
#glue("Test Set : Area Under Precision-Recall Curve: {MLmetrics::PRAUC(as.vector(xgb_no_weights_pred),test_y)}")
dump(xgb_no_weights, '/content/drive/My Drive/ImbalancedData/xgb_no_weights.joblib') 
['/content/drive/My Drive/ImbalancedData/xgb_no_weights.joblib']

Model one : XGBoost with Weights on Label/ No Sampling

ns_model =,  y=train_y)
dump(ns_model, '/content/drive/My Drive/ImbalancedData/ns_model.joblib') 
['/content/drive/My Drive/ImbalancedData/ns_model.joblib']
#dump(ns_model, '/content/drive/My Drive/ImbalancedData/ns_model.joblib') 

ns_model = load('/content/drive/My Drive/ImbalancedData/ns_model.joblib') 

ns_model_pred = ns_model.predict_proba(test_x)[:,1]

Evaluate(labels=test_y, predictions=ns_model_pred, p=0.5)
Legitimate Transactions Detected (True Negatives): 56839
Fraudulent Transactions Missed (False Negatives):  15
Fraudulent Transactions Detected (True Positives): 90
Legitimate Transactions Incorrectly Detected (False Positives):18
Total Fraudulent Transactions:  105
auc :0.9961524191434317
precision :0.8333333333333334
recall :0.8571428571428571
f1 :0.8450704225352113
Area Under PR Curve(AP): 0.8435

XGBoost with Undersampling

from collections import Counter
from sklearn.datasets import make_classification
from imblearn.under_sampling import RandomUnderSampler # doctest: +NORMALIZE_WHITESPACE

rus = RandomUnderSampler(random_state=42)
X_rus, y_rus = rus.fit_resample(X=train_x,  y=train_y)
print('Resampled dataset shape %s' % Counter(y_rus))

xgb_model_rus =, y=y_rus)                             

dump(xgb_model_rus, '/content/drive/My Drive/ImbalancedData/xgb_model_rus.joblib') 
Resampled dataset shape Counter({0: 402, 1: 402})

['/content/drive/My Drive/ImbalancedData/xgb_model_rus.joblib']
xgb_model_rus = load('/content/drive/My Drive/ImbalancedData/xgb_model_rus.joblib') 

xgb_model_rus_pred  =  xgb_model_rus.predict_proba(test_x.values)[:,1]

print(" balanced_accuracy_score {}".format(balanced_accuracy_score(test_y, xgb_model_rus_pred>p) ))

cm=confusion_matrix(test_y, xgb_model_rus_pred>0.5)

print("confusion matrix : {}".format(cm))

Evaluate(labels=test_y, predictions=xgb_model_rus_pred, p=0.5)

import collections

 balanced_accuracy_score 0.9634347489985318
confusion matrix : [[54865  1992]
 [    4   101]]
Legitimate Transactions Detected (True Negatives): 54865
Fraudulent Transactions Missed (False Negatives):  4
Fraudulent Transactions Detected (True Positives): 101
Legitimate Transactions Incorrectly Detected (False Positives):1992
Total Fraudulent Transactions:  105
auc :0.9926581055061278
precision :0.0482560917343526
recall :0.9619047619047619
f1 :0.09190172884440401
Area Under PR Curve(AP): 0.5033
Counter({False: 54869, True: 2093})
print("Classification Report")
print(classification_report(test_y, xgb_model_rus_pred > p))

# ROC curve and Area-Under-Curve (AUC)
#calculating accuracy
accuracy_xgbm_sm= accuracy_score(test_y, xgb_model_rus_pred>p)

print('accuracy score : {:0.3f}'.format( accuracy_xgbm_sm))

roc_auc_sm = roc_auc_score(test_y, xgb_model_rus_pred)

print('roc score : {:0.3f}'.format( roc_auc_sm))

Classification Report
              precision    recall  f1-score   support

           0       1.00      0.96      0.98     56857
           1       0.05      0.96      0.09       105

    accuracy                           0.96     56962
   macro avg       0.52      0.96      0.54     56962
weighted avg       1.00      0.96      0.98     56962

accuracy score : 0.965
roc score : 0.993


from collections import Counter
from sklearn.datasets import make_classification
from imblearn.over_sampling import SMOTE # doctest: +NORMALIZE_WHITESPACE

print('Original dataset shape %s' % Counter(train_y))

sm = SMOTE(random_state=42)

X_res, y_res = sm.fit_resample(X=train_x,  y=train_y)
print('Resampled dataset shape %s' % Counter(y_res))

xgb_model_sm=, y=y_res)
xgb_model_sm_pred = xgb_model_sm.predict(test_x.values)
#iba(y_test, y_pred)
#balanced_accuracy_score(y_test, y_pred) 

cm=confusion_matrix(test_y, xgb_model_sm_pred)

print("confusion matrix")
print("Classification Report")
print(classification_report(test_y, xgb_model_sm_pred))

# ROC curve and Area-Under-Curve (AUC)
#calculating accuracy
accuracy_xgbm_sm= accuracy_score(test_y, xgb_model_sm_pred)
print('accuracy score : {:0.3f}'.format( accuracy_xgbm_sm))
roc_auc_sm = roc_auc_score(test_y, xgb_model_sm_pred)
print('roc score : {:0.3f}'.format( roc_auc_sm))

dump(xgb_model_sm, '/content/drive/My Drive/ImbalancedData/xgb_model_sm.joblib') 
xgb_model_sm = load('/content/drive/My Drive/ImbalancedData/xgb_model_sm.joblib') 

xgb_model_sm_pred  =  xgb_model_sm.predict_proba(test_x.values)[:,1]
Evaluate(labels=test_y, predictions=xgb_model_sm_pred, p=0.5)

Legitimate Transactions Detected (True Negatives): 56306
Fraudulent Transactions Missed (False Negatives):  6
Fraudulent Transactions Detected (True Positives): 99
Legitimate Transactions Incorrectly Detected (False Positives):551
Total Fraudulent Transactions:  105
auc :0.9948591160614306
precision :0.1523076923076923
recall :0.9428571428571428
f1 :0.2622516556291391
Area Under PR Curve(AP): 0.5458

Forest of randomized trees

BalancedRandomForestClassifier is another ensemble method in which each tree of the forest will be provided a balanced bootstrap sample. This class provides all functionality of the sklearn.ensemble.RandomForestClassifier and notably the feature_importances_ attributes:

from imblearn.ensemble import BalancedRandomForestClassifier
brf = BalancedRandomForestClassifier(n_estimators=100, random_state=0),  train_y )

brf_pred = brf.predict(test_x)
balanced_accuracy_score(test_y, brf_pred)  


dump(brf, '/content/drive/My Drive/ImbalancedData/brf.joblib')
brf = load('/content/drive/My Drive/ImbalancedData/brf.joblib') 

brf_pred  =  brf.predict_proba(test_x.values)[:,1]
Evaluate(labels=test_y, predictions= brf_pred, p=0.5)
Legitimate Transactions Detected (True Negatives): 55509
Fraudulent Transactions Missed (False Negatives):  2
Fraudulent Transactions Detected (True Positives): 103
Legitimate Transactions Incorrectly Detected (False Positives):1348
Total Fraudulent Transactions:  105
auc :0.9878704887868227
precision :0.0709855272226051
recall :0.9809523809523809
f1 :0.13239074550128535
Area Under PR Curve(AP): 0.5241


from imblearn.ensemble import RUSBoostClassifier
from sklearn.datasets import make_classification

rbt = RUSBoostClassifier(random_state=0,
# Fit the grid search to the data,  y=train_y)
dump(rbt, '/content/drive/My Drive/ImbalancedData/rbt.joblib')
rbt = load('/content/drive/My Drive/ImbalancedData/rbt.joblib') 

rbt_pred  =  rbt.predict_proba(test_x.values)[:,1]
Evaluate(labels=test_y, predictions= rbt_pred, p=0.5)
Legitimate Transactions Detected (True Negatives): 55970
Fraudulent Transactions Missed (False Negatives):  5
Fraudulent Transactions Detected (True Positives): 100
Legitimate Transactions Incorrectly Detected (False Positives):887
Total Fraudulent Transactions:  105
auc :0.9918764452507
precision :0.10131712259371833
recall :0.9523809523809523
f1 :0.18315018315018314
Area Under PR Curve(AP): 0.5250


A specific method which uses AdaBoost as learners in the bagging classifier is called EasyEnsemble. The EasyEnsembleClassifier allows to bag AdaBoost learners which are trained on balanced bootstrap sample. Similarly to the BalancedBaggingClassifier API, one can construct the ensemble as:

from imblearn.ensemble import EasyEnsembleClassifier
eec = EasyEnsembleClassifier(random_state=0,
                            sampling_strategy='auto'),  y=train_y)

dump(eec, '/content/drive/My Drive/ImbalancedData/eec.joblib')
eec = load('/content/drive/My Drive/ImbalancedData/eec.joblib') 

eec_pred  =  eec.predict_proba(test_x.values)[:,1]
Evaluate(labels=test_y, predictions= eec_pred, p=0.5)
Legitimate Transactions Detected (True Negatives): 54565
Fraudulent Transactions Missed (False Negatives):  3
Fraudulent Transactions Detected (True Positives): 102
Legitimate Transactions Incorrectly Detected (False Positives):2292
Total Fraudulent Transactions:  105
auc :0.9912644671636528
precision :0.042606516290726815
recall :0.9714285714285714
f1 :0.08163265306122448
Area Under PR Curve(AP): 0.5052

Bagging classifier

BalancedBaggingClassifier allows to resample each subset of data before to train each estimator of the ensemble. In short, it combines the output of an EasyEnsemble sampler with an ensemble of classifiers (i.e. BaggingClassifier). Therefore, BalancedBaggingClassifier takes the same parameters than the scikit-learn BaggingClassifier. Additionally, there is two additional parameters, sampling_strategy and replacement to control the behaviour of the random under-sampler:

from imblearn.ensemble import BalancedBaggingClassifier

from imblearn.ensemble import BalancedBaggingClassifier
bbc = BalancedBaggingClassifier(base_estimator=DecisionTreeClassifier(),
                               random_state=0),  y=train_y)
dump(bbc, '/content/drive/My Drive/ImbalancedData/bbc.joblib')

bbc_model = load('/content/drive/My Drive/ImbalancedData/bbc.joblib') 
bbc_pred  =  bbc_model.predict_proba(test_x.values)[:,1]
Evaluate(labels=test_y, predictions= bbc_pred, p=0.5)
Legitimate Transactions Detected (True Negatives): 55701
Fraudulent Transactions Missed (False Negatives):  5
Fraudulent Transactions Detected (True Positives): 100
Legitimate Transactions Incorrectly Detected (False Positives):1156
Total Fraudulent Transactions:  105
auc :0.9845647853386567
precision :0.07961783439490445
recall :0.9523809523809523
f1 :0.1469507714915503
Area Under PR Curve(AP): 0.5142
pred_df = pd.DataFrame(test_y,index=None)
pred_df['baggingclassifier_pred'] =bbc_pred
pred_df['easyensemble_pred'] = eec_pred
pred_df['RusBoost_pred']  =  rbt_pred
pred_df['forest_r_t']  =  brf_pred 
pred_df['xgb_rus_pred'] = xgb_model_rus_pred
pred_df['xgb_smote_pred']  =  xgb_model_sm_pred
pred_df['xgboost_weights'] =  ns_model_pred
#pred_df["test_y"]    = test_y
pred_df["xgb_no_weights_pred"] = xgb_no_weights_pred

pred_df.to_csv('/content/drive/My Drive/ImbalancedData/pred_df.csv')


Plot the AUC ROC

mpl.rcParams['figure.figsize'] = (12, 10)
colors = plt.rcParams['axes.prop_cycle'].by_key()['color']
from sklearn.metrics import roc_curve

def plot_roc(name, labels, predictions, p=0.5, **kwargs):
  fp, tp, _ = sklearn.metrics.roc_curve(labels, predictions)

  plt.plot(100*fp, 100*tp, label=name, linewidth=2, **kwargs)
  plt.xlabel('False positives [%]')
  plt.ylabel('True positives [%]')
  plt.title('Area Under ROC Curve @{:.2f}'.format(p))
  ax = plt.gca()
#%matplotlib inline
plot_roc("xgboost No Weight", test_y, xgb_no_weights_pred, color=colors[0],linestyle='--')
plot_roc("Xgboost Weight", test_y ,ns_model_pred, color=colors[1])
plot_roc("Xgboost Under-Sampling", test_y, xgb_model_rus_pred, color=colors[2])
plot_roc("Xgboost Smote", test_y ,xgb_model_sm_pred, color=colors[3])
plot_roc("Forest of Randomized Trees", test_y ,brf_pred, color=colors[4])
plot_roc("RusBoost", test_y ,rbt_pred, color=colors[5])
plot_roc("EasyEnsemble Classifier", test_y ,eec_pred, color=colors[6])
plot_roc("Bagging Classifier", test_y ,bbc_pred, color=colors[7])
plt.legend(loc='lower right')
plt.savefig('/content/drive/My Drive/ImbalancedData/all_rocauc.png')


Plot the Area Under Orecision Recall Curve

from sklearn.metrics import precision_recall_curve

def plot_auc_pr(name, labels, predictions,n=0.5, **kwargs):
  p, r, _ = sklearn.metrics.precision_recall_curve(labels, predictions)

  plt.plot(100*r, 100*p, label=name, linewidth=2, **kwargs)
  plt.xlabel('Recall [%]')
  plt.ylabel('Precision [%]')
  plt.title('Area Under Precision-Recall Curve @{:.2f}'.format(n))
  #plt.title('Area Under Precision-Recall Curve: {}' .format(p))
  ax = plt.gca()
#plot_auc_pr("Train  Weight", train_labels, train_predictions_weight, color=colors[1])
plot_auc_pr("xgboost No Weight", test_y, xgb_no_weights_pred, color=colors[0],linestyle='--')
plot_auc_pr("Xgboost Weight", test_y ,ns_model_pred, color=colors[1])
plot_auc_pr("Xgboost Under-Sampling", test_y, xgb_model_rus_pred, color=colors[2])
plot_auc_pr("Xgboost Smote", test_y ,xgb_model_sm_pred, color=colors[3])
plot_auc_pr("Forest of Randomized Trees", test_y ,brf_pred, color=colors[4])
plot_auc_pr("RusBoost", test_y ,rbt_pred, color=colors[5])
plot_auc_pr("EasyEnsemble Classifier", test_y ,eec_pred, color=colors[6])
plot_auc_pr("Bagging Classifier", test_y ,bbc_pred, color=colors[7])
plt.legend(loc='lower left')
plt.savefig('/content/drive/My Drive/ImbalancedData/auc_pr2.png')


Cost-Sensitive Logistic Regression

Logistic Regression is a well known statistical model for modeling binary target values that is often overlooked. It will be interesting to see how it performs when presented with an imbalanced target class. It can be modified to perform a Cost-sensitive learning with imbaalanced data.

from sklearn.linear_model import ElasticNet
from sklearn import linear_model

elasticreg = linear_model.SGDClassifier( tol=1e-3,
                                 max_iter = int(1e4), 
                                 warm_start = True, 
                                 n_jobs = -1)'ggplot')

fig = plt.figure(figsize=(15,8))
ax1 = fig.add_subplot(1,2,1)
ax1.set_title('Area Under Precision-Recall Curve @{:.2f}'.format(threshhold))

ax2 = fig.add_subplot(1,2,2)
ax2.set_xlabel('False Positive Rate')
ax2.set_ylabel('True Positive Rate')
ax2.set_title('Area Under ROC Curve @{:.2f}'.format(threshhold))

rocauc_vector= []
f1_vector= []
prec_vector= []
rec_vector= []
#cfn_matrix_  =  np.zeros((8, 4))
cfn_matrix_ =[]
pr_auc_vector =[]

for w,k in zip([1,5,10,20,50,100,500,10000],'bgrcmykw'):
    lr_model = LogisticRegression(class_weight={0:1,1:w}),train_y) 
    pred_prob = lr_model.predict_proba(test_x)[:,1]

    p,r,_ = precision_recall_curve(test_y,pred_prob)
    tpr,fpr,_ = roc_curve(test_y,pred_prob)
    auc=     roc_auc_score(test_y,pred_prob)
    f1 =     f1_score(test_y,pred_prob >threshhold )
    #f1 = f1_score(labels, predictions> threshhold)
    rec=recall_score(test_y,pred_prob > threshhold)
    cfn_matrix = confusion_matrix(test_y,pred_prob > threshhold)
    #cfn_matrix_[w,:] = cfn_matrix.flatten()
    precision, recall, thresholds = precision_recall_curve(test_y, pred_prob > threshhold)
    #use the trapezoidal rule to calculate the area under the precion-recall curve
    area =  trapz(recall, precision)
    #plt.title('Area Under ROC Curve @{:.2f}'.format(p))
ax1.legend(loc='lower left')    
ax2.legend(loc='lower right')
plt.savefig('/content/drive/My Drive/ImbalancedData/logistic.png')


results.columns = ['Weight','ROC_AUC','F-Score','Precision','Recall']
results['TP'] = [66,83,89,92,95,96,101,108]
results['TN'] = [56846,56831,56829,56805,56750,56613,55668,40576]
results['FP'] = [6,21,23,47,102,239,1184,16276]
results['FN'] = [46,27,21,18,15,14,9,2]
results['PR_AUC'] = pr_auc_vector
Weight ROC_AUC F-Score Precision Recall TP TN FP FN PR_AUC
0 1 0.959148 0.725275 0.916667 0.600000 66 56846 6 46 0.756788
1 5 0.976090 0.775701 0.798077 0.754545 83 56831 21 27 0.774617
2 10 0.978801 0.801802 0.794643 0.809091 89 56829 23 21 0.800120
3 20 0.982562 0.738956 0.661871 0.836364 92 56805 47 18 0.747344
4 50 0.985512 0.618893 0.482234 0.863636 95 56750 102 15 0.671135
5 100 0.986236 0.431461 0.286567 0.872727 96 56613 239 14 0.577839
6 500 0.988871 0.144803 0.078599 0.918182 101 55668 1184 9 0.496538
7 10000 0.986018 0.013096 0.006592 0.981818 108 40576 16276 2 0.492291

Cost-sensitive logistic regression with a weight of 10 on the minority class performs well in comparison to other weights. It has a high ROC-AUC and area under the precision-recall curve.

