Using Scikit-Learn Pipelines and Converting Them To PMML

Introduction

Pipelining in machine learning involves chaining all the steps involved in training a model together. The pipeline allows to assemble several steps that can be cross-validated together while setting different parameter values. It is a step closer to automating the all the steps such as preprocessing, missing value imputation etc together. The pipeline steps used in getting the model ready for the training can also be applied to test data in a single step to make it ready for prediction. A Pipeline Sequentially applies a list of transforms and a final estimator to a dataset. The Intermediate steps of the implement fit and transform methods whiles the final estimator implements a fit. The transformers in the pipeline can be cached using memory argument.

#!pip install category-encoders
#!pip install scipy
#!pip install --upgrade pip --user
#!pip install scipy==1.2 --upgrade --user
#!pip install sklearn2pmml
#!pip install numpy
#!pip install sklearn
#!pip install pandas

%%bash

pip install sklearn2pmml
pip install -q git+https://github.com/gmihaila/ml_things.git
!pip install --upgrade scikit-learn

Requirement already satisfied: sklearn2pmml in /usr/local/lib/python3.6/dist-packages (0.61.0)
Requirement already satisfied: sklearn-pandas>=0.0.10 in /usr/local/lib/python3.6/dist-packages (from sklearn2pmml) (1.8.0)
Requirement already satisfied: scikit-learn>=0.18.0 in /usr/local/lib/python3.6/dist-packages (from sklearn2pmml) (0.23.2)
Requirement already satisfied: joblib>=0.13.0 in /usr/local/lib/python3.6/dist-packages (from sklearn2pmml) (0.16.0)
Requirement already satisfied: pandas>=0.11.0 in /usr/local/lib/python3.6/dist-packages (from sklearn-pandas>=0.0.10->sklearn2pmml) (1.1.2)
Requirement already satisfied: numpy>=1.6.1 in /usr/local/lib/python3.6/dist-packages (from sklearn-pandas>=0.0.10->sklearn2pmml) (1.18.5)
Requirement already satisfied: scipy>=0.14 in /usr/local/lib/python3.6/dist-packages (from sklearn-pandas>=0.0.10->sklearn2pmml) (1.4.1)
Requirement already satisfied: threadpoolctl>=2.0.0 in /usr/local/lib/python3.6/dist-packages (from scikit-learn>=0.18.0->sklearn2pmml) (2.1.0)
Requirement already satisfied: pytz>=2017.2 in /usr/local/lib/python3.6/dist-packages (from pandas>=0.11.0->sklearn-pandas>=0.0.10->sklearn2pmml) (2018.9)
Requirement already satisfied: python-dateutil>=2.7.3 in /usr/local/lib/python3.6/dist-packages (from pandas>=0.11.0->sklearn-pandas>=0.0.10->sklearn2pmml) (2.8.1)
Requirement already satisfied: six>=1.5 in /usr/local/lib/python3.6/dist-packages (from python-dateutil>=2.7.3->pandas>=0.11.0->sklearn-pandas>=0.0.10->sklearn2pmml) (1.15.0)


bash: line 4: !pip: command not found

from pandas_profiling import ProfileReport
from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.model_selection import GridSearchCV, RandomizedSearchCV
from sklearn.model_selection import StratifiedKFold
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.compose import ColumnTransformer
from xgboost import XGBClassifier
from sklearn.metrics import accuracy_score, confusion_matrix, precision_score, classification_report, recall_score
import pandas_profiling
from sklearn.impute import SimpleImputer
Imputer = SimpleImputer(fill_value=-9999999, strategy='constant')

import pandas as pd
from sklearn2pmml.decoration import ContinuousDomain
from sklearn2pmml.pipeline import PMMLPipeline
from sklearn2pmml import make_pmml_pipeline, sklearn2pmml
import xgboost as xgb
from joblib import dump, load
from sklearn.metrics import classification_report, f1_score, accuracy_score, precision_score, confusion_matrix
from sklearn.metrics import roc_auc_score
from sklearn.metrics import auc
from sklearn.metrics import precision_recall_curve
from sklearn.metrics import plot_confusion_matrix
import numpy as np
from sklearn.metrics import classification_report, accuracy_score
from ml_things import plot_dict, plot_confusion_matrix, fix_text
import seaborn as sns; sns.set()
import matplotlib.pyplot as plt
%matplotlib inline
pd.set_option('display.max_columns', None)  # or 1000
pd.set_option('display.max_rows', None)  # or 1000
pd.set_option('display.max_colwidth', None)  # or 19

% %autosave 5

Autosaving every 5 seconds

import numpy as np
import tensorflow as tf
from tensorflow import keras
#from tensorflow.keras.preprocessing.image import image_dataset_from_directory
from keras import  layers
import matplotlib.pyplot as plt
import matplotlib as mpl
import os
mpl.rcParams['figure.figsize'] = (12, 10)
colors = plt.rcParams['axes.prop_cycle'].by_key()['color']
%autosave 5
print(tf.__version__)

Autosaving every 5 seconds
2.3.0

from google.colab import drive
import glob
drive.mount('/content/drive', force_remount=True)

Mounted at /content/drive

The dataset is hosted on UCI machine learning dataset repository and can be downloaded from this link data. The task is to Predict whether income exceeds $50K/yr based on census data. Decription of the dataset is below. It consist of 14 features both numeric and categorical variables.

Income : >50K, <=50K. Target variable for income below or equal to $50000 and above $50000.
age: continuous.
workclass: Private, Self-emp-not-inc, Self-emp-inc, Federal-gov, Local-gov, State-gov, Without-pay, Never-worked.
fnlwgt: continuous.
education: Bachelors, Some-college, 11th, HS-grad, Prof-school, Assoc-acdm, Assoc-voc, 9th, 7th-8th, 12th, Masters, 1st-4th, 10th, Doctorate, 5th-6th, Preschool.
education-num: continuous.
marital-status: Married-civ-spouse, Divorced, Never-married, Separated, Widowed, Married-spouse-absent, Married-AF-spouse. – occupation: Tech-support, Craft-repair, Other-service, Sales, Exec-managerial, Prof-specialty, Handlers-cleaners, Machine-op-inspct, Adm-clerical, Farming-fishing, Transport-moving, Priv-house-serv, Protective-serv, Armed-Forces.
relationship: Wife, Own-child, Husband, Not-in-family, Other-relative, Unmarried.
race: White, Asian-Pac-Islander, Amer-Indian-Eskimo, Other, Black.
sex: Female, Male.
capital-gain: continuous.
capital-loss: continuous.
hours-per-week: continuous.
native-country: United-States, Cambodia, England, Puerto-Rico, Canada, Germany, Outlying-US(Guam-USVI-etc), India, Japan, Greece, South, China, Cuba, Iran, Honduras, Philippines, Italy, Poland, Jamaica, Vietnam, Mexico, Portugal, Ireland, France, Dominican-Republic, Laos, Ecuador, Taiwan, Haiti, Columbia, Hungary, Guatemala, Nicaragua, Scotland, Thailand, Yugoslavia, El-Salvador, Trinadad&Tobago, Peru, Hong, Holand-Netherlands.

header=['age',
        'workclas',
        'fnlwgt',
        'education',
        'education-num',
        'marital-status',
        'occupation',
        'relationship',
        'race',
        'sex',
        'capital-gain',
        'capital-loss',
        'hours-per-week',
        'native-country',
        'Income']

len(header)

data= pd.read_csv('/content/drive/My Drive/Data/adult.data',header=0,names=header)

data.shape

(32560, 15)

data.head()

	age	workclas	fnlwgt	education	education-num	marital-status	occupation	relationship	race	sex	hours-per-week	native-country	Income
0	50	Self-emp-not-inc	83311	Bachelors	13	Married-civ-spouse	Exec-managerial	Husband	White	Male	13	United-States	<=50K
1	38	Private	215646	HS-grad	9	Divorced	Handlers-cleaners	Not-in-family	White	Male	40	United-States	<=50K
2	53	Private	234721	11th	7	Married-civ-spouse	Handlers-cleaners	Husband	Black	Male	40	United-States	<=50K
3	28	Private	338409	Bachelors	13	Married-civ-spouse	Prof-specialty	Wife	Black	Female	40	Cuba	<=50K
4	37	Private	284582	Masters	14	Married-civ-spouse	Exec-managerial	Wife	White	Female	40	United-States	<=50K

#data['income']  = [1 if  i ==' >50K' else 0  for i in data.Income]

#from sklearn.preprocessing import LabelEncoder 
#labelencoder= LabelEncoder() #initializing an object of class LabelEncoder
#data['Income'] = labelencoder.fit_transform(data['Income']) #fitting and transforming the desired categorical column.

print(data.Income.value_counts())
print(data.Income.value_counts(normalize=True)
)

 <=50K    24719
 >50K      7841
Name: Income, dtype: int64
 <=50K    0.759183
 >50K     0.240817
Name: Income, dtype: float64

#def do_the_job(x):
#    ref=[]
#    if (x ==' <=50K'):
#       ret = 0
#    elif (x == ' >50K'):
#       ret =1

#    return ret

#data['income'] = data['Income'].apply(do_the_job)
#data['native-country'].value_counts()
#data[data[data['native-country']==' Holand-Netherlands']['native-country']=='Germany']
#data[data['native-country']==' Holand-Netherlands']['native-country']
#data[data['native-country']==' Germany']['native-country'].tolist()
#data['native-country'] = data['native-country'].str.replace(' Holand-Netherlands' ,' Germany')

data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 32560 entries, 0 to 32559
Data columns (total 15 columns):
 #   Column          Non-Null Count  Dtype 
---  ------          --------------  ----- 
 0   age             32560 non-null  int64 
 1   workclas        32560 non-null  object
 2   fnlwgt          32560 non-null  int64 
 3   education       32560 non-null  object
 4   education-num   32560 non-null  int64 
 5   marital-status  32560 non-null  object
 6   occupation      32560 non-null  object
 7   relationship    32560 non-null  object
 8   race            32560 non-null  object
 9   sex             32560 non-null  object
 10  capital-gain    32560 non-null  int64 
 11  capital-loss    32560 non-null  int64 
 12  hours-per-week  32560 non-null  int64 
 13  native-country  32560 non-null  object
 14  Income          32560 non-null  object
dtypes: int64(6), object(9)
memory usage: 3.7+ MB

data.describe()

	age	fnlwgt	education-num	capital-gain	capital-loss	hours-per-week
count	32560.000000	3.256000e+04	32560.000000	32560.000000	32560.000000	32560.000000
mean	38.581634	1.897818e+05	10.080590	1077.615172	87.306511	40.437469
std	13.640642	1.055498e+05	2.572709	7385.402999	402.966116	12.347618
min	17.000000	1.228500e+04	1.000000	0.000000	0.000000	1.000000
25%	28.000000	1.178315e+05	9.000000	0.000000	0.000000	40.000000
50%	37.000000	1.783630e+05	10.000000	0.000000	0.000000	40.000000
75%	48.000000	2.370545e+05	12.000000	0.000000	0.000000	45.000000
max	90.000000	1.484705e+06	16.000000	99999.000000	4356.000000	99.000000

neg, pos = data.Income.value_counts()
total = neg + pos
print('Examples:\n    Total: {}\n    >50K: {} ({:.2f}% of total)\n  '.format(
    total, pos, 100 * pos / total,100 * neg / total))

print('Total: {}\n    <=50K: {} ({:.2f}% of total)\n  '.format(
    total, neg, 100 * neg / total))

Examples:
    Total: 32560
    >50K: 7841 (24.08% of total)
  
Total: 32560
    <=50K: 24719 (75.92% of total)

Check for missing values in the dataframe.

data.isnull().sum(axis=0)
# let df be your dataframe and x be the value you want to fill it with
#heartdata.fillna(-9999.01)

age               0
workclas          0
fnlwgt            0
education         0
education-num     0
marital-status    0
occupation        0
relationship      0
race              0
sex               0
capital-gain      0
capital-loss      0
hours-per-week    0
native-country    0
Income            0
dtype: int64

Remove duplicated columns if any.

data = data.loc[:,~data.columns.duplicated()]

drop all duplicate rows if any

data.drop_duplicates(keep=False, inplace=True)

Remove Correlated Features

from tqdm import tqdm
from scipy import stats
from scipy.stats import chi2_contingency
import sys #threshold=0.95
#threshold=0.95



def Remove_correlation(data, threshold):
    ''' 
    Find correlation among columns of a dataframe
    return names of columns removed  
    return  reduced dataframe with correlated columns removed
    '''
# Create correlation matrix
    corr_matrix = data.corr().abs()

# Select upper triangle of correlation matrix
    upper = corr_matrix.where(np.triu(np.ones(corr_matrix.shape), k=1).astype(np.bool))

# Find features with correlation greater than 0.95
    drop_columns = [column for column in upper.columns if any(upper[column] > threshold)]

# Drop features 
    red_data=data.drop(drop_columns, axis=1)
    return  drop_columns, red_data

print('Features Removed : {}'.format(Remove_correlation(data, threshold=0.95)[0]))
data = Remove_correlation(data, threshold=0.95)[1]
print('shape of dataframe {}'.format(data.shape))

Features Removed : []
shape of dataframe (32560, 15)

No correlated features exist in the dataset.

check the distribution of the target variable.

import seaborn as sns
sns.set_theme(style="darkgrid")
#titanic = sns.load_dataset("titanic")
ax = sns.countplot(x="Income", data=data,palette="Set3")
plt.xlabel('Income')
plt.ylabel('Frequency')
plt.title('frequency of Income  ')
plt.show()

png

# get the target, separate target and time from predictors
y = data.Income             
X= data.drop(['Income'], axis=1, inplace=False)
#X= heartdata.drop(['DEATH_EVENT', 'time'], axis=1, inplace=True)
X.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 32560 entries, 0 to 32559
Data columns (total 14 columns):
 #   Column          Non-Null Count  Dtype 
---  ------          --------------  ----- 
 0   age             32560 non-null  int64 
 1   workclas        32560 non-null  object
 2   fnlwgt          32560 non-null  int64 
 3   education       32560 non-null  object
 4   education-num   32560 non-null  int64 
 5   marital-status  32560 non-null  object
 6   occupation      32560 non-null  object
 7   relationship    32560 non-null  object
 8   race            32560 non-null  object
 9   sex             32560 non-null  object
 10  capital-gain    32560 non-null  int64 
 11  capital-loss    32560 non-null  int64 
 12  hours-per-week  32560 non-null  int64 
 13  native-country  32560 non-null  object
dtypes: int64(6), object(8)
memory usage: 3.5+ MB

# Break off validation set from training data
X_train, X_valid, y_train, y_valid = train_test_split(X, y, train_size=0.75, test_size=0.25,random_state=0)
# Display results
print ("Shapes:")
print ("X: {}".format(X.shape))
print ("y: {}".format(y.shape))
print()
print ("X_train: {}".format(X_train.shape))
print ("X_valid: {}".format(X_valid.shape))
print ("y_train: {}".format(y_train.shape))
print ("y_valid: {}\n".format(y_valid.shape))

Shapes:
X: (32560, 14)
y: (32560,)

X_train: (24420, 14)
X_valid: (8140, 14)
y_train: (24420,)
y_valid: (8140,)

Select the numeric and categorical features. The categorical feature processing would include one-hot encoding

from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.compose import ColumnTransformer, make_column_transformer

##
#numerical_features=heartdata.select_dtypes(exclude=['object','category']).drop('DEATH_EVENT',axis=1).columns.tolist()
# Select numeric columns
numeric_features = [col for col in X_train.columns if X_train[col].dtype in ['int64', 'float64']]
numeric_features

['age',
 'fnlwgt',
 'education-num',
 'capital-gain',
 'capital-loss',
 'hours-per-week']

categorical_features=X.select_dtypes(include=['object','category']).columns.tolist()
categorical_features

['workclas',
 'education',
 'marital-status',
 'occupation',
 'relationship',
 'race',
 'sex',
 'native-country']

#R_test = pd.read_csv('C:\\Users\\admin1\\Desktop\\WORK\\R_test_M_LIN.csv',delimiter="\t")
#R_train = pd.read_csv('C:\\Users\\admin1\\Desktop\\WORK\\R_train_M_LIN.csv',delimiter="\t")        

impute categorical features by filing with a constant

categorical_transformer = Pipeline(steps=[
    #('encoding',LabelEncoder()),
    ('onehot', OneHotEncoder(handle_unknown='ignore')),
    #('encoding',MyLabelEncoder()),
    #('encoding',MultiColumnLabelEncoder())
    ('imputer', SimpleImputer(fill_value= -9999.01, strategy='constant'))
    #('encoding',OrdinalEncoder(categories='auto')),
   # 
])

missing value imputation by flling with a number

numeric_transformer = Pipeline(steps=[
    ('imputer', SimpleImputer(fill_value=-9999999, strategy='constant')),
    ('scaler', StandardScaler()) 
    ])

xgb_classifier = xgb.XGBClassifier(
              n_jobs = -1,
              max_depth = 6,
              #learning_rate=  0.1,
              min_child_weight= 2, 
              #min_samples_split= 0.9,
              n_estimators= 100,
              eta = 0.1, 
              verbose = 1, 
              gamma=0.05,
              #nrounds = 100
              objective = "binary:logistic", 
              eval_metric = "auc",       #"aucpr",    # "aucpr",  #aucpr, auc
              subsample = 0.7,
              colsample_bytree =0.8,
              max_delta_step=1,
              verbosity=1,
              tree_method='approx')

preprocessor = ColumnTransformer(
    transformers=[
        ('num', numeric_transformer, numeric_features),
        ('cat', categorical_transformer, categorical_features)])

# Append classifier to preprocessing pipeline.

pipeline_f = Pipeline(steps=[('preprocessor', preprocessor),
                      #('dropfeature',UniqueDropper()),
                      #('anova', SelectPercentile(chi2)),
                      # ('feature_selection', SelectFromModel(LinearSVC(penalty="l1"))),
                      ('classifier', xgb_classifier)])


xgb_model = pipeline_f.fit(X_train,  y_train)

print("model score: %.3f" % pipeline_f.score(X_valid, y_valid))

model score: 0.868

# Get predictions
preds = pipeline_f.predict(X_valid)

# Evaluate the model
score = accuracy_score(y_valid,preds)

# Display the result
print("Accuracy Score: {}".format(score))


pred_prob = pipeline_f.predict_proba(X_valid)
print(pred_prob)

Accuracy Score: 0.8681818181818182
[[0.7959137  0.20408632]
 [0.9082163  0.09178371]
 [0.9838083  0.01619174]
 ...
 [0.90567094 0.09432907]
 [0.9148087  0.08519129]
 [0.9606085  0.0393915 ]]

#pipe.predict(X_valid)
#y_valid
y_val  = [1 if  i ==' >50K' else 0  for i in y_valid]
y_pred  = [1 if  i ==' >50K' else 0  for i in preds]

from sklearn.metrics import precision_score
from sklearn.metrics import recall_score
from sklearn.metrics import precision_recall_fscore_support
from numpy import trapz
from scipy.integrate import simps
from sklearn.metrics import f1_score

def Evaluate(y_val, y_pred):
      ''' A function that calculates the confusion matrix, accuracy, precision, recall, and f1_score.
        Accuracy, precision, recall, and f1_score.'''
      CM = confusion_matrix(y_val, y_pred)
      TN = CM[0][0]
      FN = CM[1][0]
      TP = CM[1][1]
      FP = CM[0][1]
      print(' (True Negatives): {}'.format(TN))
      print(' (False Negatives):  {}'.format(FN))
      print(' (True Positives): {}'.format(TP))
      print(' (False Positives):{}'.format(FP))
      print('Total Number from positive class  from Confusion Matrix : ', np.sum(CM[1]))
      auc = roc_auc_score(y_val, y_pred)
      prec=precision_score(y_val, y_pred)   
      rec=recall_score(y_val, y_pred)
     # calculate F1 score
      f1 = f1_score(y_val, y_pred)
      print('auc :{}'.format(auc))
      print('precision :{}'.format(prec))
      print('recall :{}'.format(rec))
      print('f1 :{}'.format(f1))
    # Compute Precision-Recall and plot curve
      precision, recall, thresholds = precision_recall_curve(y_val, y_pred)
    #use the trapezoidal rule to calculate the area under the precion-recall curve
      area =  trapz(recall, precision)
   
    #area =  simps(recall, precision)
      print("Area Under Precision Recall  Curve(AP): %0.4f" % area)  

    

Evaluate(y_val, y_pred)

 (True Negatives): 5818
 (False Negatives):  698
 (True Positives): 1249
 (False Positives):375
Total Number from positive class  from Confusion Matrix :  1947
auc :0.7904737533993637
precision :0.7690886699507389
recall :0.6414997431946584
f1 :0.6995239428731448
Area Under Precision Recall  Curve(AP): 0.5090

# Create the evaluation report.
#evaluation_report = classification_report(y_val, y_pred, labels=['<=50K','>50K'], target_names=['<=50K','>50K'])
evaluation_report = classification_report(y_valid, preds )
# Show the evaluation report.
print(evaluation_report)

              precision    recall  f1-score   support

       <=50K       0.89      0.94      0.92      6193
        >50K       0.77      0.64      0.70      1947

    accuracy                           0.87      8140
   macro avg       0.83      0.79      0.81      8140
weighted avg       0.86      0.87      0.86      8140

from ml_things import plot_dict, plot_confusion_matrix, fix_text

# Plot confusion matrix.
# <=50K    24719
# >50K      7841
# Plot confusion matrix.

plot_confusion_matrix(y_true=y_val, y_pred=y_pred, 
                      classes=['<=50K','>50K'], normalize=False, 
                      magnify=1,
                      );


plot_confusion_matrix(y_true=y_val, y_pred=y_pred, 
                      classes=['<=50K','>50K'], normalize=True, 
                      magnify=1,
                      );
 #y_true=y_val, y_pred=y_pred,                     

Confusion matrix, without normalization
Normalized confusion matrix

png

The intermediate steps in a Pipeline can be accessed by using the “named_steps” attribute:

pipe.named_steps.STEP_NAME.ATTRIBUTE

#xgb_no_weights_pred = xgb_no_weights.predict_proba(test_x)
#xgb_pred = xgb_model.predict_proba(test_set)[:,1]
#joblib.dump(xgb_model, 'C:\\Users\\admin1\\Desktop\\WORK\\xgbmodel.joblib')
#xgb_model =  joblib.load( '/content/drive/My Drive/ImbalancedData/xgb_no_weights.joblib')
#print(pipe.steps[1])
print(pipeline_f.steps[1][1].feature_importances_.shape)
print(pipeline_f.named_steps.classifier)
print(pipeline_f.named_steps.preprocessor)

(108,)
XGBClassifier(base_score=0.5, booster='gbtree', colsample_bylevel=1,
              colsample_bynode=1, colsample_bytree=0.8, eta=0.1,
              eval_metric='auc', gamma=0.05, learning_rate=0.1,
              max_delta_step=1, max_depth=6, min_child_weight=2, missing=None,
              n_estimators=100, n_jobs=-1, nthread=-1,
              objective='binary:logistic', random_state=0, reg_alpha=0,
              reg_lambda=1, scale_pos_weight=1, seed=None, silent=1,
              subsample=0.7, tree_method='approx', verbose=1, verbosity=1)
ColumnTransformer(n_jobs=None, remainder='drop', sparse_threshold=0.3,
                  transformer_weights=None,
                  transformers=[('num',
                                 Pipeline(memory=None,
                                          steps=[('imputer',
                                                  SimpleImputer(add_indicator=False,
                                                                copy=True,
                                                                fill_value=-9999999,
                                                                missing_values=nan,
                                                                strategy='constant',
                                                                verbose=0)),
                                                 ('scaler',
                                                  StandardScaler(copy=True,
                                                                 with_mean=True,
                                                                 with_std=True))],
                                          verbose=False),...
                                                  OneHotEncoder(categories='auto',
                                                                drop=None,
                                                                dtype=<class 'numpy.float64'>,
                                                                handle_unknown='ignore',
                                                                sparse=True)),
                                                 ('imputer',
                                                  SimpleImputer(add_indicator=False,
                                                                copy=True,
                                                                fill_value=-9999.01,
                                                                missing_values=nan,
                                                                strategy='constant',
                                                                verbose=0))],
                                          verbose=False),
                                 ['workclas', 'education', 'marital-status',
                                  'occupation', 'relationship', 'race', 'sex',
                                  'native-country'])],
                  verbose=False)

preprocessor.fit(X_train)
#preprocessor.get_feature_names()

ColumnTransformer

ColumnTransformer(transformers=[('num',
                                 Pipeline(steps=[('imputer',
                                                  SimpleImputer(fill_value=-9999999,
                                                                strategy='constant')),
                                                 ('scaler', StandardScaler())]),
                                 ['age', 'fnlwgt', 'education-num',
                                  'capital-gain', 'capital-loss',
                                  'hours-per-week']),
                                ('cat',
                                 Pipeline(steps=[('onehot',
                                                  OneHotEncoder(handle_unknown='ignore')),
                                                 ('imputer',
                                                  SimpleImputer(fill_value=-9999.01,
                                                                strategy='constant'))]),
                                 ['workclas', 'education', 'marital-status',
                                  'occupation', 'relationship', 'race', 'sex',
                                  'native-country'])])

num

['age', 'fnlwgt', 'education-num', 'capital-gain', 'capital-loss', 'hours-per-week']

SimpleImputer

SimpleImputer(fill_value=-9999999, strategy='constant')

StandardScaler

StandardScaler()

cat

['workclas', 'education', 'marital-status', 'occupation', 'relationship', 'race', 'sex', 'native-country']

OneHotEncoder

OneHotEncoder(handle_unknown='ignore')

SimpleImputer

SimpleImputer(fill_value=-9999.01, strategy='constant')

from sklearn.compose import ColumnTransformer
#preprocessor.get_feature_names()
#pipeline_f.get_feature_names()
#pipeline_f["pipeline_f"]["pipeline_f"][-1]
preprocessor.transform(X).toarray()

array([[ 0.83962322, -1.01883977,  1.13061257, ...,  1.        ,
         0.        ,  0.        ],
       [-0.04263147,  0.24414566, -0.41852807, ...,  1.        ,
         0.        ,  0.        ],
       [ 1.06018689,  0.4261946 , -1.19309838, ...,  1.        ,
         0.        ,  0.        ],
       ...,
       [ 1.42779301, -0.36414118, -0.41852807, ...,  1.        ,
         0.        ,  0.        ],
       [-1.21897105,  0.10904291, -0.41852807, ...,  1.        ,
         0.        ,  0.        ],
       [ 0.98666566,  0.93398469, -0.41852807, ...,  1.        ,
         0.        ,  0.        ]])

The pipeline can be exported/saved to file as follows:

#!pip install scikit-learn==0.23.2
from sklearn.utils import estimator_html_repr

with open('/content/drive/My Drive/Data/pipeline.html','w') as f:
     f.write(estimator_html_repr(pipeline_f))

The pipeline can also be plotted as shown below:

#plot pipeline
from sklearn import set_config


set_config(display='diagram')   
# diplays HTML representation in a jupyter context
pipeline_f 

Pipeline

Pipeline(steps=[('preprocessor',
                 ColumnTransformer(transformers=[('num',
                                                  Pipeline(steps=[('imputer',
                                                                   SimpleImputer(fill_value=-9999999,
                                                                                 strategy='constant')),
                                                                  ('scaler',
                                                                   StandardScaler())]),
                                                  ['age', 'fnlwgt',
                                                   'education-num',
                                                   'capital-gain',
                                                   'capital-loss',
                                                   'hours-per-week']),
                                                 ('cat',
                                                  Pipeline(steps=[('onehot',
                                                                   OneHotEncoder(handle_unknown='ignore')),
                                                                  ('imputer',
                                                                   SimpleImputer(fill_value=-9999.01,
                                                                                 strategy='constant'))]),
                                                  ['workclas', 'education',
                                                   'marital-status',
                                                   'occupation', 'relationship',
                                                   'race', 'sex',
                                                   'native-country'])])),
                ('classifier',
                 XGBClassifier(colsample_bytree=0.8, eta=0.1, eval_metric='auc',
                               gamma=0.05, max_delta_step=1, max_depth=6,
                               min_child_weight=2, n_jobs=-1, subsample=0.7,
                               tree_method='approx', verbose=1))])

preprocessor: ColumnTransformer

ColumnTransformer(transformers=[('num',
                                 Pipeline(steps=[('imputer',
                                                  SimpleImputer(fill_value=-9999999,
                                                                strategy='constant')),
                                                 ('scaler', StandardScaler())]),
                                 ['age', 'fnlwgt', 'education-num',
                                  'capital-gain', 'capital-loss',
                                  'hours-per-week']),
                                ('cat',
                                 Pipeline(steps=[('onehot',
                                                  OneHotEncoder(handle_unknown='ignore')),
                                                 ('imputer',
                                                  SimpleImputer(fill_value=-9999.01,
                                                                strategy='constant'))]),
                                 ['workclas', 'education', 'marital-status',
                                  'occupation', 'relationship', 'race', 'sex',
                                  'native-country'])])

num

['age', 'fnlwgt', 'education-num', 'capital-gain', 'capital-loss', 'hours-per-week']

SimpleImputer

SimpleImputer(fill_value=-9999999, strategy='constant')

StandardScaler

StandardScaler()

cat

['workclas', 'education', 'marital-status', 'occupation', 'relationship', 'race', 'sex', 'native-country']

OneHotEncoder

OneHotEncoder(handle_unknown='ignore')

SimpleImputer

SimpleImputer(fill_value=-9999.01, strategy='constant')

XGBClassifier

XGBClassifier(colsample_bytree=0.8, eta=0.1, eval_metric='auc', gamma=0.05,
              max_delta_step=1, max_depth=6, min_child_weight=2, n_jobs=-1,
              subsample=0.7, tree_method='approx', verbose=1)

Generate pmml from the pipeline using the sklearn2pmml. The make_pmml_pipeline function translates a regular Scikit-Learn estimator or pipeline to a PMML pipeline.

from sklearn2pmml import make_pmml_pipeline, sklearn2pmml

pipeline = make_pmml_pipeline(
pipeline_f
#active_fields = ["x1", "x2", ...], optional Feature name
#target_fields = ["y"]  optional target name
)
sklearn2pmml(pipeline, "/content/drive/My Drive/Data/xgbmodel.pmml")

/usr/local/lib/python3.6/dist-packages/sklearn/base.py:213: FutureWarning: From version 0.24, get_params will raise an AttributeError if a parameter cannot be retrieved as an instance attribute. Previously it would return None.
  FutureWarning)

Equivalently the pmml could be generated by using the PMMLPipeline function from the sklearn2pmml in the pipeline steps.

from sklearn2pmml import PMMLPipeline
from sklearn2pmml.decoration import CategoricalDomain, ContinuousDomain
from sklearn_pandas import DataFrameMapper


# it takes a list of tuples as parameter
pipeline = PMMLPipeline(
    [('preprocessor', preprocessor),
                      #('dropfeature',UniqueDropper()),
                      #('anova', SelectPercentile(chi2))
                     # ("feat_sel", SelectKBest(10)),

                      ('classifier', xgb_classifier)]
)

#  the pipeline object is  used similar to how a regular classifier is used in
#scikit-learn.
pipeline.fit(X=X_train,  y=y_train)

from sklearn2pmml import sklearn2pmml

#pipeline = sklearn2pmml.make_pmml_pipeline(pipeline)

sklearn2pmml(pipeline, "/content/drive/My Drive/Data/xgbmodel.pmml", with_repr = True)

/usr/local/lib/python3.6/dist-packages/sklearn/base.py:197: FutureWarning: From version 0.24, get_params will raise an AttributeError if a parameter cannot be retrieved as an instance attribute. Previously it would return None.
  FutureWarning)

Adding Hyperparameter Tuning to Pipeline

The steps in creating a pmml file from pipeline steps above does not include hyperparameter tuning. In the steps below we show how to include hyperparameter tuning in the pipeline steps.

#help(make_pmml_pipeline)
#help(sklearn2pmml)

from sklearn.model_selection import GridSearchCV
from sklearn2pmml import PMMLPipeline
from sklearn2pmml import sklearn2pmml

#pipeline = PMMLPipeline([...])

#tuner = RandomizedSearchCV(pipeline, param_grid = {...})
#tuner.fit(X_train,  y_train)
# GridSearchCV.best_estimator_ is of type PMMLPipeline
#sklearn2pmml(tuner.best_estimator_, "/content/drive/My Drive/Data/xgbmodel.pmml")

Equivalently using the make_pmml_pipeline function is shown below

pipeline.fit(X=X_train,  y=y_train)

PMMLPipeline(steps=[('preprocessor', ColumnTransformer(n_jobs=None, remainder='drop', sparse_threshold=0.3,
                  transformer_weights=None,
                  transformers=[('num',
                                 Pipeline(memory=None,
                                          steps=[('imputer',
                                                  SimpleImputer(add_indicator=False,
                                                                copy=True,
                                                                fill_value=-9999999,
                                                                missing_values=nan,
                                                                strategy='constant',
                                                                verbose=0)),
                                                 ('scaler',
                                                  StandardScaler(copy=True,
                                                                 with_mean=True,
                                                                 with_std=True))],
                                          verbose=False),...
                                                  OneHotEncoder(categories='auto',
                                                                drop=None,
                                                                dtype=<class 'numpy.float64'>,
                                                                handle_unknown='error',
                                                                sparse=True)),
                                                 ('imputer',
                                                  SimpleImputer(add_indicator=False,
                                                                copy=True,
                                                                fill_value=-9999.01,
                                                                missing_values=nan,
                                                                strategy='constant',
                                                                verbose=0))],
                                          verbose=False),
                                 ['workclas', 'education', 'marital-status',
                                  'occupation', 'relationship', 'race', 'sex',
                                  'native-country'])],
                  verbose=False)),
       ('classifier', XGBClassifier(base_score=0.5, booster='gbtree', colsample_bylevel=1,
              colsample_bynode=1, colsample_bytree=0.8, eta=0.1,
              eval_metric='auc', gamma=0.05, learning_rate=0.1,
              max_delta_step=1, max_depth=6, min_child_weight=2, missing=None,
              n_estimators=100, n_jobs=-1, nthread=-1,
              objective='binary:logistic', random_state=0, reg_alpha=0,
              reg_lambda=1, scale_pos_weight=1, seed=None, silent=1,
              subsample=0.7, tree_method='approx', verbose=1, verbosity=1))])

# Define model parameters for grid search
search_spaces = {'classifier__learning_rate': [0.1],
              'classifier__n_estimators': [100],
              'classifier__max_depth': [ 5, 6],
              'classifier__min_child_weight': [1, 2],
              'classifier__gamma': [0.05],
              'classifier__subsample': [ 0.70, 0.80],
              'classifier__colsample_bytree': [ 0.70, 0.80, 0.90],
              'classifier__random_state' : [10]}

#RandomizedSearchCV.best_estimator_ is of type PMMLPipeline
tuner  = RandomizedSearchCV(pipeline_f, search_spaces, cv=5, n_jobs=-1,scoring='roc_auc')
tuner.fit(X=X_train,  y=y_train)
sklearn2pmml(make_pmml_pipeline(tuner.best_estimator_),
 "/content/drive/My Drive/Data/xgbmodel.pmml")

/usr/local/lib/python3.6/dist-packages/sklearn/base.py:197: FutureWarning: From version 0.24, get_params will raise an AttributeError if a parameter cannot be retrieved as an instance attribute. Previously it would return None.
  FutureWarning)

We can look at the best parameter values from the hyperparameter search.

tuner.best_estimator_.get_params

<bound method Pipeline.get_params of Pipeline(memory=None,
         steps=[('preprocessor',
                 ColumnTransformer(n_jobs=None, remainder='drop',
                                   sparse_threshold=0.3,
                                   transformer_weights=None,
                                   transformers=[('num',
                                                  Pipeline(memory=None,
                                                           steps=[('imputer',
                                                                   SimpleImputer(add_indicator=False,
                                                                                 copy=True,
                                                                                 fill_value=-9999999,
                                                                                 missing_values=nan,
                                                                                 strategy='constant',
                                                                                 verbose=0)),
                                                                  ('scaler',
                                                                   StandardScaler(copy=True,
                                                                                  wit...
                               colsample_bytree=0.8, eta=0.1, eval_metric='auc',
                               gamma=0.05, learning_rate=0.1, max_delta_step=1,
                               max_depth=6, min_child_weight=2, missing=None,
                               n_estimators=100, n_jobs=-1, nthread=-1,
                               objective='binary:logistic', random_state=10,
                               reg_alpha=0, reg_lambda=1, scale_pos_weight=1,
                               seed=None, silent=1, subsample=0.7,
                               tree_method='approx', verbose=1, verbosity=1))],
         verbose=False)>

Using the second pipeline with RandomizedSearchCV to generate pmml is shown below

#import matplotlib.pyplot as plt
#from sklearn2pmml import make_pmml_pipeline, sklearn2pmml


#tuner = RandomizedSearchCV(pipeline, search_spaces)
#tuner.fit(X=X_train,  y=y_train)
# GridSearchCV.best_estimator_ is of type Pipeline
#sklearn2pmml(make_pmml_pipeline(tuner.best_estimator_),
                  #active_fields =
#[...], target_fields = [...]),
# "/content/drive/My Drive/Data/xgbmodel.pmml"
#)
print("Best parameter (CV score=%0.3f):" % tuner.best_score_)
print('Best Hyperparameter values :{}'.format(tuner.best_params_))

Best parameter (CV score=0.925):
Best Hyperparameter values :{'classifier__subsample': 0.8, 'classifier__random_state': 10, 'classifier__n_estimators': 100, 'classifier__min_child_weight': 1, 'classifier__max_depth': 6, 'classifier__learning_rate': 0.1, 'classifier__gamma': 0.05, 'classifier__colsample_bytree': 0.7}

#fig, ax = plt.subplots(figsize=(12,18))
#xgb.plot_importance(xgb_model, max_num_features=50, height=0.8, ax=ax)
#plt.show()

sklearn2pmml(pmml_pipeline, "C:/Users/admin1/Desktop/WORK/xgboost.pmml")

After the pmml is created, the acccuracy of predictions from the pmml fie can be compared with that from the scikit-learn by using the pypmml library. The library allows you to read a pmml file and make predictions on test data.

#!pip install pypmml
!ls '/content/drive/My Drive/Data/xgbmodel.pmml'

'/content/drive/My Drive/Data/xgbmodel.pmml'

"""
import pmml file into python
"""

from pypmml import Model
import pandas as pd



model = Model.fromFile("/content/drive/My Drive/Data/xgbmodel.pmml") 

Use the preprocessor from the pipeline to prepare the test data for prediction

#
test_data = preprocessor.transform(X_valid).toarray()
#Prediction
new_pred = model.predict(test_data)

Tags: pipelines PMML Automation xgboost