Automated Machine Learning
Automated Machine Learning (AutoML) has increased greatly the efficiency of building machine learning models. AutoML achieves this by automating in some applications data pre-processing ,feature engineering, feature extraction , feature selection and hyper-parameter tuning when building machine learning models. AutoML has also reduced the expertise in academic domains such as computer science, statistics and other academic knowledge hitherto needed to build machine learning models. Repetitive task which do not need any expertise in machine learning could be easily automated by AutoML. AutoML can have some drawback which include the amount of time used in training. Since AutoML trains many models and selects the best performing model as the final model, large computaional resources is needed to reduce the time involved in training.
Data
Appliances energy prediction dataset from UCI machine learning repository would be used for this exercise. Further description of data and the data itself is available here. The goal is to predict home Appliance energy use in Wh given several features including temperature in various rooms of the house and humidity etc.
Attribute Information:
- date time year-month-day hour:minute:second
- Appliances, energy use in Wh lights, energy use of light fixtures in the house in Wh
- T1, Temperature in kitchen area, in Celsius
- RH_1, Humidity in kitchen area, in %
- T2, Temperature in living room area, in Celsius
- RH_2, Humidity in living room area, in %
- T3, Temperature in laundry room area
- RH_3, Humidity in laundry room area, in %
- T4, Temperature in office room, in Celsius
- RH_4, Humidity in office room, in %
- T5, Temperature in bathroom, in Celsius
- RH_5, Humidity in bathroom, in %
- T6, Temperature outside the building (north side), in Celsius
- RH_6, Humidity outside the building (north side), in %
- T7, Temperature in ironing room , in Celsius
- RH_7, Humidity in ironing room, in %
- T8, Temperature in teenager room 2, in Celsius
- RH_8, Humidity in teenager room 2, in %
- T9, Temperature in parents room, in Celsius
- RH_9, Humidity in parents room, in %
- To, Temperature outside (from Chievres weather station), in Celsius
- Pressure (from Chievres weather station), in mm Hg
- RH_out, Humidity outside (from Chievres weather station), in %
- Wind speed (from Chievres weather station), in m/s
- Visibility (from Chievres weather station), in km
- Tdewpoint (from Chievres weather station), °C
- rv1, Random variable 1, nondimensional
- rv2, Random variable 2, nondimensional
import numpy as np
import pandas as pd
from google.colab import files
import io
from sklearn.model_selection import train_test_split
import rpy2
%load_ext rpy2.ipython
The rpy2.ipython extension is already loaded. To reload it, use:
%reload_ext rpy2.ipython
uploaded = files.upload()
<input type="file" id="files-5333c190-8c44-4f40-82a7-fbb1943d7fea" name="files[]" multiple disabled />
<output id="result-5333c190-8c44-4f40-82a7-fbb1943d7fea">
Upload widget is only available when the cell has been executed in the
current browser session. Please rerun this cell to enable.
</output>
<script src="/nbextensions/google.colab/files.js"></script>
Saving energydata_complete.csv to energydata_complete.csv
for fn in uploaded.keys():
print('User uploaded file "{name}" with length {length} bytes'.format(
name=fn, length=len(uploaded[fn])))
User uploaded file "energydata_complete.csv" with length 11979363 bytes
dateparse = lambda x: pd.datetime.strptime(x, '%Y-%m-%d %H:%M:%S')
energy_data=pd.read_csv(io.StringIO(uploaded['energydata_complete.csv'].decode('utf-8')),
parse_dates=['date'],
date_parser=dateparse)
energy_data.head()
date | Appliances | lights | T1 | RH_1 | T2 | RH_2 | T3 | RH_3 | T4 | RH_4 | T5 | RH_5 | T6 | RH_6 | T7 | RH_7 | T8 | RH_8 | T9 | RH_9 | T_out | Press_mm_hg | RH_out | Windspeed | Visibility | Tdewpoint | rv1 | rv2 | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 2016-01-11 17:00:00 | 60 | 30 | 19.89 | 47.596667 | 19.2 | 44.790000 | 19.79 | 44.730000 | 19.000000 | 45.566667 | 17.166667 | 55.20 | 7.026667 | 84.256667 | 17.200000 | 41.626667 | 18.2 | 48.900000 | 17.033333 | 45.53 | 6.600000 | 733.5 | 92.0 | 7.000000 | 63.000000 | 5.3 | 13.275433 | 13.275433 |
1 | 2016-01-11 17:10:00 | 60 | 30 | 19.89 | 46.693333 | 19.2 | 44.722500 | 19.79 | 44.790000 | 19.000000 | 45.992500 | 17.166667 | 55.20 | 6.833333 | 84.063333 | 17.200000 | 41.560000 | 18.2 | 48.863333 | 17.066667 | 45.56 | 6.483333 | 733.6 | 92.0 | 6.666667 | 59.166667 | 5.2 | 18.606195 | 18.606195 |
2 | 2016-01-11 17:20:00 | 50 | 30 | 19.89 | 46.300000 | 19.2 | 44.626667 | 19.79 | 44.933333 | 18.926667 | 45.890000 | 17.166667 | 55.09 | 6.560000 | 83.156667 | 17.200000 | 41.433333 | 18.2 | 48.730000 | 17.000000 | 45.50 | 6.366667 | 733.7 | 92.0 | 6.333333 | 55.333333 | 5.1 | 28.642668 | 28.642668 |
3 | 2016-01-11 17:30:00 | 50 | 40 | 19.89 | 46.066667 | 19.2 | 44.590000 | 19.79 | 45.000000 | 18.890000 | 45.723333 | 17.166667 | 55.09 | 6.433333 | 83.423333 | 17.133333 | 41.290000 | 18.1 | 48.590000 | 17.000000 | 45.40 | 6.250000 | 733.8 | 92.0 | 6.000000 | 51.500000 | 5.0 | 45.410389 | 45.410389 |
4 | 2016-01-11 17:40:00 | 60 | 40 | 19.89 | 46.333333 | 19.2 | 44.530000 | 19.79 | 45.000000 | 18.890000 | 45.530000 | 17.200000 | 55.09 | 6.366667 | 84.893333 | 17.200000 | 41.230000 | 18.1 | 48.590000 | 17.000000 | 45.40 | 6.133333 | 733.9 | 92.0 | 5.666667 | 47.666667 | 4.9 | 10.084097 | 10.084097 |
energy_data.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 19735 entries, 0 to 19734
Data columns (total 29 columns):
date 19735 non-null datetime64[ns]
Appliances 19735 non-null int64
lights 19735 non-null int64
T1 19735 non-null float64
RH_1 19735 non-null float64
T2 19735 non-null float64
RH_2 19735 non-null float64
T3 19735 non-null float64
RH_3 19735 non-null float64
T4 19735 non-null float64
RH_4 19735 non-null float64
T5 19735 non-null float64
RH_5 19735 non-null float64
T6 19735 non-null float64
RH_6 19735 non-null float64
T7 19735 non-null float64
RH_7 19735 non-null float64
T8 19735 non-null float64
RH_8 19735 non-null float64
T9 19735 non-null float64
RH_9 19735 non-null float64
T_out 19735 non-null float64
Press_mm_hg 19735 non-null float64
RH_out 19735 non-null float64
Windspeed 19735 non-null float64
Visibility 19735 non-null float64
Tdewpoint 19735 non-null float64
rv1 19735 non-null float64
rv2 19735 non-null float64
dtypes: datetime64[ns](1), float64(26), int64(2)
memory usage: 4.4 MB
energy_data.describe().transpose()
count | mean | std | min | 25% | 50% | 75% | max | |
---|---|---|---|---|---|---|---|---|
Appliances | 19735.0 | 97.694958 | 102.524891 | 10.000000 | 50.000000 | 60.000000 | 100.000000 | 1080.000000 |
lights | 19735.0 | 3.801875 | 7.935988 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 70.000000 |
T1 | 19735.0 | 21.686571 | 1.606066 | 16.790000 | 20.760000 | 21.600000 | 22.600000 | 26.260000 |
RH_1 | 19735.0 | 40.259739 | 3.979299 | 27.023333 | 37.333333 | 39.656667 | 43.066667 | 63.360000 |
T2 | 19735.0 | 20.341219 | 2.192974 | 16.100000 | 18.790000 | 20.000000 | 21.500000 | 29.856667 |
RH_2 | 19735.0 | 40.420420 | 4.069813 | 20.463333 | 37.900000 | 40.500000 | 43.260000 | 56.026667 |
T3 | 19735.0 | 22.267611 | 2.006111 | 17.200000 | 20.790000 | 22.100000 | 23.290000 | 29.236000 |
RH_3 | 19735.0 | 39.242500 | 3.254576 | 28.766667 | 36.900000 | 38.530000 | 41.760000 | 50.163333 |
T4 | 19735.0 | 20.855335 | 2.042884 | 15.100000 | 19.530000 | 20.666667 | 22.100000 | 26.200000 |
RH_4 | 19735.0 | 39.026904 | 4.341321 | 27.660000 | 35.530000 | 38.400000 | 42.156667 | 51.090000 |
T5 | 19735.0 | 19.592106 | 1.844623 | 15.330000 | 18.277500 | 19.390000 | 20.619643 | 25.795000 |
RH_5 | 19735.0 | 50.949283 | 9.022034 | 29.815000 | 45.400000 | 49.090000 | 53.663333 | 96.321667 |
T6 | 19735.0 | 7.910939 | 6.090347 | -6.065000 | 3.626667 | 7.300000 | 11.256000 | 28.290000 |
RH_6 | 19735.0 | 54.609083 | 31.149806 | 1.000000 | 30.025000 | 55.290000 | 83.226667 | 99.900000 |
T7 | 19735.0 | 20.267106 | 2.109993 | 15.390000 | 18.700000 | 20.033333 | 21.600000 | 26.000000 |
RH_7 | 19735.0 | 35.388200 | 5.114208 | 23.200000 | 31.500000 | 34.863333 | 39.000000 | 51.400000 |
T8 | 19735.0 | 22.029107 | 1.956162 | 16.306667 | 20.790000 | 22.100000 | 23.390000 | 27.230000 |
RH_8 | 19735.0 | 42.936165 | 5.224361 | 29.600000 | 39.066667 | 42.375000 | 46.536000 | 58.780000 |
T9 | 19735.0 | 19.485828 | 2.014712 | 14.890000 | 18.000000 | 19.390000 | 20.600000 | 24.500000 |
RH_9 | 19735.0 | 41.552401 | 4.151497 | 29.166667 | 38.500000 | 40.900000 | 44.338095 | 53.326667 |
T_out | 19735.0 | 7.411665 | 5.317409 | -5.000000 | 3.666667 | 6.916667 | 10.408333 | 26.100000 |
Press_mm_hg | 19735.0 | 755.522602 | 7.399441 | 729.300000 | 750.933333 | 756.100000 | 760.933333 | 772.300000 |
RH_out | 19735.0 | 79.750418 | 14.901088 | 24.000000 | 70.333333 | 83.666667 | 91.666667 | 100.000000 |
Windspeed | 19735.0 | 4.039752 | 2.451221 | 0.000000 | 2.000000 | 3.666667 | 5.500000 | 14.000000 |
Visibility | 19735.0 | 38.330834 | 11.794719 | 1.000000 | 29.000000 | 40.000000 | 40.000000 | 66.000000 |
Tdewpoint | 19735.0 | 3.760707 | 4.194648 | -6.600000 | 0.900000 | 3.433333 | 6.566667 | 15.500000 |
rv1 | 19735.0 | 24.988033 | 14.496634 | 0.005322 | 12.497889 | 24.897653 | 37.583769 | 49.996530 |
rv2 | 19735.0 | 24.988033 | 14.496634 | 0.005322 | 12.497889 | 24.897653 | 37.583769 | 49.996530 |
We would perform some light feature engineering by creating additional time-based features from the date column and drop the date column.
energy_data['hour'] = energy_data['date'].dt.hour
energy_data['day'] = energy_data['date'].dt.day
energy_data['month'] = pd.DatetimeIndex(energy_data['date']).month
energy_data['week'] = energy_data['date'].dt.week
energy_data['weekday']= energy_data['date'].dt.weekday
energy_data['quarter']= energy_data['date'].dt.quarter
energy_data['year'] = pd.DatetimeIndex(energy_data['date']).year
Energy_Data = energy_data.drop(['date'],axis=1)
x= Energy_Data.drop('Appliances',axis=1)
y= Energy_Data.Appliances
train_x, test_x, train_y, test_y = train_test_split(x, y,test_size=0.3, random_state=148)
Cloud AutoML
Google Cloud AutoML is among the early implementors and also popular AutoML software around. Googles AutoML is however not open source and you need some Google account and GCP credit to train AutoML on Google cloud. Google AutoML uses neural architecture search to find a the optimal neural network architecture.
Tree-Based Pipeline Optimization Tool(TPOT)
TPOT library built on top of scikit-learn, and my favorite thing about it is that when you export a model you’re actually exporting all the Python code you need to train that model.
TPOT is a Python Automated Machine Learning tool built on top of scikit-learn ML library that optimizes machine learning pipelines using genetic programming. TPOT automates the building of ML pipelines by combining a flexible expression tree representation of pipelines with stochastic search algorithms such as genetic programming. TPOT explores thousands of possible pipelines, selects the best one for your data and finaly provides you with the python code. Both regression and classfication models can be built with TPOTRegressor and TPOTClassifier.
!conda install numpy scipy scikit-learn pandas joblib
!pip install deap update_checker tqdm stopit
!pip install xgboost
!pip install dask[delayed] dask[dataframe] dask-ml fsspec>=0.3.3
!pip install scikit-mdr skrebate
!pip install tpot
!pip install h2o
/bin/bash: conda: command not found
Collecting deap
[?25l Downloading https://files.pythonhosted.org/packages/81/98/3166fb5cfa47bf516e73575a1515734fe3ce05292160db403ae542626b32/deap-1.3.0-cp36-cp36m-manylinux2010_x86_64.whl (151kB)
from tpot import TPOTClassifier
from tpot import TPOTRegressor
from sklearn.datasets import load_digits
from sklearn.model_selection import train_test_split
# create & fit TPOT classifier with
pipeline_optimizer = TPOTRegressor(generations=5,
population_size=20,
cv=5,
random_state=42,
#njobs=-1,
scoring= 'neg_mean_squared_error', #'f1', 'f1_macro', 'f1_micro',
max_time_mins= 60, #MAX_TIME_MINS
max_eval_time_mins =20,
verbosity=2)
pipeline_optimizer.fit(train_x, train_y)
print(pipeline_optimizer.score(test_x, test_y))
HBox(children=(FloatProgress(value=0.0, description='Optimization Progress', max=20.0, style=ProgressStyle(des…
Generation 1 - Current best internal CV score: -5690.647614823849
Generation 2 - Current best internal CV score: -5690.647614823849
Generation 3 - Current best internal CV score: -5526.782440894636
Generation 4 - Current best internal CV score: -5344.624150306781
TPOT closed during evaluation in one generation.
WARNING: TPOT may not provide a good pipeline if TPOT is stopped/interrupted in a early generation.
TPOT closed prematurely. Will use the current best pipeline.
Best pipeline: RandomForestRegressor(CombineDFs(input_matrix, input_matrix), bootstrap=False, max_features=0.05, min_samples_leaf=1, min_samples_split=14, n_estimators=100)
-4983.7293780759865
/usr/local/lib/python3.6/dist-packages/sklearn/preprocessing/_function_transformer.py:97: FutureWarning: The default validate=True will be replaced by validate=False in 0.22.
"validate=False in 0.22.", FutureWarning)
/usr/local/lib/python3.6/dist-packages/sklearn/preprocessing/_function_transformer.py:97: FutureWarning: The default validate=True will be replaced by validate=False in 0.22.
"validate=False in 0.22.", FutureWarning)
# save our model code
#tpot.export('tpot_pipeline.py')
pipeline_optimizer.export('tpot_exported_pipeline.py')
# print the model code to see what it says
#!cat tpot_pipeline.py
predict(test_x)
H20.ai AutoML
H2O’s AutoML can be used for automating the machine learning workflow, which includes automatic training and tuning of many models within a user-specified time-limit.
H2O AutoML performs Random Search followed by a stacking stage. By default it uses the H2O machine learning package, which supports distributed training. It is available in both Python and R languages.
import h2o
from h2o.automl import H2OAutoML
# initilaize an H20 instance running locally
h2o.init(nthreads=-1)
h2o.cluster().show_status()
# convert our data to h20Frame
Energy_Data_h2o = h2o.H2OFrame(Energy_Data)
#train_data = train_data.cbind(y_data)
splits = Energy_Data_h2o.split_frame(ratios = [0.7], seed = 1)
train = splits[0]
test = splits[1]
folds = 5
# Run AutoML for 20 base models (limited to 1 hour max runtime by default)
aml = H2OAutoML(max_models=20,
max_runtime_secs=600,
# validation_frame=val_df,
nfolds = folds,
#balance_classes = TRUE,
stopping_metric='AUTO', #aucpr,mean_per_class_error,logloss
sort_metric = "RMSE",
seed=1)
aml.train(y = 'Appliances', training_frame=train
#leaderboard_frame = test
)
# The leader model can be access with `aml.leader`
# save the model out (we'll need to for tomorrow!)
h2o.save_model(aml.leader)
Checking whether there is an H2O instance running at http://localhost:54321 ..... not found.
Attempting to start a local H2O server...
Java Version: openjdk version "11.0.4" 2019-07-16; OpenJDK Runtime Environment (build 11.0.4+11-post-Ubuntu-1ubuntu218.04.3); OpenJDK 64-Bit Server VM (build 11.0.4+11-post-Ubuntu-1ubuntu218.04.3, mixed mode, sharing)
Starting server from /usr/local/lib/python3.6/dist-packages/h2o/backend/bin/h2o.jar
Ice root: /tmp/tmpyby0u3va
JVM stdout: /tmp/tmpyby0u3va/h2o_unknownUser_started_from_python.out
JVM stderr: /tmp/tmpyby0u3va/h2o_unknownUser_started_from_python.err
Server is running at http://127.0.0.1:54321
Connecting to H2O server at http://127.0.0.1:54321 ... successful.
H2O cluster uptime: | 02 secs |
H2O cluster timezone: | Etc/UTC |
H2O data parsing timezone: | UTC |
H2O cluster version: | 3.28.0.1 |
H2O cluster version age: | 1 day |
H2O cluster name: | H2O_from_python_unknownUser_dm09ir |
H2O cluster total nodes: | 1 |
H2O cluster free memory: | 2.938 Gb |
H2O cluster total cores: | 2 |
H2O cluster allowed cores: | 2 |
H2O cluster status: | accepting new members, healthy |
H2O connection url: | http://127.0.0.1:54321 |
H2O connection proxy: | {'http': None, 'https': None} |
H2O internal security: | False |
H2O API Extensions: | Amazon S3, XGBoost, Algos, AutoML, Core V3, TargetEncoder, Core V4 |
Python version: | 3.6.9 final |
H2O cluster uptime: | 02 secs |
H2O cluster timezone: | Etc/UTC |
H2O data parsing timezone: | UTC |
H2O cluster version: | 3.28.0.1 |
H2O cluster version age: | 1 day |
H2O cluster name: | H2O_from_python_unknownUser_dm09ir |
H2O cluster total nodes: | 1 |
H2O cluster free memory: | 2.938 Gb |
H2O cluster total cores: | 2 |
H2O cluster allowed cores: | 2 |
H2O cluster status: | accepting new members, healthy |
H2O connection url: | http://127.0.0.1:54321 |
H2O connection proxy: | {'http': None, 'https': None} |
H2O internal security: | False |
H2O API Extensions: | Amazon S3, XGBoost, Algos, AutoML, Core V3, TargetEncoder, Core V4 |
Python version: | 3.6.9 final |
Parse progress: |█████████████████████████████████████████████████████████| 100%
AutoML progress: |████████████████████████████████████████████████████████| 100%
'/content/StackedEnsemble_AllModels_AutoML_20191218_010817'
# View the top five models from the AutoML Leaderboard
lb = aml.leaderboard
lb.head(rows=5)
model_id | rmse | mean_residual_deviance | mse | mae | rmsle |
---|---|---|---|---|---|
StackedEnsemble_AllModels_AutoML_20191218_010817 | 68.7509 | 4726.69 | 4726.69 | 33.606 | 0.396076 |
StackedEnsemble_BestOfFamily_AutoML_20191218_010817 | 68.7863 | 4731.55 | 4731.55 | 33.5578 | 0.396007 |
XRT_1_AutoML_20191218_010817 | 69.6868 | 4856.24 | 4856.24 | 33.9568 | 0.401883 |
DRF_1_AutoML_20191218_010817 | 69.7698 | 4867.82 | 4867.82 | 33.744 | 0.401009 |
XGBoost_2_AutoML_20191218_010817 | 69.8614 | 4880.61 | 4880.61 | 33.9859 | 0.402586 |
#save model as MOJO file
#path = "./", the path can be specified
#also in the parenthesis
aml.leader.download_mojo()
# print the rmse for the cross-validated data
#lb.rmse(xval=True)
'/content/StackedEnsemble_AllModels_AutoML_20191218_010817.zip'
#Predict Using Leader Model
pred = aml.predict(test)
pred.head()
stackedensemble prediction progress: |████████████████████████████████████| 100%
predict |
---|
61.1797 |
96.4973 |
92.1388 |
175.122 |
314.226 |
250.165 |
141.156 |
107.482 |
69.8182 |
89.4973 |
#the standard model_performance() method can be applied to
# the AutoML leader model and a test set to generate an H2O model performance object.
perf = aml.leader.model_performance(test)
perf
ModelMetricsRegressionGLM: stackedensemble
** Reported on test data. **
MSE: 5064.8189589784215
RMSE: 71.16754147066219
MAE: 34.120372351242764
RMSLE: 0.3951929658613881
R^2: 0.5334201696104182
Mean Residual Deviance: 5064.8189589784215
Null degrees of freedom: 5865
Residual degrees of freedom: 5860
Null deviance: 63677819.057285145
Residual deviance: 29710228.01336742
AIC: 66698.39903447537
h2o.cluster().shutdown()
R
We will also train the h20 automl model also in R.
# activate R magic to run R in google colab notebook
import rpy2
%load_ext rpy2.ipython
The rpy2.ipython extension is already loaded. To reload it, use:
%reload_ext rpy2.ipython
%%R
# Next, we download, install and initialize the H2O package for R.
install.packages("h2o", repos=(c("http://s3.amazonaws.com/h2o-release/h2o/master/1497/R", getOption("repos"))))
library(h2o)
localH2O = h2o.init()
#localH2O = h2o.init(nthreads = -1,strict_version_check = FALSE,startH2O = FALSE)
h2o.no_progress() # Turn off progress bars for notebook readability
H2O is not running yet, starting it now...
Note: In case of errors look at the following log files:
/tmp/RtmpDB345V/h2o_UnknownUser_started_from_r.out
/tmp/RtmpDB345V/h2o_UnknownUser_started_from_r.err
Starting H2O JVM and connecting: .. Connection successful!
R is connected to the H2O cluster:
H2O cluster uptime: 2 seconds 441 milliseconds
H2O cluster timezone: Etc/UTC
H2O data parsing timezone: UTC
H2O cluster version: 3.26.0.2
H2O cluster version age: 4 months and 21 days !!!
H2O cluster name: H2O_started_from_R_root_uiy706
H2O cluster total nodes: 1
H2O cluster total memory: 2.94 GB
H2O cluster total cores: 2
H2O cluster allowed cores: 2
H2O cluster healthy: TRUE
H2O Connection ip: localhost
H2O Connection port: 54321
H2O Connection proxy: NA
H2O Internal Security: FALSE
H2O API Extensions: Amazon S3, XGBoost, Algos, AutoML, Core V3, Core V4
R Version: R version 3.6.1 (2019-07-05)
Convert the Energy_Data from pandas dataframe in python to an R dataframe.
%%R -i Energy_Data
head(Energy_Data,5)
/usr/local/lib/python3.6/dist-packages/rpy2/robjects/pandas2ri.py:191: FutureWarning: from_items is deprecated. Please use DataFrame.from_dict(dict(items), ...) instead. DataFrame.from_dict(OrderedDict(items)) may be used to preserve the key order.
res = PandasDataFrame.from_items(items)
Appliances lights T1 RH_1 T2 RH_2 T3 RH_3 T4
0 60 30 19.89 47.59667 19.2 44.79000 19.79 44.73000 19.00000
1 60 30 19.89 46.69333 19.2 44.72250 19.79 44.79000 19.00000
2 50 30 19.89 46.30000 19.2 44.62667 19.79 44.93333 18.92667
3 50 40 19.89 46.06667 19.2 44.59000 19.79 45.00000 18.89000
4 60 40 19.89 46.33333 19.2 44.53000 19.79 45.00000 18.89000
RH_4 T5 RH_5 T6 RH_6 T7 RH_7 T8 RH_8
0 45.56667 17.16667 55.20 7.026667 84.25667 17.20000 41.62667 18.2 48.90000
1 45.99250 17.16667 55.20 6.833333 84.06333 17.20000 41.56000 18.2 48.86333
2 45.89000 17.16667 55.09 6.560000 83.15667 17.20000 41.43333 18.2 48.73000
3 45.72333 17.16667 55.09 6.433333 83.42333 17.13333 41.29000 18.1 48.59000
4 45.53000 17.20000 55.09 6.366667 84.89333 17.20000 41.23000 18.1 48.59000
T9 RH_9 T_out Press_mm_hg RH_out Windspeed Visibility Tdewpoint
0 17.03333 45.53 6.600000 733.5 92 7.000000 63.00000 5.3
1 17.06667 45.56 6.483333 733.6 92 6.666667 59.16667 5.2
2 17.00000 45.50 6.366667 733.7 92 6.333333 55.33333 5.1
3 17.00000 45.40 6.250000 733.8 92 6.000000 51.50000 5.0
4 17.00000 45.40 6.133333 733.9 92 5.666667 47.66667 4.9
rv1 rv2 hour day month week weekday quarter year
0 13.27543 13.27543 17 11 1 2 0 1 2016
1 18.60619 18.60619 17 11 1 2 0 1 2016
2 28.64267 28.64267 17 11 1 2 0 1 2016
3 45.41039 45.41039 17 11 1 2 0 1 2016
4 10.08410 10.08410 17 11 1 2 0 1 2016
%%R
# Identify predictors and response
y <- "Appliances"
x <- setdiff(names(Energy_Data), y)
%%R
#convert R dataframe o h2o dataframe
Energy_Data_h2o = as.h2o(Energy_Data)
# split into train and validation sets
splits = h2o.splitFrame(Energy_Data_h2o,ratios = c(0.7), seed = 1)
train = splits[[1]]
test = splits[[2]]
%%R
library(h2o)
# For binary classification, response should be a factor
#train[,y] <- as.factor(train[,y])
#test[,y] <- as.factor(test[,y])
# import the dataset:
#df<- h2o.importFile("https://")
# set the number of folds for you n-fold cross validation:
folds <- 5
# Run AutoML for 20 base models (limited to 1 hour max runtime by default)
model_h2o_automl = h2o.automl(x = x, y = y,
nfolds = folds,
# validation_frame=val_df,
training_frame = train,
max_models = 20,
max_runtime_secs=600,
sort_metric = "RMSE",
#balance_classes = TRUE,
stopping_metric='AUTO', #aucpr,mean_per_class_error,logloss
seed=1
)
%%R
# Get model ids for all models in the AutoML Leaderboard
model_ids <- as.data.frame(model_h2o_automl@leaderboard$model_id)[,1]
# Get the "All Models" Stacked Ensemble model
se <- h2o.getModel(grep("StackedEnsemble_AllModels", model_ids, value = TRUE)[1])
# Get the Stacked Ensemble metalearner model
metalearner <- h2o.getModel(se@model$metalearner$name)
metalearner
Model Details:
==============
H2ORegressionModel: glm
Model ID: metalearner_AUTO_StackedEnsemble_AllModels_AutoML_20191218_024428
GLM Model: summary
family link regularization
1 gaussian identity Elastic Net (alpha = 0.5, lambda = 0.1497 )
number_of_predictors_total number_of_active_predictors number_of_iterations
1 19 6 1
training_frame
1 levelone_training_StackedEnsemble_AllModels_AutoML_20191218_024428
Coefficients: glm coefficients
names coefficients
1 Intercept -6.641758
2 XRT_1_AutoML_20191218_024428 0.252550
3 DRF_1_AutoML_20191218_024428 0.173411
4 XGBoost_2_AutoML_20191218_024428 0.198408
5 XGBoost_1_AutoML_20191218_024428 0.137834
6 GBM_4_AutoML_20191218_024428 0.000000
7 GBM_3_AutoML_20191218_024428 0.000000
8 GBM_2_AutoML_20191218_024428 0.000000
9 GBM_5_AutoML_20191218_024428 0.000000
10 GBM_grid_1_AutoML_20191218_024428_model_1 0.128306
11 GBM_1_AutoML_20191218_024428 0.000000
12 XGBoost_3_AutoML_20191218_024428 0.000000
13 DeepLearning_1_AutoML_20191218_024428 0.000000
14 GBM_grid_1_AutoML_20191218_024428_model_3 0.000000
15 GLM_grid_1_AutoML_20191218_024428_model_1 0.000000
16 XGBoost_grid_1_AutoML_20191218_024428_model_2 0.380006
17 DeepLearning_grid_1_AutoML_20191218_024428_model_2 0.000000
18 GBM_grid_1_AutoML_20191218_024428_model_2 0.000000
19 DeepLearning_grid_1_AutoML_20191218_024428_model_1 0.000000
20 XGBoost_grid_1_AutoML_20191218_024428_model_1 0.000000
standardized_coefficients
1 97.560747
2 18.034622
3 12.359363
4 13.761426
5 9.470020
6 0.000000
7 0.000000
8 0.000000
9 0.000000
10 10.556734
11 0.000000
12 0.000000
13 0.000000
14 0.000000
15 0.000000
16 12.679330
17 0.000000
18 0.000000
19 0.000000
20 0.000000
H2ORegressionMetrics: glm
** Reported on training data. **
MSE: 4651.294
RMSE: 68.20039
MAE: 33.06181
RMSLE: 0.3919449
Mean Residual Deviance : 4651.294
R^2 : 0.5512544
Null Deviance :143753580
Null D.o.F. :13868
Residual Deviance :64508792
Residual D.o.F. :13862
AIC :156496.8
H2ORegressionMetrics: glm
** Reported on cross-validation data. **
** 5-fold cross-validation on training data (Metrics computed for combined holdout predictions) **
MSE: 4664.595
RMSE: 68.29784
MAE: 33.09259
RMSLE: 0.3921523
Mean Residual Deviance : 4664.595
R^2 : 0.5499711
Null Deviance :143772363
Null D.o.F. :13868
Residual Deviance :64693262
Residual D.o.F. :13861
AIC :156538.4
Cross-Validation Metrics Summary:
mean sd cv_1_valid cv_2_valid
mae 33.082054 0.45924768 33.94574 33.453297
mean_residual_deviance 4659.3423 244.0951 5096.3237 5001.1846
mse 4659.3423 244.0951 5096.3237 5001.1846
null_deviance 2.8754472E7 1749350.6 3.0964342E7 3.1220312E7
r2 0.549355 0.01679809 0.5363588 0.55255246
residual_deviance 1.2938652E7 824356.4 1.4356344E7 1.3968309E7
rmse 68.21254 1.7876369 71.38854 70.719055
rmsle 0.39201766 0.003685681 0.39849618 0.39207774
cv_3_valid cv_4_valid cv_5_valid
mae 33.361572 32.31076 32.338898
mean_residual_deviance 4634.8745 4320.175 4244.153
mse 4634.8745 4320.175 4244.153
null_deviance 2.9694604E7 2.4724726E7 2.716838E7
r2 0.55832946 0.5139777 0.5855564
residual_deviance 1.3107425E7 1.2001446E7 1.1259738E7
rmse 68.07991 65.728035 65.14716
rmsle 0.39180788 0.39492768 0.38277885
%%R
#install.packages("tidyverse")
library(tidyverse)
#The leader model is stored at `aml@leader` and the leaderboard is stored at `aml@leaderboard`.
#lb = model_h2o_automl@leaderboard
lb=model_h2o_automl@leaderboard%>%as_tibble()
#Now we will view a snapshot of the top models. Here we should see the
#two Stacked Ensembles at or near the top of the leaderboard.
#Stacked Ensembles can almost always outperform a single model.
print(lb)
#To view the entire leaderboard, specify the `n` argument of
#the `print.H2OFrame()` function as the total number of rows:
print(lb, n = nrow(lb))
# A tibble: 21 x 6
model_id mean_residual_devi… rmse mse mae rmsle
<chr> <dbl> <dbl> <dbl> <dbl> <dbl>
1 StackedEnsemble_AllModels_AutoML… 4665. 68.3 4665. 33.1 0.392
2 StackedEnsemble_BestOfFamily_Aut… 4716. 68.7 4716. 33.3 0.393
3 XRT_1_AutoML_20191218_024428 4775. 69.1 4775. 33.6 0.399
4 DRF_1_AutoML_20191218_024428 4849. 69.6 4849. 33.8 0.400
5 XGBoost_2_AutoML_20191218_024428 4880. 69.9 4880. 34.0 0.402
6 XGBoost_1_AutoML_20191218_024428 4964. 70.5 4964. 34.7 0.409
7 GBM_4_AutoML_20191218_024428 5270. 72.6 5270. 36.3 0.424
8 GBM_3_AutoML_20191218_024428 5502. 74.2 5502. 37.5 0.438
9 GBM_2_AutoML_20191218_024428 5543. 74.5 5543. 37.9 0.442
10 GBM_5_AutoML_20191218_024428 5626. 75.0 5626. 39.0 0.452
# … with 11 more rows
# A tibble: 21 x 6
model_id mean_residual_devi… rmse mse mae rmsle
<chr> <dbl> <dbl> <dbl> <dbl> <dbl>
1 StackedEnsemble_AllModels_Auto… 4665. 68.3 4665. 33.1 0.392
2 StackedEnsemble_BestOfFamily_A… 4716. 68.7 4716. 33.3 0.393
3 XRT_1_AutoML_20191218_024428 4775. 69.1 4775. 33.6 0.399
4 DRF_1_AutoML_20191218_024428 4849. 69.6 4849. 33.8 0.400
5 XGBoost_2_AutoML_20191218_0244… 4880. 69.9 4880. 34.0 0.402
6 XGBoost_1_AutoML_20191218_0244… 4964. 70.5 4964. 34.7 0.409
7 GBM_4_AutoML_20191218_024428 5270. 72.6 5270. 36.3 0.424
8 GBM_3_AutoML_20191218_024428 5502. 74.2 5502. 37.5 0.438
9 GBM_2_AutoML_20191218_024428 5543. 74.5 5543. 37.9 0.442
10 GBM_5_AutoML_20191218_024428 5626. 75.0 5626. 39.0 0.452
11 GBM_grid_1_AutoML_20191218_024… 5645. 75.1 5645. 40.7 NA
12 GBM_1_AutoML_20191218_024428 5711. 75.6 5711. 38.7 0.452
13 XGBoost_3_AutoML_20191218_0244… 5816. 76.3 5816. 39.2 0.457
14 DeepLearning_1_AutoML_20191218… 8251. 90.8 8251. 55.0 NA
15 GBM_grid_1_AutoML_20191218_024… 8475. 92.1 8475. 52.1 0.605
16 GLM_grid_1_AutoML_20191218_024… 8643. 93.0 8643. 52.6 0.615
17 XGBoost_grid_1_AutoML_20191218… 9491. 97.4 9491. 55.0 0.795
18 DeepLearning_grid_1_AutoML_201… 9818. 99.1 9818. 59.6 NA
19 GBM_grid_1_AutoML_20191218_024… 10290. 101. 10290. 60.0 0.699
20 DeepLearning_grid_1_AutoML_201… 10584. 103. 10584. 67.7 NA
21 XGBoost_grid_1_AutoML_20191218… 10961. 105. 10961. 61.0 1.68
%%R
#install.packages("yardstick")
#Examine the variable importance of the metalearner (combiner)
#algorithm in the ensemble. This shows us how much each base learner is contributing to the ensemble. The AutoML Stacked Ensembles use the default metalearner algorithm (GLM with non-negative weights), so the variable importance of the metalearner is actually the standardized coefficient magnitudes of the GLM.
print(h2o.varimp_plot(model_h2o_automl@leader))
#We can also plot the base learner contributions to the ensemble.
print(h2o.varimp(model_h2o_automl))
## Save Leader Model
#There are two ways to save the leader model -- binary format
#and MOJO format. If you're taking your leader model to production,
# then we'd suggest the MOJO format since it's optimized for production use.
#h2o.saveModel(model_h2o_automl@leader, path = "")
#h2o.download_mojo(model_h2o_automl@leader, path = "")
# Eval ensemble performance on a test set
perf <- h2o.performance(model_h2o_automl@leader, newdata = test)
print(perf)
print('Train set r2 :')
print(h2o.r2(model_h2o_automl@leader))
pred=predict(model_h2o_automl@leader, newdata = test)
#df= bind_cols(truth=test$Appliances,response=pred)
df= h2o.cbind(test$Appliances,pred)
names(df)[1:2] = c("truth","response")
print('Test set r2 :')
print(yardstick::rsq(as_tibble(df),truth,response))
#test_performance = model_h2o_automl.model_performance(test)
#print(test_performance)
df
NULL
NULL
H2ORegressionMetrics: stackedensemble
MSE: 5102.83
RMSE: 71.43409
MAE: 33.77602
RMSLE: 0.3921076
Mean Residual Deviance : 5102.83
[1] "Train set r2 :"
[1] 0.9321445
[1] "Test set r2 :"
# A tibble: 1 x 3
.metric .estimator .estimate
<chr> <chr> <dbl>
1 rsq standard 0.530
truth response
1 50 60.92641
2 60 110.36650
3 60 112.35890
4 70 186.47200
5 230 320.81606
6 430 259.65633
[5866 rows x 2 columns]
%%R
ggplot(Energy_Data, aes(x = Appliances)) +
geom_density(trim = TRUE) +
geom_density(data = Energy_Data, trim = TRUE, col = "red")
## Ensemble Exploration
#To understand how the ensemble works, let's take a peek inside the
#Stacked Ensemble "All Models" model. The "All Models" ensemble is
#an ensemble of all of the individual models in the AutoML run.
#This is often the top performing model on the leaderboard.
# Get model ids for all models in the AutoML Leaderboard
model_ids <- as.data.frame(aml@leaderboard$model_id)[,1]
# Get the "All Models" Stacked Ensemble model
se <- h2o.getModel(grep("StackedEnsemble_AllModels", model_ids, value = TRUE)[1])
# Get the Stacked Ensemble metalearner model
metalearner <- h2o.getModel(se@model$metalearner$name)
/content
auto-sklearn
auto-sklearn is an automated machine learning toolkit and a drop-in replacement for a scikit-learn estimator. It provides a scikit-learn-like interface in Python and uses Bayesian optimization to find good machine learning pipelines.
It uses automatic ensemble construction and Meta-learning to warm-start the search procedure, this means that the search is more likely to start with good pipelines. Auto-sklearn can be installed using either pip or conda commands. An example code fro training a regression model is given below
!ruby -e "$(curl -fsSL https://raw.githubusercontent.com/Homebrew/install/master/install)" < /dev/null 2> /dev/null
!brew install swig
!swig -version
!pip install swig
!pip install pyrfr
!scikit-learn
!pip install auto-sklearn
/bin/bash: brew: command not found
/bin/bash: swig: command not found
import autosklearn.classification
import sklearn.model_selection
import sklearn.datasets
import sklearn.datasets
import sklearn.metrics
import autosklearn.regression
#automl = autosklearn.classification.AutoSklearnClassifier()
#automl.fit(train_y, train_y)
#y_hat = automl.predict(test_y)
#print("Accuracy score", sklearn.metrics.accuracy_score(test_y, y_hat))
feature_types = (['numerical'] * 27))
automl = autosklearn.regression.AutoSklearnRegressor(
time_left_for_this_task=120,
per_run_time_limit=30,
tmp_folder='/tmp/',
output_folder='/tmp/',
)
automl.fit(train_x, train_y,
feat_type=feature_types)
print(automl.show_models())
predictions = automl.predict(test_x)
print("R2 score:", sklearn.metrics.r2_score(test_y, predictions))
Automatic Deep Learning
AutoKeras is an open source software library for automated machine learning (AutoML). The ultimate goal of AutoML is to provide easily accessible deep learning tools to domain experts with limited data science or machine learning background. AutoKeras provides functions to automatically search for architecture and hyperparameters of deep learning models. This makes very slow because it requires full retraining from scratch of each model AutoKeras is useful for (2D) Image classification. Many other ML deep learning task has not been implemented so far. Auto-Keras uses network morphism to reduce training time in neural architecture search. Further, it uses a Gaussian process (GP) for Bayesian optimization of guiding network morphism.
!!pip install autokeras
['Collecting autokeras',
'\x1b[?25l Downloading https://files.pythonhosted.org/packages/c2/32/de74bf6afd09925980340355a05aa6a19e7378ed91dac09e76a487bd136d/autokeras-0.4.0.tar.gz (67kB)',
'', |
from tensorflow import keras
from keras.datasets import mnist
from autokeras.image.image_supervised import ImageClassifier
(x_train, y_train), (x_test, y_test) = mnist.load_data()
x_train = x_train.reshape(x_train.shape + (1,))
x_test = x_test.reshape(x_test.shape + (1,))
clf = ImageClassifier(verbose=True)
clf.fit(x_train, y_train, time_limit=12 * 60 * 60)
clf.final_fit(x_train, y_train, x_test, y_test, retrain=True)
y = clf.evaluate(x_test, y_test)
print(y)
/usr/local/lib/python3.6/dist-packages/tensorflow/python/framework/dtypes.py:526: FutureWarning: Passing (type, 1) or '1type' as a synonym of type is deprecated; in a future version of numpy, it will be understood as (type, (1,)) / '(1,)type'.
_np_qint8 = np.dtype([("qint8", np.int8, 1)])
/usr/local/lib/python3.6/dist-packages/tensorflow/python/framework/dtypes.py:527: FutureWarning: Passing (type, 1) or '1type' as a synonym of type is deprecated; in a future version of numpy, it will be understood as (type, (1,)) / '(1,)type'.
_np_quint8 = np.dtype([("quint8", np.uint8, 1)])
/usr/local/lib/python3.6/dist-packages/tensorflow/python/framework/dtypes.py:528: FutureWarning: Passing (type, 1) or '1type' as a synonym of type is deprecated; in a future version of numpy, it will be understood as (type, (1,)) / '(1,)type'.
_np_qint16 = np.dtype([("qint16", np.int16, 1)])
/usr/local/lib/python3.6/dist-packages/tensorflow/python/framework/dtypes.py:529: FutureWarning: Passing (type, 1) or '1type' as a synonym of type is deprecated; in a future version of numpy, it will be understood as (type, (1,)) / '(1,)type'.
_np_quint16 = np.dtype([("quint16", np.uint16, 1)])
/usr/local/lib/python3.6/dist-packages/tensorflow/python/framework/dtypes.py:530: FutureWarning: Passing (type, 1) or '1type' as a synonym of type is deprecated; in a future version of numpy, it will be understood as (type, (1,)) / '(1,)type'.
_np_qint32 = np.dtype([("qint32", np.int32, 1)])
/usr/local/lib/python3.6/dist-packages/tensorflow/python/framework/dtypes.py:535: FutureWarning: Passing (type, 1) or '1type' as a synonym of type is deprecated; in a future version of numpy, it will be understood as (type, (1,)) / '(1,)type'.
np_resource = np.dtype([("resource", np.ubyte, 1)])
The default version of TensorFlow in Colab will soon switch to TensorFlow 2.x.
We recommend you upgrade now
or ensure your notebook will continue to use TensorFlow 1.x via the %tensorflow_version 1.x
magic:
more info.
Using TensorFlow backend.
Better speed can be achieved with apex installed from https://www.github.com/nvidia/apex.
Downloading data from https://s3.amazonaws.com/img-datasets/mnist.npz
11493376/11490434 [==============================] - 1s 0us/step
Saving Directory: /tmp/autokeras_NEPR81
Preprocessing the images.
Preprocessing finished.
Initializing search.
Initialization finished.
+----------------------------------------------+
| Training model 0 |
+----------------------------------------------+
No loss decrease after 5 epochs.
Saving model.
+--------------------------------------------------------------------------+
| Model ID | Loss | Metric Value |
+--------------------------------------------------------------------------+
| 0 | 0.4400691822171211 | 0.9648 |
+--------------------------------------------------------------------------+
+----------------------------------------------+
| Training model 1 |
+----------------------------------------------+
No loss decrease after 5 epochs.
Saving model.
+--------------------------------------------------------------------------+
| Model ID | Loss | Metric Value |
+--------------------------------------------------------------------------+
| 1 | 0.07899918230250477 | 0.9936 |
+--------------------------------------------------------------------------+
+----------------------------------------------+
| Training model 2 |
+----------------------------------------------+
Epoch-6, Current Metric - 0.97: 77%|███████████████████▎ | 360/465 [01:57<00:35, 2.96 batch/s]
auto_ml
auto_ml is designed for production. It is very simple to use and turning out to be one of my favorites. More informattion on this package can be found here. You can train
XGBoost, Deep Leaarning with TensorFlow & Keras,CatBoost and LightGBM which are all integrated with auto_ml.
Generally, just pass one of them in for model_names. ml_predictor.train(data, model_names=[‘DeepLearningClassifier’]).
Available options are - DeepLearningClassifier and DeepLearningRegressor - XGBClassifier and XGBRegressor - LGBMClassifier and LGBMRegressor. Among the numerous automation this package provides include feature engineering like one-hot encoding, creating new features like day_of_week,minutes etc from date-time features, feature normalization etc.
!pip install dill
!pip install -r advanced_requirements.txt,
!pip install auto_ml
!pip install xgboost
!pip install catboost
!pip install tensorflow
!pip install keras
!pip install ipywidgets
!jupyter nbextension enable --py widgetsnbextension
Requirement already satisfied: pygments in /usr/local/lib/python3.6/dist-packages (from ipython>=4.0.0; python_version >= "3.3"->ipywidgets) (2.1.3)
Requirement already satisfied: simplegeneric>0.8 in /usr/local/lib/python3.6/dist-packages (from ipython>=4.0.0; python_version >= "3.3"->ipywidgets) (0.8.1)
from auto_ml import Predictor
from auto_ml.utils import get_boston_dataset
from auto_ml.utils_models import load_ml_model
column_descriptions = {
'Appliances': 'output',
#'CHAS': 'categorical'
}
train_data= pd.concat([train_x,train_y],axis=1)
ml_predictor = Predictor(type_of_estimator='regressor',
column_descriptions=column_descriptions
)
#ml_predictor.train(train_data,model_names=['DeepLearningRegressor']), r2=0.22
#ml_predictor.train(train_data,model_names=['XGBRegressor']) #r2=0.32
#ml_predictor.train(train_data,model_names=['LGBMRegressor']) #0.497
ml_predictor.train(train_data) #0.431
ml_predictor.score(test_x, test_y)
Welcome to auto_ml! We're about to go through and make sense of your data using machine learning, and give you a production-ready pipeline to get predictions with.
If you have any issues, or new feature ideas, let us know at http://auto.ml
You are running on version 2.9.10
Now using the model training_params that you passed in:
{}
After overwriting our defaults with your values, here are the final params that will be used to initialize the model:
{'presort': False, 'learning_rate': 0.1, 'warm_start': True}
Running basic data cleaning
Fitting DataFrameVectorizer
Now using the model training_params that you passed in:
{}
After overwriting our defaults with your values, here are the final params that will be used to initialize the model:
{'presort': False, 'learning_rate': 0.1, 'warm_start': True}
********************************************************************************************
About to fit the pipeline for the model GradientBoostingRegressor to predict Appliances
Started at:
2019-12-16 01:53:15
[1] random_holdout_set_from_training_data's score is: -103.844
[2] random_holdout_set_from_training_data's score is: -102.638
[3] random_holdout_set_from_training_data's score is: -101.687
[4] random_holdout_set_from_training_data's score is: -100.815
[5] random_holdout_set_from_training_data's score is: -100.113
[6] random_holdout_set_from_training_data's score is: -99.56
[7] random_holdout_set_from_training_data's score is: -99.099
[8] random_holdout_set_from_training_data's score is: -98.695
[9] random_holdout_set_from_training_data's score is: -98.152
[10] random_holdout_set_from_training_data's score is: -97.829
[11] random_holdout_set_from_training_data's score is: -97.534
[12] random_holdout_set_from_training_data's score is: -97.089
:
:
:
[9700] random_holdout_set_from_training_data's score is: -78.201
[9800] random_holdout_set_from_training_data's score is: -78.196
[9900] random_holdout_set_from_training_data's score is: -78.183
The number of estimators that were the best for this training dataset: 9200
The best score on the holdout set: -78.17831855991089
Finished training the pipeline!
Total training time:
0:19:50
Here are the results from our GradientBoostingRegressor
predicting Appliances
Calculating feature responses, for advanced analytics.
/usr/local/lib/python3.6/dist-packages/sklearn/model_selection/_split.py:2179: FutureWarning: From version 0.21, test_size will always complement train_size unless both are specified.
FutureWarning)
The printed list will only contain at most the top 100 features.
+----+----------------+--------------+---------+-------------------+-------------------+-----------+-----------+-----------+-----------+
| | Feature Name | Importance | Delta | FR_Decrementing | FR_Incrementing | FRD_abs | FRI_abs | FRD_MAD | FRI_MAD |
|----+----------------+--------------+---------+-------------------+-------------------+-----------+-----------+-----------+-----------|
| 33 | year | 0.0000 | 0.0000 | 0.0000 | 0.0000 | 0.0000 | 0.0000 | 0.0000 | 0.0000 |
| 32 | quarter | 0.0004 | 0.2454 | 0.0000 | 0.0000 | 0.0000 | 0.0000 | 0.0000 | 0.0000 |
| 29 | month | 0.0044 | 0.6681 | 1.8807 | -0.5714 | 2.7292 | 0.7609 | 0.1994 | 0.0000 |
| 31 | weekday | 0.0096 | 1.0071 | 1.1695 | -1.1261 | 3.9296 | 4.1527 | 1.0363 | 0.9279 |
| 23 | Visibility | 0.0121 | 5.9874 | 2.5418 | 1.7824 | 5.9795 | 4.3472 | 1.6654 | 1.6835 |
| 28 | day | 0.0149 | 4.2629 | -1.1345 | 8.7127 | 3.9208 | 12.0951 | 1.0589 | 1.2265 |
| 25 | rv1 | 0.0157 | 7.1657 | 56.4534 | 96.4440 | 57.2148 | 97.1504 | 47.4928 | 64.5670 |
| 26 | rv2 | 0.0178 | 7.1657 | 65.4006 | 66.8796 | 66.3326 | 67.5132 | 60.2277 | 60.9206 |
| 1 | T1 | 0.0188 | 0.7852 | 13.8211 | 1.1122 | 19.1298 | 9.2902 | 8.3319 | 3.7755 |
| 17 | T9 | 0.0204 | 0.9972 | 3.8749 | 17.1198 | 12.5666 | 23.9141 | 6.9691 | 13.0813 |
| 30 | week | 0.0213 | 2.8238 | 6.6785 | 8.0968 | 9.4417 | 12.5097 | 3.7337 | 1.9681 |
| 22 | Windspeed | 0.0219 | 1.2516 | 0.3866 | 2.6343 | 7.8313 | 7.5808 | 4.5071 | 2.9881 |
| 13 | T7 | 0.0229 | 1.0471 | 19.2027 | 2.7276 | 24.3899 | 11.8721 | 6.9290 | 5.4307 |
| 21 | RH_out | 0.0230 | 7.5409 | 4.5123 | 5.7414 | 10.1780 | 10.9347 | 4.3237 | 4.2214 |
| 7 | T4 | 0.0234 | 1.0108 | 4.8105 | 3.0581 | 11.4415 | 11.6706 | 6.1204 | 5.2510 |
| 3 | T2 | 0.0235 | 1.0783 | 14.0835 | 0.8992 | 19.7000 | 11.4377 | 6.4855 | 5.7581 |
| 19 | T_out | 0.0239 | 2.6511 | 6.8862 | -4.4735 | 18.3031 | 10.5233 | 10.0484 | 7.6458 |
| 9 | T5 | 0.0240 | 0.9025 | 13.1377 | 3.0813 | 19.8936 | 10.4180 | 7.6748 | 4.9818 |
| 18 | RH_9 | 0.0264 | 2.0675 | 5.6641 | 0.8102 | 13.4686 | 13.8479 | 8.2520 | 8.7596 |
| 24 | Tdewpoint | 0.0299 | 2.0609 | 2.5114 | 5.8576 | 15.2535 | 15.9113 | 7.3289 | 8.4186 |
| 0 | lights | 0.0302 | 3.9307 | 0.0000 | 0.0000 | 0.0000 | 0.0000 | 0.0000 | 0.0000 |
| 12 | RH_6 | 0.0324 | 15.9312 | 6.9685 | 23.3361 | 14.6424 | 29.8704 | 5.8741 | 9.1803 |
| 16 | RH_8 | 0.0334 | 2.6452 | 5.3547 | 1.5003 | 18.7197 | 13.6153 | 9.8327 | 7.8713 |
| 11 | T6 | 0.0343 | 3.0657 | 0.7589 | 13.1492 | 20.2709 | 21.9876 | 11.7236 | 10.2991 |
| 14 | RH_7 | 0.0353 | 2.5387 | 12.6561 | 2.0688 | 18.6964 | 12.7747 | 8.9771 | 6.7723 |
| 8 | RH_4 | 0.0372 | 2.1545 | 14.7021 | 11.6144 | 21.2902 | 19.7052 | 7.5457 | 6.6670 |
| 15 | T8 | 0.0375 | 0.9570 | -3.6613 | 10.3051 | 15.2607 | 14.0218 | 9.4964 | 8.0907 |
| 10 | RH_5 | 0.0381 | 4.6173 | 34.8231 | 0.3044 | 39.5146 | 8.5136 | 10.9724 | 4.7449 |
| 2 | RH_1 | 0.0390 | 1.9196 | 0.8117 | 12.8458 | 17.0123 | 18.3431 | 9.3164 | 10.8043 |
| 4 | RH_2 | 0.0426 | 1.9942 | 14.9731 | 1.1853 | 17.7433 | 15.2093 | 8.9040 | 8.3585 |
| 6 | RH_3 | 0.0441 | 1.5894 | 8.8671 | 16.8765 | 23.3520 | 19.1881 | 9.8736 | 11.1776 |
| 20 | Press_mm_hg | 0.0484 | 3.6523 | 7.5979 | 12.3749 | 18.9428 | 23.5722 | 7.7871 | 8.6312 |
| 5 | T3 | 0.0491 | 0.9847 | 19.9205 | 5.5366 | 27.4795 | 16.7018 | 10.3640 | 10.0421 |
| 27 | hour | 0.1439 | 3.3978 | 2.3847 | -1.8274 | 20.2401 | 21.1750 | 11.4552 | 14.4227 |
+----+----------------+--------------+---------+-------------------+-------------------+-----------+-----------+-----------+-----------+
*******
Legend:
Importance = Feature Importance
Explanation: A weighted measure of how much of the variance the model is able to explain is due to this column
FR_delta = Feature Response Delta Amount
Explanation: Amount this column was incremented or decremented by to calculate the feature reponses
FR_Decrementing = Feature Response From Decrementing Values In This Column By One FR_delta
Explanation: Represents how much the predicted output values respond to subtracting one FR_delta amount from every value in this column
FR_Incrementing = Feature Response From Incrementing Values In This Column By One FR_delta
Explanation: Represents how much the predicted output values respond to adding one FR_delta amount to every value in this column
FRD_MAD = Feature Response From Decrementing- Median Absolute Delta
Explanation: Takes the absolute value of all changes in predictions, then takes the median of those. Useful for seeing if decrementing this feature provokes strong changes that are both positive and negative
FRI_MAD = Feature Response From Incrementing- Median Absolute Delta
Explanation: Takes the absolute value of all changes in predictions, then takes the median of those. Useful for seeing if incrementing this feature provokes strong changes that are both positive and negative
FRD_abs = Feature Response From Decrementing Avg Absolute Change
Explanation: What is the average absolute change in predicted output values to subtracting one FR_delta amount to every value in this column. Useful for seeing if output is sensitive to a feature, but not in a uniformly positive or negative way
FRI_abs = Feature Response From Incrementing Avg Absolute Change
Explanation: What is the average absolute change in predicted output values to adding one FR_delta amount to every value in this column. Useful for seeing if output is sensitive to a feature, but not in a uniformly positive or negative way
*******
None
***********************************************
Advanced scoring metrics for the trained regression model on this particular dataset:
Here is the overall RMSE for these predictions:
78.12730544452076
Here is the average of the predictions:
99.33728042751186
Here is the average actual value on this validation set:
98.55429826042898
Here is the median prediction:
71.29957584827399
Here is the median actual value:
60.0
Here is the mean absolute error:
39.930776664246636
Here is the median absolute error (robust to outliers):
17.292385522544976
Here is the explained variance:
0.43110768950362766
Here is the R-squared value:
0.4310505453591652
Count of positive differences (prediction > actual):
3353
Count of negative differences:
2568
Average positive difference:
35.947832693155526
Average negative difference:
-45.131248290052
***********************************************
-78.12730544452076
autoxgboost
source | documentation | R | Optimization: Bayesian Optimization | -
autoxgboost aims to find an optimal xgboost model automatically using the machine learning framework mlr and the bayesian optimization framework mlrMBO.
Autoxgboost is different from most frameworks on this page in that it does not search over multiple learning algorithms. Instead, it restricts itself to finding a good hyperparameter configuration for xgboost. The exception to this is a preprocessing step for categorical variables, where the specific encoding strategy to use is tuned as well.
%%R
install.packages("devtools")
install.packages("usethis")
install.packages("remotes")
install.packages("githubinstall")
install.packages("ghit")
#remotes::install_github("edwardcooper/automl")
library("devtools")
library(githubinstall)
# library("ghit")
#devtools::install_github("ja-thomas/autoxgboost")
remotes::install_github("ja-thomas/autoxgboost")
#ghit::install_github("cloudyr/ghit")
# githubinstall("autoxgboost")
%%R
library(autoxgboost)
#install.packages("devtools")
#install.packages("usethis")
#install.packages("remotes")
#remotes::install_github("edwardcooper/automl")
library("devtools")
pacman::p_load(lime,DALEX,forcats,ALEPlot,pdp,iBreakDown,localModel,breakDown,
xfun,clipr,clipr,sf,spdep,lubrridate)
library(tidyverse)
#devtools::install_github("ja-thomas/autoxgboost")
#remotes::install_github("ja-thomas/autoxgboost")
file= '/Users/energydata_complete.csv'
energy_data= readr::read_csv(file)
energy_data$hour = hour(energy_data$date)
energy_data$day = wday(energy_data$date)
energy_data$month = month(energy_data$date)
energy_data$week = week(energy_data$date)
#energy_data$weekday= weekdays(energy_data$date)
energy_data$quarter=quarter(energy_data$date)
energy_data$year = year(energy_data$date)
energydata =
energy_data %>%select(-date)
library(rsample)
data_split <- initial_split(energydata, strata = "Appliances", prop = 0.70)
energydata_train <- training(data_split)
energydata_test <- testing(data_split)
reg_task <- makeRegrTask(data = energydata_train, target = "Appliances")
set.seed(1234)
#system.time(reg_auto <- autoxgboost(reg_task))
# saveRDS(reg_auto, file = "D:/SDIautoxgboost_80.rds")
#data.task = makeRegrTask(data = iris, target = "Appliances")
ctrl = makeMBOControl()
ctrl = setMBOControlTermination(ctrl, iters = 2L) #Speed up Tuning by only doing 1 iteration
res = autoxgboost(reg_task, control = ctrl, tune.threshold = FALSE)
print(res)
pred=predict(res,data.frame(energydata_test))
library(yardstick)pred$data
print(yardstick::rmse(pred$data,truth,response))
print(yardstick::rsq(pred$data,truth,response))
/usr/local/lib/python3.6/dist-packages/rpy2/rinterface/__init__.py:146: RRuntimeWarning: Installing package into ‘/usr/local/lib/R/site-library’
(as ‘lib’ is unspecified)
warnings.warn(x, RRuntimeWarning)