AutoML: The Next Wave of Machine Learning

Mercari is a popular shopping app in Japan that has been using AutoML Vision (Google’s AutoML solution) for classifying images. According to Mercari, they’ve been “developing their own ML model that suggests a brand name from 12 major brands in the photo uploading user interface.”

While their own model—trained on TensorFlow—achieved an accuracy of 75%, AutoML Vision in advanced mode with 50,000 training images achieved an accuracy of 91.3%, which is a whopping 15% increase. With such astounding results, Mercari has integrated AutoML into their systems.

This is just one example of how AutoML is fundamentally changing the face of ML-based solutions today by enabling people from diverse backgrounds to evolve machine learning models to address complex scenarios.

Automated Machine Learning: AutoML

Machine learning has provided some significant breakthroughs in diverse fields in recent years. Areas like financial services, healthcare, retail, transportation, and more have been using machine learning systems in one way or another, and the results have been promising.

Machine learning today is not limited to R&D applications but has made its foray into the enterprise domain. However, the traditional ML process is human-dependent, and not all businesses have the resources to invest in an experienced data science team. AutoML may be the answer to such situations.

Automated machine learning (AutoML) is the process of automating the end-to-end process of applying machine learning to real-world problems. AutoML makes machine learning available in a true sense, even to people with no major expertise in this field.

A typical machine learning model consists of the following four processes:

Right from ingesting data to pre-processing, optimization, and then predicting outcomes, every step is controlled and performed by humans. AutoML essentially focuses on two major aspects — data acquisition/collection and prediction. All the other steps that take place in between can be easily automated while delivering a model that’s optimized well and ready to make predictions.

The Need for AutoML

The demand for machine learning systems has soared over the past few years. This is due to the success of ML in a wide range of applications today. However, even with this clear indication that machine learning can provide boosts to certain businesses, a lot of companies struggle to deploy ML models.

First, they need to set up a team of seasoned data scientists who command a premium salary. Second, even if you have a great team, deciding which model is the best for your problem often requires more experience than knowledge.

The success of machine learning in a wide range of applications has led to an ever-growing demand for machine learning systems that can be used off the shelf by non-experts¹. AutoML tends to automate the maximum number of steps in an ML pipeline—with a minimum amount of human effort and without compromising the model’s performance.

Advantages

The advantages of AutoML can be summed up in three major points:

  • Increases productivity by automating repetitive tasks. This enables a data scientist to focus more on the problem rather than the models.
  • Automating the ML pipeline also helps to avoid errors that might creep in manually.
  • Ultimately, AutoML is a step towards democratizing machine learning by making the power of ML accessible to everybody.

AutoML Frameworks

Let’s take a look at some of the popular frameworks that tend to automate some or the entire machine learning pipeline. This isn’t an exhaustive list, but I’ve tried to cover the ones that are being used on a larger scale.

1. MLBox

MLBox is a powerful automated machine learning Python library. According to the official documentation, this library provides the following features:

  • Fast reading and distributed data preprocessing/cleaning/formatting.
  • Highly robust feature selection, leak detection, and accurate hyperparameter optimization
  • State-of-the-art predictive models for classification and regression (Deep Learning, Stacking, LightGBM,…)
  • Prediction with model interpretation
  • It has already been tested on Kaggle and performs well. (see Kaggle “Two Sigma Connect: Rental Listing Inquiries” | Rank: 85/2488)

Pipeline

MLBox’s main package contains 3 sub-packages for automating the following tasks:

  • Pre-processing: for reading and pre-processing data.
  • Optimization: for testing and cross-validating the models
  • Prediction: for making predictions.

Installation

Currently, MLBox is only compatible with Linux, but Windows and MacOS support will be added very soon.

Demo

Running the “MLBox” auto-ML package on the Famous House Prices Regression problem.

{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {
    "_cell_guid": "b7f978a7-ac20-4d0a-b3fe-a9ebe997c14d",
    "_uuid": "c6b4f386aae196ddfccc4eaa5eb20bb7aa9c3ea0"
   },
   "source": [
    "# Inputs & imports : that's all you need to give !"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "metadata": {
    "_cell_guid": "58c96d5f-d0b2-43b7-a415-eb6e69d8cca7",
    "_uuid": "f60077090ed42fb68274932a478121d027536b48"
   },
   "outputs": [
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "Using Theano backend.n"
     ]
    }
   ],
   "source": [
    "from mlbox.preprocessing import *n",
    "from mlbox.optimisation import *n",
    "from mlbox.prediction import *"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "metadata": {
    "_cell_guid": "b50d061d-fe5d-479d-9754-14a3a9fa5f7f",
    "_uuid": "fe7d96120efaff4ae8a9cb51818f393f63ad1373",
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "paths = ["../input/train.csv","../input/test.csv"]n",
    "target_name = "SalePrice""
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "_cell_guid": "a74fcd51-5f8d-4283-a041-75bbb31ac10a",
    "_uuid": "60b5abc85d4b14ee05ad309fec17432c93af5d1d"
   },
   "source": [
    "# Now let MLBox do the job ! "
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "_cell_guid": "c8bbd2bc-abea-4020-82e2-bfb2f689f93b",
    "_uuid": "08955d0a1e8abb49d3ce975c95916d00aafdca15"
   },
   "source": [
    "## ... to read and clean all the files "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "metadata": {
    "_cell_guid": "fcf95b6e-a149-458a-a224-2e3b9945d281",
    "_uuid": "59ecd874d452bba5cd2f29a07e675b6ca899dc93",
    "collapsed": true
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "n",
      "reading csv : train.csv ...n",
      "cleaning data ...n",
      "CPU time: 1.150022268295288 secondsn",
      "n",
      "reading csv : test.csv ...n",
      "cleaning data ...n",
      "CPU time: 1.1377265453338623 secondsn",
      "n",
      "> Number of common features : 80n",
      "n",
      "gathering and crunching for train and test datasets ...n",
      "reindexing for train and test datasets ...n",
      "dropping training duplicates ...n",
      "dropping constant variables on training set ...n",
      "n",
      "> Number of categorical features: 43n",
      "> Number of numerical features: 37n",
      "> Number of training samples : 1460n",
      "> Number of test samples : 1459n",
      "n",
      "> Top sparse features (% missing values on train set):n",
      "PoolQC         99.5n",
      "MiscFeature    96.3n",
      "Alley          93.8n",
      "Fence          80.8n",
      "FireplaceQu    47.3n",
      "dtype: float64n",
      "n",
      "> Task : regressionn",
      "count      1460.000000n",
      "mean     180921.195890n",
      "std       79442.502883n",
      "min       34900.000000n",
      "25%      129975.000000n",
      "50%      163000.000000n",
      "75%      214000.000000n",
      "max      755000.000000n",
      "Name: SalePrice, dtype: float64n"
     ]
    }
   ],
   "source": [
    "rd = Reader(sep = ",")n",
    "df = rd.train_test_split(paths, target_name)   #reading and preprocessing (dates, ...)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "metadata": {
    "_cell_guid": "7f362b5e-8ace-4a0a-9adc-9cdd883b06bf",
    "_uuid": "8ea8a0f79c8e7ed3f7c76078ec8cf1bfa60fa433",
    "collapsed": true
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "n",
      "computing drifts ...n",
      "CPU time: 1.1901113986968994 secondsn",
      "n",
      "> Top 10 driftsn",
      "n",
      "('Id', 1.0)n",
      "('2ndFlrSF', 0.043727406089464349)n",
      "('FireplaceQu', 0.042353711516121217)n",
      "('Exterior1st', 0.040058391064138776)n",
      "('HeatingQC', 0.037907300453223325)n",
      "('GrLivArea', 0.034105310873727035)n",
      "('TotRmsAbvGrd', 0.030938611129773586)n",
      "('BsmtFinType1', 0.030811329215275407)n",
      "('FullBath', 0.029647240388988916)n",
      "('MSSubClass', 0.028765488729139754)n",
      "n",
      "> Deleted variables : ['Id']n",
      "> Drift coefficients dumped into directory : saven"
     ]
    }
   ],
   "source": [
    "dft = Drift_thresholder()n",
    "df = dft.fit_transform(df)   #removing non-stable features (like ID,...)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "_cell_guid": "1da1eadc-e421-4fbe-846f-0c57d7413777",
    "_uuid": "c7d3d2d92f50ae18e5a5a562ef774d0ef44ab2b2"
   },
   "source": [
    "## ... to tune all the hyper-parameters"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "metadata": {
    "_cell_guid": "29c585f0-da65-4795-a297-d2b6d1b98bf3",
    "_uuid": "5d6875be21ef7afca613fec2f5c579e9fd92c8b8",
    "collapsed": true
   },
   "outputs": [
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "/opt/conda/lib/python3.6/site-packages/mlbox/optimisation/optimiser.py:78: UserWarning: Optimiser will save all your fitted models into directory 'save/joblib'. Please clear it regularly.n",
      "  +str(self.to_path)+"/joblib'. Please clear it regularly.")n"
     ]
    }
   ],
   "source": [
    "rmse = make_scorer(lambda y_true, y_pred: np.sqrt(np.sum((y_true - y_pred)**2)/len(y_true)), greater_is_better=False, needs_proba=False)n",
    "opt = Optimiser(scoring = rmse, n_folds = 3)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "_cell_guid": "03b216a4-4d1f-43e6-919a-7a9af8447554",
    "_uuid": "43bd3715a80ae549b7e427719061263304f0da7e"
   },
   "source": [
    "**LightGBM**"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "metadata": {
    "_cell_guid": "ee947f01-7ae6-4e10-842f-4d3161a3d300",
    "_uuid": "385c4731320701b4eaacaa1ed89ac811de9e6b01",
    "collapsed": true
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "n",
      "##################################################### testing hyper-parameters... #####################################################n",
      "n",
      ">>> NA ENCODER :{'numerical_strategy': 'mean', 'categorical_strategy': '<NULL>'}n",
      "n",
      ">>> CA ENCODER :{'strategy': 'label_encoding'}n",
      "n",
      ">>> ESTIMATOR :{'strategy': 'LightGBM', 'colsample_bytree': 0.8137091996089085, 'learning_rate': 0.07, 'max_depth': 9, 'n_estimators': 150, 'subsample': 0.9324125554458768, 'boosting_type': 'gbdt', 'max_bin': 255, 'min_child_samples': 10, 'min_child_weight': 5, 'min_split_gain': 0, 'nthread': -1, 'num_leaves': 31, 'objective': 'regression', 'reg_alpha': 0, 'reg_lambda': 0, 'seed': 0, 'silent': True, 'subsample_for_bin': 50000, 'subsample_freq': 1}n",
      "n",
      "n",
      "MEAN SCORE : make_scorer(<lambda>, greater_is_better=False) = -70031.2767813n",
      "VARIANCE : 3134.28444012 (fold 1 = -74463.7952605, fold 2 = -67801.1115712, fold 3 = -67828.9235122)n",
      "CPU time: 271.8608019351959 secondsn",
      "n",
      "n",
      "##################################################### testing hyper-parameters... #####################################################n",
      "n",
      ">>> NA ENCODER :{'numerical_strategy': 'mean', 'categorical_strategy': '<NULL>'}n",
      "n",
      ">>> CA ENCODER :{'strategy': 'label_encoding'}n",
      "n",
      ">>> ESTIMATOR :{'strategy': 'LightGBM', 'colsample_bytree': 0.9418201179272242, 'learning_rate': 0.07, 'max_depth': 9, 'n_estimators': 150, 'subsample': 0.9480673023232556, 'boosting_type': 'gbdt', 'max_bin': 255, 'min_child_samples': 10, 'min_child_weight': 5, 'min_split_gain': 0, 'nthread': -1, 'num_leaves': 31, 'objective': 'regression', 'reg_alpha': 0, 'reg_lambda': 0, 'seed': 0, 'silent': True, 'subsample_for_bin': 50000, 'subsample_freq': 1}n",
      "n",
      "n",
      "MEAN SCORE : make_scorer(<lambda>, greater_is_better=False) = -70008.8290912n",
      "VARIANCE : 3154.33850911 (fold 1 = -74469.7285774, fold 2 = -67786.0589077, fold 3 = -67770.6997884)n",
      "CPU time: 253.32535767555237 secondsn",
      "n",
      "n",
      "##################################################### testing hyper-parameters... #####################################################n",
      "n",
      ">>> NA ENCODER :{'numerical_strategy': 'mean', 'categorical_strategy': '<NULL>'}n",
      "n",
      ">>> CA ENCODER :{'strategy': 'label_encoding'}n",
      "n",
      ">>> ESTIMATOR :{'strategy': 'LightGBM', 'colsample_bytree': 0.9180800003765482, 'learning_rate': 0.07, 'max_depth': 9, 'n_estimators': 150, 'subsample': 0.9234361035287161, 'boosting_type': 'gbdt', 'max_bin': 255, 'min_child_samples': 10, 'min_child_weight': 5, 'min_split_gain': 0, 'nthread': -1, 'num_leaves': 31, 'objective': 'regression', 'reg_alpha': 0, 'reg_lambda': 0, 'seed': 0, 'silent': True, 'subsample_for_bin': 50000, 'subsample_freq': 1}n",
      "n",
      "n",
      "MEAN SCORE : make_scorer(<lambda>, greater_is_better=False) = -70010.1431083n",
      "VARIANCE : 3151.68515498 (fold 1 = -74467.2984189, fold 2 = -67783.534551, fold 3 = -67779.5963551)n",
      "CPU time: 121.50654983520508 secondsn",
      "n",
      "n",
      "##################################################### testing hyper-parameters... #####################################################n",
      "n",
      ">>> NA ENCODER :{'numerical_strategy': 'mean', 'categorical_strategy': '<NULL>'}n",
      "n",
      ">>> CA ENCODER :{'strategy': 'label_encoding'}n",
      "n",
      ">>> ESTIMATOR :{'strategy': 'LightGBM', 'colsample_bytree': 0.9023451928192011, 'learning_rate': 0.07, 'max_depth': 5, 'n_estimators': 150, 'subsample': 0.8911450106210692, 'boosting_type': 'gbdt', 'max_bin': 255, 'min_child_samples': 10, 'min_child_weight': 5, 'min_split_gain': 0, 'nthread': -1, 'num_leaves': 31, 'objective': 'regression', 'reg_alpha': 0, 'reg_lambda': 0, 'seed': 0, 'silent': True, 'subsample_for_bin': 50000, 'subsample_freq': 1}n",
      "n",
      "n",
      "MEAN SCORE : make_scorer(<lambda>, greater_is_better=False) = -70136.4872727n",
      "VARIANCE : 3104.28930042 (fold 1 = -74526.594676, fold 2 = -67929.778905, fold 3 = -67953.088237)n",
      "CPU time: 74.49110388755798 secondsn",
      "n",
      "n",
      "##################################################### testing hyper-parameters... #####################################################n",
      "n",
      ">>> NA ENCODER :{'numerical_strategy': 'mean', 'categorical_strategy': '<NULL>'}n",
      "n",
      ">>> CA ENCODER :{'strategy': 'label_encoding'}n",
      "n",
      ">>> ESTIMATOR :{'strategy': 'LightGBM', 'colsample_bytree': 0.9080209022062069, 'learning_rate': 0.07, 'max_depth': 8, 'n_estimators': 150, 'subsample': 0.8185595300825068, 'boosting_type': 'gbdt', 'max_bin': 255, 'min_child_samples': 10, 'min_child_weight': 5, 'min_split_gain': 0, 'nthread': -1, 'num_leaves': 31, 'objective': 'regression', 'reg_alpha': 0, 'reg_lambda': 0, 'seed': 0, 'silent': True, 'subsample_for_bin': 50000, 'subsample_freq': 1}n",
      "n",
      "n",
      "MEAN SCORE : make_scorer(<lambda>, greater_is_better=False) = -69997.8220034n",
      "VARIANCE : 3134.80299946 (fold 1 = -74431.0487482, fold 2 = -67800.1886131, fold 3 = -67762.2286488)n",
      "CPU time: 195.2304494380951 secondsn",
      "n",
      "n",
      "##################################################### testing hyper-parameters... #####################################################n",
      "n",
      ">>> NA ENCODER :{'numerical_strategy': 'mean', 'categorical_strategy': '<NULL>'}n",
      "n",
      ">>> CA ENCODER :{'strategy': 'label_encoding'}n",
      "n",
      ">>> ESTIMATOR :{'strategy': 'LightGBM', 'colsample_bytree': 0.8766988693756298, 'learning_rate': 0.07, 'max_depth': 5, 'n_estimators': 150, 'subsample': 0.9434962590100372, 'boosting_type': 'gbdt', 'max_bin': 255, 'min_child_samples': 10, 'min_child_weight': 5, 'min_split_gain': 0, 'nthread': -1, 'num_leaves': 31, 'objective': 'regression', 'reg_alpha': 0, 'reg_lambda': 0, 'seed': 0, 'silent': True, 'subsample_for_bin': 50000, 'subsample_freq': 1}n",
      "n",
      "n",
      "MEAN SCORE : make_scorer(<lambda>, greater_is_better=False) = -70145.4558177n",
      "VARIANCE : 3106.69556773 (fold 1 = -74538.9635004, fold 2 = -67936.3041191, fold 3 = -67961.0998334)n",
      "CPU time: 85.50592613220215 secondsn",
      "n",
      "n",
      "##################################################### testing hyper-parameters... #####################################################n",
      "n",
      ">>> NA ENCODER :{'numerical_strategy': 'mean', 'categorical_strategy': '<NULL>'}n",
      "n",
      ">>> CA ENCODER :{'strategy': 'label_encoding'}n",
      "n",
      ">>> ESTIMATOR :{'strategy': 'LightGBM', 'colsample_bytree': 0.9461942628512395, 'learning_rate': 0.07, 'max_depth': 8, 'n_estimators': 150, 'subsample': 0.9024391020410172, 'boosting_type': 'gbdt', 'max_bin': 255, 'min_child_samples': 10, 'min_child_weight': 5, 'min_split_gain': 0, 'nthread': -1, 'num_leaves': 31, 'objective': 'regression', 'reg_alpha': 0, 'reg_lambda': 0, 'seed': 0, 'silent': True, 'subsample_for_bin': 50000, 'subsample_freq': 1}n",
      "n",
      "n",
      "MEAN SCORE : make_scorer(<lambda>, greater_is_better=False) = -70019.5589279n",
      "VARIANCE : 3145.24398135 (fold 1 = -74467.5755006, fold 2 = -67809.727385, fold 3 = -67781.3738981)n",
      "CPU time: 110.08434128761292 secondsn",
      "n",
      "n",
      "##################################################### testing hyper-parameters... #####################################################n",
      "n",
      ">>> NA ENCODER :{'numerical_strategy': 'mean', 'categorical_strategy': '<NULL>'}n",
      "n",
      ">>> CA ENCODER :{'strategy': 'label_encoding'}n",
      "n",
      ">>> ESTIMATOR :{'strategy': 'LightGBM', 'colsample_bytree': 0.8314611711199553, 'learning_rate': 0.07, 'max_depth': 5, 'n_estimators': 150, 'subsample': 0.8565561506831961, 'boosting_type': 'gbdt', 'max_bin': 255, 'min_child_samples': 10, 'min_child_weight': 5, 'min_split_gain': 0, 'nthread': -1, 'num_leaves': 31, 'objective': 'regression', 'reg_alpha': 0, 'reg_lambda': 0, 'seed': 0, 'silent': True, 'subsample_for_bin': 50000, 'subsample_freq': 1}n",
      "n",
      "n",
      "MEAN SCORE : make_scorer(<lambda>, greater_is_better=False) = -70156.9445756n",
      "VARIANCE : 3095.88473192 (fold 1 = -74535.100098, fold 2 = -67944.0114986, fold 3 = -67991.7221302)n",
      "CPU time: 70.14835977554321 secondsn",
      "n",
      "n",
      "##################################################### testing hyper-parameters... #####################################################n",
      "n",
      ">>> NA ENCODER :{'numerical_strategy': 'mean', 'categorical_strategy': '<NULL>'}n",
      "n",
      ">>> CA ENCODER :{'strategy': 'label_encoding'}n",
      "n",
      ">>> ESTIMATOR :{'strategy': 'LightGBM', 'colsample_bytree': 0.8211304539554622, 'learning_rate': 0.07, 'max_depth': 7, 'n_estimators': 150, 'subsample': 0.8729118255474302, 'boosting_type': 'gbdt', 'max_bin': 255, 'min_child_samples': 10, 'min_child_weight': 5, 'min_split_gain': 0, 'nthread': -1, 'num_leaves': 31, 'objective': 'regression', 'reg_alpha': 0, 'reg_lambda': 0, 'seed': 0, 'silent': True, 'subsample_for_bin': 50000, 'subsample_freq': 1}n",
      "n",
      "n",
      "MEAN SCORE : make_scorer(<lambda>, greater_is_better=False) = -70032.3697673n",
      "VARIANCE : 3126.37912114 (fold 1 = -74453.7307698, fold 2 = -67828.3808773, fold 3 = -67814.9976548)n",
      "CPU time: 238.7316792011261 secondsn",
      "n",
      "n",
      "##################################################### testing hyper-parameters... #####################################################n",
      "n",
      ">>> NA ENCODER :{'numerical_strategy': 'mean', 'categorical_strategy': '<NULL>'}n",
      "n",
      ">>> CA ENCODER :{'strategy': 'label_encoding'}n",
      "n",
      ">>> ESTIMATOR :{'strategy': 'LightGBM', 'colsample_bytree': 0.829877690177056, 'learning_rate': 0.07, 'max_depth': 5, 'n_estimators': 150, 'subsample': 0.8403614519519063, 'boosting_type': 'gbdt', 'max_bin': 255, 'min_child_samples': 10, 'min_child_weight': 5, 'min_split_gain': 0, 'nthread': -1, 'num_leaves': 31, 'objective': 'regression', 'reg_alpha': 0, 'reg_lambda': 0, 'seed': 0, 'silent': True, 'subsample_for_bin': 50000, 'subsample_freq': 1}n",
      "n",
      "n",
      "MEAN SCORE : make_scorer(<lambda>, greater_is_better=False) = -70168.5931155n",
      "VARIANCE : 3105.07616509 (fold 1 = -74559.7761367, fold 2 = -67953.4889836, fold 3 = -67992.5142262)n",
      "CPU time: 40.285968542099 secondsn",
      "n",
      "n",
      "##################################################### testing hyper-parameters... #####################################################n",
      "n",
      ">>> NA ENCODER :{'numerical_strategy': 'mean', 'categorical_strategy': '<NULL>'}n",
      "n",
      ">>> CA ENCODER :{'strategy': 'label_encoding'}n",
      "n",
      ">>> ESTIMATOR :{'strategy': 'LightGBM', 'colsample_bytree': 0.9164633585666616, 'learning_rate': 0.07, 'max_depth': 7, 'n_estimators': 150, 'subsample': 0.8243304425719699, 'boosting_type': 'gbdt', 'max_bin': 255, 'min_child_samples': 10, 'min_child_weight': 5, 'min_split_gain': 0, 'nthread': -1, 'num_leaves': 31, 'objective': 'regression', 'reg_alpha': 0, 'reg_lambda': 0, 'seed': 0, 'silent': True, 'subsample_for_bin': 50000, 'subsample_freq': 1}n",
      "n",
      "n",
      "MEAN SCORE : make_scorer(<lambda>, greater_is_better=False) = -70008.2567805n",
      "VARIANCE : 3134.5730596 (fold 1 = -74441.2036792, fold 2 = -67799.4478162, fold 3 = -67784.1188462)n",
      "CPU time: 49.79308104515076 secondsn",
      "n",
      "n",
      "##################################################### testing hyper-parameters... #####################################################n",
      "n",
      ">>> NA ENCODER :{'numerical_strategy': 'mean', 'categorical_strategy': '<NULL>'}n",
      "n",
      ">>> CA ENCODER :{'strategy': 'label_encoding'}n",
      "n",
      ">>> ESTIMATOR :{'strategy': 'LightGBM', 'colsample_bytree': 0.9204704365088157, 'learning_rate': 0.07, 'max_depth': 8, 'n_estimators': 150, 'subsample': 0.9387439874102184, 'boosting_type': 'gbdt', 'max_bin': 255, 'min_child_samples': 10, 'min_child_weight': 5, 'min_split_gain': 0, 'nthread': -1, 'num_leaves': 31, 'objective': 'regression', 'reg_alpha': 0, 'reg_lambda': 0, 'seed': 0, 'silent': True, 'subsample_for_bin': 50000, 'subsample_freq': 1}n",
      "n",
      "n",
      "MEAN SCORE : make_scorer(<lambda>, greater_is_better=False) = -70016.3784659n",
      "VARIANCE : 3158.61644948 (fold 1 = -74483.3288258, fold 2 = -67775.6455919, fold 3 = -67790.1609801)n",
      "CPU time: 68.811443567276 secondsn",
      "n",
      "n",
      "##################################################### testing hyper-parameters... #####################################################n",
      "n",
      ">>> NA ENCODER :{'numerical_strategy': 'mean', 'categorical_strategy': '<NULL>'}n",
      "n",
      ">>> CA ENCODER :{'strategy': 'label_encoding'}n",
      "n",
      ">>> ESTIMATOR :{'strategy': 'LightGBM', 'colsample_bytree': 0.9099840750281214, 'learning_rate': 0.07, 'max_depth': 7, 'n_estimators': 150, 'subsample': 0.8043191395996018, 'boosting_type': 'gbdt', 'max_bin': 255, 'min_child_samples': 10, 'min_child_weight': 5, 'min_split_gain': 0, 'nthread': -1, 'num_leaves': 31, 'objective': 'regression', 'reg_alpha': 0, 'reg_lambda': 0, 'seed': 0, 'silent': True, 'subsample_for_bin': 50000, 'subsample_freq': 1}n",
      "n",
      "n",
      "MEAN SCORE : make_scorer(<lambda>, greater_is_better=False) = -70013.2162867n",
      "VARIANCE : 3130.79747006 (fold 1 = -74440.8323976, fold 2 = -67800.3453697, fold 3 = -67798.4710927)n",
      "CPU time: 71.00030493736267 secondsn",
      "n",
      "n",
      "##################################################### testing hyper-parameters... #####################################################n",
      "n",
      ">>> NA ENCODER :{'numerical_strategy': 'mean', 'categorical_strategy': '<NULL>'}n",
      "n",
      ">>> CA ENCODER :{'strategy': 'label_encoding'}n",
      "n",
      ">>> ESTIMATOR :{'strategy': 'LightGBM', 'colsample_bytree': 0.8031637063022484, 'learning_rate': 0.07, 'max_depth': 9, 'n_estimators': 150, 'subsample': 0.9399595575279052, 'boosting_type': 'gbdt', 'max_bin': 255, 'min_child_samples': 10, 'min_child_weight': 5, 'min_split_gain': 0, 'nthread': -1, 'num_leaves': 31, 'objective': 'regression', 'reg_alpha': 0, 'reg_lambda': 0, 'seed': 0, 'silent': True, 'subsample_for_bin': 50000, 'subsample_freq': 1}n",
      "n",
      "n",
      "MEAN SCORE : make_scorer(<lambda>, greater_is_better=False) = -70034.5769383n",
      "VARIANCE : 3133.15396435 (fold 1 = -74465.4794982, fold 2 = -67801.5892795, fold 3 = -67836.6620371)n",
      "CPU time: 72.4679946899414 secondsn",
      "n",
      "n",
      "##################################################### testing hyper-parameters... #####################################################n",
      "n",
      ">>> NA ENCODER :{'numerical_strategy': 'mean', 'categorical_strategy': '<NULL>'}n",
      "n",
      ">>> CA ENCODER :{'strategy': 'label_encoding'}n",
      "n",
      ">>> ESTIMATOR :{'strategy': 'LightGBM', 'colsample_bytree': 0.8534578560312139, 'learning_rate': 0.07, 'max_depth': 5, 'n_estimators': 150, 'subsample': 0.8507021204316042, 'boosting_type': 'gbdt', 'max_bin': 255, 'min_child_samples': 10, 'min_child_weight': 5, 'min_split_gain': 0, 'nthread': -1, 'num_leaves': 31, 'objective': 'regression', 'reg_alpha': 0, 'reg_lambda': 0, 'seed': 0, 'silent': True, 'subsample_for_bin': 50000, 'subsample_freq': 1}n",
      "n",
      "n",
      "MEAN SCORE : make_scorer(<lambda>, greater_is_better=False) = -70157.9191645n",
      "VARIANCE : 3102.90538823 (fold 1 = -74546.0823235, fold 2 = -67956.7073456, fold 3 = -67970.9678245)n",
      "CPU time: 43.82064723968506 secondsn",
      "n",
      "n",
      "n",
      "~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~n",
      "~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ BEST HYPER-PARAMETERS ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~n",
      "~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~n",
      "n",
      "{'est__colsample_bytree': 0.9080209022062069, 'est__learning_rate': 0.07, 'est__max_depth': 8, 'est__n_estimators': 150, 'est__strategy': 'LightGBM', 'est__subsample': 0.8185595300825068}n"
     ]
    }
   ],
   "source": [
    "space = {n",
    "    n",
    "        'est__strategy':{"search":"choice",n",
    "                                  "space":["LightGBM"]},    n",
    "        'est__n_estimators':{"search":"choice",n",
    "                                  "space":[150]},    n",
    "        'est__colsample_bytree':{"search":"uniform",n",
    "                                  "space":[0.8,0.95]},n",
    "        'est__subsample':{"search":"uniform",n",
    "                                  "space":[0.8,0.95]},n",
    "        'est__max_depth':{"search":"choice",n",
    "                                  "space":[5,6,7,8,9]},n",
    "        'est__learning_rate':{"search":"choice",n",
    "                                  "space":[0.07]} n",
    "    n",
    "        }n",
    "n",
    "params = opt.optimise(space, df,15)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "_cell_guid": "5a3f84b2-3b27-45ed-a3a4-d0ac704147de",
    "_uuid": "b4dc424c89f195730cceacb89a421fc30b7d5dd4"
   },
   "source": [
    "But you can also tune the whole Pipeline ! Indeed, you can choose:n",
    "n",
    "* different strategies to impute missing valuesn",
    "* different strategies to encode categorical features (entity embeddings, ...)n",
    "* different strategies and thresholds to select relevant features (random forest feature importance, l1 regularization, ...)n",
    "* to add stacking meta-features !n",
    "* different models and hyper-parameters (XGBoost, Random Forest, Linear, ...)n",
    "n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "_cell_guid": "e2ebc865-aff8-44c0-8c99-c5c68979e9c0",
    "_uuid": "846449eb6bd211e7923b64a5302870e0036d3f5e"
   },
   "source": [
    "## ... to predict"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 7,
   "metadata": {
    "_cell_guid": "996573f8-c1d9-4168-8f55-b4a684232986",
    "_uuid": "cdff913ca3ddd5211870df3e0996da3f32af772b",
    "collapsed": true
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "n",
      "fitting the pipeline ...n",
      "CPU time: 27.68521022796631 secondsn"
     ]
    },
    {
     "data": {
      "image/png": "n",
      "text/plain": [
       "<matplotlib.figure.Figure at 0x7f93eeefdfd0>"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "n",
      "> Feature importances dumped into directory : saven",
      "n",
      "predicting...n",
      "CPU time: 0.09216570854187012 secondsn",
      "n",
      "> Overview on predictions : n",
      "n",
      "   SalePrice_predictedn",
      "0        165921.195890n",
      "1        167701.648928n",
      "2        175319.175272n",
      "3        177518.368928n",
      "4        192895.614810n",
      "5        176685.509650n",
      "6        170356.903631n",
      "7        174661.891858n",
      "8        181593.424964n",
      "9        165921.195890n",
      "n",
      "dumping predictions into directory : save ...n"
     ]
    },
    {
     "data": {
      "text/plain": [
       "<mlbox.prediction.predictor.Predictor at 0x7f9430dd2780>"
      ]
     },
     "execution_count": 7,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "prd = Predictor()n",
    "prd.fit_predict(params, df)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "_cell_guid": "242102b7-174a-4c97-9eb0-9875b085ea2b",
    "_uuid": "caeee1581dda9444d9d519051a77fd1e73b41f2d"
   },
   "source": [
    "### Formatting for submissionn"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 8,
   "metadata": {
    "_cell_guid": "7668891c-cabc-45d3-bbd6-a0465a037eff",
    "_uuid": "1706dac719a20b006d903db219065ec78ebeccb6",
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "submit = pd.read_csv("../input/sample_submission.csv",sep=',')n",
    "preds = pd.read_csv("save/"+target_name+"_predictions.csv")n",
    "n",
    "submit[target_name] =  preds[target_name+"_predicted"].valuesn",
    "n",
    "submit.to_csv("mlbox.csv", index=False)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "_cell_guid": "2c0c568e-2de8-4a09-a985-1548bfa4fb7f",
    "_uuid": "a2dcc9ee8edc83f6b90eb751a71db5554a0f098e"
   },
   "source": [
    "# **That's all !!**"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.7.2"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 1
}

2. Auto-Sklearn

Auto-Sklearn is an automated machine learning package built on top of Scikit-learn. Auto-sklearn frees a machine learning user from algorithm selection and hyperparameter tuning. It includes feature engineering methods such as one-hot encoding, numeric feature standardization, PCA, and more. The models use sklearn estimators for classification and regression problems.

Auto-sklearn creates a pipeline and optimizes it using Bayesian search. Two components are added to Bayesian hyperparameter optimization of an ML framework: meta-learning for initializing the Bayesian optimizer and automated ensemble construction from configurations evaluated during optimization.

Auto-sklearn performs well on small and medium-sized datasets, but it cannot be applied to modern deep learning systems that yield state-of-the-art performance on large datasets.

Installation

Auto-sklearn currently only works on Linux machines.

Demo

The following example shows how to fit a simple regression model with Auto-Sklearn.


import sklearn.model_selection
import sklearn.datasets
import sklearn.metrics

import autosklearn.regression

def main():
    X, y = sklearn.datasets.load_boston(return_X_y=True)
    feature_types = (['numerical'] * 3) + ['categorical'] + (['numerical'] * 9)
    X_train, X_test, y_train, y_test = 
        sklearn.model_selection.train_test_split(X, y, random_state=1)

    automl = autosklearn.regression.AutoSklearnRegressor(
        time_left_for_this_task=120,
        per_run_time_limit=30,
        tmp_folder='/tmp/autosklearn_regression_example_tmp',
        output_folder='/tmp/autosklearn_regression_example_out',
    )
    automl.fit(X_train, y_train, dataset_name='boston',
               feat_type=feature_types)

    print(automl.show_models())
    predictions = automl.predict(X_test)
    print("R2 score:", sklearn.metrics.r2_score(y_test, predictions))


3. Tree-Based Pipeline Optimization Tool (TPOT)

TPOT is a Python automated machine learning tool that optimizes machine learning pipelines using genetic programming. TPOT extends the Scikit-learn framework but with its own regressor and classifier methods. TPOT works by exploring thousands of possible pipelines and finding the best one for your data.

TPOT cannot automatically process natural language inputs. Additionally, it’s also not able to processes categorical strings, which must be integer-encoded before being passed in as data.

Installation

For detailed instructions, please visit the TPOT installation instructions in the documentation.

Demo

Demonstrating the working of TPOT for classifying MNIST digits.

4. H2O

H2O is a fully open source, distributed in-memory machine learning platform from the company H2O.ai. With support for both R and Python, H2O supports the most widely used statistical & machine learning algorithms, including gradient boosted machines, generalized linear models, deep learning models, and more.

H2O includes an automatic machine learning module that uses its own algorithms to build a pipeline. It performs an exhaustive search over its feature engineering methods and model hyperparameters to optimize its pipelines

H2O automates some of the most difficult data science and machine learning workflows, such as feature engineering, model validation, model tuning, model selection and model deployment. In addition to this, it also offers automatic visualizations and machine learning interpretability (MLI).

Installation

Follow the link below to download and install H2O on your systems.

Demo

Here’s an example showing the basic usage of the H2OAutoML class in Python:

import h2o
from h2o.automl import H2OAutoML

h2o.init()

# Import a sample binary outcome train/test set into H2O
train = h2o.import_file("https://s3.amazonaws.com/erin-data/higgs/higgs_train_10k.csv")
test = h2o.import_file("https://s3.amazonaws.com/erin-data/higgs/higgs_test_5k.csv")

# Identify predictors and response
x = train.columns
y = "response"
x.remove(y)

# For binary classification, response should be a factor
train[y] = train[y].asfactor()
test[y] = test[y].asfactor()

# Run AutoML for 30 seconds
aml = H2OAutoML(max_runtime_secs = 30)
aml.train(x = x, y = y,
          training_frame = train,
          leaderboard_frame = test)

# View the AutoML Leaderboard
lb = aml.leaderboard
lb

# model_id                                            auc       logloss
# --------------------------------------------------  --------  ---------
#           StackedEnsemble_model_1494643945817_1709  0.780384  0.561501
# GBM_grid__95ebce3d26cd9d3997a3149454984550_model_0  0.764791  0.664823
# GBM_grid__95ebce3d26cd9d3997a3149454984550_model_2  0.758109  0.593887
#                          DRF_model_1494643945817_3  0.736786  0.614430
#                        XRT_model_1494643945817_461  0.735946  0.602142
# GBM_grid__95ebce3d26cd9d3997a3149454984550_model_3  0.729492  0.667036
# GBM_grid__95ebce3d26cd9d3997a3149454984550_model_1  0.727456  0.675624
# GLM_grid__95ebce3d26cd9d3997a3149454984550_model_1  0.685216  0.635137
# GLM_grid__95ebce3d26cd9d3997a3149454984550_model_0  0.685216  0.635137


# The leader model is stored here
aml.leader


# If you need to generate predictions on a test set, you can make
# predictions directly on the `"H2OAutoML"` object, or on the leader
# model object directly

preds = aml.predict(test)

# or:
preds = aml.leader.predict(test)

Output

The AutoML object includes a “leaderboard” of models that were trained in the process, ranked by a default metric based on the problem type (the second column of the leaderboard). Here’s an example leaderboard for a binary classification task:

5. AutoKeras

Auto-Keras is an open source software library built by DATA Lab for automated machine learning. Auto-Keras, which is based on the Keras deep learning framework provides functions to automatically search for architecture and hyperparameters for deep learning models.

The API’s design follows the classic design of the Scikit-Learn API; hence, it’s extremely simple to use. The current version provides functionalities to automatically search for hyperparameters during the deep learning process.

Auto-Keras tends to simplify the ML process through the use of automated Neural Architecture Search (NAS) algorithms. Neural Architecture Search essentially replaces the deep learning engineer/practitioner with a set of algorithms that automatically tunes the model.

Installation

The installation part is also very simple:

Demo

Here’s a demo of Auto-Keras library on the MNIST dataset:

from keras.datasets import mnist
from autokeras import ImageClassifier
from autokeras.constant import Constant

if __name__ == '__main__':
    (x_train, y_train), (x_test, y_test) = mnist.load_data()
    x_train = x_train.reshape(x_train.shape + (1,))
    x_test = x_test.reshape(x_test.shape + (1,))
    clf = ImageClassifier(verbose=True, augment=False)
    clf.fit(x_train, y_train, time_limit=30 * 60)
    clf.final_fit(x_train, y_train, x_test, y_test, retrain=True)
    y = clf.evaluate(x_test, y_test)

    print(y * 100)

6. Cloud AutoML

Cloud AutoML is a suite of machine learning products from Google that enables developers with limited machine learning expertise to train high-quality models specific to their business needs by leveraging Google’s state-of-the-art transfer learning and Neural Architecture Search technology.

Cloud AutoML provides a simple graphical user interface (GUI) to train, evaluate, improve, and deploy models based on your own data. Currently, the suite provides the following AutoML solutions:

The downside of Google’s AutoML is that isn’t open source and hence comes with a price. In the case of AutoML Vision, the cost depends both on the time taken to train the model as well as in terms of how many images you send to AutoML Vision for predictions. The pricing is as follows:

7. TransmogrifAI

TransmogrifAI is an open source automated machine learning library from Salesforce. The company’s flagship ML platform called Einstein is also powered by TransmogrifAI. It is an end-to-end AutoML library for structured data written in Scala that runs on top of Apache Spark. TransmogrifAI is especially useful when you need to :

  • Rapidly train good quality machine learnt models with minimal hand tuning
  • Build modular, reusable, strongly-typed machine learning workflows

Installation

There are some pre-requisites like Java and Spark that need to be installed.

Read the documentation for complete installation instructions.

Demo

Predicting Titanic Survivors with TransmogrifAI. See the entire example here.

import com.salesforce.op._
import com.salesforce.op.readers._
import com.salesforce.op.features._
import com.salesforce.op.features.types._
import com.salesforce.op.stages.impl.classification._
import org.apache.spark.SparkConf
import org.apache.spark.sql.SparkSession

implicit val spark = SparkSession.builder.config(new SparkConf()).getOrCreate()
import spark.implicits._

// Read Titanic data as a DataFrame
val passengersData = DataReaders.Simple.csvCase[Passenger](path = pathToData).readDataset().toDF()

// Extract response and predictor features
val (survived, predictors) = FeatureBuilder.fromDataFrame[RealNN](passengersData, response = "survived")

// Automated feature engineering
val featureVector = predictors.transmogrify()

// Automated feature validation and selection
val checkedFeatures = survived.sanityCheck(featureVector, removeBadFeatures = true)

// Automated model selection
val (pred, raw, prob) = BinaryClassificationModelSelector().setInput(survived, checkedFeatures).getOutput()

// Setting up a TransmogrifAI workflow and training the model
val model = new OpWorkflow().setInputDataset(passengersData).setResultFeatures(pred).train()

println("Model summary:n" + model.summaryPretty())

Check out the results of the above code here.

The Future of AutoML

Essentially, the purpose of AutoML is to automate the repetitive tasks like pipeline creation and hyperparameter tuning so that data scientists can actually spend more of their time on the business problem at hand.

AutoML also aims to make the technology available to everybody rather than a select few. AutoML and data scientists can work in conjunction to accelerate the ML process so that the real effectiveness of machine learning can be utilized.

Whether or not AutoML becomes a success depends mainly on its adoption and the advancements that are made in this sector. However, it’s clear that AutoML is a big part of the future of machine learning.

References

  1. Efficient and Robust Automated Machine Learning
  2. Benchmarking Automatic Machine Learning Frameworks

Fritz

Our team has been at the forefront of Artificial Intelligence and Machine Learning research for more than 15 years and we're using our collective intelligence to help others learn, understand and grow using these new technologies in ethical and sustainable ways.

Comments 0 Responses

Leave a Reply

Your email address will not be published. Required fields are marked *

wix banner square