Boosting your Machine Learning Models Using XGBoost

In this tutorial we’ll cover XGBoost, a machine learning algorithm that has dominated the applied machine learning space recently.

Plan of Attack

What is XGBoost?
Why would you use XGBoost?
Boosting Vis-a-vis Bagging
Applying XGBoost in Python
XGBoost’s Hyperparameters
Cross Validation when using XGBoost
Visualizing Feature Importance in XGBoost
Conclusion

What is XGBoost?

XGBoost is an open source library that provides gradient boosting for Python, Java and C++, R and Julia. In this tutorial, our focus will be on Python. Gradient Boosting is a machine learning technique for classification and regression problems that produces a prediction from an ensemble of weak decision trees.

Why would you use XGBoost?

The primary reasons you’d use this algorithm are its accuracy, efficiency, and feasibility. It’s a linear model and a tree learning algorithm that does parallel computations on a single machine. It also has extra features for doing cross validation and computing feature importance. Below are some of the main features of the model:

Sparsity: It accepts sparse input for tree booster and linear booster.
Customization: It supports customized objective and evaluation functions.
DMatrix: Its optimized data structure that improves its performance and efficiency.

Boosting Vis-a-Vis Bagging

Boosting is a machine learning ensemble algorithm that reduces bias and variance that converts weak learners into strong learners. XGBoost is an example of a boosting algorithm. Bagging, on the other hand, is a technique whereby one takes random samples of data, builds learning algorithms, and takes means to find bagging probabilities.

Applying XGBoost in Python

Next let’s show how one can apply XGBoost to their machine learning models. If you don’t have XGBoost installed, follow this link to install it (depending on your operating system). If you’re using pip for package management you can install XGBoost by typing this command in the terminal:

We’ll use the the Boston Housing Dataset that ships with Scikit-learn. We assume that the reader has understanding of basic scientific packages such as Pandas, Scikit-learn, and numpy.

We kick off by loading the dataset from sklearn.datasets. We then import pandas to enable us convert the Boston Housing Dataset to a dataframe. Next we obtain the features using the feature_name attribute. We obtain the target variable, which in this case is the price column, using the target attribute.

from sklearn.datasets import load_boston
boston = load_boston()
import pandas as pd
data = pd.DataFrame(boston.data)
data.columns = boston.feature_names
data['PRICE'] = boston.target

We’re going to use XGBoost to predict the price column of the dataset. In this case, all the features in the dataset are numerical. It’s important to note that XGBoost works only with numerical values. If we had categorical features we would have to convert them to numbers using techniques such as one-hot-encoding.

Next we import XGBoost, numpy and mean_squared_error, which we’ll use as our evaluation metric to check the performance of the trained model on the test dataset. We then move forward to separate the feature variables from the target variable using pandas iloc utility.

import xgboost as xgb
from sklearn.metrics import mean_squared_error
import numpy as np
X, y = data.iloc[:,:-1],data.iloc[:,-1]

In order to take full advantage of XGBoost’s performance and efficiency, we convert the dataset into a DMatrix. This is achieved by using XGBoost’s Dmatrix functionality.

data_dmatrix = xgb.DMatrix(data=X,label=y)

XGBoost’s Hyperparameters

XGBoost provides a way for us to tune parameters in order to obtain the best results. The most common tuning parameters for tree based learners such as XGBoost are:

. Booster: This specifies which booster to use. It can be gbtree, gblinear or dart. gbtree and dart use tree based models while gblinear uses linear functions. gbtree is the default.

silent 0 means printing running messages. 1 means silent mode. The default is 0.
nthread is the number of parallel threads used to run XGBoost.
disable_default_eval_metric is the flag to disable default metric. Set to >0 to disable. The default is 0.
num_pbuffer is the size of prediction buffer, normally set to the number of training instances. The buffers are used to save the prediction results of the last boosting step. It’s set automatically by XGBoost, so no need to be set by the user
num_feature is the feature dimension used in boosting, set to the maximum dimension of the feature. It’s set automatically by XGBoost, so again, no need to be set by the user

We then proceed to split our data set into a training and testing set using the train_test_split functionality from the model_selection module. We take a test size of 20% and set the random state to 100 to ensure we’re getting the same results.

from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=100)

The next step is to create an instance of the XGBoost Regressor class and pass the parameters as arguments. The parameters are explained below:

objective =’reg:linear’ specifies that the learning task is linear.
colsample_bytree is the subsample ratio of columns when constructing each tree. Subsampling will occur once in every boosting iteration. This number ranges from 0 to 1.
learning_rate is the step size shrinkage and is used to prevent overfitting. This number ranges from 0 to 1.
max_depth specifies the maximum depth of the tree. Increasing this number makes the model complex and increases the possibility of overfitting. The default is 6.
alpha is the L1 regularization on weights.Increasing this number makes the model more conservative.
n_estimators is the number of boosted trees to fit

xg_reg = xgb.XGBRegressor(objective ='reg:linear', colsample_bytree = 0.3, learning_rate = 0.1, max_depth = 5, 
                          alpha = 10, n_estimators = 10)

The next step is to fit the regressor and make predictions using it.

xg_reg.fit(X_train,y_train)
preds = xg_reg.predict(X_test)

After this we compute the root mean squared error in order to evaluate the performance of our model.

rmse = np.sqrt(mean_squared_error(y_test, preds))
print("RMSE: %f" % (rmse))

Cross Validation when Using XGBoost

It’s a very common practice to use cross validation to make models more robust. XGboost supports K-fold validation via the cv() functionality. We’ll use this to apply cross validation to our model.

We now specify a new variable params to hold all the parameters apart from n_estimators because we’ll use num_boost_rounds from the cv() utility. The parameters taken by the cv() utility are explained below:

dtrain is the data to be trained.
params specifies the booster parameters.
nfold is the number of folds in the cross validation function.
num_boost_round is the number of boosting iterations.
early_stopping_rounds activates early stopping. CV error needs to decrease at least every <early_stopping_rounds> round(s) to continue.
metrics are the evaluation metrics to be watched in the cross validation.
as_pandas if True will return a pandas dataframe; if false it will return a numpy array.

params = {"objective":"reg:linear",'colsample_bytree': 0.3,'learning_rate': 0.1,
                'max_depth': 5, 'alpha': 10}

cv_results = xgb.cv(dtrain=data_dmatrix, params=params, nfold=5,
                    num_boost_round=50,early_stopping_rounds=10,metrics="rmse", as_pandas=True, seed=100

The cv_results variable will return the train and test RMSE for each boosting round. The final boosting round metric can be obtained as follows:

print((cv_results["test-rmse-mean"]).tail(1))

Visualizing Feature Importance in XGBoost

You may be interested in seeing the most important features in the dataset.

XGBoost has a plot_importance() function that enables you to see all the features in the dataset ranked by their importance. This can be achieved using Matplotlib and by passing in our already fitted regressor.

import matplotlib.pyplot as plt
xgb.plot_importance(xg_reg)
plt.rcParams['figure.figsize'] = [5, 5]
plt.show()

You can use the above visualization to select the most relevant features for your machine learning model.

Conclusion

There are other techniques such as Grid Search that you can use to improve your machine learning model. Grid Search works by doing an exhaustive search over specified parameter values for an estimator. You can learn more about XGBoost by visiting the official documentation.

Discuss this post on Hacker News and Reddit.