Hands-on with Feature Engineering Techniques: Handling Outliers

This post is a part of a series about feature engineering techniques for machine learning with Python.

You can check out the rest of the articles:

Welcome back! In this post of our series on feature engineering, we’re going to focus on another common issue in most datasets—outliers. Here, we’ll examine what an outlier is and the different methods to handle them, alongside some code snippets. Let’s get started.

Outliers

An outlier is a data point that’s significantly different from the remaining data.

Another way of saying this is that an outlier is an observation that deviates so much from the other observations, it arouses suspicion that a different mechanism produced it.

Detecting Outliers

We can detect and find outliers using various techniques. Some of them include:

  • Using visualization plots like boxplot and scatterplot:
  • Using a normal distribution (mean and std):

In a normal distribution, about 99.7% of the data lie within three standard deviations of the mean. Consequently, if any observation is more than three times the standard deviation, it’s possible that it’s an outlier.

Inter-quantal range proximity rule:

The concept of the interquartile range (IQR) is used to build boxplot graphs. The idea is simple—we divide our data into four parts, and each part is a quartile.

IQR is the difference between the third quartile Q3 ( 75 percent) and the first quartile or Q1 (25 percent).

With IQR, outliers are defined as the observations that are:

  • Below Q1 − 1.5 × IQR.
  • Above Q3 + 1.5 × IQR.

Here’s a very descriptive image found on Wikipedia:

Handling Outliers

Now that we understand how to detect outliers in a better way, it’s time to engineer them. We’re going to explore a few different techniques and methods to achieve that:

  • Trimming: Simply removing the outliers from our dataset.
  • Imputing: We treat outliers as missing data, and we apply missing data imputation techniques.
  • Discretization: We place outliers in edge bins with higher or lower values of the distribution.
  • Censoring: Capping the variable distribution at the maximum and minimum values.

Trimming

Trimming (or truncation) merely means removing outliers from the dataset; what we need here is just to decide on a metric to determine outliers.

Here are some points to consider when working with the trimming method:

  • This method is fast.
  • It can remove a significant amount of data ( — so be careful).

Here’s a sample code snippet for trimming outliers with Python:

# import the needed packages
import pandas as pd
import numpy as np

# read your data
data = pd.read_csv("yourData.csv")

for variable in data.columns:
    #calculate the IQR
    IQR = data[variable].quantile(0.75) - data[variable].quantile(0.25)
    
    #calculate the boundries
    lower = data[variable].quantile(0.25) - (IQR * 1.5)
    upper = data[variable].quantile(0.75) + (IQR * 1.5)
    
    # find the outliers
    outliers = np.where(data[variable] > upper, True, np.where(data[variable] < lower, True, False))
    
    # remove outliers from data.
    data = data.loc[~(outliers, ]   

Censoring

Censoring (or capping) means setting the maximum and/or the minimum of the distribution at an arbitrary value.

In other words, values bigger or smaller than the arbitrarily chosen value are replaced by this value.

When doing capping, remember that:

  • It does not remove data.
  • It distorts the distributions of the variables.

The numbers at which to cap the distribution can be determined using various methods, which we’ll cover below

Arbitrarily

You can choose values to replace outliers arbitrarily; this can be based on the requirements of your use case. Here’s a code snippet:

# import the needed packages
import pandas as pd
import numpy as np

# read your data
data = pd.read_csv("yourData.csv")

for variable in data.columns:

    # create boundries (age for example)
    lower = 10
    upper = 89
    
    # replacing the outliers
    data[variable] = np.where(data[variable] > upper, upper, np.where(data[variable] < lower, lower, data[variable]))

Inter-quantal range proximity rule

In this rule, the boundaries are determined using IQR proximity rules:

# import the needed packages
import pandas as pd
import numpy as np

# read your data
data = pd.read_csv("yourData.csv")

for variable in data.columns:
    #calculate the IQR
    IQR = data[variable].quantile(0.75) - data[variable].quantile(0.25)
    
    #calculate the boundries
    lower = data[variable].quantile(0.25) - (IQR * 1.5)
    upper = data[variable].quantile(0.75) + (IQR * 1.5)
    
    # replacing the outliers
    data[variable] = np.where(data[variable] > upper, upper, np.where(data[variable] < lower, lower, data[variable]))

Gaussian approximation

Another code snippet that sets the boundaries with values according to the mean and standard deviation:

# import the needed packages
import pandas as pd
import numpy as np

# read your data
data = pd.read_csv("yourData.csv")

for variable in data.columns:
    
    #calculate the boundries
    lower = data[variable].mean() - 3 * data[variable].std()
    upper = data[variable].mean() + 3 * data[variable].std()
    
    # replacing the outliers
    data[variable] = np.where(data[variable] > upper, upper, np.where(data[variable] < lower, lower, data[variable]))

Using quantiles

In the following code snippet, the boundaries are determined using the quantiles, through which you can specify any percentage you want:

# import the needed packages
import pandas as pd
import numpy as np

# read your data
data = pd.read_csv("yourData.csv")

for variable in data.columns:
    
    #calculate the boundries
    lower = data[variable].quantile(0.10)
    upper = data[variable].quantile(0.90)
    
    # replacing the outliers
    data[variable] = np.where(data[variable] > upper, upper, np.where(data[variable] < lower, lower, data[variable]))

Imputing

Another technique used to handle outliers is to treat them as missing data. We have a range of methods that we can use to replace or impute outliers. If you’d like to explore these techniques in more depth, you can do so here.

Transformation

We can also apply some mathematical transformations, such as log transformation. To handle the outliers, there are a range of transformation techniques, which you can learn more about here.

Conclusion

To sum things up, we have learned how to detect outliers in our dataset and learned multiple methods that we can use to handle them.

I hope this post will get you started with engineering outliers—the practices described here can certainly enhance your data science toolkit.

Fritz

Our team has been at the forefront of Artificial Intelligence and Machine Learning research for more than 15 years and we're using our collective intelligence to help others learn, understand and grow using these new technologies in ethical and sustainable ways.

Comments 0 Responses

Leave a Reply

Your email address will not be published. Required fields are marked *

wix banner square