Hands-on with Feature Engineering Techniques: Imputing Missing Values

This post is a part of a series about feature engineering techniques for machine learning with python.

You can check out the rest of the articles:

Welcome back! In this post, we’re going to cover the different imputation techniques used when dealing with missing data. Additionally, we’ll also explore a few code snippets you can use directly in your machine learning and data science projects.

The objective is to learn multiple techniques and understand their impacts on our variables and machine learning models.

Data Imputation

Data imputation is the act of replacing missing data with statistical estimates of the missing values.

The goal of any imputation technique is to produce a complete dataset to use in the process of training machine learning models.

To help with this process, we’ll need to use a library called feature-engine that can simplify the process of imputing missing values. You can easily pip install it:

Missing Data Imputation Techniques

We’re going to dive into techniques that apply to numerical and categorical variables, and also some methods that apply to both:

Numerical Variables

Mean or median imputation
Arbitrary value imputation
End of tail imputation

Categorical Variables

Frequent category imputation
Add a missing category

Both

Complete case analysis
Add a missing indicator
Random sample imputation

Mean or Median Imputation

Mean or median imputation consists of replacing all occurrences of missing values (NA) within a variable with the mean or median of that variable.

Here are some points to consider when using this method:

If the variable follows a normal distribution, the mean and median are approximately the same.
If the variable has a skewed distribution, then the median is a better representation.

You can use this method when data is missing completely at random, and no more than 5% of the variable contains missing data.

Here’s an example of mean imputation:

Assumptions of mean or median imputation

Data is missing at random.
The missing observations most likely look like the majority of the observations in the variable.

Advantages of mean or median imputation

Easy to implement.
Fast way of obtaining complete datasets.
It can be used in production, i.e during model deployment.

Limitations of mean or median imputation

It distorts the original variable distribution and variance.
It distorts the covariance with the remaining dataset variables.
The higher the percentage of missing values, the higher the distortions.

Finally, here’s a Python code snippet:

from sklearn.impute import SimpleImputer

# create the imputer, the strategy can be mean and median.
imputer = SimpleImputer(missing_values=np.nan, strategy='mean')

# fit the imputer to the train data
imputer.fit(train)

# apply the transformation to the train and test
train = imputer.transform(train)
test = imputer.transform(test)

Arbitrary Value Imputation

Arbitrary value imputation consists of replacing all occurrences of missing values (NA) within a variable with an arbitrary value. The arbitrary value should be different from the mean or median and not within the normal values of the variable.

We can use arbitrary values such as 0, 999, -999 (or other combinations of 9s) or -1 (if the distribution is positive).

Here is an example using 99 as an arbitrary value:

Assumptions of arbitrary value imputation

Data is not missing at random.

Advantages of arbitrary value imputation

Easy to implement.
It’s a fast way to obtain complete datasets.
It can be used in production, i.e during model deployment.
It captures the importance of a value being “missing”, if there is one.

Limitations arbitrary value imputation

Distortion of the original variable distribution and variance.
Distortion of the covariance with the remaining dataset variables.
If the arbitrary value is at the end of the distribution, it may mask or create outliers.
We need to be careful not to choose an arbitrary value too similar to the mean or median (or any other typical value of the variable distribution).
The higher the percentage of NA, the higher the distortions.

Here is a code snippet using 999 as an arbitrary value:

from sklearn.impute import SimpleImputer

# create the imputer, with fill value 999 as the arbitraty value
imputer = SimpleImputer(missing_values=np.nan, strategy='constant', fill_value=999)

# fit the imputer to the train data
imputer.fit(train)

# apply the transformation to the train and test
train = imputer.transform(train)
test = imputer.transform(test)

End of Tail Imputation

End of tail imputation is roughly equivalent to arbitrary value imputation, but it automatically selects the arbitrary values at the end of the variable distributions.

Here are ways to select arbitrary values:

If the variable follows a normal distribution, we can use the mean plus or minus 3 times the standard deviation.
If the variable is skewed, we can use the IQR proximity rule.

Here is another example of the age variable (this variable follows a normal distribution):

Normal Distribution

Most of the observations (~99%) of a normally-distributed variable lie within the mean plus/minus three times standard deviations—for that the selected value = mean ± 3 ×standard deviations.

Skewed distributions

The general approach is to calculate the quantiles, and then the inter-quantile range (IQR), as follows:

IQR = 75th Quantile –25th Quantile.
Upper limit = 75th Quantile + IQR ×3.
Lower limit = 25th Quantile – IQR ×3.

So the selected value for imputation is the previously calculated upper limit or the lower limit.

Here’s a code snippet using feature-engine alongside a Gaussian distribution, and the right tail of the variable distribution:

from feature_engine.missing_data_imputers import EndTailImputer

# create the imputer
imputer = EndTailImputer(distribution='gaussian', tail='right')

# fit the imputer to the train set
imputer.fit(train)

# transform the data
train_t = imputer.transform(train)
test_t = imputer.transform(test)

Frequent Category Imputation

Frequent category imputation—or mode imputation—consists of replacing all occurrences of missing values (NA) within a variable with the mode, or the most frequent value.

You can use this method when data is missing completely at random, and no more than 5% of the variable contains missing data.

Here’s an example:

Assumptions of frequent category imputation

Data is missing at random.
The missing observations most likely look like the majority of the observations (i.e. the mode).

Advantages of frequent category imputation

Easy to implement.
It’s a fast way to obtain a complete dataset.
It can be used in production, i.e during model deployment.

Limitations of frequent category imputation

It distorts the relation of the most frequent label with other variables within the dataset.
May lead to an over-representation of the most frequent label if there is are a lot of missing observations.

And here’s a code snippet for use frequent category (mode) imputation:

from sklearn.impute import SimpleImputer

# create the imputer, with most frequent as strategy to fill missing value.
imputer = SimpleImputer(missing_values=np.nan, strategy='most_frequent')

# fit the imputer to the train data
imputer.fit(train)

# apply the transformation to the train and test
train = imputer.transform(train)
test = imputer.transform(test)

Missing Category Imputation

This method consists of treating missing data as an additional label or category of the variable. Thus, we create a new label or category by filling the missing observations with a Missing category.

Here is an illustration of that concept:

Advantages of missing category imputation

Easy to implement.
Fast way of obtaining complete datasets.
It can be integrated into production.
Captures the importance of “missingness”.
No assumption made on the data.

Limitations of missing category imputation

If the number of missing values is small, creating an additional category is just adding another rare label to the variable.

The following code shows us how to fill the missing values with a new category called “Missing”:

from sklearn.impute import SimpleImputer

# create the imputer, with most frequent as strategy to fill missing value.
imputer = SimpleImputer(missing_values=np.nan, strategy='constant', fill_value="Missing")

# fit the imputer to the train data
# make sure to select only the categorical variable in the following train and test sets.
imputer.fit(train)

# apply the transformation to the train and test
train = imputer.transform(train)
test = imputer.transform(test)

Complete Case Analysis

Complete case analysis (CCA) is a technique that consists of discarding observations where values in any of the variables are missing.

In CCA, we keep only those observations for which there’s information in all of the dataset variables. Observations with any missing data are excluded.

Here is an illustration of the concept:

CCA Assumptions

Data is missing at random.

CCA Advantages

CCA is simple to implement.
No data manipulation required.
CCA Preserves the distribution of the variables.

CCA Limitations

It can exclude a significant fraction of the original dataset (if missing data is significant).
Excludes observations that could be informative for the analysis (if data is not missing at random).
CCA can create a biased dataset if the complete cases differ from the original data (e.g., when missing information is, in fact, MAR or NMAR).
When using this method in production, the model can’t know how to handle missing data.

When to use CCA

Data is missing completely at random.
No more than 5% of the total dataset contains missing data.

It really straightforward to apply this technique—here’s the code for it:

#read you data and apply that
data.dropna(inplace=True)

## dropna will drop any row that has at least one variable with na value

Missing Indicator

A missing indicator is an additional binary variable that indicates whether the data was missing for an observation (1) or not (0). The goal here is to capture observations where data is missing.

The missing indicator is used together with methods that assume data is missing at random:

Mean, median, mode imputation.
Random sample imputation.

Here is an illustration of a missing indicator, alongside a random sample imputation (covered in more detail below):

Assumptions of missing indicator

Data is NOT missing at random.
Missing data are predictive.

Advantages of missing indicator

Easy to implement.
It can capture the importance of missing data.
It can be integrated into production.

Disadvantages of missing indicator

It expands the feature space.
The original variable still needs to be imputed.
Many missing indicators may end up being identical or very highly correlated.

Here’s a code example showing how to add the missing indicator. Note that you still need to impute missing values with a method of your choice:

from sklearn.impute import MissingIndicator

# create the object with missing only columns.
indicator = MissingIndicator(error_on_new=True, features='missing-only')
indicator.fit(train)  

# print the columns of missing data.
print(train.columns[indicator.features_])

# get the columns of missing indicators.
temporary = indicator.transform(X_train)

# create a column name for each of the new Missing indicators
indicator_columns = [column +'_NA_IND' for column in train.columns[indicator.features_]]
indicator_df = pd.DataFrame(temporary, columns = indicator_columns)

# create the final train data.
train = pd.concat([train.reset_index(), indicator_df], axis=1)

# now the same for the test set
temporary = indicator.transform(test)
indicator_df =  pd.DataFrame(temporary, columns = indicator_columns)

# create the final test data.
test = pd.concat([ X_test.reset_index(), indicator_df], axis=1)

Random Sample Imputation

Random sampling consists of taking a random observation from the pool of available observations of the variable and using those randomly selected values to fill in the missing ones.

Here’s an illustration of this technique:

Assumptions of random sample imputation

Data is missing at random.
We’re replacing the missing values with other values within the same distribution of the original variable.

Advantages of random sample imputation

Easy to implement and a fast way of obtaining complete datasets.
It can be used in production.
Preserves the variance of the variable.

Disadvantages of random sample imputation

Randomness.
The relationship between imputed variables and other variables may be affected if there are a lot of missing values.
Memory is massive for deployment, as we need to store the original training set to extract values from, and replace the missing values with the randomly selected values.

Here is a code snippet implementing random sample imputation using feature-engine:

from feature_engine.missing_data_imputers import RandomSampleImputer

# create the imputer
imputer = RandomSampleImputer(random_state = 29)

# fit the imputer to the train set
imputer.fit(train)

# transform the data
train_t = imputer.transform(train)
test_t = imputer.transform(test)

Conclusion

We’ve explored multiple methods for imputing missing values in a given dataset, and showed the limitations and the advantages of each technique. What’s left now is to try these methods on your dataset!

Lastly, it’s important to mention not every method is perfect—you’ll want to experiment and see which one offers the best results. You can actually do an automatic process (grid search) to see which technique performs the best.