Predicting Depression from Routine Survey Data using Keras

Depression is among the most prominent mental illnesses, affecting more than 300 million people globally, according to the World Health Organization (WHO).

Beyond the potentially devastating personal and social effects of depression, the economic costs related to this mental health issue have substantially increased in recent years. In fact, it’s currently estimated to cost the global economy over $1 trillion USD each year.

In this tutorial, we’ll build a simple neural network model using Keras to predict individuals that are likely to be depressed from routine survey data. Keras is a high-level neural networks API written in Python, and it can run on top of TensorFlow, CNTK, or Theano. We evaluate the accuracy of the model and compare it to a binary classification baseline model based on a random forest algorithm.

Loading the Required Dependencies

We begin by loading the required dependencies to process data and build the model:

import pandas as pd
import numpy as np
import re
from keras.models import Sequential
from keras.layers import Dense
from keras.preprocessing import sequence

Reading and Understanding the Data

We’re using the Busara’s Mental Health dataset, which can be downloaded from the Zindi data science competition platform. Zindi is a platform where you can access datasets to solve African problems. We read the train.csv and test.csv files using pandas. The train.csv contains the data that we’ll use to train the model, while test.csv is the dataset we’ll apply to our model to make predictions.

#train set
df_train = pd.read_csv('data/busara/train.csv')
# test set
df_test = pd.read_csv('data/busara/test.csv')

#show the shape of the train dataframe
df_train.shape

Using the dataframes’s shape property, we establish that the train set comprises 1143 rows and 75 columns, among them the target column (‘depressed’) which has values as 0,1.

Data Cleaning and Processing

Poor data quality due to missing data is always a challenge in machine learning. Handling missing data is therefore an important step in ensuring that ML models produce more accurate and valid prediction results. The function below computes and prints the percentage of missing values for each column.

def missing_values_table(df):
        # Total missing values
        mis_val = df.isnull().sum()
        
        # Percentage of missing values
        mis_val_percent = 100 * df.isnull().sum() / len(df)
        
        # Make a table with the results
        mis_val_table = pd.concat([mis_val, mis_val_percent], axis=1)
        
        # Rename the columns
        mis_val_table_ren_columns = mis_val_table.rename(
        columns = {0 : 'Missing Values', 1 : '% of Total Values'})
        
        # Sort the table by percentage of missing descending
        mis_val_table_ren_columns = mis_val_table_ren_columns[
            mis_val_table_ren_columns.iloc[:,1] != 0].sort_values(
        '% of Total Values', ascending=False).round(1)
        
        # Print some summary information
        print ("The dataset has " + str(df.shape[1]) + " columns.n"      
            "There are " + str(mis_val_table_ren_columns.shape[0]) +
              " columns that have missing values.")
        
        # Return the dataframe with missing information
        return mis_val_table_ren_columns

A good rule of the thumb is that a column with 50% missing values can be excluded. For our training dataset, 4 columns exceed the 50% threshold and, therefore, we drop them.

# Get the columns with > 50% missing
missing_df = missing_values_table(df_train);
missing_columns = list(missing_df[missing_df['% of Total Values'] > 50].index)
print('n','%d columns will be deleted.' % len(missing_columns))

# Drop the columns with 50% missing data
df_train = df_train.drop(columns = list(missing_columns))

Because we’re building a simple neural network using Keras, which only understands numbers, we convert the column survey_date, which is displayed as text, to numeric format using SKlearn. This involves creating a label (category) encoder object, fitting the encoder to the survey_date column, and finally applying the fitted encoder to the column to transform categories into integers.

# Create a label (category) encoder object
encoder = LabelEncoder()

# fitting the encoder to the "survey_date" column
encoder.fit(df_train['survey_date'])

# Apply the fitted encoder to the "survey_date" to transform categories into integers
encoded_train = encoder.transform(df_train['survey_date'])
# encoded_test = encoder.transform(df_test['survey_date'])

#assign the tranformed column back to the dataframe
df_train['survey_date'] = encoded_train

Now that we’ve processed our data to a considerable level, it’s time to specify the input and target variables. Our input will be every column except the depressed column, since that’s what we’re attempting to predict—hence, it’s our target variable.

We also handle any missing values in the columns with less that 50% null values using SKlearn’s Imputer class. The data is then split into train and test sets, and a random seed of 5 is specified for the purpose of reproducing the results.

# split data into train and test sets
X = df_train.drop(["depressed"], axis=1)

# fill missing values with mean column values
imputer = Imputer()
transformed_X = imputer.fit_transform(X)

y = df_train.depressed

seed = 5
test_size = 0.33

X_train, X_test, y_train, y_test = train_test_split(transformed_X, y, test_size=test_size, random_state=seed)

Creating the Classification Model

We build a simple neural network classification model using the Keras library. Keras is a user-friendly neural network library written in Python that defines models as a sequence of layers.

Using the add function, we add two input layers and an output layer, all of the type Dense, meaning that the layers are fully connected. Each input layer has 36 neurons, and the first layer expects 70 inputs while the output layer has 1 neuron to predict the outcome. We use the sigmoid function on the output layer to ensure our network output is between 0 and 1, while the relu function is used in the input layers.

# create model
model = Sequential()
model.add(Dense(36, input_dim=70, activation='relu'))
model.add(Dense(36, activation='relu'))
model.add(Dense(1, activation='sigmoid'))

Compile the Model

Now that the model is defined, we can compile it. In our case, TensorFlow backend is used for compiling, and it involves automatically choosing the best way to represent the network for training and making predictions. Training a network simply means finding the best set of weights to make predictions for this problem.

We therefore specify the loss function to be used to evaluate a set of weights, the optimizer used to search through different weights for the network, and metrics to evaluate the model’s performance. For a binary classification problem like this one, we use binary_crossentropy for the logarithmic loss, adam for gradient descent, and accuracy for the metric parameter.

# Compile model
model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])

Fit the Model

Having created and compiled our model, it’s time to train or fit the model on our loaded data by calling the fit() function. The model will iterate 100 times through the dataset, as defined by the epochs argument. The batch_size argument sets the number of instances that are evaluated before a weight update in the network is performed.

# Fit the model
model.fit(X_train, y_train, epochs=100, batch_size=10)

Evaluate the Model

We finally evaluate the model for our test data.

scores = model.evaluate(X_test, y_test)
print("n%s: %.2f%%" % (model.metrics_names[1], scores[1]*100))

The model achieved accuracy of 86.51%. This accuracy is slightly higher than that of the random forest baseline model below, whose accuracy evaluated to 85.98%.

Baseline Model

#Baseline Model
from sklearn.ensemble import RandomForestClassifier
from sklearn import metrics
# logreg = LogisticRegression()
logreg = RandomForestClassifier()
logreg.fit(X_train, y_train)
# accuracy = logreg.score(X_test, y_test)

y_pred = logreg.predict(X_test)
accuracy = metrics.accuracy_score(y_test, y_pred)
print("Accuracy: %.2f%%" % (accuracy * 100.0))

Conclusion

Mental health is a key component of public health more broadly. It’s a key determinant of overall health and socio-economic development. Governments globally continue to heavily invest and put in place measures and strategies for mitigating the burden of mental health problems and disorders, such as depression.

Early prediction and detection of depression is an essential step towards implementing effective prevention and clinical intervention strategies. Statistical models, like the one developed in this tutorial, can provide an alternative approach to the medical science approaches for detecting depression, which mainly rely on analyzing self-reported patient data.

Our model could be incorporated into public health information systems to screen for risk factors for depression. It could also be used by various organizations, especially non-governmental organizations (NGOs) that routinely conduct surveys related to mental health.

Discuss this post on Hacker News and Reddit.