Using Transfer Learning and Pre-trained Language Models to Classify Spam

Transfer learning, an approach where a model developed for a task is reused as the starting point for a model on a second task, is an important approach in machine learning. Prior knowledge from one domain and task is leveraged into a different domain and task.

Transfer learning, therefore, draws inspiration from human beings, who are capable of transferring and leveraging knowledge from what they have learned in the past for tackling a wide variety of tasks.

In computer vision, great advances have been made using transfer learning approach, with pre-trained models being used as a starting point. This has sped up training and improved the performance of deep learning models. This is attributed to the availability of huge datasets like ImageNet, that have enabled the development of state-of-the-art pre-trained models used for transfer learning.

Until recently, the natural language processing community was lacking its ImageNet equivalent. But development of transfer learning techniques in NLP continues to gain traction. In NLP, transfer learning techniques are mainly based on pre-trained language models, which repurpose and reuse deep learning models trained in high-resource languages and domains.

The pre-trained models are then fine-tuned for downstream tasks, often in low-resource settings. The downstream tasks include part-of-speech tagging, text classification, and named-entity recognition, among others.

Contextualized Embeddings

Word embedding plays a critical role in the realization of transfer learning in NLP. The intuition behind word embeddings is that words are represented as low-dimensional vectors that capture both the syntax and semantics of the text corpus. Words with similar meanings tend to occur in similar context.

The word representations are learned by exploiting vast amounts of text corpora. A popular implementation of word embeddings is the Word2Vec model which has two training options—Continuous Bag of Words and the Skip-gram model. Word embeddings are often used as the first data processing layer in a deep learning model.

One limitation of standard word embedding techniques such as Word2Vec, fasttext, and Glove is that they aren’t able to better disambiguate between the correct sense of a given word. In other words, each instance of a given word ends up having the same representation regardless of the context in which it appears.

Recently, contextual word embeddings such as Embeddings from Language Models (ELMo) and Bidirectional Encoder Representations from Transformers (BERT) have emerged. These techniques generate embeddings for a word based on the context in which the word appears, thus generating slightly different embeddings for each of word’s occurrence.

ELMo uses a combination of independently trained left-to-right and right-to-left LSTMs to generate features for downstream tasks. On the other hand, BERT representations are jointly conditioned on both the left and right context and use the Transformer, a neural network architecture based on a self-attention mechanism. The Transformer has been shown to have superior performance in modeling long-term dependencies in the text, compared to recurrent neural network architecture.

The integration of the contextual word embeddings into neural architectures has led to consistent improvements in important NLP tasks such as sentiment analysis, question answering, reading comprehension, textual entailment, semantic role labeling, coreference resolution, or dependency parsing.

Language model embeddings can be used as features in a target model or a language model can be fine-tuned on target task data. Training a model on a large-scale dataset and then fine-tuning the Pre-trained model for a target task (transfer learning, if you’ll recall), can particularly be beneficial to low-resource languages where labeled data is limited.

The Flair Library

Flair is a library for state-of-the-art NLP developed by Zalando Research. It’s built in Python on top of the PyTorch framework. Flair allows for the application of state-of-the-art NLP models to text, such as named entity recognition (NER), part-of-speech tagging (PoS), sense disambiguation, and classification.

It is multilingual and allows you to use and combine different word and document embeddings, including the BERT embeddings, ELMo embeddings, and their proposed Flair embeddings. In addition, Flair allows you to train your own language model, targeted to your language or domain, and apply it to the downstream task.

Spam Classification using Flair

While email continues to be the dominant medium for digital communications for both consumer and business uses, unsolicited bulk emails (i.e. spam) make up for approximately 53.5% (as of September 2018) of the global email traffic.

Machine learning-based spam filtering approaches have been applied with success to automatically classify spam and non-spam emails. A crucial component in such approaches is word embeddings, typically trained over very large collections of unlabeled data to assist learning and generalization.

Contextualized word embeddings have been shown to significantly improve the performance of text classifier because they are able to capture word semantics in context. This means that the same word can have different embeddings depending on its contextual use, thus disambiguating words and addressing polysemy which affects the accuracy of text classification models.

The following implementation illustrates how to use the Flair library to train a language model and fine-tune it to classify spam.

Getting started

We begin by installing the Flair library using the pip command.

pip install flair

The required Python libraries are then imported

import pandas as pd
from flair.data_fetcher import NLPTaskDataFetcher
from flair.embeddings import WordEmbeddings, FlairEmbeddings, DocumentLSTMEmbeddings
from flair.models import TextClassifier
from flair.trainers import ModelTrainer
from pathlib import Path

Loading and Pre-processing the Data

We use the SMS Spam Collection, a public dataset of SMS labeled messages that have been collected for mobile phone spam research. The data is read using pandas and basic preprocessing is done—namely removing duplicates, ensuring the labels are prefixed with __label__, and splitting the dataset into train, dev and test sets using the 80/10/10 split.

Flair’s classification dataset needs to be formatted based on Facebook’s FastText format, which requires labels to be defined at the beginning of each line starting with the prefix __label__.

data = pd.read_csv('SMSSpamCollection.txt', delimiter='t',header=None)
data = data.rename(columns={0:"label", 1:"text"}).drop_duplicates()
data['label'] = '__label__' + data['label'].astype(str)

data.iloc[0:int(len(data)*0.8)].to_csv('train.csv', sep='t', 
                                       index = False, header = False)
data.iloc[int(len(data)*0.8):int(len(data)*0.9)].to_csv('test.csv', sep='t', 
                                                        index = False, header = False)
data.iloc[int(len(data)*0.9):].to_csv('dev.csv', sep='t', index = False, header = False)

The next step is to train the model. All the required libraries and datasets are loaded into a corpus object.

from flair.data_fetcher import NLPTaskDataFetcher
from flair.embeddings import WordEmbeddings, FlairEmbeddings, DocumentLSTMEmbeddings
from flair.models import TextClassifier
from flair.trainers import ModelTrainer
from pathlib import Path

corpus = NLPTaskDataFetcher.load_classification_corpus(Path('./'), test_file='test.csv', 
dev_file='dev.csv', train_file='train.csv')

word_embeddings = [WordEmbeddings('glove'), 
FlairEmbeddings('news-forward-fast'), FlairEmbeddings('news-backward-fast')]

document_embeddings = DocumentLSTMEmbeddings(word_embeddings, hidden_size=512, 
reproject_words=True, reproject_words_dimension=256)

classifier = TextClassifier(document_embeddings, 
label_dictionary=corpus.make_label_dictionary(), multi_label=False)

trainer = ModelTrainer(classifier, corpus)
trainer.train('./', max_epochs=10)

Finally, we load the pre-trained model and use it to predict if a message is spam or not.

from flair.models import TextClassifier
from flair.data import Sentence
classifier = TextClassifier.load_from_file('./best-model.pt')
# sentence = data['text'].tolist()
sent = ["FREE entry into our £250 weekly comp just 
        send the word WIN to 80086 NOW. 18 T&C www.txttowin.co.uk"]
sentence = Sentence(sent)
classifier.predict(sentence)
# print(sentence.labels)
label = str(sentence.labels[0]).split()[0]
print(f"{label}t{sentence}")

Model Evaluation

The model achieved a F-score of 0.9845 after 10 epochs, using default parameters. F-score calculates metrics globally by counting the total true positives, false negatives, and false positives.

It is a measure of a test’s accuracy that considers both the precision and the recall of the test to compute the score. Precision is the fraction of relevant instances among the retrieved instances, while recall is the fraction of relevant instances that have been retrieved over the total amount of relevant instances. In classification tasks for which every test case is guaranteed to be assigned to exactly one class, micro-F is equivalent to accuracy. This won’t be the case in multi-label classification.

Baseline Model

The logistic regression baseline model achieved a f-score of 0.9668, which was marginally lower than that of Flair model above.

#load the required libraries
import pandas as pd
import numpy as np
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.linear_model.logistic import LogisticRegression
from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.metrics import accuracy_score, f1_score
#read the data
df = pd.read_csv('SMSSpamCollection.txt', delimiter='t',header=None)
df.rename(columns = {0:'label',1: 'text'}, inplace = True)
#Input and output variables
X = df['text']
y = df['label']

seed = 5
test_size = 0.33
#split dataset into train and test sets
X_train_raw, X_test_raw, y_train, y_test = train_test_split(X, y, 
                                                            test_size=test_size, 
                                                            random_state=seed)
#Convert to a matrix of TF-IDF features
vectorizer = TfidfVectorizer()
X_train = vectorizer.fit_transform(X_train_raw)
X_test = vectorizer.transform(X_test_raw)
#Model training
classifier = LogisticRegression()
classifier.fit(X_train, y_train)
#prediction
predictions = classifier.predict(X_test)
#model evaluation
score = accuracy_score(y_test, predictions)
f_score = f1_score(y_test, predictions, average='micro')
print("The accuracy score (Logistic Regression) is:", score)
print("The F score-Micro (Logistic Regression) is:", f_score)

Discussion

Machine learning models are data intensive and require access to large annotated data to train good predictive NLP models. The required annotated data will in most cases not be available beforehand for many domains or language. Annotation of such datasets is a time-consuming, expensive exercise and a challenging task. But sufficient and accurately labeled is a key determinant of model’s prediction accuracy.

In view of these challenges, transfer learning via pre-trained language models is a promising approach to addressing these challenges.

Discuss this post on Hacker News and Reddit.