Detecting the Language of a Person’s Name using a PyTorch RNN

In this tutorial, we’ll build a Recurrent Neural Network (RNN) in PyTorch that will classify people’s names by their languages. We assume that the reader has a basic understanding of PyTorch and machine learning in Python.

At the end of this tutorial, we’ll be able to predict the language of the names based on their spelling. The dataset of names used in this tutorial can be downloaded here.

This tutorial has been adapted from PyTorch’s official docs— check out more about the implementation from these docs.

Data Pre-processing

As is the case with any machine learning task, we’ll kick off by loading and preparing our dataset. Upon downloading the dataset, we notice that there’s a folder called names inside the data folder. It contains text files with surnames in eighteen different languages.

In order to load all the files in one go, we’ll use a Python module known as glob. The glob module finds all the pathnames matching a specified pattern according to the rules used by the Unix shell. Results are returned in an arbitrary order. We’ll use it to load all the files in the folder that end with .txt .

import glob

all_text_files = glob.glob('data/names/*.txt')
print(all_text_files)

Currently the names are in Unicode format. However, we have to convert them to ASCII standard. This will remove the diacritics in the words. For example, the French name Béringer will be converted to Beringer.

import unicodedata
import string

all_letters = string.ascii_letters + " .,;'"
n_letters = len(all_letters)

def unicode_to_ascii(s):
    return ''.join(
        c for c in unicodedata.normalize('NFD', s)
        if unicodedata.category(c) != 'Mn'
        and c in all_letters
    )

print(unicode_to_ascii('Béringer'))

In the next step, we create a dictionary with a list of names for each language.

category_languages = {}
all_categories = []

def readLines(filename):
    lines = open(filename).read().strip().split('n')
    return [unicode_to_ascii(line) for line in lines]

for filename in all_text_files:
    category = filename.split('/')[-1].split('.')[0]
    all_categories.append(category)
    languages = readLines(filename)
    category_languages[category] = languages

no_of_languages = len(all_categories)
print('There are {} langauages'.format(no_of_languages))

We can view the first fifteen names in the French dictionary as shown below.

Turning the Names into PyTorch Tensors

When working with data in PyTorch, we have to convert it to PyTorch tensors. This is very similar to NumPy arrays. In our case, we have to convert each letter into a torch tensor. This will be a one-hot vector filled with 0s except for a 1 at the index of the current letter. Let’s show how this is done and then convert the letter M to a one-hot vector.

import torch

def letter_to_tensor(letter):
    tensor = torch.zeros(1, n_letters)
    letter_index = all_letters.find(letter)
    tensor[0][letter_index] = 1
    return tensor

def line_to_tensor(line):
    tensor = torch.zeros(len(line), 1, n_letters)
    for li, letter in enumerate(line):
        letter_index = all_letters.find(letter)
        tensor[li][0][letter_index] = 1
    return tensor

In order to form a single word, we’ll have to join several one-hot vectors to form a 2D matrix.

Building the RNN

When creating a neural network in PyTorch, we use the torch.nn.Module, which is the base class for all neural network modules. torch.autograd provides classes and functions implementing automatic differentiation of arbitrary scalar valued functions. torch.nn.LogSoftmax() applies the Log(Softmax(x)) function to an n-dimensional input Tensor.

import torch.nn as nn
from torch.autograd import Variable

class RNN(nn.Module):
    def __init__(self, input_size, hidden_size, output_size):
        super(RNN, self).__init__()
        
        self.input_size = input_size
        self.hidden_size = hidden_size
        self.output_size = output_size
        
        self.i2h = nn.Linear(input_size + hidden_size, hidden_size)
        self.i2o = nn.Linear(input_size + hidden_size, output_size)
        self.softmax = nn.LogSoftmax()
    
    def forward(self, input, hidden):
        combined = torch.cat((input, hidden), 1)
        hidden = self.i2h(combined)
        output = self.i2o(combined)
        output = self.softmax(output)
        return output, hidden

    def init_hidden(self):
        return Variable(torch.zeros(1, self.hidden_size))

Testing the RNN

We kick off by creating an instance of the RNN class and passing in the arguments as required.

n_hidden = 128
rnn = RNN(n_letters, n_hidden, no_of_languages)

We’d like the network to give us the probability of each language. In order to achieve this, we’ll pass the tensor of the current letter.

input = Variable(letter_to_tensor('D'))
hidden = rnn.init_hidden()

output, next_hidden = rnn(input, hidden)
print('output.size =', output.size())

input = Variable(line_to_tensor('Derrick'))
hidden = Variable(torch.zeros(1, n_hidden))

output, next_hidden = rnn(input[0], hidden)
print(output)

Training the RNN

In order to get the likelihood of each category, we use Tensor.topk to get the index of the greatest value.

def category_from_output(output):
    top_n, top_i = output.data.topk(1) 
    category_i = top_i[0][0]
    return all_categories[category_i], category_i

print(category_from_output(output))

Next we need a quick way to obtain a name and its output.

import random

def random_training_pair():                                                                                                               
    category = random.choice(all_categories)
    line = random.choice(category_languages[category])
    category_tensor = Variable(torch.LongTensor([all_categories.index(category)]))
    line_tensor = Variable(line_to_tensor(line))
    return category, line, category_tensor, line_tensor

for i in range(10):
    category, line, category_tensor, line_tensor = random_training_pair()
    print('category =', category, '/ line =', line)

The next step is to define the loss function and create an optimizer that will update the parameters of the model according to its gradients. We also specify a learning rate for our model.

criterion = nn.NLLLoss()
 
learning_rate = 0.005 
optimizer = torch.optim.SGD(rnn.parameters(), lr=learning_rate)

We move forward to define a function that will create the input and output tensors, compare the final output to the target output, and finally do back-propagation.

def train(category_tensor, line_tensor):
    rnn.zero_grad()
    hidden = rnn.init_hidden()
    
    for i in range(line_tensor.size()[0]):
        output, hidden = rnn(line_tensor[i], hidden)

    loss = criterion(output, category_tensor)
    loss.backward()

    optimizer.step()

    return output, loss.data[0]

The next step is to run several examples using this train function as we keep track of the losses for later plotting.

import time
import math

n_epochs = 100000
print_every = 5000
plot_every = 1000

current_loss = 0
all_losses = []

def time_since(since):
    now = time.time()
    s = now - since
    m = math.floor(s / 60)
    s -= m * 60
    return '%dm %ds' % (m, s)

start = time.time()

for epoch in range(1, n_epochs + 1):
    category, line, category_tensor, line_tensor = random_training_pair()
    output, loss = train(category_tensor, line_tensor)
    current_loss += loss
    
    if epoch % print_every == 0:
        guess, guess_i = category_from_output(output)
        correct = '✓' if guess == category else '✗ (%s)' % category
        print('%d %d%% (%s) %.4f %s / %s %s' % (epoch, epoch / n_epochs * 100, time_since(start), loss, line, guess, correct))

    if epoch % plot_every == 0:
        all_losses.append(current_loss / plot_every)
        current_loss = 0

Plotting the Results

We plot the results using Matplotlib’s pyplot. The plot will show us the learning rate of our network.

import matplotlib.pyplot as plt
import matplotlib.ticker as ticker
%matplotlib inline

plt.figure()
plt.plot(all_losses)

Evaluating the Results

We’ll create a confusion matrix in order to see how the network performed on different categories. The bright spots off the main axis show the languages it guesses incorrectly.

confusion = torch.zeros(no_of_languages, no_of_languages)
n_confusion = 10000

def evaluate(line_tensor):
    hidden = rnn.init_hidden()
    
    for i in range(line_tensor.size()[0]):
        output, hidden = rnn(line_tensor[i], hidden)
    
    return output

for i in range(n_confusion):
    category, line, category_tensor, line_tensor = random_training_pair()
    output = evaluate(line_tensor)
    guess, guess_i = category_from_output(output)
    category_i = all_categories.index(category)
    confusion[category_i][guess_i] += 1

for i in range(no_of_languages):
    confusion[i] = confusion[i] / confusion[i].sum()

fig = plt.figure()
ax = fig.add_subplot(111)
cax = ax.matshow(confusion.numpy())
fig.colorbar(cax)

ax.set_xticklabels([''] + all_categories, rotation=90)
ax.set_yticklabels([''] + all_categories)

ax.xaxis.set_major_locator(ticker.MultipleLocator(1))
ax.yaxis.set_major_locator(ticker.MultipleLocator(1))

plt.show()

Predicting New Names

We’ll define a function that will take in a name and return the likely languages the name is from.

def predict(input_line, n_predictions=3):
    print('n> %s' % input_line)
    output = evaluate(Variable(line_to_tensor(input_line)))

    topv, topi = output.data.topk(n_predictions, 1, True)
    predictions = []

    for i in range(n_predictions):
        value = topv[0][i]
        category_index = topi[0][i]
        print('(%.2f) %s' % (value, all_categories[category_index]))
        predictions.append([value, all_categories[category_index]])

predict('Austin')

Reference

Conclusion

If you’d like to learn more about PyTorch, there are a bunch of tutorials nested in its official docs. If you’re looking to learn more about RNNs, Brian Mwangi. has written a fantastic guide as well:

Discuss this post on Hacker News and Reddit.