“If you talk to a man in a language he understands, that goes to his head. If you talk to him in his own language, that goes to his heart.” – Nelson Mandela
The beauty of language transcends boundaries and cultures. Learning a language other than our mother tongue is a huge advantage. But the path to bilingualism, or multilingualism, can often be a long, never-ending one.
There are so many little nuances that we get lost in the sea of words. Things have, however, become so much easier with online translation services (I’m looking at you Google Translate!).
I have always wanted to learn a language other than English. I tried my hand at learning German (or Deutsch), back in 2014. It was both fun and challenging. I had to eventually quit but I harboured a desire to start again.
Fast-forward to 2019, I am fortunate to be able to build a language translator for any possible pair of languages. What a boon Natural Language Processing has been!
In this article, we will walk through the steps of building a German-to-English language translation model using Keras. We’ll also take a quick look at the history of machine translation systems with the benefit of hindsight.
This article assumes familiarity with RNN, LSTM, and Keras. Below are a couple of articles to read more about them:
Most of us were introduced to machine translation when Google came up with the service. But the concept has been around since the middle of last century.
Research work in Machine Translation (MT) started as early as 1950’s, primarily in the United States. These early systems relied on huge bilingual dictionaries, hand-coded rules, and universal principles underlying natural language.
In 1954, IBM held a first ever public demonstration of a machine translation. The system had a pretty small vocabulary of only 250 words and it could translate only 49 hand-picked Russian sentences to English. The number seems minuscule now but the system is widely regarded as an important milestone in the progress of machine translation.
This image has been taken from the research paper describing IBM’s system
Soon, two schools of thought emerged:
In 1964, the Automatic Language Processing Advisory Committee (ALPAC) was established by the United States government to evaluate the progress in Machine Translation. ALPAC did a little prodding around and published a report in November 1966 on the state of MT. Below are the key highlights from that report:
Not exactly a glowing recommendation!
A long dry period followed this miserable report. Finally, in 1981, a new system called the METEO System was deployed in Canada for translation of weather forecasts issued in French into English. It was quite a successful project which stayed in operation until 2001.
The world’s first web translation tool, Babel Fish, was launched by the AltaVista search engine in 1997.
And then came the breakthrough we are all familiar with now – Google Translate. It has since changed the way we work (and even learn) with different languages.
Let’s circle back to where we left off in the introduction section, i.e., learning German. However, this time around I am going to make my machine do this task. The objective is to convert a German sentence to its English counterpart using a Neural Machine Translation (NMT) system.
We will use German-English sentence pairs data from http://www.manythings.org/anki/. You can download it from here.
Sequence-to-Sequence (seq2seq) models are used for a variety of NLP tasks, such as text summarization, speech recognition, DNA sequence modeling, among others. Our aim is to translate given sentences from one language to another.
Here, both the input and output are sentences. In other words, these sentences are a sequence of words going in and out of a model. This is the basic idea of Sequence-to-Sequence modeling. The figure below tries to explain this method.
A typical seq2seq model has 2 major components –
a) an encoder
b) a decoder
Both these parts are essentially two different recurrent neural network (RNN) models combined into one giant network:
I’ve listed a few significant use cases of Sequence-to-Sequence modeling below (apart from Machine Translation, of course):
It’s time to get our hands dirty! There is no better feeling than learning a topic by seeing the results first-hand. We’ll fire up our favorite Python environment (Jupyter Notebook for me) and get straight down to business.
import string import re from numpy import array, argmax, random, take import pandas as pd from keras.models import Sequential from keras.layers import Dense, LSTM, Embedding, RepeatVector from keras.preprocessing.text import Tokenizer from keras.callbacks import ModelCheckpoint from keras.preprocessing.sequence import pad_sequences from keras.models import load_model from keras import optimizers import matplotlib.pyplot as plt %matplotlib inline pd.set_option('display.max_colwidth', 200)
Our data is a text file (.txt) of English-German sentence pairs. First, we will read the file using the function defined below.
Python Code:
We can now use these functions to read the text into an array in our desired format.
data = read_text("deu.txt") deu_eng = to_lines(data) deu_eng = array(deu_eng)
The actual data contains over 150,000 sentence-pairs. However, we will use only the first 50,000 sentence pairs to reduce the training time of the model. You can change this number as per your system’s computation power (or if you’re feeling lucky!).
deu_eng = deu_eng[:50000,:]
Quite an important step in any project, especially so in NLP. The data we work with is more often than not unstructured so there are certain things we need to take care of before jumping to the model building part.
(a) Text Cleaning
Let’s first take a look at our data. This will help us decide which pre-processing steps to adopt.
deu_eng
array([['Hi.', 'Hallo!'], ['Hi.', 'Grüß Gott!'], ['Run!', 'Lauf!'], ..., ['Mary has very long hair.', 'Maria hat sehr langes Haar.'], ["Mary is Tom's secretary.", 'Maria ist Toms Sekretärin.'], ['Mary is a married woman.', 'Maria ist eine verheiratete Frau.']], dtype='<U380')
We will get rid of the punctuation marks and then convert all the text to lower case.
# Remove punctuation deu_eng[:,0] = [s.translate(str.maketrans('', '', string.punctuation)) for s in deu_eng[:,0]] deu_eng[:,1] = [s.translate(str.maketrans('', '', string.punctuation)) for s in deu_eng[:,1]] deu_eng
array([['Hi', 'Hallo'], ['Hi', 'Grüß Gott'], ['Run', 'Lauf'], ..., ['Mary has very long hair', 'Maria hat sehr langes Haar'], ['Mary is Toms secretary', 'Maria ist Toms Sekretärin'], ['Mary is a married woman', 'Maria ist eine verheiratete Frau']], dtype='<U380')
# convert text to lowercase for i in range(len(deu_eng)): deu_eng[i,0] = deu_eng[i,0].lower() deu_eng[i,1] = deu_eng[i,1].lower() deu_eng
array([['hi', 'hallo'], ['hi', 'grüß gott'], ['run', 'lauf'], ..., ['mary has very long hair', 'maria hat sehr langes haar'], ['mary is toms secretary', 'maria ist toms sekretärin'], ['mary is a married woman', 'maria ist eine verheiratete frau']], dtype='<U380')
(b) Text to Sequence Conversion
A Seq2Seq model requires that we convert both the input and the output sentences into integer sequences of fixed length.
But before we do that, let’s visualise the length of the sentences. We will capture the lengths of all the sentences in two separate lists for English and German, respectively.
# empty lists eng_l = [] deu_l = [] # populate the lists with sentence lengths for i in deu_eng[:,0]: eng_l.append(len(i.split())) for i in deu_eng[:,1]: deu_l.append(len(i.split())) length_df = pd.DataFrame({'eng':eng_l, 'deu':deu_l}) length_df.hist(bins = 30) plt.show()
Quite intuitive – the maximum length of the German sentences is 11 and that of the English phrases is 8.
Next, vectorize our text data by using Keras’s Tokenizer() class. It will turn our sentences into sequences of integers. We can then pad those sequences with zeros to make all the sequences of the same length.
Note that we will prepare tokenizers for both the German and English sentences:
# function to build a tokenizer def tokenization(lines): tokenizer = Tokenizer() tokenizer.fit_on_texts(lines) return tokenizer
# prepare english tokenizer eng_tokenizer = tokenization(deu_eng[:, 0]) eng_vocab_size = len(eng_tokenizer.word_index) + 1 eng_length = 8 print('English Vocabulary Size: %d' % eng_vocab_size)
English Vocabulary Size: 6453
# prepare Deutch tokenizer deu_tokenizer = tokenization(deu_eng[:, 1]) deu_vocab_size = len(deu_tokenizer.word_index) + 1 deu_length = 8 print('Deutch Vocabulary Size: %d' % deu_vocab_size)
Deutch Vocabulary Size: 10998
The below code block contains a function to prepare the sequences. It will also perform sequence padding to a maximum sentence length as mentioned above.
# encode and pad sequences def encode_sequences(tokenizer, length, lines): # integer encode sequences seq = tokenizer.texts_to_sequences(lines) # pad sequences with 0 values seq = pad_sequences(seq, maxlen=length, padding='post') return seq
We will now split the data into train and test set for model training and evaluation, respectively.
from sklearn.model_selection import train_test_split # split data into train and test set train, test = train_test_split(deu_eng, test_size=0.2, random_state = 12)
It’s time to encode the sentences. We will encode German sentences as the input sequences and English sentences as the target sequences. This has to be done for both the train and test datasets.
# prepare training data trainX = encode_sequences(deu_tokenizer, deu_length, train[:, 1]) trainY = encode_sequences(eng_tokenizer, eng_length, train[:, 0]) # prepare validation data testX = encode_sequences(deu_tokenizer, deu_length, test[:, 1]) testY = encode_sequences(eng_tokenizer, eng_length, test[:, 0])
Now comes the exciting part!
We’ll start off by defining our Seq2Seq model architecture:
Model Architecture
# build NMT model def define_model(in_vocab,out_vocab, in_timesteps,out_timesteps,units): model = Sequential() model.add(Embedding(in_vocab, units, input_length=in_timesteps, mask_zero=True)) model.add(LSTM(units)) model.add(RepeatVector(out_timesteps)) model.add(LSTM(units, return_sequences=True)) model.add(Dense(out_vocab, activation='softmax')) return model
We are using the RMSprop optimizer in this model as it’s usually a good choice when working with recurrent neural networks.
# model compilation model = define_model(deu_vocab_size, eng_vocab_size, deu_length, eng_length, 512)
rms = optimizers.RMSprop(lr=0.001) model.compile(optimizer=rms, loss='sparse_categorical_crossentropy')
Please note that we have used ‘sparse_categorical_crossentropy‘ as the loss function. This is because the function allows us to use the target sequence as is, instead of the one-hot encoded format. One-hot encoding the target sequences using such a huge vocabulary might consume our system’s entire memory.
We are all set to start training our model!
We will train it for 30 epochs and with a batch size of 512 with a validation split of 20%. 80% of the data will be used for training the model and the rest for evaluating it. You may change and play around with these hyperparameters.
We will also use the ModelCheckpoint() function to save the model with the lowest validation loss. I personally prefer this method over early stopping.
filename = 'model.h1.24_jan_19' checkpoint = ModelCheckpoint(filename, monitor='val_loss', verbose=1, save_best_only=True, mode='min') # train model history = model.fit(trainX, trainY.reshape(trainY.shape[0], trainY.shape[1], 1), epochs=30, batch_size=512, validation_split = 0.2,callbacks=[checkpoint], verbose=1)
Let’s compare the training loss and the validation loss.
plt.plot(history.history['loss']) plt.plot(history.history['val_loss']) plt.legend(['train','validation']) plt.show()
As you can see in the above plot, the validation loss stopped decreasing after 20 epochs.
Finally, we can load the saved model and make predictions on the unseen data – testX.
model = load_model('model.h1.24_jan_19') preds = model.predict_classes(testX.reshape((testX.shape[0],testX.shape[1])))
These predictions are sequences of integers. We need to convert these integers to their corresponding words. Let’s define a function to do this:
def get_word(n, tokenizer): for word, index in tokenizer.word_index.items(): if index == n: return word return None
Convert predictions into text (English):
preds_text = [] for i in preds: temp = [] for j in range(len(i)): t = get_word(i[j], eng_tokenizer) if j > 0: if (t == get_word(i[j-1], eng_tokenizer)) or (t == None): temp.append('') else: temp.append(t) else: if(t == None): temp.append('') else: temp.append(t) preds_text.append(' '.join(temp))
Let’s put the original English sentences in the test dataset and the predicted sentences in a dataframe:
pred_df = pd.DataFrame({'actual' : test[:,0], 'predicted' : preds_text})
We can randomly print some actual vs predicted instances to see how our model performs:
# print 15 rows randomly pred_df.sample(15)
Our Seq2Seq model does a decent job. But there are several instances where it misses out on understanding the key words. For example, it translates “im tired of boston” to “im am boston”.
These are the challenges you will face on a regular basis in NLP. But these aren’t immovable obstacles. We can mitigate such challenges by using more training data and building a better (or more complex) model.
You can access the full code from this Github repo.
Even with a very simple Seq2Seq model, the results are pretty encouraging. We can improve on this performance easily by using a more sophisticated encoder-decoder model on a larger dataset.
Another experiment I can think of is trying out the seq2seq approach on a dataset containing longer sentences. The more you experiment, the more you’ll learn about this vast and complex space.
If you have any feedback on this article or have any doubts/questions, kindly share them in the comments section below.
Lorem ipsum dolor sit amet, consectetur adipiscing elit,
HI Prateek, It's a wonderful article.One request,can you show us the implementation in R?
Thanks Sayam. I will try to implement it in R as well and share it with you all.
Good one Prateek. Thanks! I am looking for models in life insurance analytics. Since you have experience in BFSI, did you develop any such model like lapsation, claims etc ! Do reply back.
Hi, For some reason,the array function is not working properly.The function should return just an array while it is returning a list of array and shape is also not correct. Can you help me debugging it. By the way...Very good article.
Hi Janmejay, Which part of the code you are referring to?
Hello Prateek, I am getting the error in the line % matplotlib inline It says that the syntax is wrong. I am using Python 3.6
Thanks Dinesh for pointing it out. The correct code is
%matplotlib inline
. I have changed it in the blog as well.Hi, I am following this tutorial as a bonus section for an assignment, but I am training on my own dataset which translated French to English. When I get to the training step, I get the error: Received a label value of 15781 which is outside the valid range of [0, 11767) (I have 11767 English words in the English vocabulary and 15789 words in the French vocabulary) so I assume the error is trying to use a value outside of the possible English integer encoding, which makes sense because French words can go > 11767 while English words can't. However, when I tried running it with your dataset, and you also have a difference in the number of words in your English and German vocabulary, you don't have this error. Can you explain to me why and any possible way I can fix this? I am really looking forward to your response!
Hi, please recheck the size of the vocabularies of your inputs and targets, repectively.
Great article, nice help in learning about seq2seq. Can you add a few lines that would allow me to send a message in english to be translated. Would be a nice addition.
Hi, I used this for a different dataset (not language translation). Model runs fine but im getting all same(blank) predictions . Any idea what could be the issue?
I guess the training data is not sufficient. It happened with me also when I was working with a smaller dataset.
Hi Prateek Nice article, I'm trying to use this code in a large sentences dataset so I want to retrain the model multiple times, can you please provide us with the implementation of that. Thanks in advance!
Wonderful article, very helpful!
Nice Article ! I was just wondering could we run the same with a Time Distributed layer?And would it make the training better and faster?
Well article, the approach is very systematic and amazing. BTW, I tried to implement the code for myself and tried to do a little modification, as I wanted to track 'train accuracy' and 'validation accuracy' too, as long 'Loss' that you have mentioned. In order to have that, I added one more argument (metrics = 'accuracy') in the model.compile() function. But it shows some error like [Incompatible shapes: [21504] vs. [1024,21] [[{{node metrics_14/acc/Equal}}]] ]. Can you kindly have a look at this? Thanks in advance.
preds = model.predict_classes(testX.reshape((testX.shape[0],testX.shape[1]))) this line giving error - 'Sequential' object has no attribute 'predict_classes' This function were removed in TensorFlow version 2.6. Can you please update the that would be very helpful to me. thanks!!
Hello, it is good article, thanks all . please , write again article about how to build dataset for different NLP tasks, that is , annotations tools, methods, and .etc. Because , I am new in NLP , so I don't know how to prepare dataset for example Uzbek language .
thanks for your help. I am new learner in the area of Artificial Intelligence and this is my first trial of running NLP project. can you please build speech recognition system using the LAS architecture.