How to make a Summarizer using the Trax library of Deep learning?
This article was published as a part of the Data Science Blogathon
Trax is a full-featured deep learning library with a focus on clean code and fast computation. In syntax, it is generally similar to Keras, and a Trax model can be converted to a Keras model. The library is actively developed and supported by the Google Brain team. Trax uses Tensorflow and is one of the libraries in its ecosystem. It runs on CPU, GPU, and TPU and uses the same version.
A transformer is designed to work with sequences, including textual ones, but unlike architectures on recurrent networks, it does not require processing the sequence in order. Simplifying greatly, we can say that if we leave only the attention mechanism from the Seq2Seq architecture on an LSTM with an attention mechanism and add a feed-forward neural network, then it will work.
Here is my experiment of creating a summarizer, this construction receives an article as input and generates a short text describing the essence using the world-famous Trax library. A summary can be just a heading. I’ll try to tell you about everything in detail. Let’s begin with data analysis!
As a dataset for the experiment, I decided to use the Lenta.Ru news corpus, the latest version of which I found on Kaggle. The corpus contains over 800 thousand news articles in the format (URL, title, text, topic, tags, date). If the article is text, then the summary for my model is the title. This is a complete sentence containing the main message of the news article.
First, I filtered out abnormally short and abnormally long articles. Then I selected texts and headings from the set, converted everything to lowercase, saved it as a list of tuples and as a full text. I split the list of tuples into two parts – for training (train) and evaluation (eval). Then I wrote an “infinite” generator, which, having reached the end of the list, shuffles it and starts over. It is unpleasant when the generator “ends” somewhere in the middle of an era. This is important primarily for the assessment set, I took only 5% of the total number of articles, about 36 thousand pairs.
Based on the full text, I trained the tokenizer and used parts of words as tokens. The problem of tokenization or segmentation into whole words is that some words in the text are rare, perhaps only once, and there are a lot of such words, and the size of the dictionary is finite and I want to make it not very large to fit into the memory of the virtual machine. You have to replace some words with named templates, often use placeholders for words that are not in the dictionary, and even use special techniques like pointer-generator. And splitting into subwords allows you to make a tokenizer with a small dictionary, which also works practically without loss of information.
For such segmentation, there are several relatively honest ways, you can get acquainted with them. I chose the Byte Pair Encoding (BPE) based model implemented in the sentence piece library. BPE is a method of encoding text with compression. To encode a frequently repeated sequence of characters, a character is used that is not in the original sequence. Everything is the same with segmentation, only a sequence of frequently occurring characters becomes a new token, and so on until the specified size of the dictionary is reached. My dictionary contains 16,000 tokens.
The model has been trained thanks to such a simple design:
import sentencepiece as spm spm.SentencePieceTrainer.train('--input=full_text.txt --pad_id=0 --bos_id=-1 --eos_id=1 --unk_id=2 --model_prefix=bpe --vocab_size=16000 --model_type=bpe')
The result is two files: a dictionary for control and a model that can be loaded into the tokenizer wrapper. For the model I have chosen, the article and title must be converted to a sequence of integers and concatenated with the service tokens EOS: 1 and PAD: 0 (end of sequence and placeholder).
After conversion, the sequence is placed in a fixed-length bucket. I have three of them: 256, 512, and 1024. The sequences in the basket are automatically padded with placeholders to a fixed length and collected in batches. The number of sequences in the package depends on the basket, respectively 16, 8, 4.
Reflection on sequences longer than 512 tokens
Segmentation and concatenation is done in the trax pipeline: input_pipeline = trax.data.Serial( trax.data.Tokenize(vocab_type='sentencepiece', vocab_dir='/content/drive/MyDrive/', vocab_file='bpe.model'), preprocessing, trax.data.FilterByLength(1024) ) train_stream = input_pipeline(train_data_stream()) eval_stream = input_pipeline(eval_data_stream()) preprocessing is my concatenation function, generator. The sorting into baskets and the formation of packages is carried out thanks to the following design: boundaries = [256, 512] batch_sizes = [16, 8, 4] train_batch_stream = trax.data.BucketByLength( boundaries, batch_sizes)(train_stream) eval_batch_stream = trax.data.BucketByLength( boundaries, batch_sizes)(eval_stream)
A transformer that works with two sequences, for example, for machine translation, includes two blocks – an encoder and a decoder, but only a decoder is sufficient for summarization. Such an architecture generally implements a language model where the probability of the next word is determined from the previous ones. It is also called Decoder-only Transformer and is similar to GPT (Generative Pre-trained Transformer).
For my case, the Trax library has a separate model class Trax.models.transformer.TransformerLM (…), that is, you can create a model with one line of code. In the mentioned specialization, the model is built from scratch. I chose something in between – I built a model from ready-made blocks using code examples.
The diagram of the model is shown in the figure:
PositionlEncoder () is a block that provides vector space construction and coding of the token position in the input sequence. Code:
from trax import layers as tl def PositionalEncoder(vocab_size, d_model, dropout, max_len, mode): return [ tl.Embedding(vocab_size, d_model), tl.Dropout(rate=dropout, mode=mode), tl.PositionalEncoding(max_len=max_len, mode=mode)]
- vocab_size (int): vocabulary size
- d_model (int): number of vector space features
- dropout (float): degree of use dropout
- max_len (int): maximum sequence length for positional encoding
- mode (str): ‘train’ or ‘eval’ – for dropout and pos. coding.
def FeedForward(d_model, d_ff, dropout, mode, ff_activation): return [ tl.LayerNorm(), tl.Dense(d_ff), ff_activation(), tl.Dropout(rate=dropout, mode=mode), tl.Dense(d_model), tl.Dropout(rate=dropout, mode=mode) ]
- d_model (int): the number of vector space
- features d_ff (int): the “width” of the block or the number of units in the output dense layer
- dropout (float): the degree of use of dropout
- mode (str): ‘train’ or ‘eval’ – so as not to use dropout when evaluating the model quality
- ff_activation (function): activation function, in my model – ReLU
def DecoderBlock(d_model, d_ff, n_heads, dropout, mode, ff_activation): return [ tl.Residual( tl.LayerNorm(), tl.CausalAttention(d_model, n_heads=n_heads, dropout=dropout, mode=mode) ), tl.Residual( FeedForward(d_model, d_ff, dropout, mode, ff_activation) ), ]
def SumTransformer(vocab_size=vocab_size, d_model=512, d_ff=2048, n_layers=6, n_heads=8, dropout=0.1, max_len=4096, mode='train', ff_activation=tl.Relu): decoder_blocks = [DecoderBlock(d_model, d_ff, n_heads, dropout, mode, ff_activation) for _ in range(n_layers)] return tl.Serial( tl.ShiftRight(mode=mode), PositionalEncoder(vocab_size, d_model, dropout, max_len, mode), decoder_blocks, tl.LayerNorm(), tl.Dense(vocab_size), tl.LogSoftmax() )
#The learning cycle looks like this: from trax.supervised import training def training_loop(SumTransformer, train_gen, eval_gen, output_dir = "~/model"): output_dir = os.path.expanduser(output_dir) train_task = training.TrainTask( labeled_data=train_gen, loss_layer=tl.CrossEntropyLoss(), optimizer=trax.optimizers.Adam(0.0001), n_steps_per_checkpoint=100 ) eval_task = training.EvalTask( labeled_data=eval_gen, metrics=[tl.CrossEntropyLoss(), tl.Accuracy()] ) loop = training.Loop(SumTransformer(), train_task, eval_tasks=[eval_task], output_dir=output_dir) return loop
- SumTransformer (trax.layers.combinators.Serial): model
- train_gen (generator): data flow for training
- eval_gen (generator): data flow for quality assessment.
- output_dir (str): folder for the model file, from where it can be copied to Google Drive before shutting down the virtual machine.
Then everything is simple:
loop = training_loop(SumTransformer, train_batch_stream, eval_batch_stream) loop.run(20000)
Evaluation of results
Examples from the evaluation set:
Model: Audemars Piguet has presented a new model from the royal oak collection
Sample: magician accidentally shot an assistant in front of the audience
model: at the festival in Pulkovo attacked with a knife
Trax library is easy to use for simple deep learning projects. In this article, you can summarize any text, blog, article within seconds. This is a beginner-friendly project!