Building a Machine Learning Model for Title Generation

Sharvari Santosh 24 Sep, 2021

6 min read

This article was published as a part of the Data Science Blogathon

Introduction

In this article, I will use the YouTube Trends database and Python programming language to train a language model that generates text using learning tools, which will be used for the task of making youtube video articles or for your blogs.

The topic generator is a function of Natural Language Processing and is a subject between several Machine Learning, including text compilation, text speaking, and discussion programs.

To create a title-generating work model or a text generator, the model must be trained to learn whether a word may occur, using words that already appear in sequence as context.

What is Natural Language Processing

NLP | Model for Title Generation — Image 2

Natural Language Processing (NLP) is often used for textual segregation activities such as spam detection and emotional analysis, text production, language translation, and text classification. Text data can be viewed in alphabetical order, word order, or sentence sequence. In general, text data is considered a sequence of words in most problems. In this article, we will enter, a process using simple sample data. However, the steps discussed here apply to any NLP activities. In particular, we will use TensorFlow2, Keras to obtain text processing which includes:

Tokenization
Sequence
Padding

Building the Machine Learning Model for Title Generation

I will start this project of building a title generator with Python and machine learning by importing libraries and reading data sets. The data sets I use for this project can be downloaded from here.

Importing the necessary libraries Building the Machine Learning Model for Title Generation

Importing libraries before we start working on them. Here, I have used Keras and TensorFlow as the main libraries for our model as it is a highly productive interface for solving such problems, with a deep learning approach.

import pandas as pd
import string
import numpy as np
import json
from keras.preprocessing.sequence import pad_sequences
from keras.layers import Embedding, LSTM, Dense, Dropout
from keras.preprocessing.text import Tokenizer
from keras.callbacks import EarlyStopping
from keras.models import Sequential
import keras.utils as ku
import tensorflow as tf
tf.random.set_seed(2)
from numpy.random import seed
seed(1)

Loading the dataset

#load all the datasets 
df1 = pd.read_csv('USvideos.csv')
df2 = pd.read_csv('CAvideos.csv')
df3 = pd.read_csv('GBvideos.csv')

#load the datasets containing the category names
data1 = json.load(open('US_category_id.json'))
data2 = json.load(open('CA_category_id.json'))
data3 = json.load(open('GB_category_id.json'))

Now we need to process our data so that we can use this data to train our machine learning model with the task of making a topic. Here are all the steps to clean up and process the data we need to follow:

def category_extractor(data):
    i_d = [data['items'][i]['id'] for i in range(len(data['items']))]
    title = [data['items'][i]['snippet']["title"] for i in range(len(data['items']))]
    i_d = list(map(int, i_d))
    category = zip(i_d, title)
    category = dict(category)
    return category

#create a new category column by mapping the category names to their id
df1['category_title'] = df1['category_id'].map(category_extractor(data1))
df2['category_title'] = df2['category_id'].map(category_extractor(data2))
df3['category_title'] = df3['category_id'].map(category_extractor(data3))

#join the dataframes
df = pd.concat([df1, df2, df3], ignore_index=True)

#drop rows based on duplicate videos
df = df.drop_duplicates('video_id')

#collect only titles of entertainment videos
#feel free to use any category of video that you want
entertainment = df[df['category_title'] == 'Entertainment']['title']
entertainment = entertainment.tolist()

#remove punctuations and convert text to lowercase
def clean_text(text):
    text = ''.join(e for e in text if e not in string.punctuation).lower()
    
    text = text.encode('utf8').decode('ascii', 'ignore')
    return text

corpus = [clean_text(e) for e in entertainment]

Generating sequences for Building the Machine Learning Model for Title Generation

Natural language processing operations require data entry in the form of a token sequence. The first step after data purification is to generate a sequence of n-gram tokens.

N-gram is the closest sequence of n elements of a given sample of text or vocal corpus. Items can be words, letters, phonemes, letters, or base pairs. In this case, n-gr is a sequence of words in the corpus of titles.

The tokenizer is an API found in TensorFlow Keras that is used to make sentences into a token. We defined our text data as sentences (each with a comma) and with multiple strings.

Since in-depth reading models do not understand the text, we need to convert the text into a numerical representation. For this purpose, the first step is to make tokens. The Tokenizer API from TensorFlow Keras divides sentences into words and converts these into numbers. Tokenization is the process of issuing tokens from a corpus:

tokenizer = Tokenizer()
def get_sequence_of_tokens(corpus):
  #get tokens
  tokenizer.fit_on_texts(corpus)
  total_words = len(tokenizer.word_index) + 1
 
  #convert to sequence of tokens
  input_sequences = []
  for line in corpus:
  token_list = tokenizer.texts_to_sequences([line])[0]
  for i in range(1, len(token_list)):
  n_gram_sequence = token_list[:i+1]
  input_sequences.append(n_gram_sequence)
 
  return input_sequences, total_words
inp_sequences, total_words = get_sequence_of_tokens(corpus)

Padding the sequences for Building the Machine Learning Model for Title Generation

In any raw text data, there will naturally be sentences of different lengths. However, all neural networks need to be input in the same size. For this purpose, wrapping is done. The use of the ‘pre’ or ‘post’ pad depends on the analysis. In some cases, wrapping at first is appropriate while not for others. For example, if we use Recurrent Neural Network (RNN) to detect spam detection, then initial wrapping may be appropriate as RNN can read long-distance patterns. Early wrap allows us to keep track of the end which is why RNN can use these sequences to predict the next. However, any support should be made after careful consideration and business knowledge.

Since sequences can vary in length, the length of the sequence must be proportional. When using neural networks, we usually feed input to the network while waiting for the result. In practice, it is better to process data in batches than to do one at a time. The pad_sequences() is a function in the Keras deep learning library that can be used to pad variable-length sequences.

This is done using matrices [batch length x sequence length], where the length of the sequence corresponds to the longest sequence. In this case, we complete the sequence with the symbol (frequency 0) to match the size of the matrix. This process of filling the token sequence is called filling. To enter data from the training model, I need to create predictions and labels.

I will build an n-gram sequence as a prediction and the following n-gram word as a label:

def generate_padded_sequences(input_sequences):
  max_sequence_len = max([len(x) for x in input_sequences])
  input_sequences = np.array(pad_sequences(input_sequences,  maxlen=max_sequence_len, padding=’pre’))
  predictors, label = input_sequences[:,:-1], input_sequences[:, -1]
  label = ku.to_categorical(label, num_classes = total_words)
  return predictors, label, max_sequence_len
predictors, label, max_sequence_len = generate_padded_sequences(inp_sequences)

LSTM Model for Title Generation

In recurrent neural networks, the activation effect is still distributed in both directions, e.g. From inputs to outputs and outputs to inputs, unlike neural networks that work directly where the output d is distributed is only one-sided. This creates barriers to the formation of a neural network that acts as a “memory state” for the nerves.

Because of this, the RNN keeps the state up to date or “remembers” what was learned over time. Memory status has its advantages, but it also has its drawbacks. The missing gradient is one of them.

In this case, while reading about a lot of layers, it becomes very difficult for the network to read and adjust the parameters of previous layers. To solve this problem, a new type of RNN has been developed; LSTM (long-term memory).

LSTM model

The LSTM model contains an additional status (cell status) that allows the network to learn what it will store in the future, what to remove and what to read. . The LSTM of this model consists of three layers:

Input layer: takes the word order as input
LSTM Layout: Calculate output using LSTM units.
Disposal layer: a regular layer to avoid overheating
Output layer: determines whether the next word may be output

I will now use LSTM Model to build a Title Generator job model with Machine Learning:

def create_model(max_sequence_len, total_words):
  input_len = max_sequence_len — 1
  model = Sequential()
 
  # Add Input Embedding Layer
  model.add(Embedding(total_words, 10, input_length=input_len))
 
  # Add Hidden Layer 1 — LSTM Layer
  model.add(LSTM(100))
  model.add(Dropout(0.1))
 
  # Add Output Layer
  model.add(Dense(total_words, activation=’softmax’))
  model.compile(loss=’categorical_crossentropy’, optimizer=’adam’)
 
  return model
model = create_model(max_sequence_len, total_words)
model.fit(predictors, label, epochs=20, verbose=5)

Now that our title generator learning model is ready and trained using data, it is time to predict the title based on the input name. The input name is completed first, the sequence is completed before being transferred to a trained model to retrieve the predicted sequence:

def generate_text(seed_text, next_words, model, max_sequence_len):
  for _ in range(next_words):
  token_list = tokenizer.texts_to_sequences([seed_text])[0]
  token_list = pad_sequences([token_list], maxlen=max_sequence_len-1,  padding=’pre’)
  predicted = model.predict_classes(token_list, verbose=0)
 
  output_word = “”
  for word,index in tokenizer.word_index.items():
  if index == predicted:
  output_word = word
  break
  seed_text += “ “+output_word
  return seed_text.title()

Now that we have created the topic of topic production let’s take a look at our topic production model:

print(generate_text(“HAPPY”, 5, model, max_sequence_len))

Output:  The Secret Of HAPPY

I hope you enjoyed this article on how to create a theme-generating model by typing with machine and Python programming language. Feel free to ask your key questions in the comments section below.

Thanks For Reading!

About Me:

Hey, I am Sharvari Raut. I love to write!

Connect with me on:

Twitter: https://twitter.com/aree_yarr_sharu

LinkedIn: https://t.co/g0A8rcvcYo?amp=1

Github: https://github.com/sharur7

References :

Image 1: https://unsplash.com/s/photos/machine-learning?utm_source=unsplash&utm_medium=referral&utm_content=creditCopyText

Image 2: https://unsplash.com/s/photos/machine-learning?utm_source=unsplash&utm_medium=referral&utm_content=creditCopyText

The media shown in this article are not owned by Analytics Vidhya and are used at the Author’s discretion.

Sharvari Santosh 24 Sep, 2021

I am Sharvari Raut. I love to write. I am a final year student in Computer Science and Engineering from NCER Pune. I have worked as a freelance technical writer for few startups and companies. Having 2 yrs of experience in Technical Writing I have written over 100+ technical articles which are published till now. Writing for Analytics Vidhya is one of my favourite things to do.

Beginner Deep Learning NLP Project Python