Text Classification & Word Representations using FastText (An NLP library by Facebook)

Last Updated : 06 Jun, 2023

11 min read

Introduction

If you put a status update on Facebook about purchasing a car -don’t be surprised if Facebook serves you a car ad on your screen. This is not black magic! This is Facebook leveraging the text data to serve you better ads.

The picture below takes a jibe at a challenge while dealing with text data.

Well, it clearly failed in the above attempt to deliver the right ad. It is all the more important to capture the context in which the word has been used. This is a common problem in Natural Processing Language (NLP) tasks.

A single word with the same spelling and pronunciation (homonyms) can be used in multiple contexts and a potential solution to the above problem is computing word representations.

Now, imagine the challenge for Facebook. Facebook deals with enormous amount of text data on a daily basis in the form of status updates, comments etc. And it is all the more important for Facebook to utilise this text data to serve its users better. And using this text data generated by billions of users to compute word representations was a very time expensive task until Facebook developed their own library FastText, for Word Representations and Text Classification.

In this article, we will see how we can calculate Word Representations and perform Text Classification, all in a matter of seconds in comparison to the existing methods which took days to achieve the same performance.

Introduction
What is FastText?
Installation of Fasttext
Fasttext Implementation
Pros and Cons of FastText
- Pros
- Cons
Projects
Frequently Asked Questions
End Notes
- Learn, Engage, Compete & Get Hired

What is FastText?

FastText is an open-source library for text representation and classification developed by Facebook’s AI Research (FAIR) team. It is designed to efficiently handle large amounts of text data and provides tools for text classification, word representation, and text similarity computation.

At its core, FastText uses the concept of word embeddings, which are dense vector representations of words in a continuous vector space. Word embeddings capture semantic and syntactic relationships between words based on their distributional properties in a given text corpus.

FastText extends the idea of word embeddings to represent entire words or subwords, called n-grams. Instead of considering words as atomic units, FastText breaks them down into smaller subword units, such as character n-grams. By doing so, it can capture morphological information and handle out-of-vocabulary words efficiently.

The training process of FastText involves learning these word and subword embeddings using a technique called continuous bag of words (CBOW) with negative sampling. CBOW predicts a target word based on the surrounding context words, and negative sampling helps train the model efficiently even with large vocabularies.

FastText supports both unsupervised and supervised learning tasks. In the unsupervised setting, it can learn word embeddings solely based on the distributional properties of words in the training corpus. In the supervised setting, it can perform text classification tasks, where it learns to classify text documents into predefined categories.

FastText has gained popularity due to its ability to handle large-scale text data efficiently. It has been used for various applications, including text classification, language identification, information retrieval, and text similarity computation.

FastText is a used for efficient learning of word representations and sentence classification.

This library has gained a lot of traction in the NLP community and is a possible substitution to the gensim package which provides the functionality of Word Vectors etc. If you are new to the Word Vectors and word representations in general then, I suggest you read this article first.

But the question that we should be really asking is – How is FastText different from gensim Word Vectors?

FastText differs in the sense that word vectors a.k.a word2vec treats every single word as the smallest unit whose vector representation is to be found but FastText assumes a word to be formed by a n-grams of character, for example, sunny is composed of [sun, sunn,sunny],[sunny,unny,nny] etc, where n could range from 1 to the length of the word. This new representation of word by fastText provides the following benefits over word2vec or glove.

It is helpful to find the vector representation for rare words. Since rare words could still be broken into character n-grams, they could share these n-grams with the common words. For example, for a model trained on a news dataset, the medical terms eg: diseases can be the rare words.
It can give the vector representations for the words not present in the dictionary (OOV words) since these can also be broken down into character n-grams. word2vec and glove both fail to provide any vector representations for words not in the dictionary.
For example, for a word like stupedofantabulouslyfantastic, which might never have been in any corpus, gensim might return any two of the following solutions – a) a zero vector or b) a random vector with low magnitude. But FastText can produce vectors better than random by breaking the above word in chunks and using the vectors for those chunks to create a final vector for the word. In this particular case, the final vector might be closer to the vectors of fantastic and fantabulous.
character n-grams embeddings tend to perform superior to word2vec and glove on smaller datasets.

We will now look at the steps to install the fastText library below.

Installation of Fasttext

To make full use of the FastText library, please make sure you have the following requirements satisfied:

OS – MacOS or Linux
C++ complier – gcc or clang
Python 2.6+, numpy and scipy.

If you do not have the above pre-requisites, I urge you to go ahead and install the above dependencies first.

To install FastText, type the code below-

git clone https://github.com/facebookresearch/fastText.git
cd fastText
make

You can check whether FastText has been properly installed by typing the below command inside the FastText folder.
./fasttext

If everything was installed correctly then, you should see the list of available commands for FastText as the output.

Fasttext Implementation

As stated earlier, FastText was designed for two specific purposes- Word Representation Learning and Text Classification. We will see each of these steps in detail. Let us get started with learning word representations.

Learning Word Representations

Words in their natural form cannot be used for any Machine Learning task in general. One way to use the words is to transform these words into some representations that capture some attributes of the word. It is analogous to describing a person as – [‘height’:5.10 ,’weight’:75, ‘colour’:’dusky’, etc.] where height, weight etc are the attributes of the person. Similarly, word representations capture some abstract attributes of words in the manner that similar words tend to have similar word representations. There are primarily two methods used to develop word vectors – Skipgram and CBOW.

We will see how we can implement both these methods to learn vector representations for a sample text file using fasttext.

Learning word representations using Skipgram and CBOW models

Skipgram
./fasttext skipgram -input file.txt -output model
CBOW
./fasttext cbow -input file.txt -output model

Let us see the parameters defined above in steps for easy understanding.

./fasttext – It is used to invoke the FastText library.
skipgram/cbow – It is where you specify whether skipgram or cbow is to be used to create the word representations.
-input – This is the name of the parameter which specifies the following word to be used as the name of the file used for training. This argument should be used as is.
data.txt – a sample text file over which we wish to train the skipgram or cbow model. Change this name to the name of the text file you have.
-output – This is the name of the parameter which specifies the following word to be used as the name of the model being created. This argument is to be used as is.
model – This is the name of the model created.

Running the above command will create two files named model.bin and model.vec. model.bin contains the model parameters, dictionary and the hyperparameters and can be used to compute word vectors. model.vec is a text file that contains the word vectors for one word per line.

Now since we have created our own word vectors let’s see if we can do some common tasks like print word vectors for a word, find similar words, analogies etc. using these word vectors.

Print word vectors of a word

In order to get the word vectors for a word or set of words, save them in a text file. For example, here is a sample text file named queries.txt that contains some random words. We will get the vector representation of these words using the model we trained above.

./fasttext print-word-vectors model.bin < queries.txt

To check word vectors for a single word without saving into a file, you can do

echo "word" | ./fasttext print-word-vectors model.bin

Finding similar words

You can also find the words most similar to a given word. This functionality is provided by the nn parameter. Let’s see how we can find the most similar words to “happy”.

./fasttext nn model.bin

After typing the above command, the terminal will ask you to input a query word.

happy

by 0.183204
be 0.0822266
training 0.0522333
the 0.0404951
similar 0.036328
and 0.0248938
The 0.0229364
word 0.00767293
that 0.00138793
syntactic -0.00251774

The above is the result returned for the most similar words to happy. Interestingly, this feature could be used to correct spellings too. For example, when you enter a wrong spelling, it shows the correct spelling of the word if it occurred in the training file.

wrd

word 0.481091
words. 0.389373
words 0.370469
word2vec 0.354458
more 0.345805
and 0.333076
with 0.325603
in 0.268813
Word2vec 0.26591
or 0.263104

Analogies

FastText word vectors can also be used on analogies task of the kind, what is to C, what B is to A. Here, A, B and C are the words.

The analogies functionality is provided by the parameter analogies. Let’s see this with the help of an example.

./fasttext analogies model.bin

The above command will ask to input the words in the form A-B+C, but we just need to give three words separated by space.

happy sad angry

of 0.199229
the 0.187058
context 0.158968
a 0.151884
as 0.142561
The 0.136407
or 0.119725
on 0.117082
and 0.113304
be 0.0996916

Training on a very large corpus will produce better results.

Text Classification

As suggested by the name, text classification is tagging each document in the text with a particular class. Sentiment analysis and email classification are classic examples of text classification. In this era of technology, millions of digital documents are being generated each day. It would cost a huge amount of time as well as human efforts to categorise them in reasonable categories like spam and non-spam, important and unimportant and so on. Text classification techniques of NLP come here to our rescue. Let’s see how by doing hands-on practice based on a sentiment analysis problem. I have taken the data for this analysis from kaggle.

Before we jump upon the execution, there is a word of caution about the training file. The default format of text file on which we want to train our model should be _ _ label _ _ <X> <Text>

Where _ _label_ _ is a prefix to the class and <X> is the class assigned to the document. Also, there should not be quotes around the document and everything in one document should be on one line.

In fact, the reason why I have selected this data for this article is that the data is already available exactly in the required default format.If you are completely new to FastText and implementing text classification for very first time in FastText, I would strongly recommend using the data mentioned above.

In case your data has some other formats of the label, don’t be bothered. FastText will take care of it once you pass a suitable argument. We will see how to do it in a moment. Just stick to the article.

After this briefing about text classification, let’s move ahead and land on the implementation part. We will be using the train.ft text file to train the model and test.ft file to predict.

#training the classifier
./fasttext supervised -input train.ft.txt -output model_kaggle -label __label__

Here, the parameters are same as the one mentioned while creating word representations. The only additional parameter is -label. This argument takes care of the format of the label specified. The file that you downloaded contains labels with the prefix __label__.

If you do not wish to use default parameters for training the model, then they can be specified during the training time. For example, if you explicitly want to specify the learning rate of the training process then you can use the argument -lr to specify the learning rate.

./fasttext supervised -input train.ft.txt -output model_kaggle -label __label__ -lr 0.5

The other available parameters that can be tuned are –

-lr : learning rate [0.1]
-lrUpdateRate : change the rate of updates for the learning rate [100]
-dim : size of word vectors [100]
-ws : size of the context window [5]
-epoch : number of epochs [5]
-neg : number of negatives sampled [5]
-loss : loss function {ns, hs, softmax} [ns]
-thread : number of threads [12]
-pretrainedVectors : pretrained word vectors for supervised learning []
-saveOutput : whether output params should be saved [0]

The values in the square brackets [] represent the default values of the parameters passed.

# Testing the result
./fasttext test model_kaggle.bin test.ft.txt

N 400000
P@1 0.916
R@1 0.916

Number of examples: 400000
P@1 is the precision
R@1 is the recall

# Predicting on the test dataset
./fasttext predict model_kaggle.bin test.ft.txt

# Predicting the top 3 labels
./fasttext predict model_kaggle.bin test.ft.txt 3

Computing Sentence Vectors (Supervised)

This model can also be used for computing the sentence vectors. Let us see how we can compute the sentence vectors by using the following commands.

echo "this is a sample sentence" | ./fasttext print-sentence-vectors model_kaggle.bin
0.008204 0.016523 -0.028591 -0.0019852 -0.0043028 0.044917 -0.055856 -0.057333 0.16713 0.079895 0.0034849 0.052638 -0.073566 0.10069 0.0098551 -0.016581 -0.023504 -0.027494 -0.070747 -0.028199 0.068043 0.082783 -0.033781 0.051088 -0.024244 -0.031605 0.091783 -0.029228 -0.017851 0.047316 0.013819 0.072576 -0.004047 -0.10553 -0.12998 0.021245 0.0019761 -0.0068286 0.021346 0.012595 0.0016618 0.02793 0.0088362 0.031308 0.035874 -0.0078695 0.019297 0.032703 0.015868 0.025272 -0.035632 0.031488 -0.027837 0.020735 -0.01791 -0.021394 0.0055139 0.009132 -0.0042779 0.008727 -0.034485 0.027236 0.091251 0.018552 -0.019416 0.0094632 -0.0040765 0.012285 0.0039224 -0.0024119 -0.0023406 0.0025112 -0.0022772 0.0010826 0.0006142 0.0009227 0.016582 0.011488 0.019017 -0.0043627 0.00014679 -0.003167 0.0016855 -0.002838 0.0050221 -0.00078066 0.0015846 -0.0018429 0.0016942 -0.04923 0.056873 0.019886 0.043118 -0.002863 -0.0087295 -0.033149 -0.0030569 0.0063657 0.0016887 -0.0022234

Pros and Cons of FastText

Like every library in development, it has its pros and cons. Let us state them explicitly.

Pros

The library is surprisingly very fast in comparison to other methods for achieving the same accuracy. Here is the result published by the Facebook research team in support of the argument.
Sentence Vectors(supervised) can be easily computed.
fastText works better on small datasets in comparison to gensim.
fastText performs superior to gensim in terms of syntactic performance and fairs equally well in case of semantic performance.

Cons

This is not a standalone library for NLP since it will require another library for the pre-processing steps.
Though, this library has a python implementation. It is not officially supported.

Projects

Now, its time to take the plunge and actually play with some other real datasets. So are you ready to take on the challenge? Accelerate your NLP journey with the following Practice Problems:

Frequently Asked Questions

Q1. Is FastText a neural network?

A. Yes, FastText utilizes a neural network architecture. It employs a shallow neural network with a single hidden layer for training word and subword embeddings. The model uses a technique called continuous bag of words (CBOW) with negative sampling for learning. FastText is a neural network-based approach for efficient text representation and classification tasks.

Q2. Which is better Bert embeddings or FastText?

A. The choice between BERT embeddings and FastText depends on the specific task and requirements. BERT embeddings capture contextual information effectively, making them suitable for tasks like sentiment analysis and named entity recognition. FastText is more efficient for handling large-scale text data and can handle out-of-vocabulary words well. Ultimately, the selection should be based on the specific needs of the application.

End Notes

This article was aimed at making you aware of the FastText library as an alternative to the word2vec model and also letting you make your first vector representation and text classification model.

For people who want to go in greater depth of the difference in performance of fastText and gensim, you can visit this link, where a researcher has carried out the comparison using a jupyter notebook and some standard text datasets.

Please feel free to try out this library and share your experiences in the comment below.

Learn, Engage, Compete & Get Hired

Classification Intermediate Libraries Machine Learning NLP

Free Courses

4.7

Generative AI - A Way of Life

Explore Generative AI for beginners: create text and images, use top AI tools, learn practical skills, and ethics.

4.5

Getting Started with Large Language Models

Master Large Language Models (LLMs) with this course, offering clear guidance in NLP and model training made simple.

4.6

Building LLM Applications using Prompt Engineering

This free course guides you on building LLM apps, mastering prompt engineering, and developing chatbots with enterprise data.

4.8

Improving Real World RAG Systems: Key Challenges & Practical Solutions

Explore practical solutions, advanced retrieval strategies, and agentic RAG systems to improve context, relevance, and accuracy in AI-driven applications.

4.7

Microsoft Excel: Formulas & Functions

Master MS Excel for data analysis with key formulas, functions, and LookUp tools in this comprehensive course.

MUID

Used by Microsoft Clarity, to store and track visits across websites.

Expiry: 1 Year

Type: HTTP

_clck

Used by Microsoft Clarity, Persists the Clarity User ID and preferences, unique to that site, on the browser. This ensures that behavior in subsequent visits to the same site will be attributed to the same user ID.

Expiry: 1 Year

Type: HTTP

_clsk

Used by Microsoft Clarity, Connects multiple page views by a user into a single Clarity session recording.

Expiry: 1 Day

Type: HTTP

SRM_I

Collects user data is specifically adapted to the user or device. The user can also be followed outside of the loaded website, creating a picture of the visitor's behavior.

Expiry: 2 Years

Type: HTTP

SM

Use to measure the use of the website for internal analytics

Expiry: 1 Years

Type: HTTP

CLID

The cookie is set by embedded Microsoft Clarity scripts. The purpose of this cookie is for heatmap and session recording.

Expiry: 1 Year

Type: HTTP

SRM_B

Collected user data is specifically adapted to the user or device. The user can also be followed outside of the loaded website, creating a picture of the visitor's behavior.

Expiry: 2 Months

Type: HTTP

_gid

This cookie is installed by Google Analytics. The cookie is used to store information of how visitors use a website and helps in creating an analytics report of how the website is doing. The data collected includes the number of visitors, the source where they have come from, and the pages visited in an anonymous form.

Expiry: 399 Days

Type: HTTP

_ga_#

Used by Google Analytics, to store and count pageviews.

Expiry: 399 Days

Type: HTTP

_gat_#

Used by Google Analytics to collect data on the number of times a user has visited the website as well as dates for the first and most recent visit.

Expiry: 1 Day

Type: HTTP

collect

Used to send data to Google Analytics about the visitor's device and behavior. Tracks the visitor across devices and marketing channels.

Expiry: Session

Type: PIXEL

AEC

cookies ensure that requests within a browsing session are made by the user, and not by other sites.

Expiry: 6 Months

Type: HTTP

G_ENABLED_IDPS

use the cookie when customers want to make a referral from their gmail contacts; it helps auth the gmail account.

Expiry: 2 Years

Type: HTTP

test_cookie

This cookie is set by DoubleClick (which is owned by Google) to determine if the website visitor's browser supports cookies.

Expiry: 1 Year

Type: HTTP

_we_us

this is used to send push notification using webengage.

Expiry: 1 Year

Type: HTTP

WebKlipperAuth

used by webenage to track auth of webenagage.

Expiry: Session

Type: HTTP

ln_or

Linkedin sets this cookie to registers statistical data on users' behavior on the website for internal analytics.

Expiry: 1 Day

Type: HTTP

JSESSIONID

Use to maintain an anonymous user session by the server.

Expiry: 1 Year

Type: HTTP

li_rm

Used as part of the LinkedIn Remember Me feature and is set when a user clicks Remember Me on the device to make it easier for him or her to sign in to that device.

Expiry: 1 Year

Type: HTTP

AnalyticsSyncHistory

Used to store information about the time a sync with the lms_analytics cookie took place for users in the Designated Countries.

Expiry: 6 Months

Type: HTTP

lms_analytics

Used to store information about the time a sync with the AnalyticsSyncHistory cookie took place for users in the Designated Countries.

Expiry: 6 Months

Type: HTTP

liap

Cookie used for Sign-in with Linkedin and/or to allow for the Linkedin follow feature.

Expiry: 6 Months

Type: HTTP

visit

allow for the Linkedin follow feature.

Expiry: 1 Year

Type: HTTP

li_at

often used to identify you, including your name, interests, and previous activity.

Expiry: 2 Months

Type: HTTP

s_plt

Tracks the time that the previous page took to load

Expiry: Session

Type: HTTP

lang

Used to remember a user's language setting to ensure LinkedIn.com displays in the language selected by the user in their settings

Expiry: Session

Type: HTTP

s_tp

Tracks percent of page viewed

Expiry: Session

Type: HTTP

AMCV_14215E3D5995C57C0A495C55%40AdobeOrg

Indicates the start of a session for Adobe Experience Cloud

Expiry: Session

Type: HTTP

s_pltp

Provides page name value (URL) for use by Adobe Analytics

Expiry: Session

Type: HTTP

s_tslv

Used to retain and fetch time since last visit in Adobe Analytics

Expiry: 6 Months

Type: HTTP

li_theme

Remembers a user's display preference/theme setting

Expiry: 6 Months

Type: HTTP

li_theme_set

Remembers which users have updated their display / theme preferences

Expiry: 6 Months

Type: HTTP

Sujatha Sivaraman

Very good article which gives you a good insight on Fasttext

Sushant Kulkarni

"./fasttext print-word-vectors model.bin > queries.txt" should have been "./fasttext print-word-vectors model.bin < queries.txt". Source - https://github.com/facebookresearch/fastText

Show 1 reply

NSS

Yes, thanks for bringing it to the notice. Corrected.

	Practice Problem: Identify the Sentiments	Identify the sentiment of tweets
	Practice Problem : Twitter Sentiment Analysis	To detect hate speech in tweets

Reading list

Introduction to NLP

Text Pre-processing

NLP Libraries

Regular Expressions

String Similarity

Spelling Correction

Topic Modeling

Text Representation

Information Retrieval System

Word Vectors

Word Senses

Dependency Parsing

Language Modeling

Getting Started with RNN

Different Variants of RNN

Machine Translation and Attention

Self Attention and Transformers

Transfomers and Pretraining

Question Answering

Text Summarization

Named Entity Recognition

Coreference Resolution

Audio Data

ASR

Audio Separation

Chatbot

Auto NLP

Text Classification & Word Representations using FastText (An NLP library by Facebook)

Introduction

Table of contents

What is FastText?

Installation of Fasttext

Fasttext Implementation

Learning Word Representations

Text Classification

Computing Sentence Vectors (Supervised)

Pros and Cons of FastText

Pros

Cons

Projects

Frequently Asked Questions

End Notes

Learn, Engage, Compete & Get Hired

Free Courses

Generative AI - A Way of Life

Getting Started with Large Language Models

Building LLM Applications using Prompt Engineering

Improving Real World RAG Systems: Key Challenges & Practical Solutions

Microsoft Excel: Formulas & Functions

Recommended Articles

Responses From Readers

Write for us

Analytics Vidhya (4)

brahmaid

csrftoken

Identityid

sessionid

Google (1)

g_state

Microsoft (7)

MUID

_clck

_clsk

SRM_I

SM

CLID

SRM_B

Google (7)

_gid

_ga_#

_gat_#

collect

AEC

G_ENABLED_IDPS

test_cookie

Webengage (2)

_we_us

WebKlipperAuth

LinkedIn (16)