Build Text Categorization Model with Spark NLP

guest_blog 13 Jul, 2020 • 7 min read


  • Setting up John Snow labs Spark-NLP on AWS EMR and using the library to perform a simple text categorization of BBC articles.
Text Categorization Spark Apache and John Snow Labs


Natural Language Processing is one of the important processes for data science teams across the globe. With ever-growing data, most of the organizations have already moved to big data platforms like Apache Hadoop and cloud offerings like AWS, Azure, and GCP. These platforms are more than capable of handling big data that enables organizations to perform analytics at scale for unstructured data like text categorization. But when it comes to machine learning, there is still a gap between big data systems and machine learning tools.

Popular machine learning python libraries like scikit-learn and Gensim are highly optimized to perform on single node machines and not designed for distributed environments. Apache Spark MLlib is one of the many tools that help to bridge this gap by offering most of machine learning models like Linear Regression, Logistic Regression, SVM, Random Forest, K-means, LDA and many more to carry out most common machine learning tasks.

If you are new to Spark, I strongly recommends you to go through below articles:

  1. Comprehensive Introduction to Apache Spark, RDDs & Dataframes (using PySpark)

  2. Using PySpark to perform Transformations and Actions on RDD

  3. Complete Guide on DataFrame Operations in PySpark

Apart from machine learning algorithms, Spark MLlib also offers a plethora of feature transformers like Tokenizer, StopWordRemover, n-grams, and features extractors like CountVectorizer, TF-IDF, and Word2Vec. Although these transformers and extractors are sufficient to build basic NLP pipeline but to build a more comprehensive and production-grade pipeline, we need more advanced techniques like stemming, lemmatization, Part-of-speech tagging, and Named Entity Recognition.

John Snow Labs Spark NLP offers a variety of annotators to perform advanced NLP tasks. For more information, check out the list of annotator and their usage on the website

Setting up the environment

Let’s go ahead and see how to set up Spark NLP on AWS EMR.

1. Before we spin up the EMR cluster, we need to create a bootstrap action. Bootstrap actions are used to set up additional software or customize the configuration of cluster nodes. Following is the bootstrap action that can be used to set up Spark NLP on EMR cluster,

#!/bin/bashsudo yum install -y python36-devel python36-pip python36-setuptools python36-virtualenvsudo python36 -m pip install --upgrade pip
sudo python36 -m pip install pandas
sudo python36 -m pip install boto3
sudo python36 -m pip install re
sudo python36 -m pip install spark-nlp==2.4.5

Once you create the shell script, copy this script to a location in AWS S3. You can also install additional python packages as per your requirement.


2. We can spin up the EMR cluster using AWS console, API, or boto3 library in python. The advantage of using Python is that you can reuse the code whenever you want to instantiate the cluster or add it to workflows.

Following is the python code to instantiate an EMR cluster.

import boto3region_name='region_name'def get_security_group_id(group_name, region_name):
    ec2 = boto3.client('ec2', region_name=region_name)
    response = ec2.describe_security_groups(GroupNames=[group_name])
    return response['SecurityGroups'][0]['GroupId']emr = boto3.client('emr', region_name=region_name)cluster_response = emr.run_job_flow(
        Name='cluster_name', # update the value
        LogUri='s3_path_for_logs', # update the value
            'InstanceGroups': [
                    'Name': "Master nodes",
                    'Market': 'ON_DEMAND',
                    'InstanceRole': 'MASTER',
                    'InstanceType': 'm5.2xlarge', # change according to the requirement
                    'InstanceCount': 1 #for master node High Availabiltiy, set count more than 1
                    'Name': "Slave nodes",
                    'Market': 'ON_DEMAND',
                    'InstanceRole': 'CORE',
                    'InstanceType': 'm5.2xlarge', # change according to the requirement
                    'InstanceCount': 2
            'KeepJobFlowAliveWhenNoSteps': True,
            'Ec2KeyName' : 'key_pair_name', # update the value
            'EmrManagedMasterSecurityGroup': get_security_group_id('ElasticMapReduce-master', region_name=region_name)
            'EmrManagedSlaveSecurityGroup': get_security_group_id('ElasticMapReduce-master', region_name=region_name)
        BootstrapActions=[    {
                            'Path':'path_to_bootstrapaction_on_s3' # update the value
        Steps = [],
            { 'Name': 'hadoop' },
            { 'Name': 'spark' },
            { 'Name': 'hive' },
            { 'Name': 'zeppelin' },
            { 'Name': 'presto' }
            # YARN
                "Classification": "yarn-site", 
                "Properties": {"yarn.nodemanager.vmem-pmem-ratio": "4",
                               "yarn.nodemanager.pmem-check-enabled": "false",
                               "yarn.nodemanager.vmem-check-enabled": "false"}
            # HADOOP
                "Classification": "hadoop-env", 
                "Configurations": [
                            "Classification": "export", 
                            "Configurations": [], 
                            "Properties": {"JAVA_HOME": "/usr/lib/jvm/java-1.8.0"}
                "Properties": {}
            # SPARK
                "Classification": "spark-env", 
                "Configurations": [
                            "Classification": "export", 
                            "Configurations": [], 
                            "Properties": {"PYSPARK_PYTHON":"/usr/bin/python3",
                                           "JAVA_HOME": "/usr/lib/jvm/java-1.8.0"}
                "Properties": {}
                "Classification": "spark",
                "Properties": {"maximizeResourceAllocation": "true"},
                "Configurations": []
                "Classification": "spark-defaults",
                "Properties": {
                    "spark.dynamicAllocation.enabled": "true" #default is also true

Note: Ensure that you have proper access to S3 bucket(s) that are used for logging and for storing bootstrap action script.

Text categorization of BBC articles using Spark NLP

Now that we have our cluster ready, let’s build a simple text categorization example on BBC data using Spark NLP and Spark MLlib.

  1. Save the Pipeline Model

    After successfully training, testing, and evaluating the model, you can save the model to disc and use it in different Spark applications. To save the model to the disc, use below code;'/path/to/storage_location')


Spark NLP provides a plethora of annotators and transformers to build a production-grade data pre-processing pipeline. Sparl NLP seamlessly integrates with Spark MLLib that enables us to build an end to end Natural Language Processing Project in a distributed environment. In this article, we looked at how to install Spark NLP on AWS EMR and implemented text categorization of BBC data. We also examined different evaluation metrics in Spark MLlib and saw how to store a model for further usage.

I hope you enjoyed the article. Keep learning!

About the Author

Satish Silveri
Former Data Engineer and an aspiring Data Scientist. I love to work with textual data and build a state of the art Natural Language Processing solutions. I like to research new technologies and my current area of research is developing ML solutions using big data tools like Apache Spark and AWS Sagemaker

guest_blog 13 Jul 2020

Frequently Asked Questions

Lorem ipsum dolor sit amet, consectetur adipiscing elit,

Responses From Readers