NLP Case Study: Build Your Own Skill Matching Algorithm

Ganeshi Shreya 27 Mar, 2023 • 8 min read


Building a good resume has always motivated every student out there to get hired by their dream company. Thousands of people from various platforms like Linkedin,, etc., start applying as the company starts its recruitment process. It’s highly impossible to, of course, interview everyone who applies. Here comes artificial intelligence’s resume screener (Word2Vec) for identifying good resumes and shortlisting those for interviews.

After cleaning the data with NLP methods such as tokenization and stopword removal, I used Word2Vec from gensim for word embeddings. Using these word embeddings, the K-Means Algorithm is used to generate K Clusters. Some of the clusters in this list contain skills (Tech, Non-tech & soft skills).


Learning Objectives 

In this article, you will-

  1. Identify the layout of the resume and determine the flow of content.
  2. Learn about Word2vec
  3. How does Word2Vec help in extracting skills from resumes?

Table of Contents

  1. Dictionary Approach for Resume Screening
  2. What is Word2Vec?
  3. How is Word2Vec Effective for Skill Matching?
    3.1 Training the word2vec model
    3.2 Reading the resume and performing tokenization
    3.3 Finding the similarities between JD skills and resume tokens.
  4. Drawbacks of Word2Vec Skill Matching
  5. Script
  6. Conclusion

Dictionary Approach for Resume Screening

A resume screener usually includes the following steps:

  1. Reading resume
  2. Layout Classification
    • Identifying the resume’s layout is essential since it determines the flow of content within the resume
  3. Section Segmentation
    • Identifying the section headers and segmenting the resume using these headers like Educational Qualification, Work Experience, Skill Set sections, etc.
  4. Information extraction Includes
    • Candidate’s Primary Details
    • Skill Set
    • Academic Details
    • Work Experience
    • Company and job designation
    • Job Location

Skill set extraction includes identifying the technical skills present in the resume and matching them with JD’s mandatory skills. The easiest way of extraction is by checking its presence in the technical skills dictionary in the backend. Usually, JD has domains specified in it as skills, and hence the skills in the dictionary need to be mapped to its domain.


What if the skills mentioned in the resume are missing in the dictionary? What if a resume skill is not mapped to its domain? Simple, the resume will be rejected!
To solve this problem, instead of checking for the presence of a skill in the dictionary, checking for the presence of a skill or its relevant skills will be more efficient. A deep learning architecture has been introduced in this article to match resume skills with JD skills efficiently.

What is Word2Vec?


Word2Vec is one of the word embedding architectures for transforming text into numerics, i.e., a vector. Word2Vec is different from other representation techniques like BOW, One-Hot encoding, TF-IDF, etc., as it captures semantic and syntactic relationships between words using a simple neural network with one hidden layer. In short, the words that are related will be placed close to each other in the vector space. The weights obtained in the hidden layer after the convergence of the model are the embeddings. So, using word2vec, we can perform tasks like next word/words prediction based on the two different Word2Vec architectures

  1. Continuous Bag of Words
    Given a sequence of words, i.e., context words, it predicts a word that is highly probable to occur next.
  2. Skip Gram
    It works exactly opposite to CBOW, which is given the word, it predicts the next t context words.

Click on this link to know more about Word2Vec

How is Word2Vec Effective for Skill Matching?

How’s word2vec useful in matching resume skills with JD? The solution is just three simple steps:

  1. Training the word2vec model
  2. Reading the resume and performing tokenization
  3. Finding the similarities between JD skills and resume tokens.

Training the word2vec model

  • Note – Our implementation is limited only to data science resumes. It can further be generalized by improving the data.

Importing all the necessary libraries

import gensim
from gensim.models.phrases import Phrases, Phraser
from gensim.models import Word2Vec
import pandas as pd
import joblib

Data Collection:

    1. Web scraping
      • Data is collected by scraping data from various data science-related websites, e-books, etc., using python’s beautiful soup.
    2. Data Preprocessing
      • Lower case conversion
      • Removal of numerics
      • Removal of stop words

Stemming and lemmatization are not performed to avoid the loss of vocabulary. For example, when “Machine Learning” is stemmed or lemmatized, the words “machine” and “learning” will be stemmed or lemmatized separately. Thus, it results in “machine learning” and, thus, loss of skill.
Here’s our sample data
Creating n-gram words using gensim’s phrases class. The data is passed to the phrases class and returns an object. The object returned can be saved locally and used whenever required.

sent = [row.split() for row in df['data']]
phrases = Phrases(sent, min_count=30, progress_per=10000)

More on gensim library

Vocabulary Building using Gensim library:
Word2Vec requires us to build the vocabulary table (simply digesting all the words, filtering out the unique/ words, and doing some basic counts on them).

Training the model:
The word2vec model is trained using the gensim library and is saved locally to use whenever required.

w2v_model = Word2Vec(min_count=20,

#Building Vocabulary

#Saving the built vocabulary locally

#Training the model
w2v_model.train(sentences, total_examples = w2v_model.corpus_count, epochs = 30, report_delay = 1)

#saving the model
path = "/content/drive/MyDrive"
model = joblib.load(path)

print(w2v_model.wv.similarity('neural_network', 'machine_learning'))



Reading the resume and performing tokenization

Reading a resume
A resume can be of different forms like pdf, docx, image, etc. Different tools are used for extracting information from different forms of resumes.
PDF – using pdfplumber
Image – using OCR

Data preparation
After extracting the data, the next step is preprocessing, creating n-grams, and tokenization.

Finding the similarities between JD skills and resume tokens

Here comes the final step. After performing the first two steps, we obtain the following things

  1. Word2vec model/Word Embeddings
  2. Phrases object
  3. Data vocabulary
  4. Resume tokens

JD’s skills are entered manually. Now, we need to find the similarity between JD skills and resume tokens; if a JD skill has at least one relevant skill in the resume tokens, then it will be considered as “present” in the resume else, “absent” in the resume.
How to check relevant skills? The answer is cosine similarity. The skill is considered relevant if the cosine similarity between the two embeddings is less than a certain threshold.
We create two arrays of JD skill embeddings and resume token embeddings for finding the numerator of cosine similarity of all the embeddings simultaneously, i.e., A.B

Drawbacks of Word2Vec for Skill Matching

What if a JD skill is not present in the vocabulary which was used for building the model? The model will not have its embedding; such words are called out of vocabulary words. This is a major drawback of word2vec. Character-level embeddings could be done to solve this issue. FastText works at character-level embeddings.

The major difference between Word2Vec and FastText is that Word2Vec feeds individual words into Neural Network to find the embeddings, whereas, FastText breaks words into several n-grams (sub-words). The word embedding vector for a word will be the sum of all the n-grams.


Installing Necessary Packages

!pip install pdfplumber
!pip install pytesseract
!sudo apt install tesseract-ocr
!pip install pdf2image
!sudo apt-get update
!sudo apt-get install python-poppler 
!pip install PyMuPDF
!pip install Aspose.Email-for-Python-via-NET
!pip install aspose-words

Importing Necessary Libraries

import pandas as pd
import os
import warnings
warnings.filterwarnings(action = 'ignore')
import gensim
from gensim.models import Word2Vec
import string
import numpy as np
from itertools import groupby, count
import re
import subprocess
import os.path
import sys
import logging
import joblib
from gensim.models.phrases import Phrases, Phraser
import pytesseract
import cv2
from pdf2image import convert_from_path
from PIL import Image 
Image.MAX_IMAGE_PIXELS = 1000000000 
import aspose.words as aw
import fitz
logger_watchtower = logging.getLogger(__name__)
from pandas.core.common import SettingWithCopyWarning
warnings.simplefilter(action="ignore", category=SettingWithCopyWarning)

Function for reading resume

def _skills_in_box(image_gray,threshold=60):
  Function for identifying boxes and identifying skills in it: Given an imge path, 
        returns string with text in it.
            img_path: Path of the image
            thresh : Threshold of the box to convert it to 0
  img = image_gray.copy()
  thresh_inv = cv2.threshold(img, 0, 255, cv2.THRESH_BINARY_INV+cv2.THRESH_OTSU)[1]
  # Blur the image
  blur = cv2.GaussianBlur(thresh_inv,(1,1),0)
  thresh = cv2.threshold(blur, 0, 255, cv2.THRESH_BINARY+cv2.THRESH_OTSU)[1]
  # find contours
  contours = cv2.findContours(thresh, cv2.RETR_EXTERNAL, cv2.CHAIN_APPROX_SIMPLE)[0]
  mask = np.ones(img.shape[:2], dtype="uint8") * 255
  available = 0
  for c in contours:
    # get the bounding rect
    x, y, w, h = cv2.boundingRect(c)
    if w*h>1000:
        cv2.rectangle(mask, (x+5, y+5), (x+w-5, y+h-5), (0, 0, 255), -1)
        available = 1

  res = ''
  if available == 1:
    res_final = cv2.bitwise_and(img, img, mask=cv2.bitwise_not(mask))
    res_final[res_final<=threshold]=0 kernel = np.array([[0, -1, 0], [-1, 5,-1], [0, -1, 0]]) res_fin = cv2.filter2D(src=res_final, ddepth=-1, kernel=kernel) vt = pytesseract.image_to_data(255-res_final,output_type='data.frame') vt = vt[vt.conf != -1] res = '' for i in vt[vt['conf']>=43]['text']:
      res = res + str(i) + ' '
  return res
def _image_to_string(img):
  Function for converting images to grayscale and converting to text: Given an image path, 
  returns text in it.
      img_path: Path of the image
  img = cv2.cvtColor(img, cv2.COLOR_BGR2GRAY)
  res = ''
  string1 = pytesseract.image_to_data(img,output_type='data.frame')
  string1 = string1[string1['conf'] != -1]
  for i in string1[string1['conf']>=43]['text']:
    res = res + str(i) + ' '
  string3 = _skills_in_box(img)
  return res+string3
def _pdf_to_png(pdf_path):
    Function for converting pdf to image and saves it in a folder and 
    convert the image into string
        pdf_path: Path of the pdf
    string = ''
    images = convert_from_path(pdf_path)
    for j in tqdm(range(len(images))):
        # Save pages as images in the pdf
        image = np.array(images[j])
        string += _image_to_string(image)
        string += '\n'
    return string
def ocr(paths):
    Function for checking the pdf is image or not. If the file is in .doc it converts it into .pdf
    if the pdf is in image format the function converts .pdf to .png
        paths: list containg paths of all pdf files
    text = ""
    res = ""
        doc =
        for page in doc:
            text += page.get_text()
        if len(text) <=10 :
            res = _pdf_to_png(paths)
            res = text
        doc = aw.Document(paths)"Document.pdf")
        doc ="Document.pdf")
        for page in doc:
            text += page.get_text()
        if len(text) <=10 :
            res = _pdf_to_png("Document.pdf")
            res = text
    return res

Function for finding Cosine Similarity

def to_la(L):
  return l.reshape(-1, 1)

def cos(A, B):
  return list(np.multiply(dot_prod,k))

Function for finding the similarities and returning the final matched skills

def check(path,skills,l2,w2v_model1,phrases,pattern):
  text = ocr(path)
  text = re.sub(r'[^\x00-\x7f]',r' ',text)
  text = text.lower()
  text = re.sub("\\\|,|/|:|\)|\("," ",text)
  t2 = text.split()
    for k in match:
      k=k.replace(' ','_')
  for i in l6_minus_skills:
    return resume_skills_dict
  arr1=np.array([w2v_model1[i] for i in l6])
  arr2=np.array([w2v_model1[i] for i in l4])
  for i in similarity_values:
    k=list(filter(lambda x: x<0.38, list(i))) if(len(k)==len(i)): resume_skills_dict[l6[count]]=0 else: resume_skills=[s for s in range(len(i)) if(i[s])>0.38]
  return resume_skills_dict

Functions required for performing JD skills preprocessing

def Convert(string):
    li = list(string.split())
    return list(set(li))

def preprocess(string):
  string = string.replace(",",' ')
  string= string.replace("'",' ')
  string = Convert(string)
  return string

Main Function

if __name__ == "__main__":
   #Arg 1 = vocabulary, Arg 2 = model, Arg 3 = phrases object, Arg 4 = JD's Mandatory Skills, Arg 5 = Resume Path 
   argv = sys.argv[1:]
   w2v_model1 = joblib.load(argv[0])
   mapper = {}
   jd_skills=" ".join(jd_skills.strip().split())
   jd_skills=jd_skills.replace(', ',',')
   for i in jd_skills.split(','):
    if '_' in i:
      mapper[i.lower().replace('_',' ')] = i
   jd_skills=jd_skills.replace(' ','_')
   jd_skills=jd_skills.replace(',',', ')
   for i in jd_skills.split(', '):
    if i not in underscore:
      if '_' in i:
        mapper[i.lower().replace('_',' ')] = i.replace('_',' ')
      elif '-' in i:
        mapper[i.lower().replace('-',' ')] = i
        mapper[i.lower()] = i
   lines = [preprocess(jd_skills.lower().rstrip())]
   path = argv[4]
   for dict in res:
    for i in dict.keys():
      j=i.replace('_',' ')
      res_dict[mapper[j]] = dict[i]
    print('skills_matched :',res_dict)

Command Line Argument

!python3 '/content/drive/MyDrive/Skill_Matching_Files/Model(cbow).joblib' '/content/drive/MyDrive/Skill_Matching_Files/vocab_split.csv' '/content/drive/MyDrive/Skill_Matching_Files/phrases_split.pkl' 'julia, kaggle, ml, mysql, oracle, python, pytorch, r, scikit learn, snowflake, sql, tensorflow' '/content/drive/MyDrive/Skill_Matching_Files/TESTING RESUME/Copy of 0_A.a.aa.pdf'


skills_matched : {'python': 1, 'r': 1, 'oracle': 0, 'snowflake': 1, 'pytorch': 1, 'tensorflow': 1, 'ml': 1, 'sql': 1, 'kaggle': 1, 'mysql': 1, 'julia': 1, 'scikit learn': 1}



I hope the article provided you the insights into extracting skills from resumes. You learned how the Word2Vec word embedding technique is used to vet the resumes by several companies in the recruitment industry and companies.

Please comment below or connect with me on LinkedIn to drop a query or feedback if you have any doubts.

Ganeshi Shreya 27 Mar 2023

Frequently Asked Questions

Lorem ipsum dolor sit amet, consectetur adipiscing elit,

Responses From Readers