Building a resume scanning system using Python


Job seekers who are usually looking for advertising jobs sometimes overwhelmed with the jobs posted on the websites. A simple search mechanism in these web pages do not really respond a good match between their resume and the job descriptions. The job seekers would need to seriously study each job description and use the same language as the job posting if they want to be shorted for an interview.

Likewise, companies advertising open positions often have hundreds of applicants given a small group of HR professionals to deal with, that's why a system called Application Tracking Software (or resume scanner system) comes to aid. Resumes with irrelevant information or keyword are mostly filtered out by this system before it can be seen by a hiring manager. An ATS sytem may use varying degrees of techniques to rank the candidates, they range from keyword matching to advanced algorithms for deeper analyzing candidate skills and employment data. 

As a job seeker, it's really challenging to measure how well my resume matches with the job descriptions, so this post is all about how I created a simple Python project to suit this task.

Approach

This project allows users to upload their own resume file and a specific job description to calculate the similarity between these text documents, or user can have option to check against the job posts mainly come from seek.com.au, an Australia's leading employment marketplace. I used Python to scrap the job posts from seek, given a job title and location from the users. I also created a wordcloud of the resume to have a clear view of all main keywords.

Importing the libraries

#Python libraries for web scraping
from bs4 import BeautifulSoup
import requests

#Python libraries for NLP
import nltk
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords

#Python libraries for NLP tasks
import pandas as pd
import numpy as np
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfVectorizer

#Python libraries for reading pdf, doc
import re
import io
import docx2txt
from pdfminer.converter import TextConverter
from pdfminer.pdfinterp import PDFPageInterpreter
from pdfminer.pdfinterp import PDFResourceManager
from pdfminer.pdfpage import PDFPage

#Python libraries for wordclouds
from wordcloud import WordCloud, STOPWORDS
import matplotlib.pyplot as plt

Loading resume file

PDF and DOC/DOCX are the most common types of resume documents so I used pdfminer and docx2txt which are python-based modules to extract the plain text. From my point of view, in general, job seekers should avoid putting any photo, images, or complex structure e.g. tables, frames unless it is required to do so. The resume with plain text could be easier for the extraction process.  

#Converting pdf file to text
def PDF2Text(pdf_doc):
    rsmgr = PDFResourceManager()
    handler = io.StringIO()
    converter = TextConverter(rsmgr,handler)
    interpreter = PDFPageInterpreter(rsmgr,converter)
    
    with open(pdf_doc,'rb') as fh:
         for page in PDFPage.get_pages(fh,caching=True,check_extractable=True):
                interpreter.process_page(page)
        
    text = handler.getvalue()
    converter.close()
    handler.close()
    
    if text: 
        return text
#Converting MS document to text
def Doc2Text(doc_cv):
    text = docx2txt.process(doc_cv)
    text = text.replace('\n',' ')
    if text:
        return text

Preprocessing the resume file

Data cleansing or preprocessing is a critical and important part for any data science project, particularly in text analytics, and to my knowledge there is no any standard process that fits all. However, there are some general tasks are but not limited to:
  • Removal of URLs, HTML tags
  • Standardizing (dm -> direct message, info -> information)
  • Irrelevant charaters: numbers, punctuation, emojis
  • Tokenization (words or sentences)
  • Lowercasing
  • Removal of stopwords (both built-in or user-defined)
  • Spelling correction
  • Stemming and lemmatization
For example, in sentiment analysis, the texts are mostly collected from web pages, commerical sites, or social media, so cleaning task will deal with removing emoji 🥰, hash tag #, or standardizing the acronyms and abbreviations into correct words (dm to direct message, b4 to before) etc.

The texts from a resume and job description seem to be fairly standardized, so I just concentrated on removing email, url links, numbers, and stopwords. The stopwords list here combines a built-in stopwords from nltk library and an user-defined stopwords.  

def TextPreprocess(text):

    #removing emails
    text = re.sub('[\w\.-]+@[\w\.-]+','',text)
    #removing http links
    text = re.sub('(www|http:|https:)+[^\s]+[\w]','',text)
    #removing web directory, sub-directory url
    text = re.sub('[\w]+[0-9]?\.[\w]+[0-9]?/([\w]+[0-9]?/?)*','',text) 
    #removing numbers, or telephone
    text = re.sub('[\+]?[0-9]+','',text)
    #replacing slash / with space
    text = re.sub('/', ' ',text)
    #removing single quote with characters e.g. 's, 're, 'll
    text = re.sub("'[\w]+",'', text)
    #lowercasing all the words
    text = text.lower()
    #tokenizing into words
    tokens = word_tokenize(text)
    #removal of stopwords, rare words
    lstopwords = []
    with open('stopwords.csv','r') as file:
        lines=file.readlines()
        for line in lines:
            lstopwords.append(line.strip())
    
    lstopwords.extend(stopwords.words('english'))
    text = ' '.join(token for token in tokens if token not in lstopwords)
    
    #removing special characters
    SpecialC = ['+','*',',',':',';','•','%','?','!','|','$','&','#','etc.','e.g.','.','–','(',')','“','”','’','\uf0b7','\u200b']
    for c in SpecialC:
        text =str(text).replace(c,'')
   
    return text

Job ads data scraping

From my previous projects, I often used Selenium library to automate the browser and scrap data but this time I tried Request and Beautiful soup instead, because these libraries are more easier to start with and I really dont need to interact much with the dynamic pages and contents, so you may choose your own scraping stratery depending on the target website and your purpose. 

The code snippet below builds a get request and send it to seek server. The parameters include a job title, location, page, and sort mode. 

headers = {"User-Agent":"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/109.0.0.0 Safari/537.36"}
url = 'https://www.seek.com.au/{}-jobs{}?page={}&sortmode={}'
title = 'Data Analyst'
location = 'Sydney'
page = '1'
sort = 'KeywordRelevance' #or 'ListedDate'

title = title.strip().replace(' ','-')
location = '/in-' + location.strip().replace(' ','-') if location else ''
url = url.format(title, location, page, sort)
url

The program will first scrap all the job ads in the first page, and for each job ad, the process will send another request to get all the job details. Once the job ads in the current page are loaded, a dataframe will be constructed for analyzing similarity scores later.

Please note the parameters in "find", "find_all" methods below are supposed to be changed depending on the changes from the HTML structure of the target webpage. It is recommended to check before running the code. 
 
#scraping job ads from seeks
page = requests.get(url, headers=headers)

if page.status_code ==200:
    soup = BeautifulSoup(page.content,'html.parser')
    search = soup.find('div',{'data-automation':'searchResults'})
                        
    if not search is None:
        jobs = search.find_all('div',{'class':'_1wkzzau0 a1msqi7e'})
        list_job = []
        
        for job in jobs:
            title = job.find('a',{'data-automation':'jobTitle'}).text 
            link = 'https://www.seek.com.au'+job.find('a')['href'].split('?')[0]
            comobj = job.find('a',{'data-automation':'jobCompany'})
            company = comobj.text if comobj is not None else 'Private Advertiser'
            location = job.find('a',{'data-automation':'jobLocation'}).text
            subclassification = job.find('a',{'data-automation':'jobSubClassification'}).text
            classification = job.find('a',{'data-automation':'jobClassification'}).text.replace('(','').replace(')','')
            salobj = job.find('span',{'data-automation':'jobSalary'})
            salary = salobj.text if salobj is not None else 'N/A' 
            premjob = job.find('span',{'data-automation':'jobPremium'})
            dateobj = job.find('span',{'data-automation':'jobListingDate'})
            listdate = dateobj.text if dateobj is not None else premjob.text
            shortJD = job.find('span',{'data-automation':'jobShortDescription'}).text
                          
            #get the details of job description
            pagejd = requests.get(link, headers = headers)
            soupjd = BeautifulSoup(pagejd.content,'html.parser')
            div_text = soupjd.find('div',{'data-automation':'jobAdDetails'}).find_all(text=True)
            jd = ' '.join(text for text in div_text)
                        
            dict_job = {'Job title':title,'Company':company,'Listed date': listdate,
                        'Location':location,'Classification':classification,
                        'Sub classification':subclassification,'Salary Package':salary,
                        'Short JD':shortJD,'Job Description':jd,'Link':link
                                        }
            list_job.append(dict_job)
                                               
df = pd.DataFrame(list_job)  
df

The scrapped dataset of job ads look like the figure below


Comparing the similarity between the text documents

Now we come to the final part to get a similarity score between a resume and job descriptions. Generally, to calculate the similarity between two objects, we need to know the features for comparing and they can be extracted from the objects. 

We can quantify how similar between two text documents by checking if they have similar contextual meaning but written in a different way, or if they simply have a number of common words. The first approach takes the semantic into account in some extent, and because of the complexity of natural language, handling semantic similarity is very challening even though there have been many studies recently developed. In the scope of this project, the similarity is mainly relied on the measurement of common words between these text documents. 

It's clear that the feature here for similarity check is the common words, but how can we represent this feature in term of the numerical data since text data is not computable by the computer, or particularly to be efficiently fed into machine learning algorithms. There are two common algorithms n Bag of Words and Term Frequency - Inverse Document Frequency (a.k.a TF-IDF). These algorithms are supported by Scikit-learn package called CountVectorizer and TfidfVectorizer. 

TF-IDF vectorizer takes into account not only how many times a word appears in documents but also the importance of the words, this is much advantageous compared to Count vectorizer just counting the words.  

def Vectorization(documents, dtype):
    
    vectorizer = None
    if dtype == 'count':
        vectorizer = CountVectorizer()
    elif dtype == 'tfidf':
        vectorizer = TfidfVectorizer()
   
    spare_matrix = vectorizer.fit_transform(documents)
    #convert to array
    doc_term_array = spare_matrix.toarray()
    #convert to dataframe
    df = pd.DataFrame(doc_term_array, columns=vectorizer.get_feature_names_out())    
    
    return df, doc_term_array

After transforming the feature in numerical forms, now it's time to compute the similarity! I am going to use a Cosine Similarity (or Cosine Distance) which measures the cosine of the angle between two objects represented by two vectors in a multi-dimentional space. The figure below illustrates how cosine distance measures the level of similarity between two objects A and B. So the smaller the angle, the higher cosine score and hence more similarity these documents are. 


If you want to know more about the explanation of Cosine similarity please go to this useful article https://janav.wordpress.com/2013/10/27/tf-idf-and-cosine-similarity/

def Cosine_Similarity(v1, v2):  
return np.dot(v1, v2) / (np.linalg.norm(v1)*np.linalg.norm(v2))

The next step is to compute all the similarity scores between a resume with all the job descriptions and combine it to the orginal job ads dataset. By default, this project will extract the first 22 jobs matched depending on the search criteria and response from seek.com.au. If you're on the web version of this project (mentioned below) you can navigate to the next jobs rather than re-run the code with new parameters.
         
#get all the job description
job_descriptions = df[['Job Description']].values

list_cosine = []

#traverse all job description in the dataset and compute cosine similarity
for job_description in job_descriptions: _, vector = Vectorization([resume, TextPreprocess(str(job_description))], 'tfidf')
    similarity_score = round(Cosine_Similarity(vector[0], vector[1])*100,2)
    dict_cosine = {'Matching percent':'{}%'.format(similarity_score)}
    list_cosine.append(dict_cosine)

#combine with the job ads dataset
df_matching = pd.DataFrame(list_cosine) df = pd.concat([df_matching, df], axis = 1) df


If you want to just compare your own resume with any specific job description then run the code snippet below. For example, the matching score is 20.64% for Data analyst job from Garvan Institute of Medical Research.

#or manually input job description
job_description = input('Input the job description: ')
_,vector = Vectorization([resume, TextPreprocess(job_description)],'tfidf')
similarity_score = round(Cosine_Similarity(vector[0], vector[1])*100,2)
print('\nYour resume and the job descripition is {}% matched'.format(similarity_score))


I have created a streamlit version of this project at https://resume-scanning.streamlit.app where you can experiment without worring about the coding environment, so please try it if you're interested in. Also full code in Jupiter notebook is also shared in the Github link below. 




I hope this lengthy post is useful and if there is any question or comment please post it below. Thanks for reading and cheer!



No comments:

Post a Comment