Job seekers who are usually looking for advertising jobs sometimes overwhelmed with the jobs posted on the websites. A simple search mechanism in these web pages do not really respond a good match between their resume and the job descriptions. The job seekers would need to seriously study each job description and use the same language as the job posting if they want to be shorted for an interview.
Likewise, companies advertising open positions often have hundreds of applicants given a small group of HR professionals to deal with, that's why a system called Application Tracking Software (or resume scanner system) comes to aid. Resumes with irrelevant information or keyword are mostly filtered out by this system before it can be seen by a hiring manager. An ATS sytem may use varying degrees of techniques to rank the candidates, they range from keyword matching to advanced algorithms for deeper analyzing candidate skills and employment data.
As a job seeker, it's really challenging to measure how well my resume matches with the job descriptions, so this post is all about how I created a simple Python project to suit this task.
Approach
This project allows users to upload their own resume file and a specific job description to calculate the similarity between these text documents, or user can have option to check against the job posts mainly come from seek.com.au, an Australia's leading employment marketplace. I used Python to scrap the job posts from seek, given a job title and location from the users. I also created a wordcloud of the resume to have a clear view of all main keywords.
#Python libraries for web scraping from bs4 import BeautifulSoup import requests #Python libraries for NLP import nltk from nltk.tokenize import word_tokenize from nltk.corpus import stopwords #Python libraries for NLP tasks import pandas as pd import numpy as np from sklearn.feature_extraction.text import CountVectorizer from sklearn.feature_extraction.text import TfidfVectorizer #Python libraries for reading pdf, doc import re import io import docx2txt from pdfminer.converter import TextConverter from pdfminer.pdfinterp import PDFPageInterpreter from pdfminer.pdfinterp import PDFResourceManager from pdfminer.pdfpage import PDFPage #Python libraries for wordclouds from wordcloud import WordCloud, STOPWORDS import matplotlib.pyplot as plt
Loading resume file
PDF and DOC/DOCX are the most common types of resume documents so I used pdfminer and docx2txt which are python-based modules to extract the plain text. From my point of view, in general, job seekers should avoid putting any photo, images, or complex structure e.g. tables, frames unless it is required to do so. The resume with plain text could be easier for the extraction process.
#Converting pdf file to text def PDF2Text(pdf_doc): rsmgr = PDFResourceManager() handler = io.StringIO() converter = TextConverter(rsmgr,handler) interpreter = PDFPageInterpreter(rsmgr,converter) with open(pdf_doc,'rb') as fh: for page in PDFPage.get_pages(fh,caching=True,check_extractable=True): interpreter.process_page(page) text = handler.getvalue() converter.close() handler.close() if text: return text
#Converting MS document to text def Doc2Text(doc_cv): text = docx2txt.process(doc_cv) text = text.replace('\n',' ') if text: return text
Preprocessing the resume file
Data cleansing or preprocessing is a critical and important part for any data science project, particularly in text analytics, and to my knowledge there is no any standard process that fits all. However, there are some general tasks are but not limited to:
- Removal of URLs, HTML tags
- Standardizing (dm -> direct message, info -> information)
- Irrelevant charaters: numbers, punctuation, emojis
- Tokenization (words or sentences)
- Lowercasing
- Removal of stopwords (both built-in or user-defined)
- Spelling correction
- Stemming and lemmatization
For example, in sentiment analysis, the texts are mostly collected from web pages, commerical sites, or social media, so cleaning task will deal with removing emoji 🥰, hash tag #, or standardizing the acronyms and abbreviations into correct words (dm to direct message, b4 to before) etc.
The texts from a resume and job description seem to be fairly standardized, so I just concentrated on removing email, url links, numbers, and stopwords. The stopwords list here combines a built-in stopwords from nltk library and an user-defined stopwords.
def TextPreprocess(text): #removing emails text = re.sub('[\w\.-]+@[\w\.-]+','',text) #removing http links text = re.sub('(www|http:|https:)+[^\s]+[\w]','',text) #removing web directory, sub-directory url text = re.sub('[\w]+[0-9]?\.[\w]+[0-9]?/([\w]+[0-9]?/?)*','',text) #removing numbers, or telephone text = re.sub('[\+]?[0-9]+','',text) #replacing slash / with space text = re.sub('/', ' ',text) #removing single quote with characters e.g. 's, 're, 'll text = re.sub("'[\w]+",'', text) #lowercasing all the words text = text.lower() #tokenizing into words tokens = word_tokenize(text) #removal of stopwords, rare words lstopwords = [] with open('stopwords.csv','r') as file: lines=file.readlines() for line in lines: lstopwords.append(line.strip()) lstopwords.extend(stopwords.words('english')) text = ' '.join(token for token in tokens if token not in lstopwords) #removing special characters SpecialC = ['+','*',',',':',';','•','%','?','!','|','$','&','#','etc.','e.g.','.','–','(',')','“','”','’','\uf0b7','\u200b'] for c in SpecialC: text =str(text).replace(c,'') return text
Job ads data scraping
From my previous projects, I often used Selenium library to automate the browser and scrap data but this time I tried Request and Beautiful soup instead, because these libraries are more easier to start with and I really dont need to interact much with the dynamic pages and contents, so you may choose your own scraping stratery depending on the target website and your purpose.
The code snippet below builds a get request and send it to seek server. The parameters include a job title, location, page, and sort mode.
headers = {"User-Agent":"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/109.0.0.0 Safari/537.36"} url = 'https://www.seek.com.au/{}-jobs{}?page={}&sortmode={}' title = 'Data Analyst' location = 'Sydney' page = '1' sort = 'KeywordRelevance' #or 'ListedDate' title = title.strip().replace(' ','-') location = '/in-' + location.strip().replace(' ','-') if location else '' url = url.format(title, location, page, sort) url
Please note the parameters in "find", "find_all" methods below are supposed to be changed depending on the changes from the HTML structure of the target webpage. It is recommended to check before running the code.
#scraping job ads from seeks page = requests.get(url, headers=headers) if page.status_code ==200: soup = BeautifulSoup(page.content,'html.parser') search = soup.find('div',{'data-automation':'searchResults'}) if not search is None: jobs = search.find_all('div',{'class':'_1wkzzau0 a1msqi7e'}) list_job = [] for job in jobs: title = job.find('a',{'data-automation':'jobTitle'}).text link = 'https://www.seek.com.au'+job.find('a')['href'].split('?')[0] comobj = job.find('a',{'data-automation':'jobCompany'}) company = comobj.text if comobj is not None else 'Private Advertiser' location = job.find('a',{'data-automation':'jobLocation'}).text subclassification = job.find('a',{'data-automation':'jobSubClassification'}).text classification = job.find('a',{'data-automation':'jobClassification'}).text.replace('(','').replace(')','') salobj = job.find('span',{'data-automation':'jobSalary'}) salary = salobj.text if salobj is not None else 'N/A' premjob = job.find('span',{'data-automation':'jobPremium'}) dateobj = job.find('span',{'data-automation':'jobListingDate'}) listdate = dateobj.text if dateobj is not None else premjob.text shortJD = job.find('span',{'data-automation':'jobShortDescription'}).text #get the details of job description pagejd = requests.get(link, headers = headers) soupjd = BeautifulSoup(pagejd.content,'html.parser') div_text = soupjd.find('div',{'data-automation':'jobAdDetails'}).find_all(text=True) jd = ' '.join(text for text in div_text) dict_job = {'Job title':title,'Company':company,'Listed date': listdate, 'Location':location,'Classification':classification, 'Sub classification':subclassification,'Salary Package':salary, 'Short JD':shortJD,'Job Description':jd,'Link':link } list_job.append(dict_job) df = pd.DataFrame(list_job) df
The scrapped dataset of job ads look like the figure below
Comparing the similarity between the text documents
Now we come to the final part to get a similarity score between a resume and job descriptions. Generally, to calculate the similarity between two objects, we need to know the features for comparing and they can be extracted from the objects.
We can quantify how similar between two text documents by checking if they have similar contextual meaning but written in a different way, or if they simply have a number of common words. The first approach takes the semantic into account in some extent, and because of the complexity of natural language, handling semantic similarity is very challening even though there have been many studies recently developed. In the scope of this project, the similarity is mainly relied on the measurement of common words between these text documents.
It's clear that the feature here for similarity check is the common words, but how can we represent this feature in term of the numerical data since text data is not computable by the computer, or particularly to be efficiently fed into machine learning algorithms. There are two common algorithms n Bag of Words and Term Frequency - Inverse Document Frequency (a.k.a TF-IDF). These algorithms are supported by Scikit-learn package called CountVectorizer and TfidfVectorizer.
TF-IDF vectorizer takes into account not only how many times a word appears in documents but also the importance of the words, this is much advantageous compared to Count vectorizer just counting the words.
def Vectorization(documents, dtype): vectorizer = None if dtype == 'count': vectorizer = CountVectorizer() elif dtype == 'tfidf': vectorizer = TfidfVectorizer() spare_matrix = vectorizer.fit_transform(documents) #convert to array doc_term_array = spare_matrix.toarray() #convert to dataframe df = pd.DataFrame(doc_term_array, columns=vectorizer.get_feature_names_out()) return df, doc_term_array
After transforming the feature in numerical forms, now it's time to compute the similarity! I am going to use a Cosine Similarity (or Cosine Distance) which measures the cosine of the angle between two objects represented by two vectors in a multi-dimentional space. The figure below illustrates how cosine distance measures the level of similarity between two objects A and B. So the smaller the angle, the higher cosine score and hence more similarity these documents are.
If you want to know more about the explanation of Cosine similarity please go to this useful article https://janav.wordpress.com/2013/10/27/tf-idf-and-cosine-similarity/
def Cosine_Similarity(v1, v2):return np.dot(v1, v2) / (np.linalg.norm(v1)*np.linalg.norm(v2))
#get all the job description job_descriptions = df[['Job Description']].values list_cosine = []
#traverse all job description in the dataset and compute cosine similarity
for job_description in job_descriptions:
_, vector = Vectorization([resume, TextPreprocess(str(job_description))], 'tfidf')
similarity_score = round(Cosine_Similarity(vector[0], vector[1])*100,2)
dict_cosine = {'Matching percent':'{}%'.format(similarity_score)} list_cosine.append(dict_cosine)
#combine with the job ads dataset
df_matching = pd.DataFrame(list_cosine) df = pd.concat([df_matching, df], axis = 1) df
If you want to just compare your own resume with any specific job description then run the code snippet below. For example, the matching score is 20.64% for Data analyst job from Garvan Institute of Medical Research.
#or manually input job description job_description = input('Input the job description: ') _,vector = Vectorization([resume, TextPreprocess(job_description)],'tfidf') similarity_score = round(Cosine_Similarity(vector[0], vector[1])*100,2) print('\nYour resume and the job descripition is {}% matched'.format(similarity_score))
I have created a streamlit version of this project at https://resume-scanning.streamlit.app where you can experiment without worring about the coding environment, so please try it if you're interested in. Also full code in Jupiter notebook is also shared in the Github link below.
I hope this lengthy post is useful and if there is any question or comment please post it below. Thanks for reading and cheer!
Github link: https://github.com/phuphan13/Resume-scanning
No comments:
Post a Comment