Every year, there are thousands of graduates from different institutions coming fresh into the job market, looking for the jobs that best fit them in terms of qualification, aspiration and salary expectation. There are also people changing their career paths, looking for the ones that they can leverage their previous experience and knowledge into better use. Similarly, employers are also hunting for the right candidates to fill the positions hoping to get them on board to help growing the business together with an aligned visions. In view of these bi-directional demands and needs, how to have the consolidated solutions in order to stay competitive for both parties to get the right fit for each other become an interesting topic to study.
This project outlined the focus by studying two main aspects:
Part 1 : Factors that impact salary
- which are significant decisive factors for both job seeker and employer during the consideration of an acceptance / offer
Part 2 : Factors that distinguish the job category
- which are essential for getting the right candidates with the most fit qualification and experiences , matching appropriately to the job requirements or roles and responsibilities
Data Source
Job Postings @ MyCareersFuture
Analytical Approach
1. Data Collection via Web-Scrapping
- Collect job postings that are data-related and scrap at least 1000 data to get the relevant information that is needed for subsequent analysis and prediction
2. Data Wrangling & Preparation
- Parse web-scrapped data and prepare them into dataframe for easy processing
- Perform data cleaning to clear doubtful entries and tranform into standardized types
3. Exploratory Data Analysis
- Set mean salary as the key predictive feature
- Check salary distribution and remove outliers >15k for senior positions like HOD, director in order to have normally distributed data for better prediction outcomes as general norms
4. Pre-Processing & Predictive Model Selection
- Analysing and transforming textual information using Natural Language Processing packages
- Evaluation of various classification models for best predictive model selection
5. Outlining Features Importance
- Summarizing the overall features that hold greatest significance in terms of Salary Prediction and Job Category Classification
Analytical Outcomes :
Web-Scrapping
Using BeautifulSoup and Selenium, relevant job postings linked were collected
Part 1 : to get basic job data info
Part 2 : to get job desc + role & responsibility info
Example of the consolidated dataframe storing partial web-scrapped data
Company | JobTitle | Location | EmploymentType | Seniority | Category | GovSupport | SalaryRange | JobLink | PostedDate | ClosingDate | RoleResponsibility | Requirements | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | ASPIRE GLOBAL NETWORK PTE. LTD. | Regional Head, Ad Operations | Central | Full Time | Senior Management | Admin / Secretarial | $9,000to$13,500Monthly | /job/bde211a5a9f2b9cef2115aa6e8104a36 | 12 Apr 2018 | 12 May 2018 | A global broadcast and entertainment giant is ... | Requirements Min 6 years’ experience with Str... | |
1 | CYRUS TECHNOLOGY (S) PTE. LTD. | program executive | Central | Full Time | Senior Management | Admin / Secretarial | $2,000to$2,400Monthly | /job/0662cb940442cd31a25989972255c676 | 12 Apr 2018 | 12 May 2018 | Handle daily enquiries and requests from clie... | Candidate possesses Diploma in any discipline... | |
2 | NATIONAL UNIVERSITY HOSPITAL (SINGAPORE) PTE LTD | Case Management Officer_RCCM (Contract) | East, Central | Permanent ... | Executive | Admin / Secretarial ... | Government support available | $2,800to$5,600Monthly | /job/ef766282d386e151e6b0b863dbbf1d25 | 12 Apr 2018 | 12 May 2018 | The case management officer reviews, assessing... | Qualification: Diploma or Degree in nursing o... |
3 | A*STAR RESEARCH ENTITIES | Scientist / Senior Scientist (ARTC / A*STAR) | East, Central | Permanent ... | Executive | Admin / Secretarial ... | $5,900to$11,800Monthly | /job/1f70879985e4d7b0c506434c2beb82ee | 12 Apr 2018 | 12 May 2018 | The Agency for Science, Technology and Researc... | Data Scientist (SMG) Senior (at the level of... | |
4 | A*STAR RESEARCH ENTITIES | IMCB - Research Manager (JEC) | South | Contract ... | Executive | Healthcare / Pharmaceutical | $6,300to$12,600Monthly | /job/c9c34298cd0dc645e3d6edb902945a4c | 12 Apr 2018 | 12 May 2018 | About the Institute of Molecular and Cell Biol... | Possess MSc in Medical Biochemistry Minimum 5... | |
5 | TEEKAY MARINE (SINGAPORE) PTE. LTD. | Marine Personnel Officer | South | Contract ... | Executive | Healthcare / Pharmaceutical | Government support available | Salary undisclosed | /job/53701b418e11e3c29cacdc01b483df97 | 12 Apr 2018 | 12 May 2018 | Position Summary The Marine Personnel Officer ... | Diploma in Maritime/Business Administration w... |
6 | ST RECRUITMENT CENTRE | Admin Assistant | West | Contract ... | Senior Executive | Sciences / Laboratory / R&D | $1,800to$2,500Monthly | /job/1ac5d9d42402ff2938608515ebf9923f | 12 Apr 2018 | 12 May 2018 | To issue purchase. Do data entry. Perform sto... | Minimum GCE 'O' level. Knowledge in Excel Spr... |
Part 1 : Factors Impacting Salary Prediction
1a. Normalization of Mean Salary Distribution
After thorough data cleaning, preparation and transformation, the normalized mean salary distribution as below :
1b. Textual Transformation using NLP
- Use TF-IDF : JobTitle, Seniority, Category, RoleResponsibility, Requirements
- Use Encoder : EmploymentType, GovSupport
Example as below :
1c. Predictive Classification Model Selection
Categorize salary range into 4 classes below in preparation for Classification Modeling :
- <3000
- 3000 < x <= 6000
- 6000 < x <= 10000
- <10000
Imbalance data set was observed. Therefore, need to consider sampling method for minority class before modeling
Different classification models were evaluated under the similar methods and their cross validation scores were examined as below :
Classification Model | Cross Validation Score |
---|---|
Logistic RegressionCV | 0.5850 |
RidgeClassifierCV | 0.5782 |
RandomForestClassifier | 0.5646 |
Support Vector Classifier | 0.5816 |
Among various classification approach, Logistic RegressionCV has the highest score at 0.585. It was therefore proceeded further to check its confusion matrix and classification report
Overall accuracy with Logistic Regression CV was still poor at 0.59. Further re-examination of feature selection and study of other models shall be done to re-assess if further improvement could be achieved. However, for continuity of this exercise, extractions of feature importance for various salary classes were still demonstrated.
1d. Extraction of Feature Importance
To find out which features influencing the salary class prediction, coefficients for each salary class were extracted. The coefficients of the model could be assessed through coefs_path:
Example of Features affecting Salary Class 1 :
Part 2 : Factors Impacting Job Category Prediction
2a. Scope Definition to Segregate Target Job Category vs Others
Create new column to indicate if the job postings are either Data Scientist or Data Analyst
2b. Textual Transformation using NLP
- Use TF-IDF : RoleResponsibility, Requirements
Example as below :
2c. Predictive Classification Model Selection
Since Logistic Regression CV yielded the highest score, it was used to examine what components in the job posting that leaded to the differentiation of the job category (Data Scientist/Data Analyst) vs others
2d. Extraction of Feature Importance
Examination of which features had the higher influence on job category prediction
Summary
Part 1: Salary Trend Prediction
Classification Approach :
Model | Cross Validation Score |
---|---|
Logistic Regression CV | 0.5850 |
RidgeClassifierCV | 0.5782 |
RandomForestClassifier | 0.5646 |
SupportVectorClassifier | 0.5816 |
Classification approaches still yield poor results indicating further re-examination of feature selection and deeper study of other models to assess next level of improvement could be tried. For continuity and completion of the exercise, Logistic Regression CV was chosen as final model predicting the salary since it yielded the highest score.
Accuracy for salary range predicted using chosen model ( Log Regression CV ) as below :
Salary Class | Accuracy |
---|---|
<= $3000 - class 1 | 0.80 |
$3000 - $6000 - class 2 | 0.62 |
$6000 - $10000 - class 3 | 0.53 |
>$10000 - class 4 | 0.40 |
Summary of the Top 10 features for various Salary Classes as below :
Part 2: Job Category Prediction
Summary of All Model Trials:
Model | Cross Validation Score |
---|---|
LogisticRegressionCV | 0.9388 |
RidgeClassifierCV | 0.9354 |
KNN | 0.9014 |
DecisionTreeClassifier | 0.9014 |
Differentiation Key Words
Key words obtained from the model ( Logistic Regression CV ) which predicting the differentiation in job posting of DataScientist/DataAnalyst from other jobs are :
- Data
- Machine learning
- Data science
- Analytics
- Statistic
- Insights
- Quantitative
- Models
These key words are well related to Data scientist/analyst job. Therefore model chosen is relative robust with accuracy at 0.94