Analysis on Job Postings

Every year, there are thousands of graduates from different institutions coming fresh into the job market, looking for the jobs that best fit them in terms of qualification, aspiration and salary expectation. There are also people changing their career paths, looking for the ones that they can leverage their previous experience and knowledge into better use. Similarly, employers are also hunting for the right candidates to fill the positions hoping to get them on board to help growing the business together with an aligned visions. In view of these bi-directional demands and needs, how to have the consolidated solutions in order to stay competitive for both parties to get the right fit for each other become an interesting topic to study.

This project outlined the focus by studying two main aspects:

Part 1 : Factors that impact salary

which are significant decisive factors for both job seeker and employer during the consideration of an acceptance / offer

Part 2 : Factors that distinguish the job category

which are essential for getting the right candidates with the most fit qualification and experiences , matching appropriately to the job requirements or roles and responsibilities

Data Source

Job Postings @ MyCareersFuture

Analytical Approach

1. Data Collection via Web-Scrapping

Collect job postings that are data-related and scrap at least 1000 data to get the relevant information that is needed for subsequent analysis and prediction

2. Data Wrangling & Preparation

Parse web-scrapped data and prepare them into dataframe for easy processing
Perform data cleaning to clear doubtful entries and tranform into standardized types

3. Exploratory Data Analysis

Set mean salary as the key predictive feature
Check salary distribution and remove outliers >15k for senior positions like HOD, director in order to have normally distributed data for better prediction outcomes as general norms

4. Pre-Processing & Predictive Model Selection

Analysing and transforming textual information using Natural Language Processing packages
Evaluation of various classification models for best predictive model selection

5. Outlining Features Importance

Summarizing the overall features that hold greatest significance in terms of Salary Prediction and Job Category Classification

Analytical Outcomes :

Web-Scrapping

Using BeautifulSoup and Selenium, relevant job postings linked were collected

Part 1 : to get basic job data info

driver = webdriver.Firefox(executable_path='./geckodriver')
compiled_data = []
for page in range(0,20):   
    url = "https://www.mycareersfuture.sg/search?search=data&page={}".format(page)
    
    # Visit relevant page.b
    driver.get(url)

    # Wait few second.
    sleep(3)

    # Grab the page source.
    html = driver.page_source
    # print(html)

    soup = BeautifulSoup(html, 'lxml')
    compiled_data.append(list(jobPostingInfo(soup)))
driver.close()

Part 2 : to get job desc + role & responsibility info

driver = webdriver.Firefox(executable_path='./geckodriver')
addOn_data = []
for i in range(0, len(compiled_data)):
    page_temp = []
    for j in range(20): # one page only has 20 records
        
        temp = []
        joblink = compiled_data[i][8][j]
        temp.append(joblink)
        
        joblink_url = "https://www.mycareersfuture.sg"+joblink
     
        # Visit relevant page.
        driver.get(joblink_url)

        # Wait few second.
        sleep(3)

        # Grab the page source.
        html = driver.page_source
        # print(html)

        soup2 = BeautifulSoup(html, 'lxml')
        
        postDate = soup2.find('span',{'id':'last_posted_date'}).text
        temp.append(postDate)       
        closeDate = soup2.find('span',{'id':'expiry_date'}).text
        temp.append(closeDate)
        job_content = soup2.find_all('div',{'id':'content'})
        role_resp = job_content[0].text
        temp.append(role_resp)
        
        try:
            requirements = job_content[1].text
            temp.append(requirements)
        except:
            temp.append('-')
   
        page_temp.append(temp)
    
    addOn_data.append(page_temp)

driver.close()

Example of the consolidated dataframe storing partial web-scrapped data

	Company	JobTitle	Location	EmploymentType	Seniority	Category	GovSupport	SalaryRange	JobLink	PostedDate	ClosingDate	RoleResponsibility	Requirements
0	ASPIRE GLOBAL NETWORK PTE. LTD.	Regional Head, Ad Operations	Central	Full Time	Senior Management	Admin / Secretarial		$9,000to$13,500Monthly	/job/bde211a5a9f2b9cef2115aa6e8104a36	12 Apr 2018	12 May 2018	A global broadcast and entertainment giant is ...	Requirements Min 6 years’ experience with Str...
1	CYRUS TECHNOLOGY (S) PTE. LTD.	program executive	Central	Full Time	Senior Management	Admin / Secretarial		$2,000to$2,400Monthly	/job/0662cb940442cd31a25989972255c676	12 Apr 2018	12 May 2018	Handle daily enquiries and requests from clie...	Candidate possesses Diploma in any discipline...
2	NATIONAL UNIVERSITY HOSPITAL (SINGAPORE) PTE LTD	Case Management Officer_RCCM (Contract)	East, Central	Permanent ...	Executive	Admin / Secretarial ...	Government support available	$2,800to$5,600Monthly	/job/ef766282d386e151e6b0b863dbbf1d25	12 Apr 2018	12 May 2018	The case management officer reviews, assessing...	Qualification: Diploma or Degree in nursing o...
3	A*STAR RESEARCH ENTITIES	Scientist / Senior Scientist (ARTC / A*STAR)	East, Central	Permanent ...	Executive	Admin / Secretarial ...		$5,900to$11,800Monthly	/job/1f70879985e4d7b0c506434c2beb82ee	12 Apr 2018	12 May 2018	The Agency for Science, Technology and Researc...	Data Scientist (SMG) Senior (at the level of...
4	A*STAR RESEARCH ENTITIES	IMCB - Research Manager (JEC)	South	Contract ...	Executive	Healthcare / Pharmaceutical		$6,300to$12,600Monthly	/job/c9c34298cd0dc645e3d6edb902945a4c	12 Apr 2018	12 May 2018	About the Institute of Molecular and Cell Biol...	Possess MSc in Medical Biochemistry Minimum 5...
5	TEEKAY MARINE (SINGAPORE) PTE. LTD.	Marine Personnel Officer	South	Contract ...	Executive	Healthcare / Pharmaceutical	Government support available	Salary undisclosed	/job/53701b418e11e3c29cacdc01b483df97	12 Apr 2018	12 May 2018	Position Summary The Marine Personnel Officer ...	Diploma in Maritime/Business Administration w...
6	ST RECRUITMENT CENTRE	Admin Assistant	West	Contract ...	Senior Executive	Sciences / Laboratory / R&D		$1,800to$2,500Monthly	/job/1ac5d9d42402ff2938608515ebf9923f	12 Apr 2018	12 May 2018	To issue purchase. Do data entry. Perform sto...	Minimum GCE 'O' level. Knowledge in Excel Spr...

Part 1 : Factors Impacting Salary Prediction

1a. Normalization of Mean Salary Distribution

After thorough data cleaning, preparation and transformation, the normalized mean salary distribution as below :

sns.distplot(df_MCF_final['mean_Salary_Monthly'], bins=30)

1b. Textual Transformation using NLP

Use TF-IDF : JobTitle, Seniority, Category, RoleResponsibility, Requirements
Use Encoder : EmploymentType, GovSupport

Example as below :

// TF-IDF on JobTitle

from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer

tvec = TfidfVectorizer(stop_words=stop_words, min_df=1, ngram_range=(2,2), max_features=1000)
// use ngram(2,2) because JobTitle more relevant when mentioning in pairs (business analyst) 
// rather than just analyst for ngram(1,1)

tvec.fit(df_MCF_final['JobTitle'])
df_JobTitle_tvec = pd.DataFrame(tvec.transform(df_MCF_final['JobTitle']).todense(), columns=['Title_'+ v for v in tvec.get_feature_names()], index=df_MCF_final['JobTitle'].index)
// to add prefix of Title in column name, so it is clearer that these features are originally from JobTitle column
// when putting the post TF-IDF processed data into new dataframe later on combining with other post TF_IDF data
// to include index=df_MCF_final['JobTitle].index in order to have the index consistent and avoid mismatch of index
// when using pd.concat with other df later on

df_JobTitle_tvec.sum().sort_values(ascending=False)
// to use TF-IDF for subsequent modeling

// Encoder on Employment Type:

df_EmpType = pd.get_dummies(df_MCF_final['EmploymentType'], drop_first=True, prefix= 'EmpType')
df_EmpType.head()
// first column = Contract

1c. Predictive Classification Model Selection

Categorize salary range into 4 classes below in preparation for Classification Modeling :

<3000
3000 < x <= 6000
6000 < x <= 10000
<10000

salary_class = []
for v in df_MCF_final['mean_Salary_Monthly']:
    if v <=3000:
        salary_class.append(1)
    elif 3000<v<=6000:
        salary_class.append(2)
    elif 6000<v<=10000:
        salary_class.append(3)
    else:
        salary_class.append(4)

df_MCF_final['salary_class'].value_counts()

'''
2    468
3    377
1     74
4     59
Name: salary_class, dtype: int64

Imbalance data set was observed. Therefore, need to consider sampling method for minority class before modeling

// train-test split for classifictaion modeling:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y_class, test_size=0.3, random_state=42)

// Random over sample minority class
from imblearn.over_sampling import RandomOverSampler
ros = RandomOverSampler(ratio='minority', random_state=42)
X_res_train, y_res_train = ros.fit_sample(X_train, y_train)

model_LOGR = LogisticRegressionCV(cv=5)
model_LOGR.fit(X_res_train, y_res_train)
y_predictions = model_LOGR.predict(X_test)

print('Model_LogisticRegressionCV')
print('-------------------------')
print('score = {}'.format(model_LOGR.score(X_test, y_test)))

'''
score_LogisticRegressionCV = 0.585034013605

'''

Different classification models were evaluated under the similar methods and their cross validation scores were examined as below :

Classification Model	Cross Validation Score
Logistic RegressionCV	0.5850
RidgeClassifierCV	0.5782
RandomForestClassifier	0.5646
Support Vector Classifier	0.5816

Among various classification approach, Logistic RegressionCV has the highest score at 0.585. It was therefore proceeded further to check its confusion matrix and classification report

model_LOGR = LogisticRegressionCV(cv=5)
model_LOGR.fit(X_res_train, y_res_train)
y_predictions = model_LOGR.predict(X_test)

print('Model_LogisticRegressionCV')
print('-------------------------')
print('score = {}'.format(model_LOGR.score(X_test, y_test)))
print('\n')
print('Confusion Matrix :')
print('-------------------')
print(confusion_matrix(y_test, y_predictions))
print('\n')
print('Classification report :')
print('------------------------')
print(classification_report(y_test, y_predictions))

Overall accuracy with Logistic Regression CV was still poor at 0.59. Further re-examination of feature selection and study of other models shall be done to re-assess if further improvement could be achieved. However, for continuity of this exercise, extractions of feature importance for various salary classes were still demonstrated.

1d. Extraction of Feature Importance

To find out which features influencing the salary class prediction, coefficients for each salary class were extracted. The coefficients of the model could be assessed through coefs_path:

salary_coeff = model_LOGR.coefs_paths_

// coefficient for each salary class determined by the index coordintes as below :

salary_coeff[1][0][0] # coefficient path for salary class 1
salary_coeff[2][0][0] # coefficient path for salary class 2
salary_coeff[3][0][0] # coefficient path for salary class 3
salary_coeff[4][0][0] # coefficient path for salary class 4

Example of Features affecting Salary Class 1 :

coeff_range_class1 = salary_coeff[1][0][0]

coeff_class1_column = list(X_train.columns)

coeff_class1_dict = dict(zip(coeff_class1_column, coeff_range_class1[:-1])) 
// to remove the last column coefficient(=mean salary monthly)


// form dataframe for features and their coefficients:
df_coeff_class1_raw = pd.DataFrame.from_records([coeff_class1_dict])
df_coeff_class1_LOGRCV = df_coeff_class1_raw.transpose().reset_index()
df_coeff_class1_LOGRCV.rename(columns={'index':'Feature',0:'Coefficient'},inplace=True)

// top 10 features affecting salary class 1:
df_coeff_class1_LOGRCV.sort_values('Coefficient', ascending=False).head(10)

Part 2 : Factors Impacting Job Category Prediction

2a. Scope Definition to Segregate Target Job Category vs Others

Create new column to indicate if the job postings are either Data Scientist or Data Analyst

data_job_list = []
for v in df_MCF_final['JobTitle']:
    if 'data scien' in v.lower():
        data_job_list.append(1)
    elif 'data analy' in v.lower():
        data_job_list.append(1) 
    else:
        data_job_list.append(0)

df_MCF_final['Data_JobList'] = data_job_list

2b. Textual Transformation using NLP

Use TF-IDF : RoleResponsibility, Requirements

Example as below :

// TF-IDF on RoleResponsibility

tvec = TfidfVectorizer(stop_words=stop_words, min_df=1, ngram_range=(1,3), max_features=1000)
// use ngram(2,3) because RoleResponbility more relevant when mentioning in longer word pairs 
// like data analysis, years professional experience

tvec.fit(df_MCF_final['RoleResponsibility'])
df_RoleResp_tvec2 = pd.DataFrame(tvec.transform(df_MCF_final['RoleResponsibility']).todense(), columns=['RoleResp_'+ v for v in tvec.get_feature_names()], index=df_MCF_final['RoleResponsibility'].index)

df_RoleResp_tvec2.sum().sort_values(ascending=False)

2c. Predictive Classification Model Selection

// Set up X matrix for modeling:

X2_raw = pd.concat([df_Requirements_tvec2, df_RoleResp_tvec2], axis=1)

// Set up y matrix for modeling:

y2_raw = df_MCF_final['Data_JobList']

// calculate baseline accuracy

baseline = y2_raw.value_counts().max()/float(len(y2_raw))
print('Baseline : {:0.4}'.format(baseline))

'''
Baseline : 0.8824
'''

X2_train, X2_test, y2_train, y2_test = train_test_split(X2_raw, y2_raw, test_size=0.3, random_state=42)

// Random over sample minority class
from imblearn.over_sampling import RandomOverSampler
ros = RandomOverSampler(ratio='minority', random_state=42)
X2_res_train, y2_res_train = ros.fit_sample(X2_train, y2_train)

// 1st model trial : LogisticRegressionCV
model_LOGRCV2 = LogisticRegressionCV(cv=5)
model_LOGRCV2.fit(X2_res_train, y2_res_train)
y2_predictions = model_LOGRCV2.predict(X2_test)

// 2nd model trial : RidgeClassifierCV
model_RClass = RidgeClassifierCV()
model_RClass.fit(X2_res_train, y2_res_train)
y_pred = model_RClass.predict(X2_test)

// 3rd model trial : KNN
model_knn = KNeighborsClassifier()
model_knn_params = {'n_neighbors': range(1, 20, 2), 
             'weights': ['uniform', 'distance']}
knn_Gsearch = GridSearchCV(model_knn, model_knn_params, n_jobs=3, cv=5, verbose=1)
knn_Gsearch.fit(ss.fit_transform(X2_raw), y2_raw)

// 4th model trial : decision tree classifier
model_dtreec = DecisionTreeClassifier(random_state=42)
model_dtreec.fit(ss.fit_transform(X2_raw), y2_raw)
model_dtreec.predict(X2_test)

// Summary of all model trials:

print('1. Score_LogisticRegressionCV \t= {}'.format(model_LOGRCV2.score(X2_test, y2_test)))
print('2. Score_RidgeClassifierCV \t= {}'.format(model_RClass.score(X2_test, y2_test)))
print('3. Score_KNN \t\t\t= {}'.format(knn_best.score(X2_test, y2_test)))
print('4. Score_DecisionTreeClassifier = {}'.format(model_dtreec.score(X2_test, y2_test)))

Since Logistic Regression CV yielded the highest score, it was used to examine what components in the job posting that leaded to the differentiation of the job category (Data Scientist/Data Analyst) vs others

2d. Extraction of Feature Importance

Examination of which features had the higher influence on job category prediction

model_LOGRCV2.coefs_paths_[1][0][0]

coeff_column = list(X2_train.columns)

coeffients_LOGRCV2 = [v for v in  model_LOGRCV2.coefs_paths_[1][0][0]]

coeff_dict = dict(zip(coeff_column, coeffients_LOGRCV2[:-1])) 
// to remove the last column coefficient(=mean salary monthly)

// form dataframe for features and their coefficients:

df_coeff_raw = pd.DataFrame.from_records([coeff_dict])
df_coeff_LOGRCV = df_coeff_raw.transpose().reset_index()
df_coeff_LOGRCV.rename(columns={'index':'Feature',0:'Coefficient'},inplace=True)
df_coeff_LOGRCV.sort_values('Coefficient', ascending=False).head(20)

Summary

Part 1: Salary Trend Prediction

Classification Approach :

Model	Cross Validation Score
Logistic Regression CV	0.5850
RidgeClassifierCV	0.5782
RandomForestClassifier	0.5646
SupportVectorClassifier	0.5816

Classification approaches still yield poor results indicating further re-examination of feature selection and deeper study of other models to assess next level of improvement could be tried. For continuity and completion of the exercise, Logistic Regression CV was chosen as final model predicting the salary since it yielded the highest score.

Accuracy for salary range predicted using chosen model ( Log Regression CV ) as below :

Salary Class	Accuracy
<= $3000 - class 1	0.80
$3000 - $6000 - class 2	0.62
$6000 - $10000 - class 3	0.53
>$10000 - class 4	0.40

Summary of the Top 10 features for various Salary Classes as below :

Part 2: Job Category Prediction

Summary of All Model Trials:

Model	Cross Validation Score
LogisticRegressionCV	0.9388
RidgeClassifierCV	0.9354
KNN	0.9014
DecisionTreeClassifier	0.9014

Differentiation Key Words

Key words obtained from the model ( Logistic Regression CV ) which predicting the differentiation in job posting of DataScientist/DataAnalyst from other jobs are :

Data
Machine learning
Data science
Analytics
Statistic
Insights
Quantitative
Models

These key words are well related to Data scientist/analyst job. Therefore model chosen is relative robust with accuracy at 0.94