Job Hunter

Job Hunter is an automated job notification system that keeps job seekers updated with the latest job postings on a company's website. The system utilizes push notifications, email or text message notifications to notify users about new job postings. Job Hunter aims to solve the problem of missing out on job opportunities due to late application or not being informed on time.

In addition to the job notification system, Job Hunter also provides a machine learning-based text classification system that classifies job postings into one of four industry categories. The system uses supervised machine learning methods to assign new job titles to one of four industry categories. The industry classification problem is a multi-class text classification problem, and the solution includes data cleaning, text preprocessing, dealing with data imbalance and building a machine learning model using LinearSVM, Multinomial NaiveBayes, and Logistic Regression.

The final machine learning model, LinearSVM, is then deployed using a Flask API that provides a RESTful API service to users. The model is not recompiled or trained with each request but rather predicts on the given data in the request.

What it does

Job Hunter aims to make the job search process easier for job seekers by notifying them about new job postings in a timely manner and assisting them in classifying job postings into relevant industry categories.

Backend

How we built it

Task

The problem is supervised text classification problem, and our goal is to investigate which supervised machine learning methods are best suited to solve it. Given a new job title that comes in, we want to assign it to one of the industry categories.

The classifier makes the assumption that each new complaint is assigned to one and only one category. This is multi-class text classification problem.

Cleaning and Preprocessing

We utilize 3 Dataset that have two variables (Job title & Industry) in a csv format of more than 25,000 samples.

The dataset is imbalanced (Imbalance means that the number of data points available for different classes is different).

Firstly, I noticed that there are a lot of duplicates in the data instances (job titles). The unique job titles in data were 10000+ while the entire data was 25000+. So, the first cleaning technique we performed was removing the duplicates in the data. We decided that because if those duplicates are to appear in both train and test sets that would produce a false accuracy increase as the model would have seen this data before losing all what a test data supposed to mean which is unseen data to the model.

Secondly, We applied various text preprocessing techniques including removing stop words (the,a,...), removing words less than 2 letters, removing numbers and using lower case for all the text . We tried also lemmatization and stemming the words but that didn’t do so well when processed in classification.

We are import necessary libraries for data analysis and machine learning, cleaning and processing a job titles and industries dataset, and training a text classification model using the Support Vector Machine (SVM) algorithm.

The clean_text() function cleans the text data by removing stop words, punctuation, and converting all text to lowercase.

The train_model() function reads in a CSV file and applies the clean_text() function to the input data. We then split the data into training and testing sets, compute the sample weights for each class, and train the SVM classifier on the training set. The trained model is then used to make predictions on the test set.

Modeling

The main body of the code reads in the job titles and industries dataset, removes duplicates, applies the clean_text() function to the job title column, and splits the data into training and testing sets. We then train the SVM classifier on the training set, evaluate its accuracy on the testing set, and save the trained model as a pickle file. Finally, we load the saved model and use it to make predictions on some test job titles.

The code imported the necessary libraries such as Keras, NumPy, TensorFlow, and Pandas. We then read a CSV file named using Pandas. After that, a TensorFlow dataset was created from the Pandas DataFrame using tf.data.Dataset.from_tensor_slices(). The code then looped through the elements of the TensorFlow dataset and printed the output of each element. The elements were destructured into three variables named idx, cat, and title.

In the test.py we import all the functions from "utils.py" using a wildcard import.

The code then sets the value of the "csv_path" variable to the test CSV file, "x_col" variable to "Job Title", and "y_col" variable to "Role".

Finally, we call a function named "train_model" and passes the "csv_path", "x_col", "y_col", and a string "LogReg" as arguments.

Challenges we ran into

When working on a supervised machine learning problem with a given data set, we try different algorithms and techniques. The same principles apply to text (or document) classification where there are many models can be used to train a text classifier.

The answer to the question “What machine learning model should I use?” is always “It depends.” Even the most experienced data scientists can’t tell which algorithm will perform best before experimenting them.

In the shade of that, we tried 3 types of models:

  • LinearSVM
  • Multi Nominal NB
  • Logistic Regression

We also tried to use the following for text vectorization:

  • Simple features out of text (CountVectorizier and TF-IDF), and complex features (word2vec)
  • While the second is supposed to be promising, it unfortunately was not the best results. So, we just kept the results of the first 3 approaches with simple features in the notebook.

Choice of Classifier

We tried three classifiers (LinearSVM, Multinomial NaiveBayes, LR), each for the reason of experimenting and trying different approaches to discover what is best for my data. Multinomial NB is a good base for the problem, LinearSVM is regarded to be one of the best text classification algorithms and LR is a simple and easy to understand algorithm.

We chose my final model to be the highest in measure (accuracy) of those three classifiers which was Logistic Regression which won over NB with 1% and over LR with 2%, so we would say all three classifiers did nice.

Deploying Flask API

We used Flask API to create a RESTful API for my model. The Model is not recompiled or trained each request, it just predicts upon the given data in request.

To test and use my Model's RESTful API service:

  • The Server only supports GET Requests.
  • The Server is run by simply running the script from terminal using command (python api_server.py)
  • After running the server you can direct GET requests to it using Postman or any other tool.

Example Request/Response: Request: "http://127.0.0.1:5000/model/api/JS developer" (Job title added to request) Response: "Junior Software Engineer" (Predicted Industry by model)

Deploying MongoDB

Initializing the MongoDB Database, installing and structuring dependencies and connecting with the project

Front-End

How we built it

We were responsible for the frontend design and development of our project, utilizing React to create a responsive and user-friendly interface. We employed a component-based approach to develop reusable UI components and utilized Sass to write more efficient and modular CSS code. We also implemented React Router to enable easy navigation and enhance the user experience, allowing users to move between different pages and components with ease.

However, one of the challenges we encountered was working with the tight deadline of the hackathon, which required us to optimize our development process and work as efficiently as possible. Also, our team only had two people left in the end, so each team member needed to handle more work within the limited amount of time. Another challenge we met was ensuring that the user interface was consistent across different devices and screen sizes, which required careful attention to detail and testing to ensure a seamless user experience. The design of the website was also a challenge because we needed to integrate all the features of the website as well as ensure that the frontend was visually appealing.

Sass (Syntactically Awesome Style Sheets) is a CSS preprocessor that allows you to write CSS code more efficiently and with more functionality. It provides features such as variables, nesting, inheritance, and more that make it easier to create reusable styles and customize the look and feel of a web application.

Challenges

Working with Hooks and JSX Components

Accomplishments that we're proud of

What we learned

  • [x] Integrating the Front-End and Back-End
  • [x] Utilized the FERM Stack (Flask and Python in Backend and React and in the Front End with MongoDB as our Non-Relational Database)
  • [x] Utilized complex classification tasks such as NB, SVM and Logistic Regression
  • [x] Learnt how to utilize MongoDB to query user information and authentication
  • [x] Used JSX and React Components to navigate and interact with web pages

What's next for Job Hunter

The limitations are mostly in the data, although we tried to handle class imbalance with sample weights and removing those duplicates, misclassifying for classes with lower relative samples still occurs (The model would fail for this case). Normally the text classification problem is a problem that requires a lot of data. Those pretrained models achieving state of the art performance have millions and billions of words to play with.

We think my model performance can be further improved if the data was more in that case. Removing the duplicates cut the size of the data into halves leaving us with 10000+ samples only to play with. Also after the train-test split it went down to 5000+ samples only. We also think using a pre-trained word embeddings model like Word2vec or GloVe on the small dataset would further increase the model performance.

Finally, our original idea was to send email notifications to the user, in order to ping them with the latest jobs posted on company websites, but due to not having access to a hosting service we cannot implement the feature and functionality for sending an email

Share this project:

Updates