HR Technical Guide

This document contains the dataset, the dataset description, useful resources and a technical guide to tackle the HR challenge.

Data Files:

HR challenge data.zip

دليل التصنف المهني السعودي.xlsx

Data Sources:

This section describe the data share in the HR challenge which mainly covers Job descriptions and CV's that are useful in the CV screening track and CV-Job matching track. The table below explains each dataset shared along with source information, format and language.

Untitled Database

*Check the appendix for additional data sources.

*For the background check track, use social media accounts API's like twitter and linkedIn.

Data Extraction and Structuring:

Resume data can be in different format like text documents, PDFs and images. This section will include a guide on how to extract data from each format and what are the available tools that can be used.

Traditional Parsers:

There are several parsers that allows you to extract text from documents like PDF or word documents. The following lists some of the available tools online:

Apache tika - for parsing PDF (https://tika.apache.org/)
python docx - for parsing word documents (https://pypi.org/project/python-docx/)

Optical Character Recognition (OCR):

OCR is the process of extracting textual data from images. All file formats can be converted to images then used with OCR to extract text. The table below lists couple of OCR technologies that can be used in your prototype.