1. Problem Understanding
Set out to build a system that matches resumes to job listings efficiently, balancing rule-based filtering (hard constraints) and semantic similarity (embeddings). The goal was to automate recommendations while ensuring quality and explainability.
Initial Idea (Evolution)
- Started with the thought: βLetβs parse resumes and listings, then run LLM comparisons.β
- Early iterations mixed LLM calls for everything β very slow + costly + no simple way to come up with a deterministic scoring mechanism.
- Quickly finalized: embeddings + structured info extraction (IE) + hard rule-based filtering are key to making the comparisons more deterministic rather than relying purely on LLM-based outputs.
Final Approach
- Built a modular pipeline to convert job listings into structured dictionaries
- Built a modular pipeline with resume parsing β structured extraction β enrichment β embeddings β smart filters β ranking.
- Focused on efficiency: precomputed job embeddings, single-pass resume enrichment.
- Added an evaluation layer + proxy metrics for validation.
Current Limitations
- No fine-tuning or ground-truth dataset yet.
- Heavy reliance on proxy metrics.
- Some fields (e.g., education, certifications, skills) are still handled in overly simplified ways.
- Final score points need refinement. Currently, the scoring mechanism doesnβt generally let the score go above 0.6 unless there are perfect matches (which will happen very rarely).
2. Data Insights
Two main data sources:
- Resumes: PDFs and DOCX resumes enriched via parsing & LLM-based enrichment