Hiring Agent Demo

E2E Models + Weave demo. Also serves as the demo project for the EU AI Act.

This repository contains a demo of a hiring agent that can evaluate job applications against job offers using Langgraph and multiple LLM providers.

Features

Automated evaluation of job applications against requirements
Support for multiple LLM providers (OpenAI, AWS Bedrock, Ollama)
PDF processing capabilities (multi-modal visualization)
Hallucination detection and guardrails (incl. self-relfection and HITL)
Expert review system
Integration with Weights & Biases Models and Weave

Usage

Once the operator UI is launched through streamlit there are four modes that can be select through the dropdown menu under "Select Mode". It is important to note currently the current pipeline expects local PDFs to make it more realistic, that means that the evaluation dataset contains paths to local PDF files. If you want to run an evaluation locally make sure to generate your own dataset first, specifically execute the following steps chronologically chronologically. :

Create Dataset
- Drag in job positions as PDFs (e.g. downloading wandb job positions)
- Generate applicant characteristics table
- Go to next step and calculate R score (no changes needed)
- Go to next step and generate actual evaluation and fine-tuning dataset
Manage Prompts
- If it's the first time you're running the project click Publish Context Prompt for every tab (change prompt if you want)
Single Test
- Drag in one of the job position PDFs and one of the generated application PDFs (under utils/data/applications)
- Decide whether to use a hallucination guardrail on the hiring reason in the config panel on the left
- Decide whether to enable expert reviews (by default means if guardrail fails twice even after self-reflecting the first time)
Batch Testing
- Turn expert review mode off (not compatible because of parallel evaluation yet)
- Paste in weave URL of evaluation dataset that you generated
- Run evaluation
Monitoring Dashboard
- Explore key monitoring metrics over time - based on Weave API

To run the fine-tuned comparison model first click on "Add Model to Ollama" if you haven't yet installed the model locally and then select custom-wandb-artifact-model in the "Comparison Model" dropdown.

Setup

(Recommended) Create a virtual environment and install dependencies:

python -m venv venv
source venv/bin/activate  # On Windows: venv\Scripts\activate
pip install -r requirements.txt # requirements_relaxed.txt if problems

Create utils/.env file with your API keys:

OPENAI_API_KEY=your_openai_api_key
AWS_ACCESS_KEY_ID=your_aws_access_key  # If using AWS Bedrock
AWS_SECRET_ACCESS_KEY=your_aws_secret_key  # If using AWS Bedrock
AWS_DEFAULT_REGION=your_aws_region  
WANDB_API_KEY=your_wandb_api_key

Run python runapp.py from root
Generate dataset and create base prompts
- From config panel select "Create Dataset" and go through all steps
- From config panel select "Manage Prompts" and publish all defaults for every tab
Run single test or whole evaluation

Use fine-tuned model as comparison model

Based on your dataset fine-tune your comparison model in this notebook
Paste in artifact path into config panel, select custom-wandb-artifact-model under "Comparison Model" and click the button "Add Model to Ollama"
- This will download the artifact from wandb
- Will then call ollama create <model-name> -f Modelfile from the root of the downloaded artifact (where the fine-tuning notebook adds a Modelfile automatically)
If you want to use parallel calls make sure to serve the ollama server with OLLAMA_NUM_PARALLEL=<number-of-parallel-calls> ollama serve
Now when use the single evaluation or the batch testing mode with the model

Set fine-tuned model as endpoint for Weave playground

Make sure the model runs on Ollama (following above guide will suffice)
Start ngrok: ngrok http 11434 --response-header-add "Access-Control-Allow-Origin: *" --host-header rewrite
Configure in Weave UI
- Add ngrok address without the v1
- Add random secret
Open Playground and debug with the actual model

Model and Dataset Improvements

GPT-4o-mini Fine-Tuning with Weights & Biases

This repository includes a script for fine-tuning OpenAI's GPT-4o-mini model using datasets stored in W&B:

python utils/finetune_openai.py

Features

Retrieves dataset from W&B artifacts
Validates and analyzes the dataset (token counts, format checking)
Splits data into training and validation sets
Recommends optimal epochs based on dataset size
Provides token usage estimates for cost planning
Logs comprehensive metrics to W&B dashboard
Optional evaluation on the fine-tuned model

Prerequisites

OpenAI API key with fine-tuning access
W&B account and properly formatted dataset

Setup

Install dependencies:
```
pip install -r requirements.txt
```

Set environment variables:

export OPENAI_API_KEY="your_openai_api_key"
export WANDB_API_KEY="your_wandb_api_key" 
export WANDB_ENTITY="your_wandb_username_or_team"
export WANDB_PROJECT="your_wandb_project"
# Optional: Enable evaluation after fine-tuning
export RUN_EVALUATION="true"

The fine-tuning progress can be monitored in your W&B dashboard, with logs of training metrics, dataset statistics, and model performance.

Improve reason labels in datasets

The repository includes a script for enhancing the reason labels in training and evaluation datasets:

python improve_reason_labels.py

This script:

Downloads the finetuning and evaluation datasets from W&B artifacts
Uses OpenAI's GPT-4o to generate high-quality, detailed hiring reasons
Reasons are structured to analyze position fit, experience, and values alignment
Processes examples in parallel (10 threads) for faster execution
Maintains proper artifact lineage with W&B
Publishes improved datasets to both W&B (with "annotated" alias) and Weave

Run this script to generate better ground truth reasons for evaluating hiring agent performance.

Name		Name	Last commit message	Last commit date
Latest commit History 42 Commits
utils		utils
.gitignore		.gitignore
README.md		README.md
requirements.txt		requirements.txt
requirements_relaxed.txt		requirements_relaxed.txt
runapp.py		runapp.py
server.py		server.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Hiring Agent Demo

Features

Usage

Setup

Use fine-tuned model as comparison model

Set fine-tuned model as endpoint for Weave playground

Model and Dataset Improvements

GPT-4o-mini Fine-Tuning with Weights & Biases

Features

Prerequisites

Setup

Improve reason labels in datasets

About

Uh oh!

Releases

Packages

Uh oh!

Contributors 2

Uh oh!

Languages

wandb/hiring-agent-demo

Folders and files

Latest commit

History

Repository files navigation

Hiring Agent Demo

Features

Usage

Setup

Use fine-tuned model as comparison model

Set fine-tuned model as endpoint for Weave playground

Model and Dataset Improvements

GPT-4o-mini Fine-Tuning with Weights & Biases

Features

Prerequisites

Setup

Improve reason labels in datasets

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors 2

Uh oh!

Languages

Packages