E2E Models + Weave demo. Also serves as the demo project for the EU AI Act.
This repository contains a demo of a hiring agent that can evaluate job applications against job offers using Langgraph and multiple LLM providers.
- Automated evaluation of job applications against requirements
- Support for multiple LLM providers (OpenAI, AWS Bedrock, Ollama)
- PDF processing capabilities (multi-modal visualization)
- Hallucination detection and guardrails (incl. self-relfection and HITL)
- Expert review system
- Integration with Weights & Biases Models and Weave
Once the operator UI is launched through streamlit there are four modes that can be select through the dropdown menu under "Select Mode". It is important to note currently the current pipeline expects local PDFs to make it more realistic, that means that the evaluation dataset contains paths to local PDF files. If you want to run an evaluation locally make sure to generate your own dataset first, specifically execute the following steps chronologically chronologically. :
Create Dataset
- Drag in job positions as PDFs (e.g. downloading wandb job positions)
- Generate applicant characteristics table
- Go to next step and calculate R score (no changes needed)
- Go to next step and generate actual evaluation and fine-tuning dataset
Manage Prompts
- If it's the first time you're running the project click
Publish Context Prompt
for every tab (change prompt if you want)
- If it's the first time you're running the project click
Single Test
- Drag in one of the job position PDFs and one of the generated application PDFs (under
utils/data/applications
) - Decide whether to use a hallucination guardrail on the hiring reason in the config panel on the left
- Decide whether to enable expert reviews (by default means if guardrail fails twice even after self-reflecting the first time)
- Drag in one of the job position PDFs and one of the generated application PDFs (under
Batch Testing
- Turn expert review mode off (not compatible because of parallel evaluation yet)
- Paste in weave URL of evaluation dataset that you generated
- Run evaluation
To run the fine-tuned comparison model first click on "Add Model to Ollama" if you haven't yet installed the model locally and then select custom-wandb-artifact-model
in the "Comparison Model" dropdown.
- (Recommended) Create a virtual environment and install dependencies:
python -m venv venv
source venv/bin/activate # On Windows: venv\Scripts\activate
pip install -r requirements.txt
- Create
utils/.env
file with your API keys:
OPENAI_API_KEY=your_openai_api_key
AWS_ACCESS_KEY_ID=your_aws_access_key # If using AWS Bedrock
AWS_SECRET_ACCESS_KEY=your_aws_secret_key # If using AWS Bedrock
AWS_DEFAULT_REGION=your_aws_region
WANDB_API_KEY=your_wandb_api_key
- Run
python runapp.py
from root - Generate dataset and create base prompts
- From config panel select "Create Dataset" and go through all steps
- From config panel select "Manage Prompts" and publish all defaults for every tab
- Run single test or whole evaluation
- Based on your dataset fine-tune your comparison model in this notebook
- Paste in artifact path into config panel, select
custom-wandb-artifact-model
under "Comparison Model" and click the button "Add Model to Ollama"- This will download the artifact from wandb
- Will then call
ollama create <model-name> -f Modelfile
from the root of the downloaded artifact (where the fine-tuning notebook adds a Modelfile automatically)
- If you want to use parallel calls make sure to serve the ollama server with
OLLAMA_NUM_PARALLEL=<number-of-parallel-calls> ollama serve
- Now when use the single evaluation or the batch testing mode with the model
- Make sure the model runs on Ollama (following above guide will suffice)
- Start ngrok:
ngrok http 11434 --response-header-add "Access-Control-Allow-Origin: *" --host-header rewrite
- Configure in Weave UI
- Add ngrok address without the
v1
- Add random secret
- Add ngrok address without the
- Open Playground and debug with the actual model