We appreciate a star ⭐ at CocoIndex Github if this is helpful.
This example shows how to use BAML to extract structured data from patient intake PDFs using CocoIndex v1. BAML provides type-safe structured data extraction with native PDF support.
- BAML Schema (
baml_src/patient.baml) - Defines the data structure and extraction function - CocoIndex v1 App (
main.py) - Wraps BAML in a custom function, processes files incrementally, and writes results to JSON files
Install from the project's pyproject.toml:
pip install -e .This is a required step that generates the Python client code from your BAML schema:
baml generateThis will create a baml_client/ directory with the generated Python code.
Create a .env file in the example directory:
echo "GEMINI_API_KEY=your_api_key_here" > .envReplace your_api_key_here with your actual Gemini API key.
cocoindex update main.pyThis will:
- Read all PDF files from
data/patient_forms/ - Extract patient information using BAML
- Write the extracted data as JSON files to
output_patients/
After running, check the output_patients/ directory:
ls -la output_patients/You should see JSON files such as:
Patient_Intake_Form_David_Artificial.jsonPatient_Intake_Form_Emily_Artificial.jsonPatient_Intake_Form_Joe_Artificial.jsonPatient_Intake_Form_Jane_Artificial.json