Skip to content

Latest commit

 

History

History

Folders and files

NameName
Last commit message
Last commit date

parent directory

..
 
 
 
 
 
 
 
 
 
 
 
 
 
 

README.md

Extract structured data from patient intake forms with BAML (v1)

GitHub We appreciate a star ⭐ at CocoIndex Github if this is helpful.

This example shows how to use BAML to extract structured data from patient intake PDFs using CocoIndex v1. BAML provides type-safe structured data extraction with native PDF support.

  • BAML Schema (baml_src/patient.baml) - Defines the data structure and extraction function
  • CocoIndex v1 App (main.py) - Wraps BAML in a custom function, processes files incrementally, and writes results to JSON files

Run

1. Install dependencies

Install from the project's pyproject.toml:

pip install -e .

2. Generate BAML client code

This is a required step that generates the Python client code from your BAML schema:

baml generate

This will create a baml_client/ directory with the generated Python code.

3. Set up environment variables

Create a .env file in the example directory:

echo "GEMINI_API_KEY=your_api_key_here" > .env

Replace your_api_key_here with your actual Gemini API key.

4. Run the application

cocoindex update main.py

This will:

  1. Read all PDF files from data/patient_forms/
  2. Extract patient information using BAML
  3. Write the extracted data as JSON files to output_patients/

5. Verify the output

After running, check the output_patients/ directory:

ls -la output_patients/

You should see JSON files such as:

  • Patient_Intake_Form_David_Artificial.json
  • Patient_Intake_Form_Emily_Artificial.json
  • Patient_Intake_Form_Joe_Artificial.json
  • Patient_Intake_Form_Jane_Artificial.json