Skip to content

Latest commit

 

History

History
297 lines (228 loc) · 8.29 KB

File metadata and controls

297 lines (228 loc) · 8.29 KB

PFVPU Scripts

This folder contains scripts for extracting data forms and transferring the data to Excel spreadsheets. These scripts include advanced features like checkbox detection and multi-destination output.

Files Overview

1. PFVPU_Script.py

Purpose: Extracts data from Word content controls AND checkboxes, writes to a single Excel destination.

Use when: You have a form with text fields and checkboxes, and need to transfer data to one Excel file/sheet.

Key Features:

  • Extracts content from Word document content controls
  • Detects checked checkboxes (☒) and records them as "Yes"
  • Skips empty fields and placeholder text
  • Handles merged cells in Excel safely
  • Writes to the first empty row

2. multiple_outputs.py

Purpose: Extracts data from Word (including checkboxes) and writes to MULTIPLE Excel files/sheets simultaneously.

Use when: You need to send the same extracted data to multiple spreadsheets or different sheets in the same workbook.

Key Features:

  • All features of PFVPU_Script.py
  • Writes to multiple destinations in one execution
  • Validates each destination before writing
  • Provides detailed success/failure reporting
  • Handles different workbooks and sheet names

Setup Instructions

Prerequisites

  1. Python 3.7 or higher installed
  2. Required packages installed (see main requirements.txt)
  3. Word document (.docx) with content controls and/or checkboxes
  4. Excel workbook(s) (.xlsx) with headers in row 2

Installation

# From the repository root
pip install -r requirements.txt

How to Use

For PFVPU_Script.py

  1. Open the script in your text editor or IDE

  2. Update the file paths at the top:

# === INPUTS ===
docx_path = r"C:\path\to\your\form.docx"
excel_path = r"C:\path\to\your\output.xlsx"
sheet_name = "Sheet 1"  # Change to your sheet name
  1. Ensure your Word document has:

    • Content controls with aliases/tags matching Excel headers
    • Checkboxes (☒ for checked, ☐ for unchecked)
    • Checkbox text labels that match Excel headers if you want to capture checkbox states
  2. Ensure your Excel file has:

    • Headers in row 2
    • Column headers matching content control aliases
    • Column headers matching checkbox labels (optional)
  3. Run the script:

python PFVPU/PFVPU_Script.py
  1. Check the output:
Data transferred successfully (checkboxes handled, merged cells safe).

For multiple_outputs.py

  1. Open the script and update the input path:
# === INPUTS ===
docx_path = r"C:\path\to\your\form.docx"
  1. Configure multiple destinations:
# Define multiple output destinations
# Format: (excel_path, sheet_name)
output_destinations = [
    (r"C:\path\to\file1.xlsx", "Sheet 1"),
    (r"C:\path\to\file1.xlsx", "Sheet 2"),
    (r"C:\path\to\file2.xlsx", "Data"),
]

You can add as many destinations as needed!

  1. Run the script:
python PFVPU/multiple_outputs.py
  1. Review the detailed output:
============================================================
DOCX TO MULTIPLE EXCEL FILES
============================================================

Extracting data from: Test Form PHAC-PFVPU.docx
✓ Extracted 15 field(s)

Writing to 3 destination(s)...

→ output.xlsx / 'Sheet 1'
  ✓ Wrote 12 values to row 3

→ output.xlsx / 'Sheet 2'
  ✓ Wrote 8 values to row 3

→ output2.xlsx / 'Data'
  ✓ Wrote 15 values to row 3

============================================================
SUMMARY
============================================================
Total destinations: 3
Successfully written: 3
Failed: 0
============================================================
✓ All data transferred successfully!

Configuration Options

Adding More Destinations (multiple_outputs.py)

Simply add more tuples to the list:

output_destinations = [
    (r"C:\path\to\main_database.xlsx", "Raw Data"),
    (r"C:\path\to\backup.xlsx", "Backup"),
    (r"C:\path\to\reports.xlsx", "Monthly"),
    (r"C:\path\to\archive\2024.xlsx", "Q1"),
]

Customizing Checkbox Detection

The scripts look for these checkbox symbols: ☒ (checked) and ☐ (unchecked)

To modify which checkboxes are captured, edit the pattern in the script:

# Current pattern
pattern = re.compile(r'(☒|☐)\s*(.+)')

for mark, txt in paragraphs:
    if mark == '☒':  # Only take checked boxes
        clean_text = txt.strip()
        if clean_text:
            data[clean_text] = "Yes"

Changing What Value Is Stored for Checkboxes

# Default: Store "Yes" for checked boxes
data[clean_text] = "Yes"

# Alternative: Store "Checked", "True", "X", etc.
data[clean_text] = "X"
data[clean_text] = True

Adjusting Starting Row

# Default (starts at row 3)
next_row = find_first_empty_row(ws, start_row=3)

# Custom (e.g., row 4)
next_row = find_first_empty_row(ws, start_row=4)

Troubleshooting

Issue: Checkboxes not being detected

Solutions:

  • Ensure checkboxes are using the correct symbols: ☒ (checked) or ☐ (unchecked)
  • Verify checkbox labels match Excel headers exactly
  • Check that checkboxes are in the main document (not in headers/footers)

Issue: "Sheet does not exist" error

Solution:

  • Verify the sheet name is spelled correctly (case-sensitive)
  • Check the script output for available sheet names
  • Update sheet_name variable to match exactly

Issue: "File does not exist" error

Solution:

  • Verify file paths are correct
  • Use raw strings (prefix with r)
  • Ensure files exist before running the script

Issue: Multiple outputs - some succeed, some fail

Analysis:

  • Check the detailed output to see which destinations failed
  • Common causes:
    • File doesn't exist
    • Sheet name is wrong
    • File is open in Excel (close it)
    • Incorrect path

Issue: Merged cells cause errors or incorrect placement

Solution: The script handles merged cells automatically. If issues persist:

  • Avoid having headers in merged cells
  • Ensure data columns are not merged
  • Test with non-merged cells first

Issue: Data appears in wrong columns

Possible causes:

  • Content control aliases don't match headers exactly
  • Extra spaces in headers or aliases
  • Case sensitivity (e.g., "Name" vs "name")

Solution:

  • Print headers and data keys to debug: Add this after line that reads headers:
headers = [cell.value for cell in ws[2]]
print(f"Excel headers: {headers}")
print(f"Form data keys: {list(data.keys())}")

Word Document Requirements

Content Controls

  1. Enable Developer tab: File > Options > Customize Ribbon > Developer
  2. For each content control:
    • Click the control
    • Click "Properties"
    • Set "Tag" or "Alias" to match Excel header
    • Click OK

Excel Requirements

  • Format: .xlsx files only
  • Headers: Row 2 by default
  • Matching: Column names must match Word control aliases/checkbox labels exactly
  • State: Close Excel files before running scripts
  • Merged Cells: Supported, but avoid in header row
  • Multiple sheets: Supported (especially useful with multiple_outputs.py)

Example Use Cases

Single Destination (PFVPU_Script.py)

  • Processing forms into a single database
  • Simple data transfer tasks
  • Testing and validation

Multiple Destinations (multiple_outputs.py)

  • Writing to multiple departmental databases
  • Creating backups automatically
  • Distributing data to different teams
  • Writing to different sheets for different data types
  • Maintaining archive copies

Example Workflow

Workflow 1: Single Destination

  1. Create Word form with content controls and checkboxes
  2. Create Excel file with matching headers
  3. Configure PFVPU_Script.py with file paths
  4. Run script and verify data transfer
  5. Use on production forms

Workflow 2: Multiple Destinations

  1. Set up Word form as above
  2. Create multiple Excel files/sheets as needed
  3. Configure output_destinations list in multiple_outputs.py
  4. Test with one form
  5. Review detailed output for any failures
  6. Fix any path/sheet name issues
  7. Run on production forms

Notes

  • Close all Excel files before running
  • Test with sample data first
  • Use multiple_outputs.py even for single destination if you want detailed reporting
  • Checkbox text must appear immediately after the checkbox symbol
  • Empty checkboxes (☐) are not recorded in Excel (no value written)