This repository contains Python scripts for extracting data from Microsoft Word documents (.docx) and transferring it to Excel spreadsheets (.xlsx). The tools are designed to work with Word documents that use content controls (form fields) and checkboxes.
These scripts automate the process of:
- Extracting text from Word document content controls (form fields)
- Detecting and reading checkbox states
- Writing extracted data to Excel spreadsheets in structured formats
- Handling multiple output destinations
- Managing duplicate entries
├── OHAF/ # Scripts for OHAF project
│ ├── OHAF_Script.py
│ ├── OHAF_Script_duplicates.py
│ ├── Rich_Text_Filler.py
│ └── README.md
├── PFVPU/ # Scripts for PFVPU project
│ ├── PFVPU_Script.py
│ ├── multiple_outputs.py
│ └── README.md
├── requirements.txt
├── .gitignore
└── README.md
- Forms Processing: Automatically extract data from completed Word forms
- Data Migration: Transfer information from Word documents to Excel databases
- Reporting: Consolidate data from multiple Word documents into Excel reports
- Quality Assurance: Test form functionality by filling controls with test data
- Python 3.7 or higher
- Microsoft Word documents (.docx format)
- Excel workbooks (.xlsx format)
- Clone this repository:
git clone <repository-url>
cd <repository-name>- Install required packages:
pip install -r requirements.txt- Choose the appropriate script folder (OHAF or PFVPU) based on your needs
- Open the Python script in a text editor
- Update the file paths in the script:
docx_path: Path to your Word documentexcel_path: Path to your Excel filesheet_name: Name of the target sheet
- Run the script:
python OHAF/OHAF_Script.py- Extract text from Word content controls by alias/tag name
- Skip empty fields and placeholder text
- Find and write to the first empty row in Excel
- Preserve Excel formatting and formulas
- Handle duplicate content control tags
- Fill Word documents with test data
- Single row output (ignores duplicates) or multi-row output
- Extract checkbox states (checked/unchecked)
- Handle merged cells in Excel
- Write to multiple Excel destinations from a single Word document
See the README files in each folder for more details:
- Ensure Excel files are closed before running scripts (file locking issues)
- Content control aliases in Word must match Excel column headers
- Scripts assume Excel headers are in row 2, data starts from row 3