Skip to content

hc-sc-ocdo-bdpd/Grants-Contributions

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

3 Commits
 
 
 
 
 
 
 
 

Repository files navigation

Word Document to Excel Data Extraction Tools

This repository contains Python scripts for extracting data from Microsoft Word documents (.docx) and transferring it to Excel spreadsheets (.xlsx). The tools are designed to work with Word documents that use content controls (form fields) and checkboxes.

Overview

These scripts automate the process of:

  • Extracting text from Word document content controls (form fields)
  • Detecting and reading checkbox states
  • Writing extracted data to Excel spreadsheets in structured formats
  • Handling multiple output destinations
  • Managing duplicate entries

Repository Structure

├── OHAF/                    # Scripts for OHAF project
│   ├── OHAF_Script.py
│   ├── OHAF_Script_duplicates.py
│   ├── Rich_Text_Filler.py
│   └── README.md
├── PFVPU/                   # Scripts for PFVPU project
│   ├── PFVPU_Script.py
│   ├── multiple_outputs.py
│   └── README.md
├── requirements.txt
├── .gitignore
└── README.md

Use Cases

  • Forms Processing: Automatically extract data from completed Word forms
  • Data Migration: Transfer information from Word documents to Excel databases
  • Reporting: Consolidate data from multiple Word documents into Excel reports
  • Quality Assurance: Test form functionality by filling controls with test data

Prerequisites

  • Python 3.7 or higher
  • Microsoft Word documents (.docx format)
  • Excel workbooks (.xlsx format)

Installation

  1. Clone this repository:
git clone <repository-url>
cd <repository-name>
  1. Install required packages:
pip install -r requirements.txt

Quick Start

  1. Choose the appropriate script folder (OHAF or PFVPU) based on your needs
  2. Open the Python script in a text editor
  3. Update the file paths in the script:
    • docx_path: Path to your Word document
    • excel_path: Path to your Excel file
    • sheet_name: Name of the target sheet
  4. Run the script:
python OHAF/OHAF_Script.py

Features

Common Features

  • Extract text from Word content controls by alias/tag name
  • Skip empty fields and placeholder text
  • Find and write to the first empty row in Excel
  • Preserve Excel formatting and formulas

OHAF Scripts

  • Handle duplicate content control tags
  • Fill Word documents with test data
  • Single row output (ignores duplicates) or multi-row output

PFVPU Scripts

  • Extract checkbox states (checked/unchecked)
  • Handle merged cells in Excel
  • Write to multiple Excel destinations from a single Word document

See the README files in each folder for more details:

Important Notes

  • Ensure Excel files are closed before running scripts (file locking issues)
  • Content control aliases in Word must match Excel column headers
  • Scripts assume Excel headers are in row 2, data starts from row 3

About

Processes standardized Word templates and automatically populates Excel tracking sheets

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages