The Multi-Modal Prompt Refinement System is a Python-based command-line tool that translates unstructured, human-provided inputs into a clean, structured, machine-ready Master Prompt in JSON format.
The system is designed to act as a translator between messy real-world inputs—such as text notes, documents, and images with descriptions—and downstream AI systems or engineering workflows that require stable, predictable input formats.
This project is not a UI application and not a model training pipeline. Its primary focus is on deterministic reasoning, validation, and explainability.
In practical AI and product development workflows, requirements are often:
- Scattered across multiple formats
- Incomplete or ambiguous
- Difficult to reuse reliably
This system addresses those challenges by:
- Consolidating information from multiple input modalities
- Structuring it into a predictable, schema-validated JSON format
- Deterministically detecting intent and extracting requirements
- Explicitly surfacing missing or unclear information
- Preventing silent assumptions or hallucinated details
The output is a transparent and auditable Master Prompt that can be safely consumed by downstream systems.
-
Clone the repository:
git clone <repository-url> cd multi-modal-prompt-refiner
-
(Recommended) Create and activate a virtual environment:
python -m venv .venv source .venv/bin/activate # On Windows: .venv\Scripts\activate
-
Install dependencies:
pip install -r requirements.txt
The system is executed from the command line. You may provide one or more input files along with optional metadata.
python main.py [INPUT_FILES...] [--output-file OUTPUT_PATH] [--image-desc IMAGE_FILE DESCRIPTION]-
INPUT_FILES...
One or more input files (.txt,.pdf,.docx,.png,.jpg) -
--output-file(optional)
Path where the refined JSON prompt will be saved
Default:outputs/refined_prompt.json -
--image-desc(optional)
Associates a human-provided description with an image file
This flag may be used multiple times
python main.py inputs/example_text.txt --output-file outputs/text_example.jsonpython main.py inputs/example_text.txt inputs/example_doc.docx inputs/example_image.png \
--image-desc inputs/example_image.png "Sketch of the homepage layout" \
--output-file outputs/mixed_example.jsonIf an input does not describe a buildable task or product, the system rejects it with a clear explanation.
python main.py inputs/irrelevant.txt --output-file outputs/irrelevant_example.jsonThe resulting output explicitly indicates the rejection and the reason for it.
-
Processing
Each input file is read using a modality-specific processor (text, document, or image metadata). -
Consolidation
Extracted content and image descriptions are combined into a unified text representation while preserving source attribution. -
Relevance Checks
Inputs that are empty or non-actionable are flagged and excluded from refinement. -
Refinement
- Intent Detection: A deterministic, rule-based process identifies the primary product or task intent.
- Extraction: Rule-based logic extracts functional requirements and technical constraints.
-
Prompt Construction
The extracted information is assembled into a structured JSON object conforming toconfig/prompt_schema.json.
Missing or unclear information is explicitly recorded. -
Validation
The final output is validated against the schema to ensure structural correctness and stability. -
Output
The validated Master Prompt is written to disk. If validation fails, an.invalid.jsonfile is produced for inspection.
Detailed design rationale, architectural decisions, and trade-offs are documented in:
docs/DESIGN_EXPLANATION.md
Name: Priyank Wadiwala Institute: Sardar Vallabhbhai Institute of Technology, Vasad