Skip to content

Latest commit

 

History

History
53 lines (44 loc) · 1.19 KB

README.md

File metadata and controls

53 lines (44 loc) · 1.19 KB

Visual Symphony - Multimodal Narrative Generator

Demo

Recording.2025-01-25.020918.1.mp4

Project Overview

AI-powered pipeline transforming visual inputs into immersive audio narratives through:

  1. Image Captioning (BLIP)
  2. Contextual Story Generation (Mixtral-8x7B via Groq)
  3. Emotional Speech Synthesis (Bark)

Features

  • Multimodal Processing Chain
    • Image → Text → Story → Speech conversion
    • Context-aware narrative generation
  • Production Optimizations
    • TensorFlow/Keras performance tuning
    • Cross-framework compatibility (PyTorch/TF)
    • Memory-efficient inference

Installation

python -m venv .venv
source .venv/bin/activate
pip install -r requirements.txt

Configuration

# .env template
GROQ_API_KEY=your_api_key_here

Usage

# Core workflow
image_description = image2text("input.jpg")
narrative = gen_story(image_description) 
gen_tts(narrative)

Architecture

graph TD
    A[Image Input] --> B[BLIP Captioning]
    B --> C[LLM Story Generation]
    C --> D[Bark Speech Synthesis]
    D --> E[Audio Output]
Loading

License

Apache 2.0 - See included LICENSE file