AI-powered pipeline transforming visual inputs into immersive audio narratives through:
- Image Captioning (BLIP)
- Contextual Story Generation (Mixtral-8x7B via Groq)
- Emotional Speech Synthesis (Bark)
- Multimodal Processing Chain
- Image → Text → Story → Speech conversion
- Context-aware narrative generation
- Production Optimizations
- TensorFlow/Keras performance tuning
- Cross-framework compatibility (PyTorch/TF)
- Memory-efficient inference
python -m venv .venv
source .venv/bin/activate
pip install -r requirements.txt
# .env template
GROQ_API_KEY=your_api_key_here
# Core workflow
image_description = image2text("input.jpg")
narrative = gen_story(image_description)
gen_tts(narrative)
graph TD
A[Image Input] --> B[BLIP Captioning]
B --> C[LLM Story Generation]
C --> D[Bark Speech Synthesis]
D --> E[Audio Output]
Apache 2.0 - See included LICENSE file