Skip to content

mixpeek/multimodal-tools

Repository files navigation

🧰 Multimodal Tools

A collection of simple, standalone Python scripts for working with video, audio, image, and text data — designed for developers exploring multimodal AI.

Each utility lives in its own folder with examples and a CLI-friendly interface.


📂 Tools (WIP)

Tool Description
segment_transcript_by_topic/ Transcribe and cluster audio/video by topic
split_video_by_second/ Split a video file into N-second chunks
extract_thumbnails/ Grab frames from a video every N seconds
transcribe_audio/ Transcribe audio using Whisper
search_local_media/ CLIP-based text search across your media folder
scene_change_split/ Splits a video into separate clips based on detected scene changes.
caption_search/ Searches for text within video captions or existing Whisper transcripts.
generate_video_captions/ Generates SRT/VTT caption files from video or audio using Whisper.
video_shot_segmenter/ Segments a video into individual shots based on visual changes.
summarize_transcript/ Summarizes text from a transcript file (e.g., from Whisper output).
blur_faces/ Detects and blurs faces in images or video frames for privacy.

🛠️ Getting Started

# Clone the repo
git clone https://github.com/mixpeek/multimodal-tools.git
cd multimodal-tools

# Pick a tool and follow its README
cd segment_transcript_by_topic
pip install -r requirements.txt
python segment_transcript.py --input path/to/video.mp4

🚀 Why use these?

Multimodal content is everywhere — but tooling is scattered. This repo brings together focused, no-dependency-heavy scripts to help you get things done without setting up complex pipelines.

Ideal for:

  • Prototyping and experimentation
  • Content analysis workflows
  • ML/AI feature extraction
  • Exploring retrieval use cases

🔌 Looking for hosted feature extractors?

If you want to scale beyond local scripts, Mixpeek offers managed, production-ready multimodal extractors (video, image, audio, and more) you can plug into your stack.


🤝 Contributing

Want to add a new tool or improve an existing one? PRs welcome.

About

🧰 Simple, standalone tools for working with multimodal data: video, audio, image, and text.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages