A next-generation AI interaction experience that reproduces the Gemini demo using GPT-4 Vision capabilities. This project enables real-time AI interactions with video input, similar to Google's Gemini demonstration.
This project explores the feasibility of creating genuine real-time AI interactions using OpenAI's GPT-4 Vision API. It captures video frames and processes them in real-time, allowing for natural conversations about what the AI sees.
Visit the live demo at: https://gpt-video-jidefr.vercel.app
Note: You'll need an OpenAI API key to use the demo. For security reasons, please remember to remove your API key after use.
- Frontend Framework: Next.js with App Router
- AI Integration: Vercel AI SDK
- APIs: OpenAI's Whisper and GPT-4 Vision
- Audio Processing: Silence-aware recorder
- Media Handling: Custom media recorder hooks
- Clone the repository:
git clone https://github.com/4mohdisa/Gemini-GPT4V.git
cd gpt-video- Install dependencies:
npm install
# or
yarn install- Set up environment variables:
Create a
.env.localfile with:
OPENAI_API_KEY=your_api_key_here
gpt-video/
├── src/
│ └── app/
│ ├── api/
│ │ ├── chat/
│ │ ├── speechtotext/
│ │ ├── texttospeech/
│ │ └── upload/
│ ├── layout.js
│ └── page.js
api/chat/: Handles GPT-4 Vision API interactionsapi/speechtotext/: Processes audio to text using Whisperapi/texttospeech/: Converts text responses to speechapi/upload/: Manages image upload and processing
Key constants in /src/app/page.js:
const INTERVAL = 250; // Screenshot interval
const IMAGE_WIDTH = 512; // Image width for processing
const IMAGE_QUALITY = 0.6; // JPEG quality
const COLUMNS = 4; // Grid columns
const MAX_SCREENSHOTS = 60; // Maximum screenshots
const SILENCE_DURATION = 2500; // Silence detection duration
const SILENT_THRESHOLD = -30; // Silence threshold{
"dependencies": {
"ai": "^2.2.28",
"merge-images": "^2.0.0",
"next": "14.0.4",
"openai": "^4.21.0",
"react": "^18",
"react-dom": "^18",
"silence-aware-recorder": "^1.0.4",
"@wmik/use-media-recorder": "^1.6.5-beta.0",
"use-camera": "^0.2.4",
"uuid": "^9.0.1"
}
}- Start the development server:
npm run dev
# or
yarn dev
# or
pnpm dev
# or
bun dev- Open http://localhost:3000 in your browser
- Handles GPT-4 Vision interactions
- Processes image grids and generates responses
- Supports multilingual outputs
- Uses OpenAI's Whisper for audio transcription
- Supports multiple languages
- Handles audio file processing
- Converts text responses to speech
- Uses OpenAI's TTS API
- Supports customizable voice and speed
- Handles image upload and processing
- Generates unique IDs for files
- Supports base64 image data
Contributions are welcome! Please feel free to submit pull requests or open issues for improvements and bug fixes.
This project is licensed under the MIT License. See the LICENSE file for details.
