If you find this project helpful, consider buying me a coffee:
A ComfyUI custom node that integrates Google's Gemini Flash 2.0 Experimental model, enabling multimodal analysis of text, images, video frames, and audio directly within ComfyUI workflows. Now with image generation capabilities!
- Multimodal input support:
- Text analysis
- Image analysis
- Video frame analysis
- Audio analysis
- NEW! Image Generation using gemini-2.0-flash-exp-image-generation model
- Chat mode with conversation history
- Voice chat with smart Audio recorder node
- Structured output option
- Temperature and token limit controls
- Proxy support
- Configurable API settings via config.json
Install via ComfyUI manager
or
Clone this repository into your ComfyUI custom_nodes folder:
cd ComfyUI/custom_nodes
git clone https://github.com/ShmuelRonen/ComfyUI-Gemini_Flash_2.0_Exp.git
Install required dependencies:
# Install BOTH packages (both are required)
pip install google-genai
pip install google-generativeai
# OR
python -m pip install google-genai
python -m pip install google-generativeai
# Other dependencies
pip install pillow
pip install torchaudio
For Ubuntu/Debian-based systems:
sudo apt-get install libportaudio2
-
Get your free API key from Google AI Studio:
- Visit Google AI Studio
- Log in with your Google account
- Click on "Get API key" or go to settings
- Create a new API key
- Copy the API key for use in config.json
-
Set up your API key in the
config.json
file (will be created automatically on first run)
Make config.json
file in the node main folder:
{
"GEMINI_API_KEY": "your_api_key_here"
}
- prompt: Main text prompt for analysis or generation
- input_type: Select from ["text", "image", "video", "audio"]
- model_version: Select model including the new image generation model
- operation_mode: Select between "analysis" or "generate_images" mode
- chat_mode: Boolean to enable/disable chat functionality
- clear_history: Boolean to reset chat history
- Additional_Context: Additional text input for context
- images: Multiple image inputs (IMAGE type with list=True)
- video: Video frame sequence input (IMAGE type)
- audio: Audio input (AUDIO type)
- api_key: Directly enter your API key (recommended for WSL/Ubuntu)
- max_output_tokens: Set maximum output length (1-8192)
- temperature: Control response randomness (0.0-1.0)
- structured_output: Enable structured response format
- max_images: Maximum number of images to process (1-16)
- batch_count: Number of images to generate (for image generation mode)
- seed: Random seed for reproducible image generation
Text Input Node -> Gemini Flash Node [input_type: "text", operation_mode: "analysis"]
Load Image Node -> Gemini Flash Node [input_type: "image", operation_mode: "analysis"]
Load Video Node -> Gemini Flash Node [input_type: "video", operation_mode: "analysis"]
Load Audio Node -> Gemini Flash Node [input_type: "audio", operation_mode: "analysis"]
Text Input Node -> Gemini Flash Node [model_version: "gemini-2.0-flash-exp-image-generation", operation_mode: "generate_images"]
Load Image Node -> Gemini Flash Node [model_version: "gemini-2.0-flash-exp-image-generation", operation_mode: "generate_images"]
Chat mode maintains conversation history and provides a more interactive experience:
- Enable chat mode by setting
chat_mode: true
- Chat history format:
=== Chat History ===
USER: your message
ASSISTANT: Gemini's response
=== End History ===
- Use
clear_history: true
to start a new conversation - Chat history persists between calls until cleared
- Works with all input types (text, image, video, audio)
- History is displayed in the output
- Maintains context across multiple interactions
- Clear history when switching topics
When processing videos:
- Automatically samples frames evenly throughout the video
- Resizes frames for efficient processing
- Works with both chat and non-chat modes
The new image generation capabilities allow you to:
- Generate images from text descriptions
- Generate variations based on reference images
- Control the generation with seed and temperature parameters
- Generate multiple images with batch_count
- For best results, use the "gemini-2.0-flash-exp-image-generation" model
- Use "generate_images" operation mode
- Provide clear, detailed prompts for better results
- Connect reference images for style guidance
- Use seed parameter for reproducible results
- On Windows, both config file and GUI methods work well
- On Ubuntu/WSL, entering the API key directly in the GUI is more reliable
- If using lowercase filenames on Ubuntu (e.g.,
gemini_flash_node.py
instead ofGemini_Flash_Node.py
), the node will still work properly
- If you get "400 Bad Request" errors, try entering your API key directly in the GUI
- Make sure binary data (images, audio) is properly base64 encoded
- Check network connectivity and proxy settings
- Ensure proper file permissions for config files
The node provides clear error messages for common issues:
- Invalid API key
- Rate limit exceeded
- Invalid input formats
- Network/proxy issues
Default rate limits (from config.json):
- 10 requests per minute (RPM_LIMIT)
- 4 million tokens per minute (TPM_LIMIT)
- 1,500 requests per day (RPD_LIMIT)
The package includes two nodes for audio handling:
- Audio Recorder Node: Smart audio recording with silence detection
- Gemini Flash Node: Audio content analysis
- Live microphone recording with automatic silence detection
- Smart recording termination after detecting silence
- Configurable silence threshold and duration
- Compatible with most input devices
- Visual recording status indicator (10-second auto-reset)
- Seamless integration with Gemini Flash analysis
Audio Recorder Node -> Gemini Flash Node [input_type: "audio"]
- device: Select input device (microphone)
- sample_rate: Audio quality setting (default: 44100 Hz)
- silence_threshold: Sensitivity for silence detection (0.001-0.1)
- silence_duration: Required silence duration to stop recording (0.5-5.0 seconds)
- Record Button:
- Click to start recording
- Records until silence is detected
- Button resets after 10 seconds automatically
- Visual feedback during recording (red indicator)
- Add Audio Recorder node to your workflow
- Connect it to Gemini Flash node
- Configure recording settings:
- Choose input device
- Adjust silence detection parameters
- Set sample rate if needed
- Click "Start Recording" to begin
- Speak your message
- Recording automatically stops after detecting silence
- The recorded audio is processed and sent to Gemini for analysis
- Recording button resets after 10 seconds, ready for next recording
Audio Recorder Node [silence_duration: 2.0, silence_threshold: 0.01] ->
Gemini Flash Node [input_type: "audio", prompt: "Transcribe and analyze this audio"]
Feel free to submit issues, fork the repository, and create pull requests for any improvements.
MIT License
- Google's Gemini API
- ComfyUI Community
- All contributors
Note: This node is experimental and based on Gemini 2.0 Flash Experimental model. Features and capabilities may change as the model evolves.