The kit integrates image creation with generative AI, voice activity detection (VAD), automatic speech recognition (ASR), large language models (LLMs), and natural language processing (NLP). A live voice transcription pipeline is connected to an LLM, which makes intelligent decisions about whether the user is describing the scene to an adventure game. When the LLM detects a new scene, the LLM will produce a detailed text prompt suitable for stable diffusion, which the application uses to illustrate the image. Utilizing the OpenVINO™ GenAI framework, this kit demonstrates the use of text2image, LLM pipeline, and whisper speech2text APIs.
This kit uses the following technology stack:
- OpenVINO Toolkit (docs)
- OpenVINO GenAI
- Whisper
- Llama3-8b-Instruct
- Single Image Super Resolution
- Latent Consistency Models
- Depth Anything V2
Check out our AI Reference Kits repository for other kits.
Contributors: Ryan Metcalfe, Garth Long, Arisha Kumar, Ria Cheruvu, Paula Ramos, Dmitriy Pastushenkov, Zhuo Wu, and Raymond Lo.
New updates will be added here.
Table of Contents
Now, let's dive into the steps starting with installing Python.
Star the repository (optional, but recommended :))
Now, let's dive into the steps starting with installing Python. This project requires Python 3.10 or higher and a few libraries. If you don't have Python installed on your machine, go to https://www.python.org/downloads/ and download the latest version for your operating system. Follow the prompts to install Python, making sure to check the option to add Python to your PATH environment variable.
Install libraries and tools:
If you're using Ubuntu, install required dependencies like this:
sudo apt install git git-lfs gcc python3-venv python3-dev portaudio19-dev
NOTE: If you are using Windows, you will probably need to install Microsoft Visual C++ Redistributable also.
To clone the repository, run the following command:
git clone https://github.com/openvinotoolkit/openvino_build_deploy.git
The above will clone the repository into a directory named "openvino_build_deploy" in the current directory. Then, navigate into the directory using the following command:
cd openvino_build_deploy/ai_ref_kits/multimodal_ai_visual_generator
Next the below will create a virtual environment, activate the environment, and install the required dependencies for the setup and execution of the project.
Linux:
python3 -m venv run_env
source run_env/bin/activate
pip install -r requirements.txt
Windows:
python -m venv run_env
run_env/Scripts/activate
pip install -r requirements.txt
Next, you’ll download and optimize the required models via the running of a download script.
- Whisper: Speech recognition
- Llama3-8b-instruct: Intelligent LLM helper
- Latent Consistency Models: Image generation
- Super Resolution: Increase the resolution of the generated image
- Depth Anything v2: Create 3d parallax animations
To run the download script:
python3 download_and_prepare_models.py
cd ..
To interact with the animated GIF outputs, host a simple web server on your system as the final output. To do so, please install Node.js via its Download page and http-server.
Run the following command to start an HTTP server within the repository. You can customize index.html with any additional elements you'd like.
http-server -c10
Open a terminal or you can use the existing one with run_env
environment activated and start the GUI -
python app.py
This theme is passed as part of the system message to the LLM, and helps the LLM make more a more educated decision about whether you are describing the scene to a story, or not.
The start button will activate the listening state (Voice Activity Detection & Whisper Transcription pipelines) on the system's default input device (microphone).
Go ahead and describe a scene your story. For example, "You find yourself at the gates of a large, abandoned castle."
The scene that you just described will be passed to the LLM, which should detect it as a new scene to your story. The detailed prompt that is generated by the LLM will show up in real-time in the UI caption box, followed soon after by the illustration generated from the stable diffusion pipeline.
You can test the intelligence of the LLM helper and say something not relevant to the story. For example, "Hey guys, do you think we should order a pizza?". You should find that the LLM will make the decision to disregard this, and not try to illustrate anything.
To interact with the 3D hoverable animation created with depth maps, start an HTTP server as explained above, and you will be able to interact with the parallax.
-
Feel free to modify
main.py
to select different OpenVINO devices for the llm, stable diffusion pipeline, whisper, etc. Look toward the bottom of the script, for a section that looks like this:if __name__ == "__main__": app = QApplication(sys.argv) llm_device = 'GPU' sd_device = 'GPU' whisper_device = 'CPU' super_res_device = 'GPU' depth_anything_device = 'GPU'
If you're running on an Intel Core Ultra Series 2 laptop, and you want to set
llm_device = 'NPU'
, be sure to have latest NPU driver installed, from here -
Based on the resolution of your display, you may want to tweak the default resolution of the illustrated image, as well as caption font size. To adjust the resolution of the illustrated image, look for and modify this line:
self.image_label.setFixedSize(1216, 684)
It's recommended to choose a 16:9 ratio resolution. You can find a convenient list here.
The caption font size can be adjusted by modifying this line:
fantasy_font = QFont("Papyrus", 18, QFont.Bold)
- Learn more about OpenVINO
- Explore OpenVINO’s documentation