Multimodal AI Assistant Playground

This project is a playground for building a Multimodal AI Assistant with a Flask backend and a React frontend. The project currently supports real-time messaging, PDF file uploads for context, and speech (both text-to-speech/speech-to-text).

Features

Real-time messaging
PDF File Upload for context
Speech (both text-to-speech/speech-to-text)
System Prompt defined for AI Assistant
Conversation context stored in session cookie
Visual feedback with thinking dots during AI response generation
AI assistant's responses are typed as a human would do it, enhancing the human touch and usability

Tech Stack

Backend: Python, Flask, Redis, Azure OpenAI GPT-4 Omni
Frontend: React, Tailwind CSS, TypeScript
Speech: Microsoft Cognitive Services Speech SDK

Environment Variables

Create a .env file with the following variables:

FLASK_ENV=development
CHOKIDAR_USEPOLLING=true
AZURE_OPENAI_API_KEY=your_azure_openai_api_key
AZURE_OPENAI_ENDPOINT=your_azure_openai_endpoint
AZURE_SPEECH_KEY=your_azure_speech_key
AZURE_SPEECH_REGION=your_azure_speech_region
REDIS_HOST=your_redis_host (default: localhost)
REDIS_PORT=your_redis_port (default: 6379)
REACT_APP_BACKEND_URL=http://localhost:5000
REACT_APP_AZURE_SPEECH_KEY=your_azure_speech_key
REACT_APP_AZURE_SPEECH_REGION=your_azure_speech_region

Project Structure

multimodal-ai-assistant
├── assets
│   └── ai-assistant-screenshot.png
├── backend
│   ├── app
│   │   ├── __init__.py
│   │   ├── chat.py
│   │   ├── config.py
│   │   ├── file.py
│   │   ├── speech.py
│   └── requirements.txt
├── frontend
│   ├── public
│   │   ├── index.html
│   └── src
│       ├── components
│       │   ├── Chat.tsx
│       │   ├── ChatInput.tsx
│       │   ├── Message.tsx
│       │   ├── FileUpload.tsx
│       │   └── Chat.css
│       ├── hooks
│       │   └── useSpeech.ts
│       ├── services
│       │   └── api.ts
│       ├── App.tsx
│       ├── index.tsx
│       ├── index.css
│       └── setupProxy.js
├── docker-compose.yml
├── .env
└── README.md

Backend Setup

Navigate to the backend directory.
Create a virtual environment:
```
python -m venv venv
```
Activate the virtual environment:
- On Windows:
```
venv\Scripts\activate
```
- On macOS/Linux:
```
source venv/bin/activate
```
Install the required dependencies:
```
pip install -r requirements.txt
```
Run the Flask application:
```
python run.py
```

Frontend Setup

Navigate to the frontend directory.
Install the required dependencies:
```
npm install
```
Start the React application:
```
npm start
```

Running with Docker

Ensure Docker is installed and running on your machine.
Navigate to the root directory of the project.
Build and start the services using Docker Compose:
```
docker-compose up --build
```

System Prompt

The AI Assistant is initialized with a system prompt that defines its behavior and response style. This prompt ensures that the assistant provides concise and actionable insights during conversations.

Conversation Context

The AI Assistant maintains the conversation context by storing the conversation history in a session cookie. This allows the assistant to provide coherent and contextually relevant responses throughout the interaction.

Speech-to-Text and Text-to-Speech Flow

sequenceDiagram
    participant User
    participant Frontend
    participant Backend
    participant AzureSpeech
    participant OpenAI

    User->>Frontend: Voice Input
    Frontend->>AzureSpeech: Recognize Speech
    AzureSpeech->>Frontend: Speech to Text
    Frontend->>Backend: Send Text Message
    Backend->>OpenAI: Send Message to OpenAI
    OpenAI->>Backend: OpenAI Response
    Backend->>Frontend: Send Text Response
    Frontend->>AzureSpeech: Synthesize Speech
    AzureSpeech->>Frontend: Text to Speech
    Frontend->>User: Audio Output

PDF Upload Flow

sequenceDiagram
    participant User
    participant Frontend
    participant Backend
    participant AzureOpenAI

    User->>Frontend: Upload PDF
    Frontend->>Backend: Send PDF File
    Backend->>Backend: Extract Text from PDF
    Backend->>AzureOpenAI: Send Text to OpenAI
    AzureOpenAI->>Backend: OpenAI Response
    Backend->>Frontend: Send Response
    Frontend->>User: Display Response

License

This project is open-source and available under the MIT License.

Name		Name	Last commit message	Last commit date
Latest commit History 13 Commits
.devcontainer		.devcontainer
assets		assets
backend		backend
frontend		frontend
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
docker-compose.yml		docker-compose.yml
image.png		image.png
package-lock.json		package-lock.json
package.json		package.json

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Multimodal AI Assistant Playground

Features

Tech Stack

Environment Variables

Project Structure

Backend Setup

Frontend Setup

Running with Docker

System Prompt

Conversation Context

Speech-to-Text and Text-to-Speech Flow

PDF Upload Flow

License

About

Uh oh!

Releases

Packages

Languages

License

karoldavid/multimodal-ai-assistant

Folders and files

Latest commit

History

Repository files navigation

Multimodal AI Assistant Playground

Features

Tech Stack

Environment Variables

Project Structure

Backend Setup

Frontend Setup

Running with Docker

System Prompt

Conversation Context

Speech-to-Text and Text-to-Speech Flow

PDF Upload Flow

License

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages