This project explores multimodal AI by integrating audio transcription, NLP, and finance-related data processing. Using Whisper for audio transcription, Sentence Transformers for NLP, and LangChain/OpenAI models for analysis, the project aims to extract insights from multiple data types—text, audio, and PDFs.
- OpenAI Whisper – Speech-to-text transcription
- LangChain & OpenAI APIs – NLP and LLM-based analysis
- Sentence Transformers – Semantic similarity search
- PDF Processing – Extracting text and images from financial documents
- Torch & Sklearn – Machine learning utilities
- 🔊 Audio Transcription: Convert spoken content into text
- 📄 PDF Parsing: Extract structured data from documents
- 🔍 Semantic Search: Identify key insights using embeddings
- 📈 Finance-Oriented Analysis: Apply ML/NLP techniques to financial data
The project utilizes Starbucks financial data and multimodal content, including audio, text, and documents.
pip install openai langchain langchain-openai langchain-community openai-whisper sentence-transformers pdf2image
apt-get install poppler-utils
pip install --upgrade Pillow
- Clone the repository:
git clone https://github.com/ctournas/multimodal-starbucks-finance.git cd multimodal-starbucks-finance
- Install dependencies (see installation section above)
- Run the Jupyter notebook
This project is licensed under the MIT License. Feel free to modify and use it for your own projects!
Pull requests are welcome! For major changes, please open an issue first to discuss what you would like to change.
For questions or collaborations, feel free to reach out via GitHub or email at [email protected].