This repository contains a comprehensive AI toolkit built as the capstone project for the "Building Generative AI-Powered Applications with Python" course by IBM on Coursera. It showcases the end-to-end development of a multi-functional AI application, from initial scripting in notebooks to a deployable, command-line accessible tool.
You can try the live application here: https://huggingface.co/spaces/cozisoul/ai-vision-toolkit
-
Interactive Web App (
app.py): A user-friendly Gradio web interface with two distinct modes for real-time interaction:- Image Captioning: Generates descriptive captions for uploaded images using the Salesforce BLIP model.
- Image Classification: Identifies objects in uploaded images using a ResNet model, providing the top 3 predictions.
-
Powerful Command-Line Tool (
captioner_cli.py): An automated script for batch processing with two modes of operation:- URL Mode: Scrapes a given webpage, extracts all valid images, and generates captions.
- Local Mode: Processes an entire local folder of images at once.
-
Deployment Ready: Includes a
Dockerfilefor easy containerization, allowing the application to be deployed to any cloud environment.
The interactive web application built with Gradio, showcasing its dual functionality.
- Python with
argparsefor CLI implementation - Gradio for the interactive web UI
- Hugging Face
transformersfor model loading and inference - PyTorch as the deep learning backend
- BeautifulSoup4 for HTML parsing and web scraping
- Docker for containerization
1. Clone the Repository
git clone https://github.com/cozisoul/Captioning-Photos-with-Generative-AI.git
cd Captioning-Photos-with-Generative-AI2. Create and Activate a Virtual Environment
# Create the environment
python -m venv .venv
# Activate (example for Git Bash on Windows)
source .venv/Scripts/activate3. Install Dependencies
pip install -r requirements.txtA. To Launch the Interactive Web App:
python app.pyNavigate to the local URL in your browser (e.g., http://127.0.0.1:7860).
B. To Use the Command-Line Tool:
-
Caption images from a local folder:
- Place images in the
imagesdirectory. - Run the command:
(Output is saved to
python captioner_cli.py --local_dir images
local_captions.txt)
- Place images in the
-
Caption images from a URL:
python captioner_cli.py --url https://en.wikipedia.org/wiki/Cat
(Output is saved to
url_captions.txt)
This project was a deep dive into the full lifecycle of an AI application. The journey from simple scripts to a polished toolkit involved overcoming several real-world challenges, which were invaluable learning experiences.
-
Environment Management: Encountered and resolved
ModuleNotFoundErrorby strictly adhering to virtual environment best practices, ensuring all dependencies were installed in the isolated.venvand not globally. -
Operating System Constraints: Solved a critical
OSErrorduring installation caused by the "Windows Long Path" limitation. Lacking admin rights, the solution was to restructure the project into a shorter file path, a practical lesson in environment portability. -
Dependency Nuances: Debugged installation failures by learning the specific package names required (e.g.,
beautifulsoup4instead ofbeautifulsoup) and understanding that core libraries liketorchdon't automatically include extensions liketorchvision.
-
Model Integration: Gained hands-on experience using the Hugging Face
transformerslibrary to run powerful, pre-trained vision models like BLIP and ResNet. -
Application Development: Learned to use Gradio to rapidly build and launch a clean, multi-tab web interface, moving beyond simple scripts to user-facing applications.
-
DevOps & Best Practices: This project was a practical lesson in essential developer operations:
- Version Control: Using
gitand GitHub to meticulously track changes, manage features, and build a project portfolio. - Reproducibility: Creating and managing a
requirements.txtfile and aDockerfileto ensure the project is easily reproducible and deployable by others. - Systematic Problem-Solving: Honed debugging skills by reading tracebacks carefully to identify the root cause (environment vs. code vs. OS) and applying targeted solutions.
- Version Control: Using
- Async Processing: Convert the web scraping logic to be asynchronous to process images from a URL much faster.
- Model Selection UI: Allow the user to select different captioning models (e.g., BLIP vs. BLIP-2) from a dropdown in the Gradio app.
- Enhanced Interactivity: Add VQA (Visual Question Answering) to allow users to ask specific questions about an image.
This project is licensed under the MIT License. See the LICENSE file for details.
- This project was completed as part of the Building Generative AI-Powered Applications with Python course offered by IBM on the Coursera platform.

