A Node.js Express server that generates cinematic narratives and multimedia content from images and videos. Uses Google's Gemini API to intelligently create album or vlog scripts with automatic scene generation, narration, and TTS (Text-to-Speech) integration.
- 📸 Asset Processing: Upload images and videos for analysis
- 🎬 Intelligent Narrative Generation: Uses Gemini AI to create contextual stories
- 🖼️ Collage Creation: Automatically groups and creates collages from media by location
- 🎙️ Voice Synthesis: Generates audio narration with PlayHT voice cloning (fallback to Google TTS)
- 🎨 Digital Annotations: Render facial annotations to HTML/PNG for genealogy context
- 🌍 Location Banners: Fetches relevant background images from Unsplash for each location
- 📤 Temporary File Storage: Uploads generated content to tmpfiles.org
- Node.js (v16+)
- Multer for file uploads
- Dependencies (see package.json)
npm installCreate a .env file in the root directory:
PORT=3000
GEMINI_API_KEY=<your-google-gemini-api-key>
GEMINI_MODEL=gemini-1.5-pro
UNSPLASH_API_ACCESS_KEY=<your-unsplash-api-key>npm start
# or
node index.jsThe server will start on http://localhost:3000
Health check endpoint.
Response:
{
"message": "Welcome to the MemoMosaic Backend!"
}Generate a complete multimedia script from uploaded media files and annotation face images.
Request:
- Method:
POST - Content-Type:
multipart/form-data - Files:
assets(max 30 files): Media files (images/videos)annotationFaces(max 50 files): Face images for annotations
- Fields:
payload: JSON string containing metadata
Form Fields:
Media file metadata (optional, use if structured metadata needed):
- assets[0].type : "IMAGE" or "VIDEO"
- assets[0].location : Location string (e.g., "Paris")
- assets[0].creation_time : ISO timestamp or date string
Annotation face mapping (in payload):
{
"annotations": [
{
"name": "John",
"relation": "Father",
"faceIndex": 0
},
{
"name": "Jane",
"relation": "Mother",
"faceIndex": 1
}
]
}Payload Schema:
{
"type": "album" or "vlog",
"memorableMoments": "Optional string describing key moments",
"playHTCred": {
"userId": "PlayHT user ID",
"secretKey": "PlayHT API secret key",
"audio": "Base64-encoded sample audio for voice cloning",
"gender": "male" or "female"
},
"annotations": [
{
"name": "Person name",
"relation": "Relationship",
"faceIndex": 0
}
]
}How it works:
- Upload media files via
assetsfield - Upload face images via
annotationFacesfield (images are indexed 0, 1, 2, ...) - In the
payload.annotationsarray, reference face images usingfaceIndex - The server converts
faceIndexto actual base64 face data before processing - Face images are cleaned up after processing
Response:
{
"title": "Generated album/vlog title",
"caption": "Short description",
"hashtags": ["tag1", "tag2"],
"scenes": [
{
"scene": "1",
"narrative": "Scene narrative",
"collage": "https://tmpfiles.org/...",
"type": "IMAGE",
"mimeType": "image/png",
"location": "Paris",
"background_image": "https://unsplash.com/...",
"audio": "https://tmpfiles.org/..."
}
]
}- Upload: Files are saved to
/tmp/uploadson disk - Processing: Files are read and converted to base64 for API processing
- Asset Tracking: Each asset is assigned an index which is preserved through all transformations (collage creation, grouping, etc.)
- Video URI Mapping: Video file URIs are mapped by asset index for reliable lookups regardless of media transformations
- Generation: Collages and audio are generated and uploaded to tmpfiles.org
- Cleanup: Temporary files are automatically deleted after processing
The system tracks assets by their original index throughout the entire processing pipeline:
- Initial indexing: Assets receive an
assetIndexproperty preserving their upload order - Grouping: Assets are grouped by location and type, but retain their original index
- Video URI mapping: Video Gemini URIs are stored in a map keyed by asset index (
videoUriMap[assetIndex]) - Collage generation: When videos are included in collages, their URIs are retrieved using the preserved asset index
- Scene generation: Each scene correctly references the appropriate video URI through the index
This index-based approach ensures:
- ✅ Efficient lookups without string-based key matching
- ✅ Robust URI resolution through grouping and sorting transformations
- ✅ No data loss during collage creation or media grouping
- @google/generative-ai: Gemini API for AI-powered narratives
- multer: File upload middleware
- express: Web framework
- puppeteer: HTML to image rendering for annotations
- playht: Voice cloning and text-to-speech
- @wylie39/image-collage: Collage generation from images
- unsplash-js: Fetching location banner images
- ejs: Template rendering for annotations
The server includes comprehensive error handling:
- Failed file uploads are cleaned up automatically
- Collage upload failures fall back to base64 responses
- TTS generation falls back from PlayHT to Google TTS if needed
- All errors are logged to console with descriptive messages
See the API endpoints section for detailed payload examples.