Version: 1.0
Last Updated: November 30, 2025
Status: Research Phase
Owner: EcareBots Project Team
This document catalogs open-source datasets relevant to EcareBots development, covering healthcare conversational AI, multi-modal accessibility, voice/gesture recognition, medical knowledge, and elderly care. All datasets listed are publicly available, ethically sourced, and have appropriate licenses for research and/or commercial use. These datasets will support model training, evaluation, and validation across EcareBots' core features: voice-first interaction, gesture control, healthcare information processing, and accessibility-optimized interfaces.
Key Findings:
- Healthcare NLP: MIMIC-III Clinical Notes, PubMedQA, and MedQA provide medical question-answering data
- Voice Recognition: Mozilla Common Voice, LibriSpeech, and VoxCeleb offer diverse accent and speaker data
- Gesture Recognition: Google MediaPipe datasets, NTU RGB+D, and EgoGesture for hand tracking
- Accessibility: No large-scale elderly-specific datasets found; will require synthetic data generation or user studies
- Insurance/Claims: Limited public data due to privacy; CMS synthetic data and CCLF files available
Recommendations:
- Use Mozilla Common Voice for accent-diverse voice training (critical for elderly users)
- Fine-tune on MedDialog for healthcare conversational AI
- Leverage MediaPipe Hands dataset for gesture recognition baseline
- Generate synthetic elderly user data (simulate tremors, slower speech, visual impairments)
- Comply with HIPAA de-identification rules if using any clinical data
Description: English and Chinese medical conversation dataset between patients and doctors
Source: https://github.com/UCSD-AI4H/Medical-Dialogue-System
Statistics:
- Size: 3.66 million conversations (English), 1.1 million (Chinese)
- Format: JSON (patient utterance, doctor response)
- Topics: General medicine, symptoms, diagnoses, treatments
License: Apache 2.0 (commercial use permitted)
Use Cases for EcareBots:
- Train conversational AI for medical question-answering
- Fine-tune LLM for healthcare domain (tone, terminology)
- Evaluate intent classification (appointment request, medication inquiry, symptom description)
Example:
{
"patient": "I've been having chest pain for 2 days.",
"doctor": "Can you describe the pain? Is it sharp, dull, or pressure-like? Does it radiate to your arm or jaw?"
}Citation:
@article{chen2020meddiag,
title={MedDialog: Large-scale Medical Dialogue Datasets},
author={Chen, Shu and others},
journal={EMNLP},
year={2020}
}
Description: Real patient-doctor conversations from HealthCareMagic.com
Source: https://huggingface.co/datasets/lavita/ChatDoctor-HealthCareMagic-100k
Statistics:
- Size: 100,000 conversations
- Format: JSON (patient question, doctor answer)
- Quality: High (verified doctors, detailed responses)
License: CC BY-NC-SA 4.0 (non-commercial research only)
Use Cases:
- Evaluation benchmark (test if AI responses match doctor quality)
- Few-shot learning examples (show LLM how doctors respond)
Limitation: Non-commercial license (cannot use for production EcareBots without permission)
Description: Medical question-answering dataset based on PubMed abstracts
Source: https://pubmedqa.github.io/
Statistics:
- Size: 1,000 expert-annotated, 61,000 auto-generated Q&A pairs
- Format: JSON (question, context from PubMed abstract, yes/no/maybe answer)
- Topics: All medical specialties
License: MIT (commercial use permitted)
Use Cases:
- Train retrieval-augmented generation (RAG) system (retrieve PubMed articles to answer questions)
- Fact-checking layer (verify LLM responses against medical literature)
Example:
{
"question": "Does aspirin prevent heart attacks?",
"context": "Abstract from PMID 12345678: Aspirin reduces cardiovascular events by 25% in high-risk patients...",
"answer": "yes"
}Citation:
@inproceedings{jin2019pubmedqa,
title={PubMedQA: A Dataset for Biomedical Research Question Answering},
author={Jin, Qiao and others},
booktitle={EMNLP},
year={2019}
}
Description: Medical exam questions from US Medical Licensing Examination (USMLE)
Source: https://github.com/jind11/MedQA
Statistics:
- Size: 61,000+ multiple-choice questions
- Format: JSON (question, 4-5 options, correct answer)
- Difficulty: Medical student to resident level
License: MIT (commercial use permitted)
Use Cases:
- Benchmark medical knowledge ("Can our AI pass USMLE?")
- Avoid - too advanced for patient-facing app (USMLE is for doctors, not patients)
Example:
{
"question": "A 55-year-old man with chest pain and shortness of breath. ECG shows ST elevation in leads V1-V4. Most likely diagnosis?",
"options": ["A) Pericarditis", "B) Anterior MI", "C) Pulmonary embolism", "D) Aortic dissection"],
"answer": "B"
}Warning: Too clinical for EcareBots (patients don't need USMLE-level knowledge). Use for internal testing only.
Description: De-identified clinical notes from ICU patients
Source: https://physionet.org/content/mimiciii/
Statistics:
- Size: 2 million+ clinical notes (discharge summaries, nursing notes, radiology reports)
- Patients: 58,000 ICU admissions
- Format: Text files (unstructured clinical notes)
License: PhysioNet Credentialed Health Data License (requires training, data use agreement)
Use Cases:
- Named entity recognition (extract medication names, dosages, conditions)
- Medical terminology normalization ("MI" = "myocardial infarction" = "heart attack")
- NOT for conversational AI (notes are doctor-to-doctor, not patient-facing)
Access Requirements:
- Complete CITI "Data or Specimens Only Research" course
- Sign data use agreement (DUA)
- Approval process: 1-2 weeks
Citation:
@article{johnson2016mimic,
title={MIMIC-III, a freely accessible critical care database},
author={Johnson, Alistair EW and others},
journal={Scientific Data},
year={2016}
}
Description: Crowd-sourced voice dataset in 100+ languages with diverse accents
Source: https://commonvoice.mozilla.org/
Statistics:
- Size: 30,000+ hours of speech (English), 100+ languages total
- Speakers: 400,000+ contributors
- Demographics: Age, gender, accent labeled
- Format: MP3 audio + text transcripts (TSV)
License: CC0 (public domain, no restrictions)
Use Cases for EcareBots:
- Critical: Train accent-robust speech-to-text (Indian English, Hispanic English, etc.)
- Fine-tune Whisper or build custom ASR model
- Evaluate ASR accuracy across demographics (elderly, non-native speakers)
Elderly-Specific Data:
- Filter by age: 60+ years (limited but available)
- Slower speech, vocal tremors common in elderly recordings
Example Usage:
from datasets import load_dataset
# Load English subset
common_voice = load_dataset("mozilla-foundation/common_voice_13_0", "en", split="train")
# Filter elderly speakers (60+)
elderly_subset = common_voice.filter(lambda x: x["age"] == "sixties" or x["age"] == "seventies")
print(f"Elderly speakers: {len(elderly_subset)} samples")Citation:
@inproceedings{ardila2020common,
title={Common Voice: A Massively-Multilingual Speech Corpus},
author={Ardila, Rosana and others},
booktitle={LREC},
year={2020}
}
Description: Audiobook recordings (clean, high-quality English speech)
Source: https://www.openslr.org/12/
Statistics:
- Size: 1,000 hours
- Speakers: 2,000+ (mostly professional audiobook narrators)
- Format: FLAC audio + text transcripts
License: CC BY 4.0 (commercial use permitted)
Use Cases:
- Pre-train ASR models (clean speech baseline)
- Benchmark (standard ASR evaluation dataset)
Limitation: No elderly speakers, no accents (mostly standard American English). Use for baseline, not primary training.
Description: Celebrity speech dataset for speaker recognition
Source: https://www.robots.ox.ac.uk/~vgg/data/voxceleb/
Statistics:
- Size: 2,000+ hours
- Speakers: 7,000+ celebrities
- Format: YouTube clips (diverse accents, background noise)
License: Research use only (not for commercial without permission)
Use Cases:
- Voice biometrics (verify user identity by voice)
- Speaker diarization (identify who is speaking)
Limitation: Non-commercial license. Use for research/prototyping only.
Description: Short voice commands ("yes", "no", "stop", "go", numbers 0-9)
Source: https://www.tensorflow.org/datasets/catalog/speech_commands
Statistics:
- Size: 105,000 utterances
- Commands: 35 keywords
- Format: WAV audio (1 second clips)
License: CC BY 4.0 (commercial use permitted)
Use Cases:
- Wake word detection ("Hey EcareBots")
- Simple command recognition ("Yes", "No", "Help")
Example Commands:
- yes, no, up, down, left, right, on, off, stop, go
- zero, one, two, three, four, five, six, seven, eight, nine
Description: Hand landmark annotations for 21 key points per hand
Source: https://google.github.io/mediapipe/solutions/hands.html
Statistics:
- Pre-trained model available (no raw dataset publicly released)
- Landmarks: 21 points (wrist, thumb, index, middle, ring, pinky joints)
- Real-time: 30 FPS on mobile devices
License: Apache 2.0 (commercial use permitted)
Use Cases for EcareBots:
- Hand tracking baseline (use MediaPipe as off-the-shelf solution)
- Gesture recognition ("thumbs up", "wave", "point")
Custom Gesture Training:
- Collect EcareBots-specific gestures ("schedule appointment" = swipe right)
- Use MediaPipe hand landmarks as input features
- Train simple classifier (Random Forest, SVM) on custom gestures
Description: Large-scale action recognition dataset (including hand gestures)
Source: https://rose1.ntu.edu.sg/dataset/actionRecognition/
Statistics:
- Size: 56,000 videos
- Actions: 60 action classes (hand gestures, body movements)
- Format: RGB video + depth (Kinect sensor)
License: Research use only (requires agreement)
Use Cases:
- Gesture recognition models ("waving", "pointing", "clapping")
- Body pose estimation (detect if user standing, sitting, lying down)
Limitation: Non-commercial license. Use for research only.
Description: First-person (egocentric) hand gesture dataset
Source: http://www.nlpr.ia.ac.cn/iva/yfzhang/datasets/egogesture.html
Statistics:
- Size: 24,000+ gesture samples
- Gestures: 83 classes (swiping, pinching, waving, pointing)
- Format: RGB + depth video
License: Research use (contact authors for commercial use)
Use Cases:
- First-person gesture recognition (webcam perspective)
- Relevant if EcareBots uses device camera (not external camera)
Description: Comprehensive medical terminology database
Source: https://www.nlm.nih.gov/research/umls/
Statistics:
- Size: 4 million+ concepts
- Coverage: Diseases, symptoms, medications, procedures, anatomy
- Languages: 25+ languages
License: Free for research, requires license agreement
Use Cases:
- Medical term normalization ("heart attack" = "myocardial infarction" = "MI")
- Symptom-to-condition mapping
- Drug name standardization
Access:
- Register at https://uts.nlm.nih.gov/uts/
- Approval: instant for research, 1-2 weeks for commercial
Description: Standardized medication names
Source: https://www.nlm.nih.gov/research/umls/rxnorm/
Statistics:
- Size: 100,000+ drug concepts
- Coverage: Brand names, generic names, dosage forms
License: Public domain (no restrictions)
Use Cases for EcareBots:
- Medication name resolution ("Advil" = "ibuprofen")
- Dosage validation ("500mg Metformin" is valid)
- Drug interaction checking (with OpenFDA API)
Example:
import requests
# Resolve brand name to generic
response = requests.get("https://rxnav.nlm.nih.gov/REST/rxcui.json?name=Advil")
rxcui = response.json()["idGroup"]["rxnormId"][0] # "5640" (ibuprofen)
# Get all related names
response = requests.get(f"https://rxnav.nlm.nih.gov/REST/rxcui/{rxcui}/allrelated.json")
names = response.json()["allRelatedGroup"]["conceptGroup"]
print(names) # ["Advil", "Motrin", "ibuprofen", ...]Description: International Classification of Diseases (diagnosis codes)
Source: https://www.cdc.gov/nchs/icd/icd-10-cm.htm
Statistics:
- Size: 70,000+ diagnosis codes
- Format: Code + description (e.g., "E11.9 = Type 2 diabetes without complications")
License: Public domain
Use Cases:
- Condition name standardization
- EHR integration (diagnoses stored as ICD-10 codes)
- Insurance claims (ICD-10 required)
Status:
Gap Identified: No large-scale elderly-specific voice dataset publicly available
Solution: Generate synthetic data or conduct user studies
Characteristics to Simulate:
- Slower speech: 20-30% slower than average adult
- Vocal tremors: Shaky voice, pitch variations
- Lower volume: Quieter speech (hearing loss compensation)
- Pauses: Longer pauses between words (cognitive processing)
- Mispronunciations: More frequent errors
Data Augmentation Techniques:
import librosa
import numpy as np
def simulate_elderly_speech(audio, sr=16000):
# Slow down speech by 30%
audio_slow = librosa.effects.time_stretch(audio, rate=0.7)
# Add vocal tremor (5 Hz amplitude modulation)
tremor = np.sin(2 * np.pi * 5 * np.arange(len(audio_slow)) / sr) * 0.1
audio_tremor = audio_slow * (1 + tremor)
# Reduce volume by 20%
audio_quiet = audio_tremor * 0.8
return audio_quietWCAG 2.1 Test Cases:
- Manual testing guidelines (not a dataset)
- https://www.w3.org/WAI/WCAG21/quickref/
Screen Reader Compatibility:
- Test with NVDA (Windows), JAWS (Windows), VoiceOver (Mac/iOS)
- No public dataset; requires manual testing
Description: Fake but realistic Medicare claims data
Source: https://www.cms.gov/Research-Statistics-Data-and-Systems/Downloadable-Public-Use-Files/SynPUFs
Statistics:
- Size: 2 million synthetic beneficiaries
- Data: Claims (diagnoses, procedures, costs), demographics
- Format: CSV
License: Public domain
Use Cases:
- Test insurance verification logic
- Estimate out-of-pocket costs
- Simulate claims processing
Limitation: Synthetic data (not real patients, patterns may differ from reality)
Description: Real Medicare claims data (de-identified)
Statistics:
- Size: 60+ million Medicare beneficiaries
- Data: All Medicare claims (2015-present)
License: Requires Data Use Agreement (DUA), HIPAA compliance
Access Requirements:
- Submit research proposal to CMS
- Approval process: 6-12 months
- Annual fees: $10,000-50,000
Use Cases:
- Healthcare cost analysis
- Claims pattern analysis
- NOT for MVP (too expensive, too slow)
Description: Image-text pairs for vision-language models
Source: https://github.com/google-research/google-research/tree/master/align
Statistics:
- Size: 1.8 billion image-text pairs
- Format: Image URLs + alt text
License: Research use (images from web, copyright varies)
Use Cases:
- Vision + language understanding ("Show me my blood pressure reading" while looking at glucose monitor)
- NOT directly applicable (no healthcare images), but useful for pre-training
Description: Video captioning dataset
Statistics:
- Size: 10,000 videos, 200,000 descriptions
- Format: Video + natural language captions
License: Research use
Use Cases:
- Video understanding (if EcareBots adds video input)
- Gesture recognition + voice (multi-modal)
Gaps in Public Datasets:
- Elderly-specific voice data (limited)
- Healthcare appointment conversations (no public data)
- Insurance card images (privacy-protected)
- Multi-modal interactions (voice + gesture simultaneously)
Synthetic Data Advantages:
- Privacy-preserving (no real patient data)
- Controllable (generate specific scenarios)
- Scalable (unlimited data)
Voice Synthesis:
- ElevenLabs: Generate realistic elderly voices (adjust pitch, speed, tremor)
- Coqui TTS: Open-source text-to-speech
- Azure Speech Studio: Custom neural voices
Text Generation:
- GPT-4 / Claude: Generate realistic patient-doctor conversations
- Prompt: "Generate 100 conversations where elderly patients ask about medication schedules"
Image Synthesis:
- Midjourney / DALL-E: Generate insurance card images (for OCR training)
- Stable Diffusion: Generate elderly user personas (for UI testing)
Example: Generate Elderly Voice Data
from elevenlabs import generate, Voice
# Generate elderly male voice saying "Schedule appointment for tomorrow"
audio = generate(
text="Schedule appointment for tomorrow",
voice=Voice(
voice_id="elderly_male_001",
settings={
"stability": 0.6,
"similarity_boost": 0.7,
"pitch": -2, # Lower pitch (elderly men)
"speed": 0.85 # Slower speech
}
)
)
with open("elderly_command.mp3", "wb") as f:
f.write(audio)Before using any dataset:
- Read license carefully (MIT, Apache, CC BY, CC BY-NC-SA, etc.)
- Verify commercial use permitted (some datasets are research-only)
- Check attribution requirements (must cite dataset in publications/docs?)
- Understand restrictions (no redistribution? no derivatives?)
- Obtain permission if needed (contact dataset authors for commercial use)
Before training models:
- Inspect samples (manually review 100-1000 examples)
- Check for biases (demographic representation, label imbalance)
- Verify accuracy (labels correct? transcripts match audio?)
- Test edge cases (noisy audio, accented speech, ambiguous gestures)
- Measure metrics (WER for ASR, accuracy for gesture recognition)
- De-identified data only (no real patient names, SSNs, addresses)
- HIPAA compliance (if using clinical data, follow Safe Harbor or Expert Determination)
- Informed consent (if collecting new data, users must consent)
- Data minimization (collect only what's needed)
- Secure storage (encrypt datasets at rest, access controls)
Priority 1:
- Mozilla Common Voice (English, 60+ age filter) → ASR training
- MedDialog → Healthcare conversational AI fine-tuning
- RxNorm → Medication name standardization
- MediaPipe Hands (pre-trained model) → Gesture recognition baseline
Deliverables:
- ASR model with >90% accuracy on elderly speech
- Healthcare chatbot with medication inquiry support
- Gesture recognition for 5 basic commands ("yes", "no", "help", "back", "home")
Priority 2:
- PubMedQA → Medical fact-checking layer
- UMLS → Medical term normalization
- CMS Synthetic Data → Insurance verification testing
- VoxCeleb → Voice biometrics (user verification)
Deliverables:
- RAG system citing PubMed articles
- Insurance verification with 95%+ accuracy
- Voice-based authentication
Priority 3:
- MIMIC-III (if approved) → Advanced medical NLP
- Custom elderly user studies (collect 1,000+ hours real elderly speech)
- Synthetic gesture dataset (generate 10,000+ custom gestures)
Deliverables:
- Production-ready ASR (>95% accuracy all demographics)
- Custom gesture recognition (20+ healthcare-specific gestures)
- Medical entity extraction (medications, dosages, conditions)
For AI/ML Engineers:
- Start with Mozilla Common Voice (best accent diversity, free, commercial-friendly)
- Fine-tune on MedDialog (healthcare-specific conversations)
- Generate synthetic elderly data (no large-scale public dataset exists)
- Use MediaPipe as baseline (don't reinvent hand tracking)
For Product Managers:
- Elderly voice data is scarce (expect 6-12 months to collect sufficient data)
- Most healthcare datasets are research-only (budget for licensing or synthetic data)
- Privacy is paramount (never use real patient data without de-identification)
For Executives:
- Data is the moat (quality datasets = competitive advantage)
- Budget for data collection ($50-200K for user studies, synthetic data generation)
- Compliance is non-negotiable (HIPAA, IRB approval for user studies)
- Download Mozilla Common Voice (English, 60+ age subset)
- Register for UMLS license (medication/condition normalization)
- Access MedDialog dataset (healthcare conversations)
- Test MediaPipe Hands (gesture recognition baseline)
- Plan user study (collect real elderly voice data for post-MVP)
Document Status: Research complete, datasets cataloged and prioritized.
This document is a living document and will be updated as new datasets become available and project needs evolve.