This project ingests unstructured drug labeling data from the DailyMed public API, extracts meaningful sections, and structures it into clean JSON for downstream NLP/ML use.
- Python
requests,lxml,json,spacy- Basic CLI orchestration
ingest.py: Downloads HTML files using the DailyMed SPL web service.clean.py: Parses HTML sections using LXML and extracts clinical sections like "INDICATIONS", "WARNINGS", etc.nlp.py: analyzes sections and creates entity dicts.
{
"INDICATIONS AND USAGE": "This medication is used for...",
"WARNINGS": "Do not use if you are allergic to...",
...
}