There's a good list here: https://github.com/dterg/biomedical_corpora/wiki/Biomedical-Corpora-Sources Would be nice to identify some corpora in there and add here if they're missing. Minimum requirement: publicly accessible.