Project 4: Scaling up a literature processing pipeline to support text analytics of published biomedical articles
The BioC format is a community-driven simple data structure for sharing text and annotations. The Auto-CORPus natural language processing command-line tool (https://github.com/omicsNLP/Auto-CORPus) converts biomedical publication HTML files to the machine-interpretable BioC JSON format to support biomedical text analytics. Auto-CORPus is used to perform the data standardisation and optimisation step in various text mining pipelines, including in the ELIXIR Humangenphen project, the Horizon Europe CoDiet programme and the CHIST-ERA FAIRClinical project. User requirements have been gathered as a result of Auto-CORPus being used for different projects investigating different health data domains. During this Hackathon project, we will work on the highest priority user requirements to improve the user experience, extend the tool’s features and increase the number of potential use cases where Auto-CORPus could make a valuable contribution. This project has two goals:
- Provide a user interface for configuring Auto-CORPus
- Integrate Auto-CORPus with the Cadmus publication download system
An Auto-CORPus configuration file is required to describe the broad structure of the HTML files to be processed. Currently, the config files (in JSON format) are manually edited by the user. We will develop a user interface for generating Auto-CORPus config files in the form of a browser extension. This will enable users to browse their desired journal article pages and select the different HTML sections required in the config. The config file will be built automatically and, once complete, can be exported for immediate use with Auto-CORPus. The accuracy of the configs will be tested by comparing the BioC outputs generated from processing the same publication from different sources (e.g., from preprint repositories, PubMed Central, journals). An accompanying tutorial will also be developed for Auto-CORPus, describing how to install and use the new tool.
Auto-CORPus processes input data that have been gathered locally. Cadmus is an automated full-text retrieval system for the generation of large biomedical corpora from published literature for research purposes (https://github.com/biomedicalinformaticsgroup/cadmus). Publications are output in XML formats provided by the publishers, so are not compliant with the BioC standard. Cadmus is used with Auto-CORPus in text analytics processes to download and standardise articles, however this involves manual intervention which is not scalable. We will integrate these applications, providing an automatic, scalable and reusable pipeline to create machine-interpretable biomedical corpora that are optimised for NLP and text mining.
This project will use data in the form of a corpus of open access publications in XML and JSON formats. Python will be used for coding activities and outputs committed to an open GitHub repository. This project would be ideal for anyone with Python, web app development or NLP skills; or anyone who would like to find out more about biomedical text analytics.
Antoine Lain, Thomas Rowlands