Research using large-scale routinely collected data enables discoveries that improve people's lives. This research is reliant on the use of code lists to define key characteristics such as diseases, drugs, and other variables of interest. However, creating code lists is often a manual, time-consuming process that can be difficult to reproduce if not reported adequately. Furthermore, code lists frequently need to be updated, mapped or revised as coding terminologies/dictionaries can differ between data sources or over time.
While several code list repositories such as the HDR UK phenotype library exist, there is a lack of tools to help researchers reliably develop new code lists or update and adapt existing ones in a transparent and reproducible manner. We recently developed a step-by-step reporting checklist (https://openresearch.nihr.ac.uk/articles/4-20) to support adherence to best practices when creating and sharing code lists.
This project aims to develop a data source agnostic, open-source toolkit designed to facilitate the creation of reproducible code lists. Such a toolkit could encompass multiple features aligning with the checklist. These tools would make code lists creation more efficient and transparent by automatisation of processes and reporting. We may also trial the use of machine learning approaches or large language models to develop some of these functionalities.
Some example features could include:
- A tool for categorizing codes by clinically relevant subtypes (e.g., diagnoses, symptoms, personal or family history, tests and disease monitoring). This would facilitate the inclusion or exclusion of codes in line with the researcher’s requirements.
- A tool for improving checks of parent and descendant codes (e.g., identifying related codes that should be included in a code list). This would simplify the process of updating code lists.
- A tool for mapping code lists across different terminologies.
- A tool for completing the reporting checklist to increase reproducibility and transparency.
We plan to:
- Conceptualize functionalities addressing current challenges.
- Develop targeted features.
- Combine and test these features in a prototype Shiny app.
We will provide several manually created gold standard code lists for evaluation of the created tools, and mapping dictionaries. The checklist we developed is already published in an open access journal, and a previously developed example toolkit can be found here (https://julian-matthewman.shinyapps.io/codelisttools/). This project aims to significantly enhance the development of code lists, ultimately leading to more efficient and reproducible use of routinely collected health data to improve health. The created tools will build upon and extend the existing HDR UK phenotype library.
Julian Matthewman, Marleen Bokern, Anne Suffel, Kirsty Andresen, Helen Strongman