Skip to content

Extract URLs from annotations in any part of the fulltext#1315

Merged
lfoppiano merged 12 commits intomasterfrom
feature/extract-any-url-in-fulltext
Jan 27, 2026
Merged

Extract URLs from annotations in any part of the fulltext#1315
lfoppiano merged 12 commits intomasterfrom
feature/extract-any-url-in-fulltext

Conversation

@lfoppiano
Copy link
Member

@lfoppiano lfoppiano commented Jul 17, 2025

This PR extends the functionality already implemented to recognise URLs and provide a clean target URI, by covering the cases where the URLs are not identified by regex (here the DATAmic13 token is annotated with an URL):

image

Here the result:

<div
                    xmlns="http://www.tei-c.org/ns/1.0">
                    <head>Data availability</head> All 43,191  
                    <p>genomes recovered in this study, the GOMC database containing 
                        <ref type="bibr" target="#b23">24,</ref>195  unique genomes and other supporting data can be interactively accessed at China National GeneBank DataBase (CNGBdb) (
                        <ref type="url" target="https://db.cngb.org/maya/datasets/MDB0000002">https://db.cngb.org/maya/datasets/MDB0000002</ref>). The previously available public marine bacterial and archaeal genomes in NCBI have been also collected and backed up in China National Gen-eBank Sequence Archive (CNSA) under the accession DATAmic13. The two marine microbial genome catalogues OMD and OceanDNA were downloaded from OMD (
                        <ref type="url" target="https://microbiomics.io/ocean/">https://microbiomics.io/ocean/</ref>) and figshare (OceanDNA, 
                        <ref type="url" target="https://doi.org/10.6084/m9.figshare.c.5564844.v1">https://doi.org/10.6084/m9.figshare.c.5564844.  v1</ref>). The Earth's Microbiomes (GEM) catalogue and Tibetan Glacier Genome and Gene (TG2G) catalogue were downloaded from 
                        <ref type="url" target="https://genome.jgi.doe.gov/GEM">https://  genome.jgi.doe.gov/GEM</ref> and 
                        <ref type="url" target="https://www.biosino.org/node/project/detail/OEP003083">https://www.biosino.org/node/project/  detail/OEP003083,</ref> respectively. The BiG-FAM database can be accessed at 
                        <ref type="url" target="https://bigfam.bioinformatics.nl/">https://bigfam.bioinformatics.nl/</ref>. Additional materials generated in this study are available on request.
                    </p>
                </div>

We've got a workable version however, this functionality should be integrated with the current one that uses the regex considering

@coveralls
Copy link

coveralls commented Jul 18, 2025

Coverage Status

coverage: 40.482% (+0.09%) from 40.394%
when pulling 9f10930 on feature/extract-any-url-in-fulltext
into 01fe109 on master.

@lfoppiano lfoppiano added this to the 0.9.0 milestone Nov 6, 2025
@lfoppiano lfoppiano self-assigned this Nov 7, 2025
@lfoppiano lfoppiano force-pushed the feature/extract-any-url-in-fulltext branch from 9f10930 to f81c588 Compare January 26, 2026 13:44
@lfoppiano lfoppiano merged commit 75e40a2 into master Jan 27, 2026
4 checks passed
@lfoppiano lfoppiano deleted the feature/extract-any-url-in-fulltext branch January 27, 2026 05:56
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants