Extract URLs from annotations in any part of the fulltext by lfoppiano · Pull Request #1315 · grobidOrg/grobid

lfoppiano · 2025-07-17T13:14:07Z

This PR extends the functionality already implemented to recognise URLs and provide a clean target URI, by covering the cases where the URLs are not identified by regex (here the DATAmic13 token is annotated with an URL):

Here the result:

<div
                    xmlns="http://www.tei-c.org/ns/1.0">
                    <head>Data availability</head> All 43,191  
                    <p>genomes recovered in this study, the GOMC database containing 
                        <ref type="bibr" target="#b23">24,</ref>195  unique genomes and other supporting data can be interactively accessed at China National GeneBank DataBase (CNGBdb) (
                        <ref type="url" target="https://db.cngb.org/maya/datasets/MDB0000002">https://db.cngb.org/maya/datasets/MDB0000002</ref>). The previously available public marine bacterial and archaeal genomes in NCBI have been also collected and backed up in China National Gen-eBank Sequence Archive (CNSA) under the accession DATAmic13. The two marine microbial genome catalogues OMD and OceanDNA were downloaded from OMD (
                        <ref type="url" target="https://microbiomics.io/ocean/">https://microbiomics.io/ocean/</ref>) and figshare (OceanDNA, 
                        <ref type="url" target="https://doi.org/10.6084/m9.figshare.c.5564844.v1">https://doi.org/10.6084/m9.figshare.c.5564844.  v1</ref>). The Earth's Microbiomes (GEM) catalogue and Tibetan Glacier Genome and Gene (TG2G) catalogue were downloaded from 
                        <ref type="url" target="https://genome.jgi.doe.gov/GEM">https://  genome.jgi.doe.gov/GEM</ref> and 
                        <ref type="url" target="https://www.biosino.org/node/project/detail/OEP003083">https://www.biosino.org/node/project/  detail/OEP003083,</ref> respectively. The BiG-FAM database can be accessed at 
                        <ref type="url" target="https://bigfam.bioinformatics.nl/">https://bigfam.bioinformatics.nl/</ref>. Additional materials generated in this study are available on request.
                    </p>
                </div>

We've got a workable version however, this functionality should be integrated with the current one that uses the regex considering

coveralls · 2025-07-18T12:47:37Z

coverage: 40.482% (+0.09%) from 40.394%
when pulling 9f10930 on feature/extract-any-url-in-fulltext
into 01fe109 on master.

lfoppiano added this to the 0.9.0 milestone Nov 6, 2025

lfoppiano self-assigned this Nov 7, 2025

lfoppiano added 11 commits January 26, 2026 14:43

add method to extract urls without regex, using only PDF annotations

fde32a9

cleanup edges

c52e98d

cleanup

3237d93

consolidate URL extraction with the previous implementation regex based

9815ee0

remove wrong filter

a5d6fa6

avoid running out of tokens

8848afe

fix tests

50049b5

merge non-annotation backed URLs into the annotated-based URLs

15dcc33

more conservative merging

97fdb9e

conservative indexing

93d79a6

reindex figures, tables and equations

f81c588

lfoppiano force-pushed the feature/extract-any-url-in-fulltext branch from 9f10930 to f81c588 Compare January 26, 2026 13:44

chore: remove unused import

0f22ea6

lfoppiano merged commit 75e40a2 into master Jan 27, 2026
4 checks passed

lfoppiano deleted the feature/extract-any-url-in-fulltext branch January 27, 2026 05:56

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Extract URLs from annotations in any part of the fulltext#1315

Extract URLs from annotations in any part of the fulltext#1315
lfoppiano merged 12 commits intomasterfrom
feature/extract-any-url-in-fulltext

lfoppiano commented Jul 17, 2025 •

edited

Loading

Uh oh!

coveralls commented Jul 18, 2025 •

edited

Loading

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

lfoppiano commented Jul 17, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

coveralls commented Jul 18, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

lfoppiano commented Jul 17, 2025 •

edited

Loading

coveralls commented Jul 18, 2025 •

edited

Loading