1010
1111<p align =" center " >
1212 <a href =" https://pypi.org/project/structflo-ner/ " ><img src =" https://img.shields.io/pypi/pyversions/structflo-ner.svg " alt =" Python Versions " ></a >
13- <a href =" https://pepy.tech/project /structflo-ner " ><img src =" https://static.pepy.tech/badge/structflo-ner " alt =" Downloads " ></a >
13+ <a href =" https://pepy.tech/projects /structflo-ner " ><img src =" https://static.pepy.tech/personalized- badge/structflo-ner?period=total&units=INTERNATIONAL_SYSTEM&left_color=BLACK&right_color=GREEN&left_text=downloads " alt =" PyPI Downloads" ></a >
1414 <a href =" https://github.com/structflo/structflo-ner/actions " ><img src =" https://img.shields.io/github/actions/workflow/status/structflo/structflo-ner/ci.yml?label=tests " alt =" Tests " ></a >
1515 <a href =" https://codecov.io/gh/structflo/structflo-ner " ><img src =" https://codecov.io/gh/structflo/structflo-ner/branch/main/graph/badge.svg " alt =" Coverage " ></a >
1616 <a href =" https://github.com/structflo/structflo-ner/blob/main/LICENSE " ><img src =" https://img.shields.io/badge/license-Apache%202.0-green.svg " alt =" License " ></a >
2929
3030---
3131
32- ** structflo.ner** is a lightweight NER library specialized for pharmaceutical and biological sciences. It uses [ LangExtract] ( https://github.com/langextract/langextract ) and other fuzzy based tools to deliver ** zero-configuration** entity extraction.
32+ ** structflo.ner** is a lightweight NER library specialized for pharmaceutical and biological sciences. It uses [ LangExtract] ( https://github.com/langextract/langextract ) and fuzzy based tools to deliver ** zero-configuration** entity extraction.
3333
3434It ships with two extraction engines:
3535
36- | | ` NERExtractor ` | ` FastNERExtractor ` |
37- | ---| ---| ---|
38- | Approach | LLM-powered (Gemini, Ollama) | Dictionary-based (YAML gazetteers) |
39- | Speed | ~ 10-60s per abstract | ~ 0.4-1s per abstract |
40- | Novel entities | Discovers new entities | Known terms only |
41- | Context awareness | Full contextual understanding | String matching (exact + fuzzy) |
42- | Cost | API costs or local GPU | Free (no API calls) |
43- | Setup | API key or Ollama | Zero config |
44- | Output format | ` NERResult ` | ` NERResult ` (identical) |
36+ | | ` NERExtractor ` | ` FastNERExtractor ` |
37+ | ----------------- | ----------------------------- | ---------------------------------- |
38+ | Approach | LLM-powered (Gemini, Ollama) | Dictionary-based (YAML gazetteers) |
39+ | Speed | ~ 10-60s per abstract | ~ 0.4-1s per abstract |
40+ | Novel entities | Discovers new entities | Known terms only |
41+ | Context awareness | Full contextual understanding | String matching (exact + fuzzy) |
42+ | Cost | API costs or local GPU | Free (no API calls) |
43+ | Setup | API key or Ollama | Zero config |
44+ | Output format | ` NERResult ` | ` NERResult ` (identical) |
4545
4646## Installation
4747
@@ -205,11 +205,11 @@ result
205205
206206# ## How matching works
207207
208- | Phase | Method | What it catches |
209- | -- - | -- - | -- - |
210- | 1 | ** Exact match** | Case- sensitive and normalized dictionary lookups with word- boundary enforcement |
211- | 1b | ** Regex patterns** | Auto- derived patterns from accession number seeds (Rv tags, UniProt, PDB , etc.) |
212- | 2 | ** Fuzzy match** | Typos and minor variants via [rapidfuzz](https:// github.com/ rapidfuzz/ rapidfuzz) (configurable threshold) |
208+ | Phase | Method | What it catches |
209+ | ---- - | ------------------ | -------------------------------------------------------------------------------------------------------- - |
210+ | 1 | ** Exact match** | Case- sensitive and normalized dictionary lookups with word- boundary enforcement |
211+ | 1b | ** Regex patterns** | Auto- derived patterns from accession number seeds (Rv tags, UniProt, PDB , etc.) |
212+ | 2 | ** Fuzzy match** | Typos and minor variants via [rapidfuzz](https:// github.com/ rapidfuzz/ rapidfuzz) (configurable threshold) |
213213
214214```python
215215# Fuzzy matching catches typos
@@ -224,17 +224,17 @@ strict = FastNERExtractor(fuzzy_threshold=0)
224224
225225The fast extractor ships with curated gazetteers for TB drug discovery:
226226
227- | Gazetteer | Examples |
228- | -- - | -- - |
229- | `accession_number` | Rv1305, B586_RS00005 |
230- | `gene_name` | atpE, InhA, DprE1 |
231- | `screening_method` | whole- cell screening, fragment- based screening |
232- | `target` | InhA, DprE1, MmpL3 |
233- | `compound_name` | Bedaquiline, Delamanid, Pretomanid |
234- | `functional_category` | DNA replication, cell wall biosynthesis |
235- | `strain` | M. tuberculosis H37Rv |
236- | `product` | enoyl- ACP reductase, ATP synthase subunit c |
237- | `disease` | TB , MDR - TB , XDR - TB |
227+ | Gazetteer | Examples |
228+ | -------------------- - | ---------------------------------------------- |
229+ | `accession_number` | Rv1305, B586_RS00005 |
230+ | `gene_name` | atpE, InhA, DprE1 |
231+ | `screening_method` | whole- cell screening, fragment- based screening |
232+ | `target` | InhA, DprE1, MmpL3 |
233+ | `compound_name` | Bedaquiline, Delamanid, Pretomanid |
234+ | `functional_category` | DNA replication, cell wall biosynthesis |
235+ | `strain` | M. tuberculosis H37Rv |
236+ | `product` | enoyl- ACP reductase, ATP synthase subunit c |
237+ | `disease` | TB , MDR - TB , XDR - TB |
238238
239239# ## Custom gazetteers
240240
@@ -264,14 +264,14 @@ Profiles control which entity types are extracted. Use them to focus the model o
264264
265265# ## Built-in profiles
266266
267- | Profile | Entity classes |
268- | -- - | -- - |
269- | `FULL ` (default) | compounds, targets, diseases, bioactivities, assays, mechanisms |
270- | `CHEMISTRY ` | compound names, SMILES , CAS numbers, molecular formulas |
271- | `BIOLOGY ` | targets, gene names, protein names |
272- | `BIOACTIVITY ` | bioactivity measurements, assays |
273- | `DISEASE ` | diseases and clinical indications |
274- | `TB ` | TB drug discovery (compounds, targets, diseases, accessions, strains, screening methods, functional categories) |
267+ | Profile | Entity classes |
268+ | ---------------- | -------------------------------------------------------------------------------------------------------------- - |
269+ | `FULL ` (default) | compounds, targets, diseases, bioactivities, assays, mechanisms |
270+ | `CHEMISTRY ` | compound names, SMILES , CAS numbers, molecular formulas |
271+ | `BIOLOGY ` | targets, gene names, protein names |
272+ | `BIOACTIVITY ` | bioactivity measurements, assays |
273+ | `DISEASE ` | diseases and clinical indications |
274+ | `TB ` | TB drug discovery (compounds, targets, diseases, accessions, strains, screening methods, functional categories) |
275275
276276```python
277277from structflo.ner import NERExtractor, CHEMISTRY
@@ -338,10 +338,10 @@ result.to_dict()
338338
339339Explore worked examples in the [`notebooks/ ` ](notebooks/ ) directory:
340340
341- | Notebook | Description |
342- | -- - | -- - |
343- | [01_quickstart .ipynb](notebooks/ 01_quickstart .ipynb) | End- to- end extraction with cloud and local models, profiles, batch extraction |
344- | [02_fast_ner .ipynb](notebooks/ 02_fast_ner .ipynb) | Fast dictionary- based NER — matching strategies, custom gazetteers, performance |
341+ | Notebook | Description |
342+ | ---------------------------------------------------- | ------------------------------------------------------------------------------ - |
343+ | [01_quickstart .ipynb](notebooks/ 01_quickstart .ipynb) | End- to- end extraction with cloud and local models, profiles, batch extraction |
344+ | [02_fast_ner .ipynb](notebooks/ 02_fast_ner .ipynb) | Fast dictionary- based NER — matching strategies, custom gazetteers, performance |
345345
346346# # Contributing
347347
0 commit comments