Skip to content

Commit 83b82b5

Browse files
committed
updated documentation
1 parent f446128 commit 83b82b5

5 files changed

Lines changed: 795 additions & 39 deletions

File tree

README.md

Lines changed: 39 additions & 39 deletions
Original file line numberDiff line numberDiff line change
@@ -10,7 +10,7 @@
1010

1111
<p align="center">
1212
<a href="https://pypi.org/project/structflo-ner/"><img src="https://img.shields.io/pypi/pyversions/structflo-ner.svg" alt="Python Versions"></a>
13-
<a href="https://pepy.tech/project/structflo-ner"><img src="https://static.pepy.tech/badge/structflo-ner" alt="Downloads"></a>
13+
<a href="https://pepy.tech/projects/structflo-ner"><img src="https://static.pepy.tech/personalized-badge/structflo-ner?period=total&units=INTERNATIONAL_SYSTEM&left_color=BLACK&right_color=GREEN&left_text=downloads" alt="PyPI Downloads"></a>
1414
<a href="https://github.com/structflo/structflo-ner/actions"><img src="https://img.shields.io/github/actions/workflow/status/structflo/structflo-ner/ci.yml?label=tests" alt="Tests"></a>
1515
<a href="https://codecov.io/gh/structflo/structflo-ner"><img src="https://codecov.io/gh/structflo/structflo-ner/branch/main/graph/badge.svg" alt="Coverage"></a>
1616
<a href="https://github.com/structflo/structflo-ner/blob/main/LICENSE"><img src="https://img.shields.io/badge/license-Apache%202.0-green.svg" alt="License"></a>
@@ -29,19 +29,19 @@
2929

3030
---
3131

32-
**structflo.ner** is a lightweight NER library specialized for pharmaceutical and biological sciences. It uses [LangExtract](https://github.com/langextract/langextract) and other fuzzy based tools to deliver **zero-configuration** entity extraction.
32+
**structflo.ner** is a lightweight NER library specialized for pharmaceutical and biological sciences. It uses [LangExtract](https://github.com/langextract/langextract) and fuzzy based tools to deliver **zero-configuration** entity extraction.
3333

3434
It ships with two extraction engines:
3535

36-
| | `NERExtractor` | `FastNERExtractor` |
37-
|---|---|---|
38-
| Approach | LLM-powered (Gemini, Ollama) | Dictionary-based (YAML gazetteers) |
39-
| Speed | ~10-60s per abstract | ~0.4-1s per abstract |
40-
| Novel entities | Discovers new entities | Known terms only |
41-
| Context awareness | Full contextual understanding | String matching (exact + fuzzy) |
42-
| Cost | API costs or local GPU | Free (no API calls) |
43-
| Setup | API key or Ollama | Zero config |
44-
| Output format | `NERResult` | `NERResult` (identical) |
36+
| | `NERExtractor` | `FastNERExtractor` |
37+
| ----------------- | ----------------------------- | ---------------------------------- |
38+
| Approach | LLM-powered (Gemini, Ollama) | Dictionary-based (YAML gazetteers) |
39+
| Speed | ~10-60s per abstract | ~0.4-1s per abstract |
40+
| Novel entities | Discovers new entities | Known terms only |
41+
| Context awareness | Full contextual understanding | String matching (exact + fuzzy) |
42+
| Cost | API costs or local GPU | Free (no API calls) |
43+
| Setup | API key or Ollama | Zero config |
44+
| Output format | `NERResult` | `NERResult` (identical) |
4545

4646
## Installation
4747

@@ -205,11 +205,11 @@ result
205205

206206
### How matching works
207207

208-
| Phase | Method | What it catches |
209-
|---|---|---|
210-
| 1 | **Exact match** | Case-sensitive and normalized dictionary lookups with word-boundary enforcement |
211-
| 1b | **Regex patterns** | Auto-derived patterns from accession number seeds (Rv tags, UniProt, PDB, etc.) |
212-
| 2 | **Fuzzy match** | Typos and minor variants via [rapidfuzz](https://github.com/rapidfuzz/rapidfuzz) (configurable threshold) |
208+
| Phase | Method | What it catches |
209+
| ----- | ------------------ | --------------------------------------------------------------------------------------------------------- |
210+
| 1 | **Exact match** | Case-sensitive and normalized dictionary lookups with word-boundary enforcement |
211+
| 1b | **Regex patterns** | Auto-derived patterns from accession number seeds (Rv tags, UniProt, PDB, etc.) |
212+
| 2 | **Fuzzy match** | Typos and minor variants via [rapidfuzz](https://github.com/rapidfuzz/rapidfuzz) (configurable threshold) |
213213

214214
```python
215215
# Fuzzy matching catches typos
@@ -224,17 +224,17 @@ strict = FastNERExtractor(fuzzy_threshold=0)
224224

225225
The fast extractor ships with curated gazetteers for TB drug discovery:
226226

227-
| Gazetteer | Examples |
228-
|---|---|
229-
| `accession_number` | Rv1305, B586_RS00005 |
230-
| `gene_name` | atpE, InhA, DprE1 |
231-
| `screening_method` | whole-cell screening, fragment-based screening |
232-
| `target` | InhA, DprE1, MmpL3 |
233-
| `compound_name` | Bedaquiline, Delamanid, Pretomanid |
234-
| `functional_category` | DNA replication, cell wall biosynthesis |
235-
| `strain` | M. tuberculosis H37Rv |
236-
| `product` | enoyl-ACP reductase, ATP synthase subunit c |
237-
| `disease` | TB, MDR-TB, XDR-TB |
227+
| Gazetteer | Examples |
228+
| --------------------- | ---------------------------------------------- |
229+
| `accession_number` | Rv1305, B586_RS00005 |
230+
| `gene_name` | atpE, InhA, DprE1 |
231+
| `screening_method` | whole-cell screening, fragment-based screening |
232+
| `target` | InhA, DprE1, MmpL3 |
233+
| `compound_name` | Bedaquiline, Delamanid, Pretomanid |
234+
| `functional_category` | DNA replication, cell wall biosynthesis |
235+
| `strain` | M. tuberculosis H37Rv |
236+
| `product` | enoyl-ACP reductase, ATP synthase subunit c |
237+
| `disease` | TB, MDR-TB, XDR-TB |
238238

239239
### Custom gazetteers
240240

@@ -264,14 +264,14 @@ Profiles control which entity types are extracted. Use them to focus the model o
264264

265265
### Built-in profiles
266266

267-
| Profile | Entity classes |
268-
|---|---|
269-
| `FULL` (default) | compounds, targets, diseases, bioactivities, assays, mechanisms |
270-
| `CHEMISTRY` | compound names, SMILES, CAS numbers, molecular formulas |
271-
| `BIOLOGY` | targets, gene names, protein names |
272-
| `BIOACTIVITY` | bioactivity measurements, assays |
273-
| `DISEASE` | diseases and clinical indications |
274-
| `TB` | TB drug discovery (compounds, targets, diseases, accessions, strains, screening methods, functional categories) |
267+
| Profile | Entity classes |
268+
| ---------------- | --------------------------------------------------------------------------------------------------------------- |
269+
| `FULL` (default) | compounds, targets, diseases, bioactivities, assays, mechanisms |
270+
| `CHEMISTRY` | compound names, SMILES, CAS numbers, molecular formulas |
271+
| `BIOLOGY` | targets, gene names, protein names |
272+
| `BIOACTIVITY` | bioactivity measurements, assays |
273+
| `DISEASE` | diseases and clinical indications |
274+
| `TB` | TB drug discovery (compounds, targets, diseases, accessions, strains, screening methods, functional categories) |
275275

276276
```python
277277
from structflo.ner import NERExtractor, CHEMISTRY
@@ -338,10 +338,10 @@ result.to_dict()
338338

339339
Explore worked examples in the [`notebooks/`](notebooks/) directory:
340340

341-
| Notebook | Description |
342-
|---|---|
343-
| [01_quickstart.ipynb](notebooks/01_quickstart.ipynb) | End-to-end extraction with cloud and local models, profiles, batch extraction |
344-
| [02_fast_ner.ipynb](notebooks/02_fast_ner.ipynb) | Fast dictionary-based NER — matching strategies, custom gazetteers, performance |
341+
| Notebook | Description |
342+
| ---------------------------------------------------- | ------------------------------------------------------------------------------- |
343+
| [01_quickstart.ipynb](notebooks/01_quickstart.ipynb) | End-to-end extraction with cloud and local models, profiles, batch extraction |
344+
| [02_fast_ner.ipynb](notebooks/02_fast_ner.ipynb) | Fast dictionary-based NER — matching strategies, custom gazetteers, performance |
345345

346346
## Contributing
347347

0 commit comments

Comments
 (0)