Skip to content

Commit 4d0e1ed

Browse files
committed
refactor
1 parent 74e7d9d commit 4d0e1ed

File tree

8 files changed

+154
-1
lines changed

8 files changed

+154
-1
lines changed

.gitignore

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -132,5 +132,5 @@ the_real_final_input/
132132
.DS_Store
133133
.vscode/
134134
.idea/
135-
*.md
135+
./*.md
136136
!README.md

paper/.gitignore

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1 @@
1+
paper/.md

paper/paper.md

Lines changed: 88 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,88 @@
1+
---
2+
title: 'nomenclator: a Python package for the automated generation of Latin binomials for Bacterial and Archaeal genera'
3+
tags:
4+
- Python
5+
- microbiology
6+
- taxonomy
7+
- nomenclature
8+
- bioinformatics
9+
authors:
10+
- name: Andrea Telatin
11+
orcid: 0000-0001-7619-281X
12+
affiliation: 1
13+
affiliations:
14+
- name: Quadram Institute Bioscience, Norwich, UK
15+
index: 1
16+
date: 28 October 2025
17+
bibliography: paper.bib
18+
---
19+
20+
# Summary
21+
22+
`nomenclator` is a Python package that automates the creation of linguistically valid Latin binomials for bacterial and archaeal taxa, based on the "Great Automated Nomenclator" script [@Pallen2021].
23+
Bacterial nomenclature requires Latin or Latinized names that conform to the rules of the International Code of Nomenclature of Prokaryotes (ICNP) and Latin grammar [@Parker2019, @Oren2019].
24+
The tool generates taxonomic names by combinatorially concatenating roots from
25+
Latin and Greek starting from two Excel files containing curated lists of roots to be combinatorially assembled into genus and species names.
26+
27+
28+
29+
# Statement of Need
30+
31+
The exponential growth in microbial species discovery through culturomics, genomics, and metagenomics has created an urgent need for millions of new taxonomic names—far exceeding the capacity of manual expert-driven nomenclature.
32+
`nomenclator` addresses this bottleneck by providing pre-generated, grammatically correct names that can be used "off the shelf" as needed.
33+
34+
Creating valid taxonomic names is challenging because it requires:
35+
36+
1. **Classical language expertise**: Names must follow Latin grammar rules with proper gender agreement and declension
37+
2. **ICNP compliance**: The nomenclature code contains 65 rules and numerous recommendations
38+
3. **Manual quality control**: Each name requires expert review, creating a significant bottleneck
39+
40+
41+
# Implementation and Features
42+
43+
## Installation
44+
45+
The package can be installed via `pip`:
46+
47+
```bash
48+
pip install nomenclator
49+
```
50+
51+
## Tools
52+
53+
`GAN` is implemented in pure Python (3.8+) with minimal dependencies.
54+
The package exports these CLI tools:
55+
56+
- `gan-genus`: Generates genus names based on user-defined parameters (number of names, roots to use, etc.)
57+
- `gan-init`: Initializes a project directory with necessary files and templates
58+
- `xls2tsv`: Converts Excel files with taxonomic data into TSV format for further processing
59+
60+
## Example input and output
61+
62+
An example of the input Excel file structure is shown below:
63+
64+
| Language | Gender | Part | Word | Root | Definition | Explanation |
65+
|----------|--------|------|-------------|-----------|--------------------------------------|-------------|
66+
| L. | masc. | n. | admissarius | admissari | a stallion used for breeding | horses |
67+
| Gr. | masc. | n. | Arion | ariono | a mythical horse that could speak | horses |
68+
| Gr. | masc. | n. | Balios | Balio | a mythical horse | horses |
69+
| L. | masc. | n. | caballus | caballi | a horse | horses |
70+
71+
The programme's output can be saved as HTML or PDF files. An example is:
72+
73+
* **Admissaristercoricola** - Etymology: *L. masc. n. admissarius*, a stallion used for breeding; *L. neut. n. stercus*, excrement; *N.L. masc./fem. n. cola*, an inhabitant; `Admissaristercoricola`: a microbe of the faeces of horses.
74+
* **Admissaristercoradaptatus** - Etymology: *L. masc. n. admissarius*, a stallion used for breeding; *L. neut. n. stercus*, excrement; *L. masc. n. adaptatus*, something adapted; `Admissaristercoradaptatus`: a microbe of the faeces of horses.
75+
* **Admissaristercorihabitans** - Etymology: *L. masc. n. admissarius*, a stallion used for breeding; *L. neut. n. stercus*, excrement; *L. masc. n. habitans*, an inhabitant; `Admissaristercorihabitans`: a microbe of the faeces of horses.
76+
# Acknowledgments
77+
78+
This software originated from research conducted with Mark J. Pallen and Aharon Oren, published in *Trends in Microbiology* [@pallen2021], where it demonstrated the concept of mass nomenclature generation for prokaryotic taxonomy.
79+
80+
# Funding
81+
82+
The author gratefully acknowledge the support of the Biotechnology and Biological Sciences Research Council (BBSRC); this research was funded
83+
by the BBSRC Core Capability Grant BB/CCG2260/1
84+
and by the BBSRC Institute Strategic Programme Microbes and Food Safety
85+
BB/X011011/1 and its constituent project
86+
BBS/E/QU/230002C.
87+
88+
# References

tests/all.sh

Lines changed: 35 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,35 @@
1+
#!/bin/bash
2+
3+
set -euo pipefail
4+
5+
DIR="$( cd "$( dirname "${BASH_SOURCE[0]}" )" >/dev/null && pwd )"
6+
7+
INPUT=$DIR/../input_test/
8+
OUTPUT=$DIR/../output/
9+
SCRIPTS=$DIR/../scripts/
10+
bold=$(tput bold)
11+
normal=$(tput sgr0)
12+
R1="$INPUT"/test_input_genera_part1.xlsx
13+
R2="$INPUT"/test_input_genera_part2.xlsx
14+
R3="$INPUT"/test_input_genera_part3.xlsx
15+
16+
for i in $R1 $R2 $R3;
17+
do
18+
echo "$bold * Validating: $(basename $i)$normal"
19+
if [ -e "$i" ]; then
20+
$SCRIPTS/gan-validate.py -p -i "$i"
21+
else
22+
echo "ERROR: $i not found."
23+
exit 1
24+
fi
25+
done
26+
27+
echo "$bold * Genera: three roots$normal"
28+
set -x pipefail
29+
mkdir -p $OUTPUT/
30+
$SCRIPTS/gan-genus.py -1 "$R1" -2 "$R2" -3 "$R3" -o $OUTPUT/ > $OUTPUT/genera_3.html
31+
set +x
32+
33+
echo "$bold * Genera: two roots$normal"
34+
$SCRIPTS/gan-genus.py -1 "$R1" -2 "$R3" -o $OUTPUT/ > $OUTPUT/genera_2.html
35+

tests/make_suppl.sh

Lines changed: 29 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,29 @@
1+
#!/bin/bash
2+
set -euxo pipefail
3+
DIR="$( cd "$( dirname "${BASH_SOURCE[0]}" )" >/dev/null && pwd )"
4+
JOIN="of"
5+
DATASETS="$DIR/../the_real_final_input"
6+
GAN="$DIR/../scripts/gan-genus.py"
7+
VAL="$DIR/../scripts/gan-validate.py"
8+
for i in "$DATASETS"/*;
9+
do
10+
B=$(basename $i);
11+
echo $B;
12+
for FILE in "$i"/[1-3].xlsx; do
13+
$VAL -t -i "$FILE" > "$FILE.validation.log" 2>&1
14+
done
15+
done
16+
17+
18+
for i in "$DATASETS"/*;
19+
do
20+
B=$(basename $i);
21+
echo $B;
22+
if [[ -e "$i/1.xlsx" && -e "$i/2.xlsx" && -e "$i/3.xlsx" ]]; then
23+
$GAN -1 "$i/1.xlsx" -2 "$i/2.xlsx" -3 "$i/3.xlsx" --connector "$JOIN" -o "$i/" --prefix "${B/_/-}"
24+
elif [[ -e "$i/1.xlsx" && -e "$i/2.xlsx" ]]; then
25+
$GAN -1 "$i/1.xlsx" -2 "$i/2.xlsx" --connector "$JOIN" -o "$i/" --prefix "${B/_/-}"
26+
27+
fi
28+
done
29+

tests/table1.xlsx

10.5 KB
Binary file not shown.

tests/table2.xlsx

10.2 KB
Binary file not shown.

tests/table3.xlsx

10.3 KB
Binary file not shown.

0 commit comments

Comments
 (0)