`standarize` gives misleading results on Arden style gene names

Hi Yuta,

I was just using `tidytcells` to fix some TCR names from some older papers, when I noticed that it was getting a few of them wrong. When I looked into it they all seemed to be from older pre-IMGT nomenclatures, particularly that from Arden *et al.* I made use of these two papers with gene ID conversion tables:

* [LeFranc *et al.*](https://doi.org/10.1002/0471142735.ima01os40) (which is basically the supplementary data following the original IMGT nomenclature approval)
* [Arden *et al.*](https://doi.org/10.1007/BF00172176) (which details that older nomenclature, handily providing a bunch of accessions, which are out of date but do link to modern sequence IDs for validation)

I then picked out a couple of example genes:

* TRAV1-2
    * In the Arden nomenclature this is AV7S2 
    * (This is what actually made me start looking, when I noticed that some supposedly MAIT alpha chain antibodies bind 'TCRAV7S2') 
* TRBV19
    * This is BV17S1 under Arden
    * (This is particularly relevant, as 'TCRBV17S1' is an example given in the `tidytcells` paper)

Running these through `tidytcells` gives the wrong answers:

```python
import tidytcells as tt


tt.tr.standardize('AV7S2')
/Library/Frameworks/Python.framework/Versions/3.12/lib/python3.12/site-packages/tidytcells/_utils/warnings.py:12: UserWarning: Failed to standardize "AV7S2" for species homosapiens: unrecognised gene name. (best attempted fix: "TRAV7-2").
  warn(warning_message)

tt.tr.standardize('BV17S1')
'TRBV17'
```

I guess the issue is that Arden-style gene IDs look plausibly like old transitional IMGT IDs (which I think we've discussed in the past), and so they're getting simply updated, rather than looked up! It makes me wonder if there could/should be a way to stipulate whether a specific nomenclature is being used, or perhaps raise a flag or warning if an ambiguous older ID is used?

On a related note, I also noticed (while trying to standardise a bunch of V-gene specific antibodies) that it doesn't seem to handle the (presumably even older style) TCR names with Greek characters, e.g. using old TRAV1-2 protein names again:

```python
tt.tr.standardize('Vα7.2')
/Library/Frameworks/Python.framework/Versions/3.12/lib/python3.12/site-packages/tidytcells/_utils/warnings.py:12: UserWarning: Failed to standardize "Vα7.2" for species homosapiens: unrecognised gene name. (best attempted fix: "TRV").
  warn(warning_message)
```

This makes sense, but given the retention of these names in product names it might be worth trying to cope with them in a future release.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

`standarize` gives misleading results on Arden style gene names #98

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

standarize gives misleading results on Arden style gene names #98

Description

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions

`standarize` gives misleading results on Arden style gene names #98