Hi Yuta,
I was just using tidytcells to fix some TCR names from some older papers, when I noticed that it was getting a few of them wrong. When I looked into it they all seemed to be from older pre-IMGT nomenclatures, particularly that from Arden et al. I made use of these two papers with gene ID conversion tables:
- LeFranc et al. (which is basically the supplementary data following the original IMGT nomenclature approval)
- Arden et al. (which details that older nomenclature, handily providing a bunch of accessions, which are out of date but do link to modern sequence IDs for validation)
I then picked out a couple of example genes:
- TRAV1-2
- In the Arden nomenclature this is AV7S2
- (This is what actually made me start looking, when I noticed that some supposedly MAIT alpha chain antibodies bind 'TCRAV7S2')
- TRBV19
- This is BV17S1 under Arden
- (This is particularly relevant, as 'TCRBV17S1' is an example given in the
tidytcells paper)
Running these through tidytcells gives the wrong answers:
import tidytcells as tt
tt.tr.standardize('AV7S2')
/Library/Frameworks/Python.framework/Versions/3.12/lib/python3.12/site-packages/tidytcells/_utils/warnings.py:12: UserWarning: Failed to standardize "AV7S2" for species homosapiens: unrecognised gene name. (best attempted fix: "TRAV7-2").
warn(warning_message)
tt.tr.standardize('BV17S1')
'TRBV17'
I guess the issue is that Arden-style gene IDs look plausibly like old transitional IMGT IDs (which I think we've discussed in the past), and so they're getting simply updated, rather than looked up! It makes me wonder if there could/should be a way to stipulate whether a specific nomenclature is being used, or perhaps raise a flag or warning if an ambiguous older ID is used?
On a related note, I also noticed (while trying to standardise a bunch of V-gene specific antibodies) that it doesn't seem to handle the (presumably even older style) TCR names with Greek characters, e.g. using old TRAV1-2 protein names again:
tt.tr.standardize('Vα7.2')
/Library/Frameworks/Python.framework/Versions/3.12/lib/python3.12/site-packages/tidytcells/_utils/warnings.py:12: UserWarning: Failed to standardize "Vα7.2" for species homosapiens: unrecognised gene name. (best attempted fix: "TRV").
warn(warning_message)
This makes sense, but given the retention of these names in product names it might be worth trying to cope with them in a future release.
Hi Yuta,
I was just using
tidytcellsto fix some TCR names from some older papers, when I noticed that it was getting a few of them wrong. When I looked into it they all seemed to be from older pre-IMGT nomenclatures, particularly that from Arden et al. I made use of these two papers with gene ID conversion tables:I then picked out a couple of example genes:
tidytcellspaper)Running these through
tidytcellsgives the wrong answers:I guess the issue is that Arden-style gene IDs look plausibly like old transitional IMGT IDs (which I think we've discussed in the past), and so they're getting simply updated, rather than looked up! It makes me wonder if there could/should be a way to stipulate whether a specific nomenclature is being used, or perhaps raise a flag or warning if an ambiguous older ID is used?
On a related note, I also noticed (while trying to standardise a bunch of V-gene specific antibodies) that it doesn't seem to handle the (presumably even older style) TCR names with Greek characters, e.g. using old TRAV1-2 protein names again:
This makes sense, but given the retention of these names in product names it might be worth trying to cope with them in a future release.