Hello folks, very cool module, thanks for your efforts. Parsing a bunch of heterogeneous nexus files I discovered a couple potential issues, which you can guess from the title. The parsenexus function appears to split on integers at the head of taxon names in the TaxaLabels field. It also splits on periods (.) in taxon names. Within the function parsetaxa(token, state, tokens, taxa) this results in a taxa dictionary with more entries than are declared in the ntaxa field of the nexus file, raising the warning at line 436.
Steps to reproduce:
using Phylo
nex = """#NEXUS
Begin taxa;
Dimensions ntax=3;
TaxLabels
_2109_Nesoenas_picturata_Reunion
2108_Nesoenas_picturata_Reunion
AY529948.1
;
End;
Begin trees;
Translate
1 _2109_Nesoenas_picturata_Reunion,
2 2108_Nesoenas_picturata_Reunion,
3 AY529948.1
;
tree TREE1 = [&R] (2:0.2311195,(1:0.18127275,3:0.18127275)3:0.049846749999999995);
End;
"""
open("/tmp/tmp.tre","w") do io
println(io, nex)
end
ts = open(parsenexus, Phylo.path("/tmp/tmp.tre"))open("/tmp/tmp.tre","w") do io
println(io, nex)
end
ts = open(parsenexus, Phylo.path("/tmp/tmp.tre"))
I updated newick.jl (~line 436) to show the taxa dictionary, so the output of the above call is more informative:
if length(taxa) != ntax
@warn "$taxa"
@warn "Taxa list length ($(length(taxa))) and ntax ($ntax) do not match"
end
And the result is:
┌ Warning: Dict("AY529948" => "AY529948","_Nesoenas_picturata_Reunion" => "_Nesoenas_picturata_Reunion",".1" => ".1","2108" => "2108","_2109_Nesoenas_picturata_Reunion" => "_2109_Nesoenas_picturata_Reunion")
└ @ Phylo /home/isaac/tmp/julia/Phylo.jl/src/newick.jl:443
┌ Warning: Taxa list length (5) and ntax (3) do not match
└ @ Phylo /home/isaac/tmp/julia/Phylo.jl/src/newick.jl:444
...
TreeSet with 0 trees, each with 0 tips.
Tree names are
Expects 3 and is getting 5 because 2108_Nesoenas_picturata_Reunion and AY529948.1 are getting split.
I think ape doesn't allow integers to lead taxon names either, so maybe this is a feature and not a bug, but I don't think it's a constraint of the nexus format, as other packages will handle this fine (e.g. toytree and dendropy).
I can work around it, so not a big deal but thought I'd report it.
Thanks again for all your work.
-isaac
Hello folks, very cool module, thanks for your efforts. Parsing a bunch of heterogeneous nexus files I discovered a couple potential issues, which you can guess from the title. The
parsenexusfunction appears to split on integers at the head of taxon names in the TaxaLabels field. It also splits on periods (.) in taxon names. Within thefunction parsetaxa(token, state, tokens, taxa)this results in ataxadictionary with more entries than are declared in thentaxafield of the nexus file, raising the warning at line 436.Steps to reproduce:
I updated newick.jl (~line 436) to show the taxa dictionary, so the output of the above call is more informative:
And the result is:
Expects 3 and is getting 5 because
2108_Nesoenas_picturata_ReunionandAY529948.1are getting split.I think
apedoesn't allow integers to lead taxon names either, so maybe this is a feature and not a bug, but I don't think it's a constraint of the nexus format, as other packages will handle this fine (e.g. toytree and dendropy).I can work around it, so not a big deal but thought I'd report it.
Thanks again for all your work.
-isaac