Conversation
| ##INFO=<ID=NS,Number=1,Type=Integer,Description="Number of Samples With Data"> | ||
| ##INFO=<ID=NS,Number=1,Type=Integer,Description="Number of \begin{environment-name} | ||
| Samples | ||
| \end{environment-name} With Data"> |
There was a problem hiding this comment.
Unintended \begin{environment-name} … edit here?
|
|
||
| It is recommended that VCFv4.5 files include END unless that VCF contains any record that could be misinterpreted by the presence of END. | ||
| That is, if there exists a sample or allele in which the END computed for that SVLEN or FORMAT LEN does not equal the maximum END, then no END should be present in any record that VCF. | ||
| This approachs maintains backwards compatibility for unproblematic VCFs while attempting to minimise the probability of downstream data errors by making problematic records not valid for earlier versions of VCF (END was required for $<$*$>$ symbolic alleles). |
There was a problem hiding this comment.
"approachs" should be "approaches".
| Those same tools will incorrectly interpret the size of the smaller symbolic structural variants and $<$*$>$ symbolic alleles when END is present. | ||
|
|
||
| It is recommended that VCFv4.5 files include END unless that VCF contains any record that could be misinterpreted by the presence of END. | ||
| That is, if there exists a sample or allele in which the END computed for that SVLEN or FORMAT LEN does not equal the maximum END, then no END should be present in any record that VCF. |
There was a problem hiding this comment.
I find the current wording confusing. May I suggest rephrasing along the following lines:
-
Clarify that END is a derived field. If it is absent, it can be computed in such and such way.
(Therefore, not deprecated. Using the term deprecated raises unnecessary doubt: should newly written software still support END? The answer is yes, it must remain supported. So it’s better to avoid language that implies otherwise.) -
Clarify the handling of inconsistencies. I do not fully understand what the other paragraphs are trying to convey. My interpretation is that they intend to describe what happens if END is computed incorrectly or conflicts with the primary information. Practically speaking, the responsibility lies with the producer to ensure consistency, and each program may choose how to handle discrepancies. If an analysis relies on the END tag, it will not recompute it from the primary fields (then we would not END in the first place). Conversely, if an analysis works directly from the primary fields, it is expected it will ignore END, since END is derived.
-
Clarify the comparison of END and LEN. If a comparison between END and LEN is important, the text should explain explicitly in what ways the two differ and in what ways they are equivalent. Although I am fairly familiar with the VCF format, the current paragraph did not make this distinction clear.
There was a problem hiding this comment.
The issue is that if an analysis relies on END, and multiple ALT alleles have end at different positions. The analysis will be silently wrong. E.g.: If a 4.5 VCF has something like POS=10 ALT=<DEL>,<DUP>,<*> SVLEN=10,20;END=40 LEN=.,.30, then the SVs will be interpreted as 30bp in length when they are actually shorter.
END is fine until you have multiple ALTs with different lengths. END is deprecated in the literal sense of it not being the preferred field to use. Should END be written in a fully 4.5-compliant ecosystem? No. It's redundant and unnecessary. Will we ever have a fully 4.5-compliant ecosystem? Also no, hence the wording in this PR around still writing it.
I do not fully understand what the other paragraphs are trying to convey.
They're conveying that there are 4.5 records that pre-4.5 software that uses END will misinterpret and results will silently be incorrect. Writing or not writing END doesn't change the fact that this is a backwards incompatible change - it just changes what it is that breaks. Would it help if I changed this to a recommendation that if you want pre-4.5 compatibility then don't write symbolic SVs in the same VCF record as gVCF <*> blocks?
Addresses concerns raised in #784