Skip to content

Conversation

david4096
Copy link
Member

A modest performance improvement at the cost of sorting. Could be improved to keep sorting by using chunking.

Includes commits from #533

performance on small VCF

OLD method

(env2) ➜  vrs-python git:(feature/annotator) ✗ time vrs-annotate vcf --vcf-out test-old.vcf NA12878-chr14-AKT1.vcf.gz
Annotating NA12878-chr14-AKT1.vcf.gz with the VCF Annotator...
VCF Annotator finished in 10.25358 seconds
vrs-annotate vcf --vcf-out test-old.vcf NA12878-chr14-AKT1.vcf.gz  7.88s user 2.59s system 98% cpu 10.613 total

NEW method

(env3) ➜  vrs-python git:(feature/thread-ann) ✗ time vrs-annotate vcf --vcf-out test-new.vcf NA12878-chr14-AKT1.vcf.gz
Annotating NA12878-chr14-AKT1.vcf.gz with the VCF Annotator...
VCF Annotator finished in 3.57795 seconds
vrs-annotate vcf --vcf-out test-new.vcf NA12878-chr14-AKT1.vcf.gz  10.65s user 8.09s system 488% cpu 3.835 total

Picking out a line to compare that things look the same

< chr14	106779713	.	G	A	50	PASS	AC=1;AF=0.5;AN=2;DP=34;FS=3.133;MQ=250;MQRankSum=4.697;QD=1.47;ReadPosRankSum=2.451;SOR=0.313;FractionInformativeReads=0.971;R2_5P_bias=13.461;VRS_Allele_IDs=ga4gh:VA.R0Y_drBrtNKY97AgFMoOY4XN5SQHOKg2,ga4gh:VA.vpg7ue7_gkI1EL39jv88tdC9V35WqclM	GT:AD:AF:DP:F1R2:F2R1:GQ:PL:GP:PRI:SB:MB	0/1:12,21:0.636:33:7,9:5,12:46:85,0,45:50,0.00011882,47.602:0,34.77,37.77:8,4,11,10:7,5,11,10
> chr14	106779713	.	G	A	50	PASS	AC=1;AF=0.5;AN=2;DP=34;FS=3.133;MQ=250;MQRankSum=4.697;QD=1.47;ReadPosRankSum=2.451;SOR=0.313;FractionInformativeReads=0.971;R2_5P_bias=13.461;VRS_Allele_IDs=ga4gh:VA.R0Y_drBrtNKY97AgFMoOY4XN5SQHOKg2,ga4gh:VA.vpg7ue7_gkI1EL39jv88tdC9V35WqclM	GT:AD:AF:DP:F1R2:F2R1:GQ:PL:GP:PRI:SB:MB	0/1:12,21:0.636:33:7,9:5,12:46:85,0,45:50,0.00011882,47.602:0,34.77,37.77:8,4,11,10:7,5,11,10

performance on larger VCF

NEW method

(env3) ➜  vrs-python git:(feature/thread-ann) ✗ time vrs-annotate vcf --vcf-out test-new2.vcf data/ALL.chrX.BI_Beagle.20100804.sites.vcf.gz 
Annotating data/ALL.chrX.BI_Beagle.20100804.sites.vcf.gz with the VCF Annotator...
[W::bcf_hdr_check_sanity] PL should be declared as Number=G
VCF Annotator finished in 213.16505 seconds
vrs-annotate vcf --vcf-out test-new2.vcf   344.08s user 242.52s system 274% cpu 3:33.58 total

OLD method

(env2) ➜  vrs-python git:(feature/annotator) ✗ time vrs-annotate vcf --vcf-out test-old2.vcf data/ALL.chrX.BI_Beagle.20100804.sites.vcf.gz

Annotating data/ALL.chrX.BI_Beagle.20100804.sites.vcf.gz with the VCF Annotator...
[W::bcf_hdr_check_sanity] PL should be declared as Number=G
VCF Annotator finished in 365.42381 seconds
vrs-annotate vcf --vcf-out test-old2.vcf   285.39s user 77.96s system 99% cpu 6:05.75 total

@jsstevenson
Copy link
Contributor

👍 this is great. I think I'd like to see the output sorted, although it might be worth checking if it's faster/easier to just run the output through bcftools sort.

@bwalsh
Copy link
Member

bwalsh commented Apr 1, 2025

@david4096 Hey! Good to hear from you.

We did some integration a while ago and noticed the same thing - threading helps. We added threading to our wrapper. I'm curious what parameters (# of threads etc) you used and how much it helped?

https://docs.google.com/presentation/d/1YUTGW3CaXimUE44aMEe9DpP1qN_mESoL7guqDXTweYo/edit#slide=id.p
https://docs.google.com/presentation/d/1hk-c2T2w6X2sh5Dlwqzxi9kjR-c91R2_snyh1q0XCLI/edit#slide=id.p

@david4096
Copy link
Member Author

Hi @bwalsh !! I put in a cursory benchmark in the above issue. The code here uses your CPU count, which for me was 12. It took a little less than half the time (which I'm sure could be improved upon).

if output_vcf_path and vcf_out:
for k in additional_info_fields:
record.info[k.value] = [
value or k.default_value() for value in vrs_field_data[k.value]
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This part I wasn't sure about

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@quinnwai FYI - can you take a look?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants