-
Notifications
You must be signed in to change notification settings - Fork 115
Lk local ancestry subsetting #1774
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Merged
Merged
Changes from all commits
Commits
Show all changes
13 commits
Select commit
Hold shift + click to select a range
a03296b
Added relatedness python script and wdl and python for local ancestry
ekiernan b96341b
added to dockstore
ekiernan d7d55a4
updated hail docker
ekiernan 0c7b3b1
changing machine type for testing
ekiernan 4b6c9e7
removed tmp_dir
ekiernan 269896c
adding check point args
ekiernan 5d2f90a
Update SubsetPhasedVcfsForFLARE.wdl
ekiernan 670db17
removed index files
ekiernan 4b8f34b
remove tbi output
ekiernan 1351c17
adding task for indexing
ekiernan 7dc5f70
Merge branch 'develop' into lk_local_ancestry_subsetting
ekiernan 3738fc6
Create SubsetPhasedVcfsForFLARE.changelog.md
ekiernan f45f01b
Merge branch 'lk_local_ancestry_subsetting' of https://github.com/bro…
ekiernan File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,122 @@ | ||
| """ | ||
| This script processes Variant Call Format (VCF) files along with Principal Component Analysis (PCA) projection | ||
| scores to identify related samples in genetic datasets. It creates one list: | ||
| 1. A list of related samples. | ||
|
|
||
| NOTE: This script does NOT generate a list of samples that can be removed from the first list using the Maximal Independent Set (MIS) method | ||
| to account for relatedness in downstream analyses. | ||
|
|
||
| """ | ||
|
|
||
| import argparse | ||
| import logging | ||
| import hail as hl | ||
| import tarfile | ||
| import subprocess | ||
| import os | ||
|
|
||
|
|
||
| def hail_init(executor_memory: str, executor_cores: str, driver_cores: str, driver_memory: str, reference_genome: str) -> None: | ||
| """ | ||
| Initialize Hail with specific Apache Spark configuration settings. | ||
|
|
||
| Parameters: | ||
| executor_memory (str): Memory assigned to each Spark executor (e.g., '4g'). | ||
| executor_cores (str): Number of cores assigned to each Spark executor. | ||
| driver_cores (str): Number of cores assigned to each Spark driver. | ||
| driver_memory (str): Memory assigned to the Spark driver node. | ||
| reference_genome (str): Reference genome identifier (e.g., 'GRCh38'). | ||
| """ | ||
| spark_conf = { | ||
| "spark.executor.memory": executor_memory, | ||
| "spark.executor.cores": executor_cores, | ||
| "spark.driver.memory": driver_memory, | ||
| "spark.driver.cores": driver_cores | ||
|
|
||
| } | ||
| hl.init(default_reference=reference_genome, idempotent=True, spark_conf=spark_conf, | ||
| quiet=False, skip_logging_configuration=False) | ||
|
|
||
|
|
||
| def parse_arguments() -> argparse.Namespace: | ||
| """ | ||
| Parse and validate command-line arguments required for the script. | ||
|
|
||
| Returns: | ||
| argparse.Namespace: An object containing parsed and validated arguments. | ||
| """ | ||
| parser = argparse.ArgumentParser( | ||
| description="Process VCF and PCA projection files for genetic sample optimization.") | ||
| parser.add_argument('--executor_memory', help='Memory assigned to each Spark worker') | ||
| parser.add_argument('--executor_cores', help='CPUs assigned to each Spark worker') | ||
| parser.add_argument('--driver_cores', help='CPUs assigned to each Spark driver') | ||
| parser.add_argument('--driver_memory', help='Memory assigned to the Spark driver node') | ||
| parser.add_argument('--reference_genome', help='Reference genome identifier (e.g., "GRCh38")') | ||
| parser.add_argument('--output_gs_url', help='Output URL for the generated files') | ||
| parser.add_argument('--task_identifier', help='Unique task identifier for the output (e.g., "aou_delta")') | ||
| parser.add_argument('--min_partitions', type=int, help='Minimum number of partitions for Spark parallelization') | ||
| parser.add_argument('--vcf_url', help='URL of the input VCF file') | ||
| parser.add_argument('--pca_scores_url', | ||
| help='URL of the PCA scores associated with the input VCF (stored in a Hail Table)') | ||
| parser.add_argument('--min_individual_maf', type=float, help='Minimum individual-specific minor allele frequency ' | ||
| 'for relatedness') | ||
| parser.add_argument('--statistics', help='Statistical method for relatedness (e.g., "kin")') | ||
| parser.add_argument('--min_kinship', type=float, help='Minimum kinship threshold for identifying related samples') | ||
| parser.add_argument('--block_size', type=int, help='Block size for matrix operations in Hail') | ||
| return parser.parse_args() | ||
|
|
||
|
|
||
| if __name__ == "__main__": | ||
| args = parse_arguments() | ||
|
|
||
| # Initialize Hail with specified Spark settings | ||
| hail_init(args.executor_memory, args.executor_cores, args.driver_cores, args.driver_memory, args.reference_genome) | ||
|
|
||
| # Load VCF and PCA scores | ||
| variants_mt = hl.import_vcf(args.vcf_url, force_bgz=True, min_partitions=args.min_partitions) | ||
|
|
||
| # Check if the URL ends with "tar.gz" | ||
| if args.pca_scores_url.endswith("tar.gz"): | ||
| # Full local path for tar.gz file | ||
| local_tar_path = "pca_scores.ht.tar.gz" | ||
| local_ht_path = "./" | ||
|
|
||
| # Download the file from GCS | ||
| subprocess.run(['gsutil', 'cp', args.pca_scores_url, local_tar_path], check=True) | ||
|
|
||
| # Decompress the file | ||
| with tarfile.open(local_tar_path, "r:gz") as tar: | ||
| # List all members of the tar file | ||
| members = tar.getmembers() | ||
|
|
||
| # Assuming the first member is the directory you want | ||
| # This will get the name of the first member in the tar file | ||
| folder_name = members[0].name.split('/')[0] | ||
| tar.extractall() | ||
|
|
||
| # Copy the decompressed file to the cluster bucket | ||
| subprocess.run(['gsutil', '-m', 'cp', '-r', os.path.join(local_ht_path, folder_name), f'{args.output_gs_url}/'], | ||
| check=True) | ||
|
|
||
| print(f'{args.output_gs_url}/') | ||
| print(f'{args.output_gs_url}/{folder_name}') | ||
|
|
||
| # Read the Hail table from the decompressed file in GCS | ||
| scores_ht = hl.read_table(f'{args.output_gs_url}/{folder_name}') | ||
| else: | ||
| # Directly read the Hail table from the provided URL | ||
| print(args.pca_scores_url) | ||
| scores_ht = hl.read_table(args.pca_scores_url) | ||
|
|
||
| # Identify related samples using Hail's pc_relate method | ||
| related_samples = hl.pc_relate(variants_mt.GT, min_individual_maf=args.min_individual_maf, | ||
| statistics=args.statistics, | ||
| scores_expr=scores_ht[variants_mt.col_key].scores, | ||
| min_kinship=args.min_kinship, block_size=args.block_size) | ||
|
|
||
| related_samples = related_samples.flatten() | ||
|
|
||
| related_samples.export(f"{args.output_gs_url}/{args.task_identifier}_relatedness.tsv") | ||
| logging.info(f"Related samples generated successfully") | ||
|
|
||
| logging.info(f"Relatedness outputs successfully written to: {args.output_gs_url}") | ||
|
Comment on lines
+120
to
+122
|
||
4 changes: 4 additions & 0 deletions
4
all_of_us/local_ancestry/SubsetPhasedVcfsForFLARE.changelog.md
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,4 @@ | ||
| # aou_10.0.0 | ||
| 2026-02-17 (Date of Last Commit) | ||
|
|
||
| * Added in initial version of the pipeline |
Oops, something went wrong.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The tar.extractall() call on line 95 does not specify a path argument, which means files will be extracted to the current directory. This could potentially lead to directory traversal security vulnerabilities if the tar file contains malicious paths. Consider using tar.extractall(path=local_ht_path) or validating member paths before extraction.