-
Notifications
You must be signed in to change notification settings - Fork 0
Expand file tree
/
Copy pathann_to_bed.py
More file actions
41 lines (32 loc) · 1.41 KB
/
ann_to_bed.py
File metadata and controls
41 lines (32 loc) · 1.41 KB
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
"""
Annotation to BED | RPINerd, 01/18/25
Given the input from:
https://hgdownload.soe.ucsc.edu/goldenPath/hgXX/database/rmsk.txt.gz,
convert the file to a simple BED format with a summary description 4th column
Input file columns:
bin|swScore|milliDiv|milliDel|milliIns|GenoName|genoStart|genoEnd|genoLeft|strand|repName|repClass|repFamily|repStart|repEnd|repLeft|id
BED info column:
rep_type|total_len|unit_len|rep_len|rep_base
"""
import sys
from pathlib import Path
with Path(sys.argv[1]).open("r", encoding="utf-8") as ref_file, Path("out.bed", "w").open("w", encoding="utf-8") as output:
for line in ref_file:
columns = line.split("\t")
genome_name = columns[5]
genome_start = int(columns[6])
genome_end = int(columns[7])
rep_base = str(columns[10])
rep_type = columns[11]
total_len = genome_end - genome_start
if rep_type == "Simple_repeat":
unit_len = len(rep_base.strip("()n"))
rep_len = int(total_len / unit_len)
elif rep_type in {"Low_complexity", "Satellite"}:
unit_len = "."
rep_len = "."
else:
continue
info_column = "|".join([rep_type, str(total_len), str(unit_len), str(rep_len), rep_base])
bed_entry = "\t".join([genome_name, str(genome_start), str(genome_end), info_column])
output.write(bed_entry + "\n")