Skip to content

Conversation

RunpengLuo
Copy link
Collaborator

combin_counts.py always use str type for #CHR, as shown below.

bbs = [pd.read_table(bb, dtype={"#CHR": str}) for bb in outfiles]
big_bb = pd.concat(bbs)
big_bb = big_bb.sort_values(by=["#CHR", "START", "SAMPLE"])

When nochr is enabled (reference/bam file doesn't have chr prefix), to_dataframe() in rd_gccorrect.py will automatically use int64 dtype for column #CHR, which is incompatible with #CHR from dataframe bb, as shown below and from the issue #236.

bb = bb.merge(
BedTool.from_dataframe(bb[["#CHR", "START", "END"]].drop_duplicates())
.nucleotide_content(fi=ref_genome)
.to_dataframe(disable_auto_names=True)
.rename(
columns={
"#1_usercol": "#CHR",
"2_usercol": "START",
"3_usercol": "END",
"5_pct_gc": "GC",
}
)[["#CHR", "START", "END", "GC"]]
)

This fix converts the dtype of #CHR to str post to_dataframe().

@RunpengLuo RunpengLuo merged commit 39b9ffa into master Feb 26, 2025
6 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants