Skip to content

Conversation

@willdumm
Copy link
Contributor

@willdumm willdumm commented Mar 1, 2025

Previously our only bulk data was heavy chain data, so we assumed that if the pcp file did not distinguish between heavy and light e.g. parent sequences, that we were talking about heavy chains. Now we have light chain data in the same format, and in theory we could have a pcp file with mixed heavy and light chain bulk data.

To handle this, we process pcp_dfs into a format that always contains heavy/light differentiated _h and _l columns, but we automatically infer the chain type for each pcp based on the v family name.
If the pcp file already has differentiated _h and _l columns, as with paired data, then we assume that no inference is necessary and only check for all necessary columns and make sure that all the heavy chain and light chain v families seem to be heavy or light, as claimed.

I also added a more informative error message for when masked parent-child nt pairs are identical, since I moved that filtering step to pre-processing in dnsm-experiments.

@willdumm willdumm marked this pull request as ready for review March 3, 2025 05:52
@willdumm willdumm requested a review from matsen March 3, 2025 05:52
@willdumm willdumm merged commit 25a3a56 into main Mar 4, 2025
2 checks passed
@willdumm willdumm deleted the wd-vanwinkle-data branch March 4, 2025 19:56
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants