Skip to content

Commit 88b2dd9

Browse files
authored
Merge pull request #1040 from broadinstitute/fix/1037-unique-library-ids
Fix splitcode lookup when library_id_per_sample is unique per sample
2 parents 15940c1 + 54647d4 commit 88b2dd9

File tree

6 files changed

+194
-34
lines changed

6 files changed

+194
-34
lines changed

.github/copilot-instructions.md

Lines changed: 31 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,31 @@
1+
# Copilot Instructions
2+
3+
This file provides guidance to GitHub Copilot when working with code in this repository.
4+
5+
**IMPORTANT**: Always read [AGENTS.md](../AGENTS.md) at the start of every session before doing any work. It contains comprehensive project context and development guidelines that are essential for working in this codebase.
6+
7+
## Quick Reference
8+
9+
- **Docker-centric development**: Run tests inside containers, not on host
10+
- **Import pattern**: `from viral_ngs import core` then `core.samtools.SamtoolsTool()`
11+
- **Test location**: `tests/unit/<module>/`
12+
- **Dependencies**: ALL via conda, not pip (see `docker/requirements/*.txt`)
13+
14+
## Running Tests
15+
16+
```bash
17+
docker run --rm \
18+
-v $(pwd):/opt/viral-ngs/source \
19+
quay.io/broadinstitute/viral-ngs:main-core \
20+
pytest -rsxX -n auto /opt/viral-ngs/source/tests/unit
21+
```
22+
23+
## Key Files
24+
25+
| File | Purpose |
26+
|------|---------|
27+
| [AGENTS.md](../AGENTS.md) | Full AI assistant guidance |
28+
| [pyproject.toml](../pyproject.toml) | Package configuration |
29+
| [docker/](../docker/) | Dockerfiles and requirements |
30+
| [src/viral_ngs/](../src/viral_ngs/) | Source code |
31+
| [tests/](../tests/) | Test files |

AGENTS.md

Lines changed: 15 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -69,6 +69,21 @@ docker run --rm \
6969
pytest -rsxX -n auto /opt/viral-ngs/source/tests/unit/classify
7070
```
7171

72+
**Important: Testing source code changes requires re-installing the package.**
73+
The `-v` mount makes your local files visible on disk, but `viral_ngs` is already installed as a package inside the container image. Python imports resolve to the *installed* copy, not your mounted source files. If you've modified files under `src/viral_ngs/`, you must re-install before running tests:
74+
75+
```bash
76+
# Run tests with local source changes applied
77+
docker run --rm \
78+
-v $(pwd):/opt/viral-ngs/source \
79+
quay.io/broadinstitute/viral-ngs:main-core \
80+
bash -c "pip install -e /opt/viral-ngs/source --quiet && pytest -rsxX -n auto /opt/viral-ngs/source/tests/unit"
81+
```
82+
83+
Changes to test files (`tests/`) and test inputs (`tests/input/`) are picked up automatically via the volume mount — the re-install is only needed when modifying the `src/viral_ngs/` package code.
84+
85+
Running pytest directly on the host will generally not work — most dependencies (bioinformatics tools, conda packages) are only available inside the Docker containers. Always test inside Docker.
86+
7287
**Test conventions:**
7388
- Uses pytest (not nose or unittest)
7489
- Test files in `tests/unit/<module>/`

src/viral_ngs/core/splitcode.py

Lines changed: 45 additions & 32 deletions
Original file line numberDiff line numberDiff line change
@@ -344,21 +344,45 @@ def duplication_check(df, primary_cols, secondary_col, error_message_header=None
344344
pool_dfs = []
345345
unmatched_dfs = []
346346

347-
for pool in barcodes_df["muxed_pool"].unique():
348-
# Get and load splitcode stats report json
349-
# Use the full pool name (including run suffix) to match the JSON filename created by splitcode
350-
pool_for_file_lookup = pool
351-
352-
# Try to find and load the splitcode summary JSON file
353-
# Add robust error handling since missing/misplaced JSON files are a common issue
347+
# Build barcode_group column for JSON lookup: outer barcodes only, no library_id.
348+
# Splitcode produces ONE summary JSON per outer barcode group (barcode_1+barcode_2),
349+
# but muxed_pool includes library_id_per_sample. When samples have unique library_ids,
350+
# there are multiple muxed_pool values per barcode group but only one JSON file.
351+
def _barcode_group(row):
352+
b1 = row.get("barcode_1", "")
353+
b2 = row.get("barcode_2", "")
354+
if b2 and str(b2).strip():
355+
return f"{b1}-{b2}"
356+
return b1
357+
barcodes_df["barcode_group"] = barcodes_df.apply(_barcode_group, axis=1)
358+
359+
for barcode_group in barcodes_df["barcode_group"].unique():
360+
samplesheet_rows_for_pool_df = barcodes_df[barcodes_df["barcode_group"] == barcode_group]
361+
362+
# Find the splitcode summary JSON file for this barcode group.
363+
# Try each muxed_pool value in the group until we find a matching JSON.
364+
# This handles the case where library_id_per_sample differs per sample:
365+
# splitcode produces one JSON named after whichever muxed_pool was used as pool_id.
366+
splitcode_summary_file = None
367+
tried_patterns = []
354368
try:
355-
summary_pattern = f"{outDir}/{pool_for_file_lookup}_summary.json"
356-
matching_files = glob.glob(summary_pattern)
357-
358-
if not matching_files:
369+
for candidate_pool in samplesheet_rows_for_pool_df["muxed_pool"].unique():
370+
summary_pattern = f"{outDir}/{candidate_pool}_summary.json"
371+
tried_patterns.append(summary_pattern)
372+
matching_files = glob.glob(summary_pattern)
373+
if matching_files:
374+
splitcode_summary_file = matching_files[0]
375+
if len(matching_files) > 1:
376+
log.warning(f"Multiple summary JSON files match pattern '{summary_pattern}':")
377+
for f in matching_files:
378+
log.warning(f" - {f}")
379+
log.warning(f"Using first match: {splitcode_summary_file}")
380+
break
381+
382+
if splitcode_summary_file is None:
359383
# JSON file not found - list directory contents for debugging
360-
log.error(f"Splitcode summary JSON not found for pool '{pool_for_file_lookup}'")
361-
log.error(f" Expected pattern: {summary_pattern}")
384+
log.error(f"Splitcode summary JSON not found for barcode group '{barcode_group}'")
385+
log.error(f" Tried patterns: {tried_patterns}")
362386
log.error(f" Searching in directory: {outDir}")
363387

364388
# List all files in the output directory to help debug
@@ -384,20 +408,11 @@ def duplication_check(df, primary_cols, secondary_col, error_message_header=None
384408
log.error(f" Could not list directory contents: {list_err}")
385409

386410
raise FileNotFoundError(
387-
f"Splitcode summary JSON not found for pool '{pool_for_file_lookup}'. "
388-
f"Expected file: {summary_pattern}. "
411+
f"Splitcode summary JSON not found for barcode group '{barcode_group}'. "
412+
f"Tried patterns: {tried_patterns}. "
389413
f"Check logs above for directory contents."
390414
)
391415

392-
splitcode_summary_file = matching_files[0]
393-
394-
# Warn if multiple matches found (shouldn't happen but good to catch)
395-
if len(matching_files) > 1:
396-
log.warning(f"Multiple summary JSON files match pattern '{summary_pattern}':")
397-
for f in matching_files:
398-
log.warning(f" - {f}")
399-
log.warning(f"Using first match: {splitcode_summary_file}")
400-
401416
log.debug(f"Loading splitcode summary from: {splitcode_summary_file}")
402417

403418
with open(splitcode_summary_file, "r") as f:
@@ -422,14 +437,12 @@ def duplication_check(df, primary_cols, secondary_col, error_message_header=None
422437
log.error(f" Could not read file for debugging: {read_err}")
423438
raise
424439
except Exception as e:
425-
log.error(f"Unexpected error loading splitcode summary for pool '{pool_for_file_lookup}'")
426-
log.error(f" File: {splitcode_summary_file if 'splitcode_summary_file' in locals() else 'not determined'}")
440+
log.error(f"Unexpected error loading splitcode summary for barcode group '{barcode_group}'")
441+
log.error(f" File: {splitcode_summary_file if splitcode_summary_file else 'not determined'}")
427442
log.error(f" Error type: {type(e).__name__}")
428443
log.error(f" Error message: {e}")
429444
raise
430445

431-
samplesheet_rows_for_pool_df = barcodes_df[barcodes_df["muxed_pool"] == pool]
432-
433446
# Parse splitcode summary JSON
434447
# IMPORTANT: The tag_qc array has MULTIPLE entries per barcode tag!
435448
# Each barcode appears once for each hamming distance level (0, 1, 2, 3).
@@ -475,18 +488,18 @@ def duplication_check(df, primary_cols, secondary_col, error_message_header=None
475488
else:
476489
# No reads were processed by splitcode for this pool
477490
# Create a dataframe with the expected schema but all counts set to 0
478-
log.warning(f"Pool {pool} has 0 reads processed by splitcode. Creating empty metrics.")
491+
log.warning(f"Barcode group {barcode_group} has 0 reads processed by splitcode. Creating empty metrics.")
479492
samplesheet_rows_for_pool_hx_df = samplesheet_rows_for_pool_df.copy()
480493
samplesheet_rows_for_pool_hx_df['count'] = 0
481494
samplesheet_rows_for_pool_hx_df['count_h1'] = 0
482495

483496
pool_dfs.append(samplesheet_rows_for_pool_hx_df)
484497

485498
unmatched_dict = {
486-
"sample" : f"{unmatched_name}.{pool}",
499+
"sample" : f"{unmatched_name}.{barcode_group}",
487500
"library_id_per_sample" : list(set(samplesheet_rows_for_pool_hx_df["library_id_per_sample"]))[0],
488-
"run" : f"{unmatched_name}.{pool}",
489-
"muxed_pool" : pool,
501+
"run" : f"{unmatched_name}.{barcode_group}",
502+
"muxed_pool" : barcode_group,
490503
"count" : splitcode_summary["n_processed"] - splitcode_summary["n_assigned"],
491504
"count_h1" : 0,
492505
"barcode_1" : list(samplesheet_rows_for_pool_hx_df["barcode_1"])[0],
Lines changed: 36 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,36 @@
1+
{
2+
"n_processed": 100000,
3+
"n_assigned": 95000,
4+
"tag_qc": [
5+
{
6+
"tag": "Sample1.lL1_R1",
7+
"distance": 0,
8+
"count": 30000
9+
},
10+
{
11+
"tag": "Sample1.lL1_R1",
12+
"distance": 1,
13+
"count": 1500
14+
},
15+
{
16+
"tag": "Sample2.lL2_R1",
17+
"distance": 0,
18+
"count": 45000
19+
},
20+
{
21+
"tag": "Sample2.lL2_R1",
22+
"distance": 1,
23+
"count": 2000
24+
},
25+
{
26+
"tag": "Sample3.lL3_R1",
27+
"distance": 0,
28+
"count": 15000
29+
},
30+
{
31+
"tag": "Sample3.lL3_R1",
32+
"distance": 1,
33+
"count": 1500
34+
}
35+
]
36+
}
Lines changed: 4 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,4 @@
1+
sample barcode_1 barcode_2 barcode_3 library_id_per_sample
2+
Sample1 ATCGATCG GCTAGCTA AAAACCCC L1
3+
Sample2 ATCGATCG GCTAGCTA GGGGTTTT L2
4+
Sample3 ATCGATCG GCTAGCTA CCCCGGGG L3

tests/unit/core/test_tools_splitcode.py

Lines changed: 63 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -146,7 +146,7 @@ def test_multi_pool(self):
146146
# Should have 4 samples + 2 unmatched (one per pool)
147147
self.assertEqual(len(df), 6)
148148

149-
# Verify both pools are present
149+
# Verify both barcode groups are present in muxed_pool
150150
pools = set(df['muxed_pool'].tolist())
151151
self.assertIn('AAAAAAAA-TTTTTTTT.lLibA', pools)
152152
self.assertIn('GGGGGGGG-CCCCCCCC.lLibB', pools)
@@ -156,7 +156,8 @@ def test_multi_pool(self):
156156
self.assertEqual(len(unmatched_rows), 2)
157157

158158
# Verify LibA unmatched count: 50000 - 48000 = 2000
159-
liba_unmatched = df[df['muxed_pool'] == 'AAAAAAAA-TTTTTTTT.lLibA']
159+
# Unmatched rows use the barcode group (outer barcodes only) as muxed_pool
160+
liba_unmatched = df[df['muxed_pool'] == 'AAAAAAAA-TTTTTTTT']
160161
liba_unmatched = liba_unmatched[liba_unmatched['sample'].str.contains('Unmatched')].iloc[0]
161162
self.assertEqual(int(liba_unmatched['num_reads_hdistance0']), 2000)
162163

@@ -190,6 +191,66 @@ def test_append_run_id(self):
190191
self.assertIn('FLOWCELL123', sample_row['run'])
191192
self.assertIn('FLOWCELL123', sample_row['muxed_pool'])
192193

194+
def test_unique_library_ids_per_sample(self):
195+
"""Test that lookup works when library_id_per_sample differs per sample
196+
within the same barcode_1+barcode_2 group (issue #1037).
197+
198+
When each sample has a unique library_id_per_sample, each gets its own
199+
muxed_pool value, but splitcode produces only ONE summary JSON per outer
200+
barcode group. The function must find and reuse that single JSON for all
201+
samples in the group.
202+
"""
203+
inDir = viral_ngs.core.file.get_test_input_path(self)
204+
sample_sheet = os.path.join(inDir, 'sample_sheet_unique_lib_ids.tsv')
205+
206+
with tempfile.TemporaryDirectory() as tmpdir:
207+
csv_out = os.path.join(tmpdir, 'lut.csv')
208+
209+
# Copy the ONE JSON that exists for this barcode pair
210+
# (named with the first muxed_pool value, as splitcode would produce)
211+
shutil.copy(
212+
os.path.join(inDir, 'ATCGATCG-GCTAGCTA.lL1_summary.json'),
213+
os.path.join(tmpdir, 'ATCGATCG-GCTAGCTA.lL1_summary.json')
214+
)
215+
216+
result_path = viral_ngs.core.splitcode.create_splitcode_lookup_table(
217+
sample_sheet, csv_out, unmatched_name="Unmatched"
218+
)
219+
220+
# Verify output file was created
221+
self.assertTrue(os.path.exists(result_path))
222+
223+
# Read and validate output CSV
224+
import pandas as pd
225+
df = pd.read_csv(result_path, dtype=str)
226+
227+
# Check number of rows (3 samples + 1 unmatched)
228+
self.assertEqual(len(df), 4)
229+
230+
# Verify all sample names present
231+
sample_names = set(df['sample'].tolist())
232+
self.assertIn('Sample1', sample_names)
233+
self.assertIn('Sample2', sample_names)
234+
self.assertIn('Sample3', sample_names)
235+
236+
# Verify unmatched row
237+
unmatched_rows = df[df['sample'].str.contains('Unmatched')]
238+
self.assertEqual(len(unmatched_rows), 1)
239+
unmatched_row = unmatched_rows.iloc[0]
240+
# Unmatched count should be n_processed - n_assigned = 100000 - 95000 = 5000
241+
self.assertEqual(int(unmatched_row['num_reads_hdistance0']), 5000)
242+
243+
# Verify read counts for Sample2 (should have highest count)
244+
sample2_row = df[df['sample'] == 'Sample2'].iloc[0]
245+
self.assertEqual(int(sample2_row['num_reads_hdistance0']), 45000)
246+
self.assertEqual(int(sample2_row['num_reads_hdistance1']), 2000)
247+
self.assertEqual(int(sample2_row['num_reads_total']), 47000)
248+
249+
# Verify barcode values are correct
250+
self.assertEqual(sample2_row['barcode_1'], 'ATCGATCG')
251+
self.assertEqual(sample2_row['barcode_2'], 'GCTAGCTA')
252+
self.assertEqual(sample2_row['inline_barcode'], 'GGGGTTTT')
253+
193254

194255
class TestSplitcodeIntegration(TestCaseWithTmp):
195256
"""

0 commit comments

Comments
 (0)