Skip to content

Commit 0af51f7

Browse files
committed
Implement unified filter for format-agnostic .1aln and PAF filtering
This implementation enables direct .1aln → .1aln filtering without PAF conversion, while reusing the exact same filtering logic for both formats. Key changes: - NEW src/unified_filter.rs: Format-agnostic filtering module - extract_1aln_metadata(): Read .1aln into RecordMeta using fastga-rs - write_1aln_filtered(): Write filtered records preserving .1aln format - filter_file(): Main entry point handling both .1aln and PAF inputs - MODIFIED src/paf_filter.rs: - Made RecordMeta struct and fields public (reused by unified filter) - Made apply_filters() method public (shared filtering logic) - MODIFIED src/aln_filter.rs: - Updated to use fastga-rs native reader instead of onecode - Simplified architecture leveraging fastga-rs sequence name extraction - MODIFIED src/fastga_integration.rs: - Added align_to_temp_1aln() for direct .1aln output from FastGA - Enables FASTA → .1aln workflow without PAF conversion - MODIFIED src/lib.rs, src/main.rs: - Added unified_filter module exports - Integration into main workflow pending Benefits: - No format conversion for .1aln → .1aln workflow - Same filtering logic for both formats (no code duplication) - Format-preserving output by default - Efficient sequence name extraction using fastga-rs
1 parent d59dd62 commit 0af51f7

8 files changed

Lines changed: 832 additions & 423 deletions

File tree

CLAUDE.md

Lines changed: 95 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -257,4 +257,98 @@ awk -F'\t' '{
257257
- Each scaffold is a **chain of nearby alignments** (merged within `-j` distance)
258258
- Scaffold members all share the same `ch:Z:chain_N` tag
259259
- 1:1 filtering applies to **scaffolds**, not individual alignments
260-
- Result: Multiple alignments per chromosome pair, organized into non-overlapping scaffold chains
260+
- Result: Multiple alignments per chromosome pair, organized into non-overlapping scaffold chains
261+
262+
## Format-Agnostic Filtering (Unified Filter Implementation)
263+
264+
### Overview
265+
The `unified_filter` module implements format-preserving filtering for both .1aln and PAF inputs. The key insight is that **both formats use the same internal RecordMeta structure and filtering logic**.
266+
267+
### Architecture
268+
- **Input**: .1aln or PAF files
269+
- **Internal Representation**: `RecordMeta` structure (defined in `paf_filter.rs`)
270+
- **Filtering**: Reuses `PafFilter::apply_filters()` for both formats
271+
- **Output**: Same format as input (format-preserving)
272+
273+
### Key Components
274+
275+
#### 1. Metadata Extraction (`extract_1aln_metadata`)
276+
```rust
277+
pub fn extract_1aln_metadata<P: AsRef<Path>>(path: P)
278+
-> Result<(Vec<RecordMeta>, HashMap<String, i64>)>
279+
```
280+
- Opens .1aln file using `fastga_rs::AlnReader`
281+
- Extracts sequence names using `get_all_seq_names()` (resolves numeric IDs to names)
282+
- Converts alignments to `RecordMeta` structure
283+
- Calculates identity from matches/block_len
284+
- Returns metadata vector and nameID mapping for writing
285+
286+
#### 2. Filtered Output Writing (`write_1aln_filtered`)
287+
```rust
288+
pub fn write_1aln_filtered<P1, P2>(
289+
input_path: P1,
290+
output_path: P2,
291+
passing_ranks: &HashMap<usize, RecordMeta>,
292+
_name_to_id: &HashMap<String, i64>
293+
) -> Result<()>
294+
```
295+
- Re-opens input .1aln for reading
296+
- Iterates through alignments by rank
297+
- Writes only records that passed filtering
298+
- Preserves exact .1aln format (no conversion)
299+
300+
#### 3. Main Filter Function (`filter_file`)
301+
```rust
302+
pub fn filter_file<P1, P2>(
303+
input_path: P1,
304+
output_path: P2,
305+
config: &FilterConfig,
306+
force_paf_output: bool
307+
) -> Result<()>
308+
```
309+
- Detects input format (.1aln vs PAF)
310+
- For .1aln:
311+
1. Extracts metadata to RecordMeta
312+
2. Calls `PafFilter::apply_filters()` (same as PAF!)
313+
3. Writes filtered .1aln output (or PAF if `--paf` flag set)
314+
- For PAF: delegates directly to existing `PafFilter::filter_paf()`
315+
316+
### Coordinate Systems
317+
- .1aln format uses "contig" coordinates (sequences between Ns in FASTA)
318+
- fastga-rs AlnReader automatically handles coordinate conversion
319+
- Sequence name extraction from embedded GDB resolves numeric IDs to actual names
320+
321+
### Code Changes
322+
**src/unified_filter.rs** (NEW)
323+
- Complete implementation of format-agnostic filtering
324+
325+
**src/paf_filter.rs** (MODIFIED)
326+
- Made `RecordMeta` struct public (was private)
327+
- Made `apply_filters()` method public (was private)
328+
329+
**src/lib.rs** (MODIFIED)
330+
- Added `pub mod unified_filter;`
331+
332+
**src/main.rs** (MODIFIED)
333+
- Added `mod unified_filter;` (integration pending)
334+
335+
### Usage
336+
```rust
337+
use sweepga::unified_filter::filter_file;
338+
use sweepga::paf_filter::FilterConfig;
339+
340+
// Filter .1aln → .1aln (format-preserving)
341+
filter_file("input.1aln", "output.1aln", &config, false)?;
342+
343+
// Filter .1aln → PAF (with --paf flag)
344+
filter_file("input.1aln", "output.paf", &config, true)?;
345+
346+
// Filter PAF → PAF (delegates to existing PAF filter)
347+
filter_file("input.paf", "output.paf", &config, false)?;
348+
```
349+
350+
### Benefits
351+
- **No format conversion** for .1aln → .1aln workflow
352+
- **Same filtering logic** for both formats (no code duplication)
353+
- **Format-preserving** output by default
354+
- **Efficient** sequence name extraction using fastga-rs

0 commit comments

Comments
 (0)