Dev (#57)

mandysulli · samcwiley · willchet · web-flow · commit 107f16cc0489 · 2025-12-23T11:25:32.000-05:00
* fix changlog * Check mira version * update changelog * Add prepmirareports (#40) * creating the prepare-mira-reports subprocess * fix argument passing * fix error handling * using polars to read in samplesheet and convert to df * add code for creating cov df * adding utils.. * Add lib.rs * add read df creation * df filtering and converting for vtype df * Use a maintained version of YAML or switch to something else later * Binary only crate * De-lint some more, but please check * Continue to prune unwanted features * working sample name back into df * going back to rust data processing in structures. Removing polars. polars df code saved in dataframes_polars.rs but not imported into app * clean up the process_txt_with_sample function and added allele table processing * add in indel processing * renaming dataframe.rs to data_ingest.rs. Settting up structure for writing file out * added prelim logic to write to json * getting the stage of reads data * tweaking header and columns situation * add csv writing out logic * adding in parquet writing logic - dicey rightnow -working to improve * full parq fix * cleaning up * read in amened_consensus * added dais structs * read in dais-ribosome data to structs * tweak DaisSeqData struct * working in dais ref struct reading * read out coverage info. Added platform and runid handling * update structs * writing more csv and josn files. Adding more specific types were I can. * scoop up ref lengths * striping down the cargo.toml a bit * write out ref_data.json with it's unique pattern * clean up * clean up paths with @sammysheep * aavar computing for flu. make dais_vars.json and aavars.csv. started data_processing.rs. * organizing. kind of, * hold qc_statement progress * tweaking container (#34) * qc_statement processing * Var of int format tweak (#36) * tweaking container (#33) * coloumns of output tweaked * tweaked to specifc positions not covered * create positions_of_interest subprocess * add variant of int logic for tweaking * tweaked to print all positions whether codon difference or not * converting read df for irma summary merge and starting the coverage calcs for irma summary * worked in sc2-spike protein handling * fixing the ref names for sc2 situations * tweak dais file reading to ignore the gen files * add in sc2 and rsv handling dais data processing * set up preliminary irma_summary * tweaking so that we are picking up failed samples * minor allele and indel count bug squash * add subtype logic for flu * adding subtype into irma_summary * sc2 subtype handling * bug squash * adding metadata into irma_sumarry * Vec references to slices * lints * tweaks * tweak * optimizing * starting the rebuild of summary with qc values * updating summary pass fail logic * trash * fix append_with_comma and unused variables * alignment utils and lints * so many things * bug squash, clean up and pass_fail_handling * calc tweaks * clear warning * starting the nt seq processing for fastas * nt seq df creation for flu * add in sc2 and rsv handling of the nt_seq_df * start pass/fail division process * added in sc2 ilumina handling for pass/fail dividing * finsh al platofrm and virus pass/fail processing * working on AA fasta handling. This code is broken, but need to store * amino acid pass/fail processing for fasta * little restructure. get all csv files in order. * structure update and gh action tweak * oops * clearing errors and warnings * tweak ingest - took way longer than it should have * restructure. setup parq write out. * coverage parq setup * alleles, indels, nt_seq and aa_seq parq handling and tweak aa_seq csv output * final parq file editting. plus typo fix. * update docs. squashed some bugs. * indel parq bug squash * squash irma summary parq bug. convert med to i32 to cdp compatibility * alleles json bug fix * Maybe I fixed everything this time? who knows. all allele vs filtered allele fix for vairants. rounding decimals. irma summary tweak fo sc2-wgs. * added in samplesheet.parq * remove prints * typo fix * write out for coverage plot jsons (per sample) * fix sc2 and rsv orf boxes in plots * tweak for rsv and restruct * starting the sankey plot. It's black though and I gotta make it pretty. fix reads.json for dashboard * fix sankey color. Starting coverage_to_heatmap.rs * finished coverage heatmap. Need to fix sc2-spike though * fixed sc2-spike with coverage plots. Squashed much larger underlying bug where hmm_position was not being used * pass fail heatmap. need to tweak to handle missing data better * Add error handling for reading in files. Adding emtpy value handling for coverage_to_heatmap.rs * clear warnings * handling missing data in pass_fail_heatmap * remove unnecessary function * add barcode distribution json creation step * fix coloring on coverage plots * fix filtering for sankey plot * no longer generating ref_data.json. No longer need for coverage plot. * Add statichtmls (#52) * preliminary writing of the statichtml. logo and tables in, but gray * Got the barcode_distribution, pass/fail heatmap and coverage heatmap in there * getting closers to coverage htmls per sample * coverage and sankey plot htmlss per sample created but link broken in main html * fix the coverage.html links in the main html * tweak main html apperance * fix fasta links in html * fixed minor variant columns and create aavars csv for download * format * Fix indel table in html and correct nulls in runid and instrument * formatting * more formating * make tables within scrolling window and sankey color update * update coverage fig so that specific colors assigned to flu segments * sankey block colors for flu segments match the coverage plot line colors now * fixed centering of everything * Update Coverage and Sankey html to not sit on each other * all colors in compliance with CDC color palette * pretty print statements * pretty prints * clear warnings * clean up * update changelog * fix docker build file * fix Undetermine subtype handling * make outdir if doesn't exist * clean up * tweak * tweak 2 * update documentation * changlog update --------- Co-authored-by: William Chettleburgh <zcs0@cdc.gov> Co-authored-by: Samuel Shepard <vfn4@cdc.gov> Co-authored-by: Sam Wiley <dzw2@cdc.gov> * fix date and version * updating action to trigger on tagging * fix trigger * update chagelog * Mira nf compatibility fix (#55) * compatibility tweaks and names fixes * fixing DAIS_ribosome.seq ingest to work with MIRA-NF and fixing null values for spike protein coverages in sc2 wgs data * update changlog * samplesheet schema fix * update changelog --------- Co-authored-by: Sam Wiley <dzw2@cdc.gov> Co-authored-by: William Chettleburgh <zcs0@cdc.gov> Co-authored-by: Samuel Shepard <vfn4@cdc.gov>
diff --git a/CHANGELOG.md b/CHANGELOG.md
@@ -3,6 +3,14 @@
 The format is based on [Keep a Changelog](https://keepachangelog.com/en/1.0.0/)
 and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0.html).
 
+## [1.2.2] - 2025-12-23
+
+- [Amanda Sullivan](https://github.com/mandysulli)
+- [Kristine Lacek](https://github.com/kristinelacek)
+
+### `Fixed`
+- [PR #57](https://github.com/CDCgov/mira-oxide/pull/57) - Fix samplesheet.parq schema fix for handling Illumina.
+
 ## [1.2.1] - 2025-12-18
 
 - [Amanda Sullivan](https://github.com/mandysulli)
diff --git a/Cargo.toml b/Cargo.toml
@@ -1,6 +1,6 @@
 [package]
 name = "mira-oxide"
-version = "1.2.0"
+version = "1.2.2"
 edition = "2024"
 description = "A set of rusty tools for use in MIRA"
 
diff --git a/src/io/write_parquet_files.rs b/src/io/write_parquet_files.rs
@@ -738,7 +738,6 @@ pub fn write_samplesheet_to_parquet(
     match samplesheet {
         Samplesheet::Illumina(data) => {
             // Extract fields from SamplesheetI
-            let barcode_vec: Vec<Option<String>> = vec![None; data.len()];
             let sample_id_vec: Vec<String> = extract_field(&data, |item| item.sample_id.clone());
             let sample_type_vec: Vec<Option<String>> =
                 extract_field(&data, |item| item.sample_type.clone());
@@ -748,15 +747,13 @@ pub fn write_samplesheet_to_parquet(
             let instrument_vec: Vec<String> = vec![instrument.to_string(); data.len()];
 
             // Convert the vectors into Arrow columns
-            let barcode_array: ArrayRef = Arc::new(StringArray::from(barcode_vec));
             let sample_id_array: ArrayRef = Arc::new(StringArray::from(sample_id_vec));
             let sample_type_array: ArrayRef = Arc::new(StringArray::from(sample_type_vec));
             let runid_array: ArrayRef = Arc::new(StringArray::from(runid_vec));
             let instrument_array: ArrayRef = Arc::new(StringArray::from(instrument_vec));
 
             // Define the schema for the Arrow IPC file
             let fields = vec![
-                Field::new("Barcode #", DataType::Utf8, true),
                 Field::new("Sample ID", DataType::Utf8, false),
                 Field::new("Sample Type", DataType::Utf8, true),
                 Field::new("Run ID", DataType::Utf8, false),
@@ -768,7 +765,6 @@ pub fn write_samplesheet_to_parquet(
             let record_batch = RecordBatch::try_new(
                 schema.clone(),
                 vec![
-                    barcode_array,
                     sample_id_array,
                     sample_type_array,
                     runid_array,