feat: Migrating existing configs to include frame & ctf metadata #428

manasaV3 · 2025-02-01T00:56:21Z

Relates to chanzuckerberg/cryoet-data-portal#1647

Depends on: #427

Description

Manually migrated the ctf entity
Updates the generated rawtlt and mdoc files as fallback
Adds migration to support frame_dose_rate

Notes:

There is a dependency on file run_dose_rate_map.tsv to exist in https://github.com/czimaginginstitute/czii-data-portal-processing/tree/main/src/data_portal_processing/jensendb containing a mapping of all run names to frame dose rating.
The failure of the ingestion config validation is expected due invalid stub placeholder, as a remainder to update these frame dose rate values. Do not merge until all the errors have been addressed.

…f_config_mig # Conflicts: # ingestion_tools/dataset_configs/10440.yaml # ingestion_tools/dataset_configs/10443.yaml # schema/api/v2.0.0/codegen/api_models_materialized.yaml # schema/core/v2.0.0/codegen/metadata_materialized.yaml # schema/core/v2.0.0/codegen/metadata_models.py # schema/core/v2.0.0/common.yaml # schema/ingestion_config/v1.0.0/codegen/ingestion_config_models.py # schema/ingestion_config/v1.0.0/codegen/ingestion_config_models_materialized.yaml

…f_config_mig

uermel · 2025-04-15T23:45:28Z

ingestion_tools/scripts/data_validation/shared/helper/angles_helper.py

@@ -13,7 +12,7 @@ def helper_angles_injection_errors(
    remaining_angles = codomain_angles.copy()
    for domain_angle in domain_angles:
        found_match = False
-        for codomain_angle in codomain_angles:
+        for codomain_angle in remaining_angles:


We need to check for presence in the remaining angles, since we have cases where two angles are very close (below the tolerance value). If we keep checking against the codomain angles this leads to the incorrect one being identified (which should have already been removed).

uermel · 2025-04-15T23:46:07Z

ingestion_tools/scripts/data_validation/standardized/fixtures/path.py

@@ -84,7 +84,7 @@ def frames_files(frames_dir: str, filesystem: FileSystemApi) -> List[str]:
    """[Dataset]/[ExperimentRun]/Frames/*"""
    files = filesystem.glob(f"{frames_dir}/*")
    # Exclude mdoc files, add s3 prefix
-    refined_files = ["s3://" + file for file in files if ".mdoc" not in file]
+    refined_files = ["s3://" + file for file in files if ".mdoc" not in file and ".json" not in file]


Frames metadata exist now and need to be excluded from count.

uermel · 2025-04-15T23:49:09Z

ingestion_tools/scripts/data_validation/standardized/tests/test_tiltseries.py

        errors = helper_angles_injection_errors(
-            mdoc_data["TiltAngle"].to_list(),
            tiltseries_metadata_range,
-            "mdoc file",
+            mdoc_data["TiltAngle"].to_list(),
            "tiltseries metadata tilt_range",
+            "mdoc file",
        )
        assert len(errors) == 0, (
-                "\n".join(errors)
-                + f"\nRange: {tiltseries_metadata['tilt_range']['min']} to {tiltseries_metadata['tilt_range']['max']}, "
-                  f"with step {tiltseries_metadata['tilt_step']}"
+            "\n".join(errors)
+            + f"\nRange: {tiltseries_metadata['tilt_range']['min']} to {tiltseries_metadata['tilt_range']['max']}, "
+            f"with step {tiltseries_metadata['tilt_step']}"


Needed to invert the logic here, see description of the test. helper_angles_injection_errors checks if domain angles are also in the codomain. Since the mdoc may contain more angles than the tilt series, tilt series metadata range needs to be domain, mdoc needs to be codomain.

uermel · 2025-04-15T23:50:00Z

ingestion_tools/scripts/gjensen_config.py

+    per_run_float_mapping: dict[str, dict[str, float]],
+    per_run_string_mapping: dict[str, dict[str, str]],


Adding overrides for float and string values here.

uermel · 2025-04-15T23:50:51Z

ingestion_tools/scripts/gjensen_config.py

+        "min": per_run_float_mapping["tilt_series_min_angle"][run_name],
+        "max": per_run_float_mapping["tilt_series_max_angle"][run_name],
    }
+    tilt_series["tilt_step"] = per_run_float_mapping["tilt_series_tilt_step"][run_name]
+    tilt_series["tilting_scheme"] = per_run_string_mapping["tilt_series_tilting_scheme"][run_name]
+    tilt_series["tilt_axis"] = per_run_float_mapping["tilt_series_tilt_axis"][run_name]
+    tilt_series["total_flux"] = per_run_float_mapping["tilt_series_total_flux"][run_name]


Since these values need to be overridden for at least some runs, they are all taken from the override map.

uermel · 2025-04-15T23:52:49Z

ingestion_tools/scripts/gjensen_config.py

+def get_per_run_float_mapping(input_dir: str) -> dict[str, dict[str, float]]:
+    """
+    Get parameter to run mapping for all runs. The data for this is sourced from the per_run_float_param_map.tsv
+    :param input_dir:
+    :return: dictionary mapping param names to dicts of per run values {param_name -> {run_name -> value}
+    """
+    with open(os.path.join(input_dir, "per_run_float_param_map.tsv")) as csvfile:
+        reader = csv.DictReader(csvfile, delimiter="\t")
+        params = reader.fieldnames
+        ret = {param_name: {} for param_name in params if param_name != "run_name"}
+
+        for row in reader:
+            for param_name in ret:
+                try:
+                    ret[param_name][row["run_name"]] = float(row[param_name])
+                except ValueError:
+                    ret[param_name][row["run_name"]] = 0.0
+                    print(f"Invalid value for {param_name} in run {row['run_name']}: {row[param_name]}")
+        return ret
+
+
+def get_per_run_string_mapping(input_dir: str) -> dict[str, dict[str, str]]:
+    """
+    Get parameter to run mapping for all runs. The data for this is sourced from the per_run_string_param_map.tsv
+    :param input_dir:
+    :return: dictionary mapping param names to dicts of per run values {param_name -> {run_name -> value}
+    """
+    with open(os.path.join(input_dir, "per_run_string_param_map.tsv")) as csvfile:
+        reader = csv.DictReader(csvfile, delimiter="\t")
+        params = reader.fieldnames
+        ret = {param_name: {} for param_name in params if param_name != "run_name"}
+
+        for row in reader:
+            for param_name in ret:
+                ret[param_name][row["run_name"]] = row[param_name]
+        return ret


Float and string overrides to use during config generation.

uermel · 2025-04-15T23:53:40Z

ingestion_tools/scripts/gjensen_config.py

+def get_included_run_map(input_dir: str) -> dict[str, bool]:
+    """
+    Get map of runs to include/exclude during generation.
+    :param input_dir:
+    :return: dictionary mapping run names to bool indicating whether to include {run_name -> include}
+    """
+    with open(os.path.join(input_dir, "included_runs.tsv")) as csvfile:
+        reader = csv.DictReader(csvfile, delimiter="\t", fieldnames=["run_name", "include"])
+        # Skip the header row
+        next(reader)
+
+        ret = {}
+        for row in reader:
+            ret[row["run_name"]] = bool(int(row["include"]))
+        return ret
+
+
+def exclude_runs(data: dict[str, Any], run_include_mapping: dict[str, bool]) -> dict[str, Any]:
+    """
+    Exclude runs based on the run_include_mapping. If the run is not in the mapping, it is excluded.
+    :param data: The data to process
+    :param run_include_mapping: The mapping of run names to include/exclude
+    :return: The processed data with excluded runs
+    """
+    runs = data["runs"]
+    runs_out = []
+    for entry in runs:
+        if run_include_mapping.get(entry["run_name"], True):
+            runs_out.append(entry)
+        else:
+            print(f"Excluding run {entry['run_name']}")
+
+    data["runs"] = runs_out
+    return data


Boolean flags for each run are used to exclude them from the CZII-json data before any further processing.

uermel · 2025-04-15T23:54:14Z

ingestion_tools/scripts/gjensen_config.py

+def handle_per_run_param_maps(
+    data: dict[str, Any],
+    run_data_map: dict,
+    per_run_mapping: dict[str, dict[str, float]] | dict[str, dict[str, str]],
+) -> tuple[dict[str, str | float | None], dict]:
+    """
+    Handle per-run parameter mappings. The function finds distinct values for each parameter in the per_run_mapping
+    and either passes a single value (if only one distinct value is found) or creates a formatted string for the field
+    and appends the values to the run_data_map.
+    :param data: The data for the dataset
+    :param run_data_map: The run data map to store the per-run values
+    :param per_run_mapping: The mapping of parameter to run name to value param_name -> {run_name -> value}
+    :return: tuple of the formatted values for the dataset and the updated run_data_map
+    """
+    distinct_values = {param: {} for param in per_run_mapping}
+    for entry in data:
+        run_name = entry["run_name"]
+        for param_name in per_run_mapping:
+            param_value = per_run_mapping[param_name].get(run_name, 0.0)
+            # dose_rate = frame_dose_rate_mapping.get(run_name, 0.0)
+            if param_value in distinct_values[param_name]:
+                distinct_values[param_name][param_value].append(run_name)
+            else:
+                distinct_values[param_name][param_value] = [run_name]
+
+    ret = {}
+
+    for param_name, values in distinct_values.items():
+        if len(values) == 0:
+            ret[param_name] = None
+        elif len(values) == 1:
+            ret[param_name] = next(iter(values.keys()))
+        else:
+            key = f"float {{{param_name}}}"
+            for param_value, runs in distinct_values[param_name].items():
+                for run_name in runs:
+                    run_data_map[run_name][param_name] = param_value
+            ret[param_name] = key
+
+    return ret, run_data_map


Converting from per-run-map to single value or mapped-value entry.

uermel · 2025-04-15T23:55:01Z

ingestion_tools/scripts/gjensen_config.py

+            exclude_runs_parent_filter(updated_dataset_config.get("frames", []), runs_without_tilt)
+            exclude_runs_parent_filter(updated_dataset_config.get("ctfs", []), runs_without_tilt)
+            exclude_runs_parent_filter(updated_dataset_config.get("rawtlts", []), runs_without_tilt)
+            exclude_runs_parent_filter(updated_dataset_config.get("collection_metadata", []), runs_without_tilt)


When a run has no tilt series, all of these should have the same exclude filters.

uermel · 2025-04-15T23:55:28Z

ingestion_tools/scripts/gjensen_config.py

+        # If there are no tiltseries, remove frames, rawtlts, ctfs, and collection_metadata
+        if not updated_dataset_config.get("tiltseries"):
+            updated_dataset_config.pop("frames", None)
+            updated_dataset_config.pop("rawtlts", None)
+            updated_dataset_config.pop("ctfs", None)
+            updated_dataset_config.pop("collection_metadata", None)
+


When a dataset has no tilt series at all, all of these should not be present.

Could we a have a usecase, where a frame block was added for a manually generated config but they don't submit a tiltseries with it?

This is possible, but is it relevant for jensen config generation?

Sorry, this response was meant to be added to the comment below. As it was a part of the migration for all configs. 😅

But for other configs we never need to run this migration again, or am I misunderstanding how these migrations work?

Do we always apply all migration scripts when we do a new one?

Do we always apply all migration scripts when we do a new one?

No, we only migrate configs as needed if they are not the latest, and we only apply the migration step needed to get it to the latest.

The only config that should ideally be impacted by this is 10006.yaml. But, since it has a tiltseries block defined it would still generate the frame block for it.

uermel · 2025-04-15T23:57:03Z

ingestion_tools/scripts/schema_migration/migrate_v1_2_0.py

+    if "frames" not in config and len(config.get("tiltseries", [])) > 0:
+        config["frames"] = [
+            {
+                "sources": [
+                    {"literal": {"value": ["default"]}},
+                ],
+            },
+        ]
+
+    if "frames" in config:
+        for entry in config["frames"]:
+            if "metadata" not in entry:
+                entry["metadata"] = {
+                    "dose_rate": frame_dose_rate,
+                    "is_gain_corrected": "gain" in config,
+                }


If no tiltseries are present for this dataset, a frames block should not exist. But we need to include an additional check on L17 to only do updates when the block does exist.

manasaV3 · 2025-04-16T21:57:11Z

ingestion_tools/docs/jensen_config_generation.md

@@ -0,0 +1,91 @@
+# Generating ingestion config files for Jensen Datasets


ingestion_tools/scripts/data_validation/standardized/fixtures/path.py

ingestion_tools/scripts/gjensen_config.py

manasaV3 · 2025-04-16T23:08:46Z

ingestion_tools/scripts/gjensen_config.py

@@ -581,6 +674,48 @@ def exclude_runs_parent_filter(entities: list, runs_to_exclude: list[str]) -> No
            source["parent_filters"]["exclude"]["run"].extend(runs_to_exclude)


+def handle_per_run_param_maps(


If I am understanding this method correctly, it handles for distinct values in the field, and adds it to run_data_map. We later sets that value to the relevant config field while processing the tiltseries.

What are your thoughts on overriding those values in the tiltseries object of the entry directly here instead? I think it might simplify things a little, and also the csv files that gets generated would not include columns that don't get referenced in the config. For example: the run_data_map/10014.csv now includes columns for tilt_series_max_angle, tilt_series_min_angle, tilt_series_quality_score and ts-tilt_range-max, ts-tilt_range-min, ts-tilt_series_quality.

While that would work for overrides made to entities such as tiltseries, tomogram. This kind of processing would still be needed for frames_dose_rate, unless we refactor how frame_dose_rate is handled now.

So the suggestion would be to:

Add dose rate to float_fields

Override all tilt series values with actual numbers, do not check for distinctness here

Let to_template_by_run handle the distinctness check

tbh, this function is just a generalization of what was being done for dose rate. I didn't realize there was an extra distinctness check that is called after to_tiltseries, and I didn't realize these values were added to the csv twice for that reason. We can do away with the distinctness check ahead of to_tiltseries I think.

iirc, we currently don't do to_template_by_run for frame dose_rate. So, we should either update how frames entity is being generated or retain the above method for just frames_dose_rate.

…path.py Co-authored-by: Manasa Venkatakrishnan <[email protected]>

…g' into mvenkatakrishnan/fnctf_config_mig

- Add frame_dose_rate to float_fields - handle_per_run_param_maps no longer returns a modified run_data_map - rename ds_per_run_mapping to dataset_data_map. - Run the gjensen_config.py to regenerate csvs.

manasaV3 · 2025-04-23T06:13:59Z

ingestion_tools/scripts/gjensen_config.py

@@ -238,6 +238,7 @@ def to_standardization_config(
    "ts-tilt_range-min",
    "ts-tilt_range-max",
    "ts-total_flux",
+    "frame_dose_rate",


I believe, the addition here will not have the desired impact as we haven't changed how we process for the frames entity.

manasaV3 · 2025-04-23T06:17:37Z

ingestion_tools/scripts/gjensen_config.py

            ret[param_name] = key

-    return ret, run_data_map


As we are removing the addition to run_data_map, the case where there are different values for frame_dose_rate for runs in a dataset is not handled correctly (ie) we are setting the value in the config still as {frame_does_rate}, but we aren't adding the frame_does_rate value to the csv.

- generalize handle_per_run_param_maps - use handle_per_run_param_maps to get frame_dose_rate - add frame_dose_rate to run_data_map/*csv Signed-off-by: Bento007 <[email protected]>

Signed-off-by: Bento007 <[email protected]>

manasaV3 added 10 commits January 30, 2025 17:13

Updating ingestion config to v2 core

696d3ac

Adding CTF entity to config

e0b8487

Updating api to v2

ae2ccf6

Adding validation for frame metadata

69d5dc7

Migrating CTF entity in configs

a9cb466

Adding generated ctf and rawtlt files as fallback

8627ec0

Removing default keyphoto source as it is handled in the ingestion

7ae7ad0

Updating the mdoc file paths

0ac5618

Addinig frames dose rate config migration

efc9d36

Generating configs with failing stub dose_rate

63e34b3

manasaV3 requested review from jgadling, Bento007 and uermel and removed request for jgadling February 1, 2025 00:56

jgadling approved these changes Feb 3, 2025

View reviewed changes

Updating configs with failing stub dose_rate

1b6477a

Base automatically changed from mvenkatakrishnan/fnctf_config to main February 4, 2025 00:34

manasaV3 and others added 11 commits February 21, 2025 13:01

Adding documentation for gjensen config generation

d30885d

Updating paths for mdoc and rawtlt to generated for jensen datasets

f26bde0

Updating schema migration

0f8c90d

Fix ctf configs

8b438fa

first batch of updates

91b685c

codegen

486ac10

Merge remote-tracking branch 'origin/main' into mvenkatakrishnan/fnct…

ffeb2d6

…f_config_mig

2nd batch of updates

c2b4280

dose rate for sim data

7e74c6e

generalize per run mapping feature

42b1b3a

uermel mentioned this pull request Mar 25, 2025

feat: Migrating existing configs to include frame & ctf metadata - Batch 1 (no jensen) #472

Closed

uermel added 5 commits April 14, 2025 21:22

lint

a211859

override total_flux

634b9be

Merge remote-tracking branch 'origin/main' into mvenkatakrishnan/fnct…

d93d0e1

…f_config_mig

deal with cases where tilt series is missing.

0b2cebd

fix 185 and 195

df3d4d3

uermel reviewed Apr 15, 2025

View reviewed changes

add CTF for pahntom

480ce46

manasaV3 commented Apr 16, 2025

View reviewed changes

ingestion_tools/scripts/data_validation/standardized/fixtures/path.py Outdated Show resolved Hide resolved

manasaV3 commented Apr 16, 2025

View reviewed changes

uermel and others added 4 commits April 16, 2025 16:58

Update ingestion_tools/scripts/data_validation/standardized/fixtures/…

a4596bf

…path.py Co-authored-by: Manasa Venkatakrishnan <[email protected]>

address review, remove bad defaults for quality

080dcf8

Merge remote-tracking branch 'origin/mvenkatakrishnan/fnctf_config_mi…

86096e4

…g' into mvenkatakrishnan/fnctf_config_mig

Suggested changes

f074966

- Add frame_dose_rate to float_fields - handle_per_run_param_maps no longer returns a modified run_data_map - rename ds_per_run_mapping to dataset_data_map. - Run the gjensen_config.py to regenerate csvs.

manasaV3 commented Apr 23, 2025

View reviewed changes

Bento007 added 2 commits April 24, 2025 12:18

Suggested changes 2

661aba4

- generalize handle_per_run_param_maps - use handle_per_run_param_maps to get frame_dose_rate - add frame_dose_rate to run_data_map/*csv Signed-off-by: Bento007 <[email protected]>

regenerate gjensen configs

315ed34

Signed-off-by: Bento007 <[email protected]>

Bento007 approved these changes Apr 25, 2025

View reviewed changes

Bento007 merged commit f7b790e into main Apr 25, 2025
8 checks passed

Bento007 deleted the mvenkatakrishnan/fnctf_config_mig branch April 25, 2025 20:58

		per_run_float_mapping: dict[str, dict[str, float]],
		per_run_string_mapping: dict[str, dict[str, str]],

		@@ -0,0 +1,91 @@
		# Generating ingestion config files for Jensen Datasets

		@@ -581,6 +674,48 @@ def exclude_runs_parent_filter(entities: list, runs_to_exclude: list[str]) -> No
		source["parent_filters"]["exclude"]["run"].extend(runs_to_exclude)


		def handle_per_run_param_maps(

feat: Migrating existing configs to include frame & ctf metadata #428

feat: Migrating existing configs to include frame & ctf metadata #428

Uh oh!

Conversation

manasaV3 commented Feb 1, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Notes:

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

manasaV3 Apr 17, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

manasaV3 Apr 23, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

manasaV3 commented Feb 1, 2025 •

edited

Loading

manasaV3 Apr 17, 2025 •

edited

Loading

manasaV3 Apr 23, 2025 •

edited

Loading