Skip to content

feat: Migrating existing configs to include frame & ctf metadata #428

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 61 commits into from
Apr 25, 2025

Conversation

manasaV3
Copy link
Contributor

@manasaV3 manasaV3 commented Feb 1, 2025

Relates to chanzuckerberg/cryoet-data-portal#1647

Depends on: #427

Description

  • Manually migrated the ctf entity
  • Updates the generated rawtlt and mdoc files as fallback
  • Adds migration to support frame_dose_rate

Notes:

  1. There is a dependency on file run_dose_rate_map.tsv to exist in https://github.com/czimaginginstitute/czii-data-portal-processing/tree/main/src/data_portal_processing/jensendb containing a mapping of all run names to frame dose rating.

  2. The failure of the ingestion config validation is expected due invalid stub placeholder, as a remainder to update these frame dose rate values. Do not merge until all the errors have been addressed.

@manasaV3 manasaV3 requested review from jgadling, Bento007 and uermel and removed request for jgadling February 1, 2025 00:56
Base automatically changed from mvenkatakrishnan/fnctf_config to main February 4, 2025 00:34
manasaV3 and others added 11 commits February 21, 2025 13:01
…f_config_mig

# Conflicts:
#	ingestion_tools/dataset_configs/10440.yaml
#	ingestion_tools/dataset_configs/10443.yaml
#	schema/api/v2.0.0/codegen/api_models_materialized.yaml
#	schema/core/v2.0.0/codegen/metadata_materialized.yaml
#	schema/core/v2.0.0/codegen/metadata_models.py
#	schema/core/v2.0.0/common.yaml
#	schema/ingestion_config/v1.0.0/codegen/ingestion_config_models.py
#	schema/ingestion_config/v1.0.0/codegen/ingestion_config_models_materialized.yaml
@@ -13,7 +12,7 @@ def helper_angles_injection_errors(
remaining_angles = codomain_angles.copy()
for domain_angle in domain_angles:
found_match = False
for codomain_angle in codomain_angles:
for codomain_angle in remaining_angles:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We need to check for presence in the remaining angles, since we have cases where two angles are very close (below the tolerance value). If we keep checking against the codomain angles this leads to the incorrect one being identified (which should have already been removed).

@@ -84,7 +84,7 @@ def frames_files(frames_dir: str, filesystem: FileSystemApi) -> List[str]:
"""[Dataset]/[ExperimentRun]/Frames/*"""
files = filesystem.glob(f"{frames_dir}/*")
# Exclude mdoc files, add s3 prefix
refined_files = ["s3://" + file for file in files if ".mdoc" not in file]
refined_files = ["s3://" + file for file in files if ".mdoc" not in file and ".json" not in file]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Frames metadata exist now and need to be excluded from count.

Comment on lines 125 to +134
errors = helper_angles_injection_errors(
mdoc_data["TiltAngle"].to_list(),
tiltseries_metadata_range,
"mdoc file",
mdoc_data["TiltAngle"].to_list(),
"tiltseries metadata tilt_range",
"mdoc file",
)
assert len(errors) == 0, (
"\n".join(errors)
+ f"\nRange: {tiltseries_metadata['tilt_range']['min']} to {tiltseries_metadata['tilt_range']['max']}, "
f"with step {tiltseries_metadata['tilt_step']}"
"\n".join(errors)
+ f"\nRange: {tiltseries_metadata['tilt_range']['min']} to {tiltseries_metadata['tilt_range']['max']}, "
f"with step {tiltseries_metadata['tilt_step']}"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Needed to invert the logic here, see description of the test. helper_angles_injection_errors checks if domain angles are also in the codomain. Since the mdoc may contain more angles than the tilt series, tilt series metadata range needs to be domain, mdoc needs to be codomain.

Comment on lines +292 to +293
per_run_float_mapping: dict[str, dict[str, float]],
per_run_string_mapping: dict[str, dict[str, str]],
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Adding overrides for float and string values here.

Comment on lines +315 to +321
"min": per_run_float_mapping["tilt_series_min_angle"][run_name],
"max": per_run_float_mapping["tilt_series_max_angle"][run_name],
}
tilt_series["tilt_step"] = per_run_float_mapping["tilt_series_tilt_step"][run_name]
tilt_series["tilting_scheme"] = per_run_string_mapping["tilt_series_tilting_scheme"][run_name]
tilt_series["tilt_axis"] = per_run_float_mapping["tilt_series_tilt_axis"][run_name]
tilt_series["total_flux"] = per_run_float_mapping["tilt_series_total_flux"][run_name]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since these values need to be overridden for at least some runs, they are all taken from the override map.

Comment on lines +591 to +626
def get_per_run_float_mapping(input_dir: str) -> dict[str, dict[str, float]]:
"""
Get parameter to run mapping for all runs. The data for this is sourced from the per_run_float_param_map.tsv
:param input_dir:
:return: dictionary mapping param names to dicts of per run values {param_name -> {run_name -> value}
"""
with open(os.path.join(input_dir, "per_run_float_param_map.tsv")) as csvfile:
reader = csv.DictReader(csvfile, delimiter="\t")
params = reader.fieldnames
ret = {param_name: {} for param_name in params if param_name != "run_name"}

for row in reader:
for param_name in ret:
try:
ret[param_name][row["run_name"]] = float(row[param_name])
except ValueError:
ret[param_name][row["run_name"]] = 0.0
print(f"Invalid value for {param_name} in run {row['run_name']}: {row[param_name]}")
return ret


def get_per_run_string_mapping(input_dir: str) -> dict[str, dict[str, str]]:
"""
Get parameter to run mapping for all runs. The data for this is sourced from the per_run_string_param_map.tsv
:param input_dir:
:return: dictionary mapping param names to dicts of per run values {param_name -> {run_name -> value}
"""
with open(os.path.join(input_dir, "per_run_string_param_map.tsv")) as csvfile:
reader = csv.DictReader(csvfile, delimiter="\t")
params = reader.fieldnames
ret = {param_name: {} for param_name in params if param_name != "run_name"}

for row in reader:
for param_name in ret:
ret[param_name][row["run_name"]] = row[param_name]
return ret
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Float and string overrides to use during config generation.

Comment on lines +629 to +662
def get_included_run_map(input_dir: str) -> dict[str, bool]:
"""
Get map of runs to include/exclude during generation.
:param input_dir:
:return: dictionary mapping run names to bool indicating whether to include {run_name -> include}
"""
with open(os.path.join(input_dir, "included_runs.tsv")) as csvfile:
reader = csv.DictReader(csvfile, delimiter="\t", fieldnames=["run_name", "include"])
# Skip the header row
next(reader)

ret = {}
for row in reader:
ret[row["run_name"]] = bool(int(row["include"]))
return ret


def exclude_runs(data: dict[str, Any], run_include_mapping: dict[str, bool]) -> dict[str, Any]:
"""
Exclude runs based on the run_include_mapping. If the run is not in the mapping, it is excluded.
:param data: The data to process
:param run_include_mapping: The mapping of run names to include/exclude
:return: The processed data with excluded runs
"""
runs = data["runs"]
runs_out = []
for entry in runs:
if run_include_mapping.get(entry["run_name"], True):
runs_out.append(entry)
else:
print(f"Excluding run {entry['run_name']}")

data["runs"] = runs_out
return data
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Boolean flags for each run are used to exclude them from the CZII-json data before any further processing.

Comment on lines 677 to 716
def handle_per_run_param_maps(
data: dict[str, Any],
run_data_map: dict,
per_run_mapping: dict[str, dict[str, float]] | dict[str, dict[str, str]],
) -> tuple[dict[str, str | float | None], dict]:
"""
Handle per-run parameter mappings. The function finds distinct values for each parameter in the per_run_mapping
and either passes a single value (if only one distinct value is found) or creates a formatted string for the field
and appends the values to the run_data_map.
:param data: The data for the dataset
:param run_data_map: The run data map to store the per-run values
:param per_run_mapping: The mapping of parameter to run name to value param_name -> {run_name -> value}
:return: tuple of the formatted values for the dataset and the updated run_data_map
"""
distinct_values = {param: {} for param in per_run_mapping}
for entry in data:
run_name = entry["run_name"]
for param_name in per_run_mapping:
param_value = per_run_mapping[param_name].get(run_name, 0.0)
# dose_rate = frame_dose_rate_mapping.get(run_name, 0.0)
if param_value in distinct_values[param_name]:
distinct_values[param_name][param_value].append(run_name)
else:
distinct_values[param_name][param_value] = [run_name]

ret = {}

for param_name, values in distinct_values.items():
if len(values) == 0:
ret[param_name] = None
elif len(values) == 1:
ret[param_name] = next(iter(values.keys()))
else:
key = f"float {{{param_name}}}"
for param_value, runs in distinct_values[param_name].items():
for run_name in runs:
run_data_map[run_name][param_name] = param_value
ret[param_name] = key

return ret, run_data_map
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Converting from per-run-map to single value or mapped-value entry.

Comment on lines +816 to +819
exclude_runs_parent_filter(updated_dataset_config.get("frames", []), runs_without_tilt)
exclude_runs_parent_filter(updated_dataset_config.get("ctfs", []), runs_without_tilt)
exclude_runs_parent_filter(updated_dataset_config.get("rawtlts", []), runs_without_tilt)
exclude_runs_parent_filter(updated_dataset_config.get("collection_metadata", []), runs_without_tilt)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When a run has no tilt series, all of these should have the same exclude filters.

Comment on lines +824 to +830
# If there are no tiltseries, remove frames, rawtlts, ctfs, and collection_metadata
if not updated_dataset_config.get("tiltseries"):
updated_dataset_config.pop("frames", None)
updated_dataset_config.pop("rawtlts", None)
updated_dataset_config.pop("ctfs", None)
updated_dataset_config.pop("collection_metadata", None)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When a dataset has no tilt series at all, all of these should not be present.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could we a have a usecase, where a frame block was added for a manually generated config but they don't submit a tiltseries with it?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is possible, but is it relevant for jensen config generation?

Copy link
Contributor Author

@manasaV3 manasaV3 Apr 17, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, this response was meant to be added to the comment below. As it was a part of the migration for all configs. 😅

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But for other configs we never need to run this migration again, or am I misunderstanding how these migrations work?

Do we always apply all migration scripts when we do a new one?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we always apply all migration scripts when we do a new one?

No, we only migrate configs as needed if they are not the latest, and we only apply the migration step needed to get it to the latest.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The only config that should ideally be impacted by this is 10006.yaml. But, since it has a tiltseries block defined it would still generate the frame block for it.

Comment on lines +8 to +23
if "frames" not in config and len(config.get("tiltseries", [])) > 0:
config["frames"] = [
{
"sources": [
{"literal": {"value": ["default"]}},
],
},
]

if "frames" in config:
for entry in config["frames"]:
if "metadata" not in entry:
entry["metadata"] = {
"dose_rate": frame_dose_rate,
"is_gain_corrected": "gain" in config,
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If no tiltseries are present for this dataset, a frames block should not exist. But we need to include an additional check on L17 to only do updates when the block does exist.

@@ -0,0 +1,91 @@
# Generating ingestion config files for Jensen Datasets
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💜

@@ -581,6 +674,48 @@ def exclude_runs_parent_filter(entities: list, runs_to_exclude: list[str]) -> No
source["parent_filters"]["exclude"]["run"].extend(runs_to_exclude)


def handle_per_run_param_maps(
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If I am understanding this method correctly, it handles for distinct values in the field, and adds it to run_data_map. We later sets that value to the relevant config field while processing the tiltseries.

What are your thoughts on overriding those values in the tiltseries object of the entry directly here instead? I think it might simplify things a little, and also the csv files that gets generated would not include columns that don't get referenced in the config. For example: the run_data_map/10014.csv now includes columns for tilt_series_max_angle, tilt_series_min_angle, tilt_series_quality_score and ts-tilt_range-max, ts-tilt_range-min, ts-tilt_series_quality.

While that would work for overrides made to entities such as tiltseries, tomogram. This kind of processing would still be needed for frames_dose_rate, unless we refactor how frame_dose_rate is handled now.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So the suggestion would be to:

  1. Add dose rate to float_fields
  2. Override all tilt series values with actual numbers, do not check for distinctness here
  3. Let to_template_by_run handle the distinctness check

tbh, this function is just a generalization of what was being done for dose rate. I didn't realize there was an extra distinctness check that is called after to_tiltseries, and I didn't realize these values were added to the csv twice for that reason. We can do away with the distinctness check ahead of to_tiltseries I think.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

iirc, we currently don't do to_template_by_run for frame dose_rate. So, we should either update how frames entity is being generated or retain the above method for just frames_dose_rate.

uermel and others added 4 commits April 16, 2025 16:58
- Add frame_dose_rate to float_fields
- handle_per_run_param_maps no longer returns a modified run_data_map
- rename ds_per_run_mapping to dataset_data_map.
- Run the gjensen_config.py to regenerate csvs.
@@ -238,6 +238,7 @@ def to_standardization_config(
"ts-tilt_range-min",
"ts-tilt_range-max",
"ts-total_flux",
"frame_dose_rate",
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe, the addition here will not have the desired impact as we haven't changed how we process for the frames entity.

ret[param_name] = key

return ret, run_data_map
Copy link
Contributor Author

@manasaV3 manasaV3 Apr 23, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As we are removing the addition to run_data_map, the case where there are different values for frame_dose_rate for runs in a dataset is not handled correctly (ie) we are setting the value in the config still as {frame_does_rate}, but we aren't adding the frame_does_rate value to the csv.

- generalize handle_per_run_param_maps
- use handle_per_run_param_maps to get frame_dose_rate
- add frame_dose_rate to run_data_map/*csv

Signed-off-by: Bento007 <[email protected]>
@Bento007 Bento007 merged commit f7b790e into main Apr 25, 2025
8 checks passed
@Bento007 Bento007 deleted the mvenkatakrishnan/fnctf_config_mig branch April 25, 2025 20:58
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants