-
Notifications
You must be signed in to change notification settings - Fork 5
feat: Migrating existing configs to include frame & ctf metadata #428
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
…f_config_mig # Conflicts: # ingestion_tools/dataset_configs/10440.yaml # ingestion_tools/dataset_configs/10443.yaml # schema/api/v2.0.0/codegen/api_models_materialized.yaml # schema/core/v2.0.0/codegen/metadata_materialized.yaml # schema/core/v2.0.0/codegen/metadata_models.py # schema/core/v2.0.0/common.yaml # schema/ingestion_config/v1.0.0/codegen/ingestion_config_models.py # schema/ingestion_config/v1.0.0/codegen/ingestion_config_models_materialized.yaml
@@ -13,7 +12,7 @@ def helper_angles_injection_errors( | |||
remaining_angles = codomain_angles.copy() | |||
for domain_angle in domain_angles: | |||
found_match = False | |||
for codomain_angle in codomain_angles: | |||
for codomain_angle in remaining_angles: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We need to check for presence in the remaining angles, since we have cases where two angles are very close (below the tolerance value). If we keep checking against the codomain angles this leads to the incorrect one being identified (which should have already been removed).
@@ -84,7 +84,7 @@ def frames_files(frames_dir: str, filesystem: FileSystemApi) -> List[str]: | |||
"""[Dataset]/[ExperimentRun]/Frames/*""" | |||
files = filesystem.glob(f"{frames_dir}/*") | |||
# Exclude mdoc files, add s3 prefix | |||
refined_files = ["s3://" + file for file in files if ".mdoc" not in file] | |||
refined_files = ["s3://" + file for file in files if ".mdoc" not in file and ".json" not in file] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Frames metadata exist now and need to be excluded from count.
errors = helper_angles_injection_errors( | ||
mdoc_data["TiltAngle"].to_list(), | ||
tiltseries_metadata_range, | ||
"mdoc file", | ||
mdoc_data["TiltAngle"].to_list(), | ||
"tiltseries metadata tilt_range", | ||
"mdoc file", | ||
) | ||
assert len(errors) == 0, ( | ||
"\n".join(errors) | ||
+ f"\nRange: {tiltseries_metadata['tilt_range']['min']} to {tiltseries_metadata['tilt_range']['max']}, " | ||
f"with step {tiltseries_metadata['tilt_step']}" | ||
"\n".join(errors) | ||
+ f"\nRange: {tiltseries_metadata['tilt_range']['min']} to {tiltseries_metadata['tilt_range']['max']}, " | ||
f"with step {tiltseries_metadata['tilt_step']}" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Needed to invert the logic here, see description of the test. helper_angles_injection_errors
checks if domain angles are also in the codomain. Since the mdoc may contain more angles than the tilt series, tilt series metadata range needs to be domain, mdoc needs to be codomain.
per_run_float_mapping: dict[str, dict[str, float]], | ||
per_run_string_mapping: dict[str, dict[str, str]], |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Adding overrides for float and string values here.
"min": per_run_float_mapping["tilt_series_min_angle"][run_name], | ||
"max": per_run_float_mapping["tilt_series_max_angle"][run_name], | ||
} | ||
tilt_series["tilt_step"] = per_run_float_mapping["tilt_series_tilt_step"][run_name] | ||
tilt_series["tilting_scheme"] = per_run_string_mapping["tilt_series_tilting_scheme"][run_name] | ||
tilt_series["tilt_axis"] = per_run_float_mapping["tilt_series_tilt_axis"][run_name] | ||
tilt_series["total_flux"] = per_run_float_mapping["tilt_series_total_flux"][run_name] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since these values need to be overridden for at least some runs, they are all taken from the override map.
def get_per_run_float_mapping(input_dir: str) -> dict[str, dict[str, float]]: | ||
""" | ||
Get parameter to run mapping for all runs. The data for this is sourced from the per_run_float_param_map.tsv | ||
:param input_dir: | ||
:return: dictionary mapping param names to dicts of per run values {param_name -> {run_name -> value} | ||
""" | ||
with open(os.path.join(input_dir, "per_run_float_param_map.tsv")) as csvfile: | ||
reader = csv.DictReader(csvfile, delimiter="\t") | ||
params = reader.fieldnames | ||
ret = {param_name: {} for param_name in params if param_name != "run_name"} | ||
|
||
for row in reader: | ||
for param_name in ret: | ||
try: | ||
ret[param_name][row["run_name"]] = float(row[param_name]) | ||
except ValueError: | ||
ret[param_name][row["run_name"]] = 0.0 | ||
print(f"Invalid value for {param_name} in run {row['run_name']}: {row[param_name]}") | ||
return ret | ||
|
||
|
||
def get_per_run_string_mapping(input_dir: str) -> dict[str, dict[str, str]]: | ||
""" | ||
Get parameter to run mapping for all runs. The data for this is sourced from the per_run_string_param_map.tsv | ||
:param input_dir: | ||
:return: dictionary mapping param names to dicts of per run values {param_name -> {run_name -> value} | ||
""" | ||
with open(os.path.join(input_dir, "per_run_string_param_map.tsv")) as csvfile: | ||
reader = csv.DictReader(csvfile, delimiter="\t") | ||
params = reader.fieldnames | ||
ret = {param_name: {} for param_name in params if param_name != "run_name"} | ||
|
||
for row in reader: | ||
for param_name in ret: | ||
ret[param_name][row["run_name"]] = row[param_name] | ||
return ret |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Float and string overrides to use during config generation.
def get_included_run_map(input_dir: str) -> dict[str, bool]: | ||
""" | ||
Get map of runs to include/exclude during generation. | ||
:param input_dir: | ||
:return: dictionary mapping run names to bool indicating whether to include {run_name -> include} | ||
""" | ||
with open(os.path.join(input_dir, "included_runs.tsv")) as csvfile: | ||
reader = csv.DictReader(csvfile, delimiter="\t", fieldnames=["run_name", "include"]) | ||
# Skip the header row | ||
next(reader) | ||
|
||
ret = {} | ||
for row in reader: | ||
ret[row["run_name"]] = bool(int(row["include"])) | ||
return ret | ||
|
||
|
||
def exclude_runs(data: dict[str, Any], run_include_mapping: dict[str, bool]) -> dict[str, Any]: | ||
""" | ||
Exclude runs based on the run_include_mapping. If the run is not in the mapping, it is excluded. | ||
:param data: The data to process | ||
:param run_include_mapping: The mapping of run names to include/exclude | ||
:return: The processed data with excluded runs | ||
""" | ||
runs = data["runs"] | ||
runs_out = [] | ||
for entry in runs: | ||
if run_include_mapping.get(entry["run_name"], True): | ||
runs_out.append(entry) | ||
else: | ||
print(f"Excluding run {entry['run_name']}") | ||
|
||
data["runs"] = runs_out | ||
return data |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Boolean flags for each run are used to exclude them from the CZII-json data before any further processing.
def handle_per_run_param_maps( | ||
data: dict[str, Any], | ||
run_data_map: dict, | ||
per_run_mapping: dict[str, dict[str, float]] | dict[str, dict[str, str]], | ||
) -> tuple[dict[str, str | float | None], dict]: | ||
""" | ||
Handle per-run parameter mappings. The function finds distinct values for each parameter in the per_run_mapping | ||
and either passes a single value (if only one distinct value is found) or creates a formatted string for the field | ||
and appends the values to the run_data_map. | ||
:param data: The data for the dataset | ||
:param run_data_map: The run data map to store the per-run values | ||
:param per_run_mapping: The mapping of parameter to run name to value param_name -> {run_name -> value} | ||
:return: tuple of the formatted values for the dataset and the updated run_data_map | ||
""" | ||
distinct_values = {param: {} for param in per_run_mapping} | ||
for entry in data: | ||
run_name = entry["run_name"] | ||
for param_name in per_run_mapping: | ||
param_value = per_run_mapping[param_name].get(run_name, 0.0) | ||
# dose_rate = frame_dose_rate_mapping.get(run_name, 0.0) | ||
if param_value in distinct_values[param_name]: | ||
distinct_values[param_name][param_value].append(run_name) | ||
else: | ||
distinct_values[param_name][param_value] = [run_name] | ||
|
||
ret = {} | ||
|
||
for param_name, values in distinct_values.items(): | ||
if len(values) == 0: | ||
ret[param_name] = None | ||
elif len(values) == 1: | ||
ret[param_name] = next(iter(values.keys())) | ||
else: | ||
key = f"float {{{param_name}}}" | ||
for param_value, runs in distinct_values[param_name].items(): | ||
for run_name in runs: | ||
run_data_map[run_name][param_name] = param_value | ||
ret[param_name] = key | ||
|
||
return ret, run_data_map |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Converting from per-run-map to single value or mapped-value entry.
exclude_runs_parent_filter(updated_dataset_config.get("frames", []), runs_without_tilt) | ||
exclude_runs_parent_filter(updated_dataset_config.get("ctfs", []), runs_without_tilt) | ||
exclude_runs_parent_filter(updated_dataset_config.get("rawtlts", []), runs_without_tilt) | ||
exclude_runs_parent_filter(updated_dataset_config.get("collection_metadata", []), runs_without_tilt) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
When a run has no tilt series, all of these should have the same exclude filters.
# If there are no tiltseries, remove frames, rawtlts, ctfs, and collection_metadata | ||
if not updated_dataset_config.get("tiltseries"): | ||
updated_dataset_config.pop("frames", None) | ||
updated_dataset_config.pop("rawtlts", None) | ||
updated_dataset_config.pop("ctfs", None) | ||
updated_dataset_config.pop("collection_metadata", None) | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
When a dataset has no tilt series at all, all of these should not be present.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could we a have a usecase, where a frame block was added for a manually generated config but they don't submit a tiltseries with it?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is possible, but is it relevant for jensen config generation?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry, this response was meant to be added to the comment below. As it was a part of the migration for all configs. 😅
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
But for other configs we never need to run this migration again, or am I misunderstanding how these migrations work?
Do we always apply all migration scripts when we do a new one?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we always apply all migration scripts when we do a new one?
No, we only migrate configs as needed if they are not the latest, and we only apply the migration step needed to get it to the latest.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The only config that should ideally be impacted by this is 10006.yaml. But, since it has a tiltseries block defined it would still generate the frame block for it.
if "frames" not in config and len(config.get("tiltseries", [])) > 0: | ||
config["frames"] = [ | ||
{ | ||
"sources": [ | ||
{"literal": {"value": ["default"]}}, | ||
], | ||
}, | ||
] | ||
|
||
if "frames" in config: | ||
for entry in config["frames"]: | ||
if "metadata" not in entry: | ||
entry["metadata"] = { | ||
"dose_rate": frame_dose_rate, | ||
"is_gain_corrected": "gain" in config, | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If no tiltseries are present for this dataset, a frames block should not exist. But we need to include an additional check on L17 to only do updates when the block does exist.
@@ -0,0 +1,91 @@ | |||
# Generating ingestion config files for Jensen Datasets |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
💜
ingestion_tools/scripts/data_validation/standardized/fixtures/path.py
Outdated
Show resolved
Hide resolved
@@ -581,6 +674,48 @@ def exclude_runs_parent_filter(entities: list, runs_to_exclude: list[str]) -> No | |||
source["parent_filters"]["exclude"]["run"].extend(runs_to_exclude) | |||
|
|||
|
|||
def handle_per_run_param_maps( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If I am understanding this method correctly, it handles for distinct values in the field, and adds it to run_data_map
. We later sets that value to the relevant config field while processing the tiltseries.
What are your thoughts on overriding those values in the tiltseries object of the entry directly here instead? I think it might simplify things a little, and also the csv files that gets generated would not include columns that don't get referenced in the config. For example: the run_data_map/10014.csv now includes columns for tilt_series_max_angle, tilt_series_min_angle, tilt_series_quality_score
and ts-tilt_range-max, ts-tilt_range-min, ts-tilt_series_quality
.
While that would work for overrides made to entities such as tiltseries, tomogram. This kind of processing would still be needed for frames_dose_rate
, unless we refactor how frame_dose_rate is handled now.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So the suggestion would be to:
- Add dose rate to
float_fields
- Override all tilt series values with actual numbers, do not check for distinctness here
- Let
to_template_by_run
handle the distinctness check
tbh, this function is just a generalization of what was being done for dose rate. I didn't realize there was an extra distinctness check that is called after to_tiltseries
, and I didn't realize these values were added to the csv twice for that reason. We can do away with the distinctness check ahead of to_tiltseries
I think.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
iirc, we currently don't do to_template_by_run
for frame dose_rate. So, we should either update how frames entity is being generated or retain the above method for just frames_dose_rate.
…path.py Co-authored-by: Manasa Venkatakrishnan <[email protected]>
…g' into mvenkatakrishnan/fnctf_config_mig
- Add frame_dose_rate to float_fields - handle_per_run_param_maps no longer returns a modified run_data_map - rename ds_per_run_mapping to dataset_data_map. - Run the gjensen_config.py to regenerate csvs.
@@ -238,6 +238,7 @@ def to_standardization_config( | |||
"ts-tilt_range-min", | |||
"ts-tilt_range-max", | |||
"ts-total_flux", | |||
"frame_dose_rate", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I believe, the addition here will not have the desired impact as we haven't changed how we process for the frames entity.
ret[param_name] = key | ||
|
||
return ret, run_data_map |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As we are removing the addition to run_data_map, the case where there are different values for frame_dose_rate
for runs in a dataset is not handled correctly (ie) we are setting the value in the config still as {frame_does_rate}
, but we aren't adding the frame_does_rate value to the csv.
- generalize handle_per_run_param_maps - use handle_per_run_param_maps to get frame_dose_rate - add frame_dose_rate to run_data_map/*csv Signed-off-by: Bento007 <[email protected]>
Signed-off-by: Bento007 <[email protected]>
Relates to chanzuckerberg/cryoet-data-portal#1647
Depends on: #427
Description
Notes:
There is a dependency on file
run_dose_rate_map.tsv
to exist inhttps://github.com/czimaginginstitute/czii-data-portal-processing/tree/main/src/data_portal_processing/jensendb
containing a mapping of all run names to frame dose rating.The failure of the ingestion config validation is expected due invalid stub placeholder, as a remainder to update these frame dose rate values. Do not merge until all the errors have been addressed.