Skip to content

Conversation

@Luke-Schreiber
Copy link
Contributor

@Luke-Schreiber Luke-Schreiber commented Oct 13, 2025

Does this PR close any open issues?

Closes #305

Give a longer description of what this PR addresses and why it's needed

  • More accurate table counts (Visit, Medication, etc.) to Utah database
  • Logging of table counts
  • Unified script recreatedata.py for recreating data (combining migrate, mock50million, generate_parquets)
  • More accurate blood product amounts and transfusions

Provide pictures/videos of the behavior before and after these changes (optional)

Have you added or updated relevant tests?

  • Yes
  • No changes are needed

Have you added or updated relevant documentation?

  • Yes
  • No changes are needed

Are there any additional TODOs before this PR is ready to go?

N/A

@Luke-Schreiber Luke-Schreiber marked this pull request as ready for review October 16, 2025 19:28
Copy link
Member

@JackWilb JackWilb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Small fixes

elif hgb < 8: rbc_units = random.randint(0, 2)
elif hgb < 9 and random.random() < 0.4: rbc_units = 1
elif hgb < 10 and random.random() < 0.25: rbc_units = 1
rbc_units = min(rbc_units, 6)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

bug

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Comment on lines 773 to 795
transfusion_events = [
(datetime.strptime(lab["lab_draw_dtm"], DATE_FORMAT), lab)
for lab in v_labs
if (not has_surg or datetime.strptime(lab["lab_draw_dtm"], DATE_FORMAT) <= datetime.strptime(surg["surgery_end_dtm"], DATE_FORMAT))
and (
(score := (
(max(0, 10 - float(lab["result_value"])) if lab["result_desc"] in ("HGB", "Hemoglobin") else 0) +
(max(0, float(lab["result_value"]) - 1) if lab["result_desc"] == "INR" else 0) +
(max(0, (150000 - float(lab["result_value"])) / 50000) if lab["result_desc"] in ("PLT", "Platelet Count") else 0) +
(max(0, (150 - float(lab["result_value"])) / 50) if lab["result_desc"] == "Fibrinogen" else 0)
)) > 0.5 and random.random() < min(0.9, 0.1 + score/12)
)
]
# Extra chance for intra-op/trauma transfusion event
if has_surg and random.random() < 0.1:
mid_surg = datetime.strptime(surg["surgery_start_dtm"], DATE_FORMAT) + timedelta(minutes=random.randint(30, int(surg_len*60-10)))
transfusion_events.append((mid_surg, None))
if has_surg and (("Emergent" in surg_type or "Trauma" in surg_type or surg_len > 4) and random.random() < 0.2):
mid_surg = datetime.strptime(surg["surgery_start_dtm"], DATE_FORMAT) + timedelta(minutes=random.randint(10, int(surg_len*60-10)))
transfusion_events.append((mid_surg, None))
# Limit to 1–3 transfusion events per visit
if transfusion_events:
transfusion_events = random.sample(transfusion_events, min(len(transfusion_events), random.choices([1,2,3],[0.8,0.1,0.05])[0]))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Refactor or comment this. Why is this clinically relevant?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Mock dataset resembles Utah dataset counts

3 participants