Skip to content

Updating repo for collaborators #28

@Miking98

Description

@Miking98

Comments from Jason --

Top-level Issues

  • Docs are outdated: Following the README often results in broken runs due to missing imports, hard-coded paths, or incorrect instructions.
  • Config handling is messy: Many configs are hard-coded or load other configs, making things hard to trace and brittle to edit.
  • Tied to Carina: Deep dependencies on internal infra make it unusable for external collaborators (e.g., ARPA-H).

MEDS Demo – Tokenizer Training Broken

  • Step 7 ("Train tokenizer with cookbook.py") fails due to missing required args.
  • Tokenizer config logic is opaque — it seems to point to existing trained configs, not a spec for training a new one.
  • No clear minimal path to train a tokenizer (e.g., just vocab size + dataset). create_cookbook_k.py seems promising but fails due to import and config issues.
  • Filesystem logic is unpredictable — e.g., setting cache/default results in cache_4k/default/.

What Works

  • Inference and patient representation code works — I can tokenize MEDS eventstreams and get patient embeddings from the pretrained models.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions