Pre-submission cleanup: README, model name, templates, dead code

max-rosenblattl · max-rosenblattl · commit de72c4c66793 · 2026-05-04T19:56:11.000-07:00
diff --git a/README.md b/README.md
@@ -1,41 +1,122 @@
-# SensorTSLM Core Framework
+# SensorTSLM
 
-Dataset-agnostic captioning pipeline for sensor time-series data.
+Dataset-agnostic pipeline that turns raw multi-channel sensor recordings into
+language-model-ready captions. The current reference implementation targets the
+MyHeartCounts (MHC) wearable dataset at daily and weekly resolution.
 
-## Setup
+## Pipeline overview
+
+```
+HuggingFace dataset ──► Transformer ──► Recording ──► Annotator ──► CaptionResult
+                                                       │
+                                                       ├── StructuralExtractor   (trends, spikes)
+                                                       ├── SemanticExtractor     (active windows)
+                                                       └── CrossChannelExtractor (workout / sleep)
+```
+
+- **`Transformer`** maps a source-dataset row to the internal `Recording` schema
+  (per-channel float arrays + metadata).
+- **`Annotator`** runs a list of `CaptionExtractor`s and returns one
+  `Annotation` per (channel, time-window, caption-type) it observes.
+- **Caption phrasing** is template-driven (`templates/templates.json`,
+  `templates/templates_hourly.json`); each annotation deterministically picks a
+  template variant from a row-derived seed.
 
-Install dependencies and set the dataset path before running:
+Adding a new source dataset means writing one `Dataset`, one `Transformer`, and
+a `ChannelConfig` describing channel names, units, and aggregators — no changes
+to the core pipeline.
+
+## Setup
 
 ```bash
 python3 -m pip install -r requirements.txt
 ```
 
+Point the loader at a HuggingFace MHC export:
+
+```bash
+export MHC_DATASET_DIR=<path-to-daily-hf-dataset>     # for daily
+export MHC_WEEKLY_DATASET_DIR=<path-to-weekly-hf>     # for weekly
+```
+
+## Generating captions
+
+Single-process export:
+
 ```bash
-export MHC_DATASET_DIR="../hf-daily_max-nonwear=50"
+python scripts/export_captions.py \
+    --variant weekly \
+    --out exports/lean_full
 ```
 
-## Usage
+Common flags:
+
+| Flag | Default | Notes |
+|------|---------|-------|
+| `--variant {daily,weekly}` | `weekly` | Which MHC resolution to caption. |
+| `--out <dir>` | `exports/lean_full` | Output directory for Arrow shards. |
+| `--max_rows <n>` | unset | Cap row count for a smoke test. |
+| `--start <i>` / `--end <j>` | full range | Slice for parallel/sharded runs. |
+| `--min_wear_pct <p>` | `0.0` | (daily) drop low-wear days. |
+| `--min_valid_hours <h>` | `0` | (weekly) drop weeks with too few valid hours. |
+| `--min_active_channels <k>` | `0` | Drop rows with fewer active channels. |
+| `--split_file <json>` | unset | Use canonical sharable-user splits. |
+
+Each shard writes an Arrow file under `<out>/recordings_*.arrow` containing the
+`Recording` data plus all `Annotation`s.
+
+### Parallel sharded export
+
+For large jobs the helper script splits the dataset into `N` Slurm jobs:
 
 ```bash
-python3 captionizer.py
+export MHC_WEEKLY_DATASET_DIR=<path>
+./scripts/export_captions_sharded.sh weekly 4 exports/lean_full
 ```
 
-## Explorer
+## Library usage
 
-Use the interactive explorer to inspect one row at a time, switch signals, and see which detector events fired where on the time series.
+```python
+from annotator import Annotator
+from captionizer import Captionizer
+from extractors.semantic import SemanticExtractor
+from extractors.structural import StructuralExtractor
+from mhc.constants import MHC_CHANNEL_CONFIG
+from mhc.cross_channel import default_extractor
+from mhc.dataset import MHCDataset
+from mhc.transformer import MHCTransformer
 
-Start it with:
+dataset = MHCDataset()
+annotator = Annotator([
+    StructuralExtractor(MHC_CHANNEL_CONFIG),
+    SemanticExtractor(MHC_CHANNEL_CONFIG),
+    default_extractor(MHC_CHANNEL_CONFIG),
+])
+captionizer = Captionizer(dataset, MHCTransformer(), annotator)
+result, _ = captionizer.run(max_rows=10)
+```
+
+## Inspecting the output
+
+The interactive explorer steps through one row at a time, switches signals, and
+overlays detector events on the time series:
 
 ```bash
-python3 explorer.py --min-wear-pct=50.0
+python explorer.py --min-wear-pct=50.0
 ```
 
-Useful controls:
+## Layout
 
-- Use the bottom row slider or `<` / `>` buttons to move between dataset rows.
-- Click a signal in the right-hand signal list or in the channel overview heatmap to switch channels.
-- Use the Matplotlib zoom and pan tools on the main plot to inspect parts of the signal in detail.
-- Click `reset` or press `home` to reset the zoom.
-- Use the overlay buttons to toggle `trend`, `spike`, `drop`, and `nonwear` overlays.
-- Use the `stats`, `events`, `captions`, and `help` tabs in the details panel to switch what metadata is shown.
-- Scroll inside the details panel with the mouse wheel or the `^` / `v` buttons.
+```
+captionizer.py        Orchestration (Dataset + Transformer + Annotator)
+annotator.py          Runs a list of extractors over one Recording
+extractors/           Caption extractors (statistical / structural / semantic / cross-channel)
+detectors/            Trend and spike detectors used by structural extractor
+synthesizers/         Cross-channel caption synthesizers (workout, sleep, …)
+templates/            Caption phrasing templates
+mhc/                  MHC daily dataset adapter
+mhc_weekly/           MHC weekly dataset adapter
+timef/                Internal Recording / CaptionResult schema
+exporters/            Arrow shard writer
+scripts/              CLI entry points (caption export, training, eval)
+```
diff --git a/captionizer.py b/captionizer.py
@@ -83,7 +83,7 @@ def run(
     model = ClientModel(
         ClientConfig(
             base_url="https://api.openai.com/v1",
-            model="gpt-5.4",
+            model="gpt-4o",
         ),
         MHC_CHANNEL_CONFIG,
     )
diff --git a/extractors/generative.py b/extractors/generative.py
diff --git a/requirements.txt b/requirements.txt
@@ -1,14 +1,12 @@
+accelerate
 datasets
 matplotlib
 numpy
+openai
+opentslm
+Pillow
 pyarrow
+python-dotenv
 scipy
 torch
-Pillow
 transformers
-accelerate
-datasets
-pyarrow
-openai
-python-dotenv
-opentslm
diff --git a/reviewer.py b/reviewer.py
@@ -10,7 +10,7 @@
 from dataclasses import dataclass, field
 
 from models.base import BaseModel
-from timef.schema import Annotation, CaptionResult
+from timef.schema import CaptionResult
 
 
 EVALUATE_PROMPT = """\
@@ -87,10 +87,6 @@ def evaluate(self, result: CaptionResult, per_channel: bool = False) -> Evaluati
 
         return EvaluationResult(scores=scores)
 
-    def refine(self, result: CaptionResult) -> list[Annotation]:
-        """Identify missing observations and add new captions."""
-        raise NotImplementedError
-
     @staticmethod
     def _parse_score(text: str) -> EvaluationScore:
         """Extract score and feedback from model response."""
diff --git a/templates/templates.json b/templates/templates.json
@@ -2,7 +2,7 @@
   "statistical": [
     "The average {name} value is {mean} {unit}, with extremes at {max} (max) and {min} (min), and a std of {std}.",
     "The {name} data exhibits a mean of {mean} {unit}, a standard deviation of {std}, and its extreme values are {min} and {max}.",
-    "{name} average {mean} {unit}, reaching a maximum of {max} and a minimum of {min}, with a standard deviation of {std}.",
+    "{name} averages {mean} {unit}, reaching a maximum of {max} and a minimum of {min}, with a standard deviation of {std}.",
     "{name} exhibits a mean of {mean} {unit}, with peak and minimal values reaching {max} and {min}, and a standard deviation of {std}.",
     "For the {name} measurements, the mean is {mean} {unit}, the standard deviation is {std}, and the data lies between {min} and {max}.",
     "Across all {name} samples, the mean is {mean} {unit}, the std is {std}, and values range from {min} to {max}.",
@@ -342,18 +342,18 @@
         "The watch tallied {total:.0f} steps for this interval",
         "Total watch steps during this stretch came to {total:.0f}",
         "The watch summed {total:.0f} steps across this interval",
-        "Aggregate watch steps for the interval was {total:.0f}",
+        "Aggregate watch steps for the interval were {total:.0f}",
         "The interval's total watch steps reached {total:.0f}",
         "Across this period the watch logged {total:.0f} steps",
         "Watch steps added up to {total:.0f} over this interval",
         "The watch accumulated {total:.0f} steps during this stretch",
         "Total watch steps for the period reached {total:.0f}",
         "This stretch saw {total:.0f} steps on the watch",
-        "The watch's cumulative steps for the interval was {total:.0f}",
+        "The watch's cumulative steps for the interval were {total:.0f}",
         "Across the interval the watch counted {total:.0f} steps",
         "Sum of watch steps over this period was {total:.0f}",
         "The watch totaled {total:.0f} steps during this interval",
-        "Total watch-recorded steps for this stretch was {total:.0f}"
+        "Total watch-recorded steps for this stretch were {total:.0f}"
       ],
       "active_energy_mean": [
         "The watch recorded an average {name} of {mean:.0f} {unit} during this period",
@@ -424,21 +424,21 @@
       "sleep_stationary": [
         "The person was stationary during this sleep interval",
         "The person remained stationary during this sleep period",
-        "The sleep interval was stationary, with no distance-based movement recorded",
+        "The sleep interval was stationary, with no movement recorded",
         "The person stayed in one place during this sleep interval",
-        "This sleep period showed no distance-based movement away from rest",
+        "This sleep period showed no movement from the resting position",
         "Throughout this sleep interval the person remained still",
         "No movement was detected during this sleep period",
         "The person stayed put across this sleep interval",
         "During this sleep stretch the person did not move",
-        "This sleep period was free of distance-based movement",
+        "This sleep period was free of movement",
         "The person was at rest for the duration of this sleep interval",
         "No locomotor activity occurred during this sleep period",
         "The person remained at rest during this sleep stretch",
         "This sleep interval involved no positional change",
         "The sleep window saw the person remain stationary",
         "During the sleep period the person stayed put",
-        "The person slept without moving in space",
+        "The person slept without changing position",
         "The interval was a still sleep period",
         "Across this sleep interval the person did not move",
         "Stationary sleep was observed across this interval"
diff --git a/templates/templates_hourly.json b/templates/templates_hourly.json
@@ -2,7 +2,7 @@
   "statistical": [
     "The average {name} value is {mean} {unit}, with extremes at {max} (max) and {min} (min), and a std of {std}.",
     "The {name} data exhibits a mean of {mean} {unit}, a standard deviation of {std}, and its extreme values are {min} and {max}.",
-    "{name} average {mean} {unit}, reaching a maximum of {max} and a minimum of {min}, with a standard deviation of {std}.",
+    "{name} averages {mean} {unit}, reaching a maximum of {max} and a minimum of {min}, with a standard deviation of {std}.",
     "{name} exhibits a mean of {mean} {unit}, with peak and minimal values reaching {max} and {min}, and a standard deviation of {std}.",
     "For the {name} measurements, the mean is {mean} {unit}, the standard deviation is {std}, and the data lies between {min} and {max}.",
     "Across all {name} samples, the mean is {mean} {unit}, the std is {std}, and values range from {min} to {max}.",
@@ -538,14 +538,14 @@
         "No movement was detected during this sleep period",
         "The person stayed put across this sleep interval",
         "During this sleep stretch the person did not move",
-        "This sleep period was free of distance-based movement",
+        "This sleep period was free of movement",
         "The person was at rest for the duration of this sleep interval",
         "No locomotor activity occurred during this sleep period",
         "The person remained at rest during this sleep stretch",
         "This sleep interval involved no positional change",
         "The sleep window saw the person remain stationary",
         "During the sleep period the person stayed put",
-        "The person slept without moving in space",
+        "The person slept without changing position",
         "The interval was a still sleep period",
         "Across this sleep interval the person did not move",
         "Stationary sleep was observed across this interval",
@@ -581,18 +581,18 @@
         "The watch tallied {total:.0f} steps for this interval",
         "Total watch steps during this stretch came to {total:.0f}",
         "The watch summed {total:.0f} steps across this interval",
-        "Aggregate watch steps for the interval was {total:.0f}",
+        "Aggregate watch steps for the interval were {total:.0f}",
         "The interval's total watch steps reached {total:.0f}",
         "Across this period the watch logged {total:.0f} steps",
         "Watch steps added up to {total:.0f} over this interval",
         "The watch accumulated {total:.0f} steps during this stretch",
         "Total watch steps for the period reached {total:.0f}",
         "This stretch saw {total:.0f} steps on the watch",
-        "The watch's cumulative steps for the interval was {total:.0f}",
+        "The watch's cumulative steps for the interval were {total:.0f}",
         "Across the interval the watch counted {total:.0f} steps",
         "Sum of watch steps over this period was {total:.0f}",
         "The watch totaled {total:.0f} steps during this interval",
-        "Total watch-recorded steps for this stretch was {total:.0f}",
+        "Total watch-recorded steps for this stretch were {total:.0f}",
         "Total steps on the watch reached {total:.0f} for this interval",
         "Watch logs sum to {total:.0f} steps for this stretch",
         "Combined watch steps for this period reached {total:.0f}"
diff --git a/time_series_datasets/mhc_caption_qa_dataset.py b/time_series_datasets/mhc_caption_qa_dataset.py
@@ -17,7 +17,7 @@ def _get_answer(self, row) -> str:
 
     def _get_post_prompt(self, _row) -> str:
         return (
-            "Please generate a detailed caption for this day of sensor data, "
+            "Generate a detailed caption for this day of sensor data, "
             "describing the person's activity, sleep, and physiological "
             "patterns as accurately as possible."
         )

Original file line number	Diff line number	Diff line change
`@@ -83,7 +83,7 @@ def run(`
`83`	`83`	`model = ClientModel(`
`84`	`84`	`ClientConfig(`
`85`	`85`	`base_url="https://api.openai.com/v1",`
`86`		`- model="gpt-5.4",`
	`86`	`+ model="gpt-4o",`
`87`	`87`	`),`
`88`	`88`	`MHC_CHANNEL_CONFIG,`
`89`	`89`	`)`
Original file line number	Diff line number	Diff line change
`@@ -17,7 +17,7 @@ def _get_answer(self, row) -> str:`
`17`	`17`
`18`	`18`	`def _get_post_prompt(self, _row) -> str:`
`19`	`19`	`return (`
`20`		`- "Please generate a detailed caption for this day of sensor data, "`
	`20`	`+ "Generate a detailed caption for this day of sensor data, "`
`21`	`21`	`"describing the person's activity, sleep, and physiological "`
`22`	`22`	`"patterns as accurately as possible."`
`23`	`23`	`)`