Skip to content

Commit de72c4c

Browse files
Pre-submission cleanup: README, model name, templates, dead code
1 parent 345da28 commit de72c4c

8 files changed

Lines changed: 122 additions & 71 deletions

File tree

README.md

Lines changed: 100 additions & 19 deletions
Original file line numberDiff line numberDiff line change
@@ -1,41 +1,122 @@
1-
# SensorTSLM Core Framework
1+
# SensorTSLM
22

3-
Dataset-agnostic captioning pipeline for sensor time-series data.
3+
Dataset-agnostic pipeline that turns raw multi-channel sensor recordings into
4+
language-model-ready captions. The current reference implementation targets the
5+
MyHeartCounts (MHC) wearable dataset at daily and weekly resolution.
46

5-
## Setup
7+
## Pipeline overview
8+
9+
```
10+
HuggingFace dataset ──► Transformer ──► Recording ──► Annotator ──► CaptionResult
11+
12+
├── StructuralExtractor (trends, spikes)
13+
├── SemanticExtractor (active windows)
14+
└── CrossChannelExtractor (workout / sleep)
15+
```
16+
17+
- **`Transformer`** maps a source-dataset row to the internal `Recording` schema
18+
(per-channel float arrays + metadata).
19+
- **`Annotator`** runs a list of `CaptionExtractor`s and returns one
20+
`Annotation` per (channel, time-window, caption-type) it observes.
21+
- **Caption phrasing** is template-driven (`templates/templates.json`,
22+
`templates/templates_hourly.json`); each annotation deterministically picks a
23+
template variant from a row-derived seed.
624

7-
Install dependencies and set the dataset path before running:
25+
Adding a new source dataset means writing one `Dataset`, one `Transformer`, and
26+
a `ChannelConfig` describing channel names, units, and aggregators — no changes
27+
to the core pipeline.
28+
29+
## Setup
830

931
```bash
1032
python3 -m pip install -r requirements.txt
1133
```
1234

35+
Point the loader at a HuggingFace MHC export:
36+
37+
```bash
38+
export MHC_DATASET_DIR=<path-to-daily-hf-dataset> # for daily
39+
export MHC_WEEKLY_DATASET_DIR=<path-to-weekly-hf> # for weekly
40+
```
41+
42+
## Generating captions
43+
44+
Single-process export:
45+
1346
```bash
14-
export MHC_DATASET_DIR="../hf-daily_max-nonwear=50"
47+
python scripts/export_captions.py \
48+
--variant weekly \
49+
--out exports/lean_full
1550
```
1651

17-
## Usage
52+
Common flags:
53+
54+
| Flag | Default | Notes |
55+
|------|---------|-------|
56+
| `--variant {daily,weekly}` | `weekly` | Which MHC resolution to caption. |
57+
| `--out <dir>` | `exports/lean_full` | Output directory for Arrow shards. |
58+
| `--max_rows <n>` | unset | Cap row count for a smoke test. |
59+
| `--start <i>` / `--end <j>` | full range | Slice for parallel/sharded runs. |
60+
| `--min_wear_pct <p>` | `0.0` | (daily) drop low-wear days. |
61+
| `--min_valid_hours <h>` | `0` | (weekly) drop weeks with too few valid hours. |
62+
| `--min_active_channels <k>` | `0` | Drop rows with fewer active channels. |
63+
| `--split_file <json>` | unset | Use canonical sharable-user splits. |
64+
65+
Each shard writes an Arrow file under `<out>/recordings_*.arrow` containing the
66+
`Recording` data plus all `Annotation`s.
67+
68+
### Parallel sharded export
69+
70+
For large jobs the helper script splits the dataset into `N` Slurm jobs:
1871

1972
```bash
20-
python3 captionizer.py
73+
export MHC_WEEKLY_DATASET_DIR=<path>
74+
./scripts/export_captions_sharded.sh weekly 4 exports/lean_full
2175
```
2276

23-
## Explorer
77+
## Library usage
2478

25-
Use the interactive explorer to inspect one row at a time, switch signals, and see which detector events fired where on the time series.
79+
```python
80+
from annotator import Annotator
81+
from captionizer import Captionizer
82+
from extractors.semantic import SemanticExtractor
83+
from extractors.structural import StructuralExtractor
84+
from mhc.constants import MHC_CHANNEL_CONFIG
85+
from mhc.cross_channel import default_extractor
86+
from mhc.dataset import MHCDataset
87+
from mhc.transformer import MHCTransformer
2688

27-
Start it with:
89+
dataset = MHCDataset()
90+
annotator = Annotator([
91+
StructuralExtractor(MHC_CHANNEL_CONFIG),
92+
SemanticExtractor(MHC_CHANNEL_CONFIG),
93+
default_extractor(MHC_CHANNEL_CONFIG),
94+
])
95+
captionizer = Captionizer(dataset, MHCTransformer(), annotator)
96+
result, _ = captionizer.run(max_rows=10)
97+
```
98+
99+
## Inspecting the output
100+
101+
The interactive explorer steps through one row at a time, switches signals, and
102+
overlays detector events on the time series:
28103

29104
```bash
30-
python3 explorer.py --min-wear-pct=50.0
105+
python explorer.py --min-wear-pct=50.0
31106
```
32107

33-
Useful controls:
108+
## Layout
34109

35-
- Use the bottom row slider or `<` / `>` buttons to move between dataset rows.
36-
- Click a signal in the right-hand signal list or in the channel overview heatmap to switch channels.
37-
- Use the Matplotlib zoom and pan tools on the main plot to inspect parts of the signal in detail.
38-
- Click `reset` or press `home` to reset the zoom.
39-
- Use the overlay buttons to toggle `trend`, `spike`, `drop`, and `nonwear` overlays.
40-
- Use the `stats`, `events`, `captions`, and `help` tabs in the details panel to switch what metadata is shown.
41-
- Scroll inside the details panel with the mouse wheel or the `^` / `v` buttons.
110+
```
111+
captionizer.py Orchestration (Dataset + Transformer + Annotator)
112+
annotator.py Runs a list of extractors over one Recording
113+
extractors/ Caption extractors (statistical / structural / semantic / cross-channel)
114+
detectors/ Trend and spike detectors used by structural extractor
115+
synthesizers/ Cross-channel caption synthesizers (workout, sleep, …)
116+
templates/ Caption phrasing templates
117+
mhc/ MHC daily dataset adapter
118+
mhc_weekly/ MHC weekly dataset adapter
119+
timef/ Internal Recording / CaptionResult schema
120+
exporters/ Arrow shard writer
121+
scripts/ CLI entry points (caption export, training, eval)
122+
```

captionizer.py

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -83,7 +83,7 @@ def run(
8383
model = ClientModel(
8484
ClientConfig(
8585
base_url="https://api.openai.com/v1",
86-
model="gpt-5.4",
86+
model="gpt-4o",
8787
),
8888
MHC_CHANNEL_CONFIG,
8989
)

extractors/generative.py

Lines changed: 0 additions & 24 deletions
This file was deleted.

requirements.txt

Lines changed: 5 additions & 7 deletions
Original file line numberDiff line numberDiff line change
@@ -1,14 +1,12 @@
1+
accelerate
12
datasets
23
matplotlib
34
numpy
5+
openai
6+
opentslm
7+
Pillow
48
pyarrow
9+
python-dotenv
510
scipy
611
torch
7-
Pillow
812
transformers
9-
accelerate
10-
datasets
11-
pyarrow
12-
openai
13-
python-dotenv
14-
opentslm

reviewer.py

Lines changed: 1 addition & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -10,7 +10,7 @@
1010
from dataclasses import dataclass, field
1111

1212
from models.base import BaseModel
13-
from timef.schema import Annotation, CaptionResult
13+
from timef.schema import CaptionResult
1414

1515

1616
EVALUATE_PROMPT = """\
@@ -87,10 +87,6 @@ def evaluate(self, result: CaptionResult, per_channel: bool = False) -> Evaluati
8787

8888
return EvaluationResult(scores=scores)
8989

90-
def refine(self, result: CaptionResult) -> list[Annotation]:
91-
"""Identify missing observations and add new captions."""
92-
raise NotImplementedError
93-
9490
@staticmethod
9591
def _parse_score(text: str) -> EvaluationScore:
9692
"""Extract score and feedback from model response."""

templates/templates.json

Lines changed: 8 additions & 8 deletions
Original file line numberDiff line numberDiff line change
@@ -2,7 +2,7 @@
22
"statistical": [
33
"The average {name} value is {mean} {unit}, with extremes at {max} (max) and {min} (min), and a std of {std}.",
44
"The {name} data exhibits a mean of {mean} {unit}, a standard deviation of {std}, and its extreme values are {min} and {max}.",
5-
"{name} average {mean} {unit}, reaching a maximum of {max} and a minimum of {min}, with a standard deviation of {std}.",
5+
"{name} averages {mean} {unit}, reaching a maximum of {max} and a minimum of {min}, with a standard deviation of {std}.",
66
"{name} exhibits a mean of {mean} {unit}, with peak and minimal values reaching {max} and {min}, and a standard deviation of {std}.",
77
"For the {name} measurements, the mean is {mean} {unit}, the standard deviation is {std}, and the data lies between {min} and {max}.",
88
"Across all {name} samples, the mean is {mean} {unit}, the std is {std}, and values range from {min} to {max}.",
@@ -342,18 +342,18 @@
342342
"The watch tallied {total:.0f} steps for this interval",
343343
"Total watch steps during this stretch came to {total:.0f}",
344344
"The watch summed {total:.0f} steps across this interval",
345-
"Aggregate watch steps for the interval was {total:.0f}",
345+
"Aggregate watch steps for the interval were {total:.0f}",
346346
"The interval's total watch steps reached {total:.0f}",
347347
"Across this period the watch logged {total:.0f} steps",
348348
"Watch steps added up to {total:.0f} over this interval",
349349
"The watch accumulated {total:.0f} steps during this stretch",
350350
"Total watch steps for the period reached {total:.0f}",
351351
"This stretch saw {total:.0f} steps on the watch",
352-
"The watch's cumulative steps for the interval was {total:.0f}",
352+
"The watch's cumulative steps for the interval were {total:.0f}",
353353
"Across the interval the watch counted {total:.0f} steps",
354354
"Sum of watch steps over this period was {total:.0f}",
355355
"The watch totaled {total:.0f} steps during this interval",
356-
"Total watch-recorded steps for this stretch was {total:.0f}"
356+
"Total watch-recorded steps for this stretch were {total:.0f}"
357357
],
358358
"active_energy_mean": [
359359
"The watch recorded an average {name} of {mean:.0f} {unit} during this period",
@@ -424,21 +424,21 @@
424424
"sleep_stationary": [
425425
"The person was stationary during this sleep interval",
426426
"The person remained stationary during this sleep period",
427-
"The sleep interval was stationary, with no distance-based movement recorded",
427+
"The sleep interval was stationary, with no movement recorded",
428428
"The person stayed in one place during this sleep interval",
429-
"This sleep period showed no distance-based movement away from rest",
429+
"This sleep period showed no movement from the resting position",
430430
"Throughout this sleep interval the person remained still",
431431
"No movement was detected during this sleep period",
432432
"The person stayed put across this sleep interval",
433433
"During this sleep stretch the person did not move",
434-
"This sleep period was free of distance-based movement",
434+
"This sleep period was free of movement",
435435
"The person was at rest for the duration of this sleep interval",
436436
"No locomotor activity occurred during this sleep period",
437437
"The person remained at rest during this sleep stretch",
438438
"This sleep interval involved no positional change",
439439
"The sleep window saw the person remain stationary",
440440
"During the sleep period the person stayed put",
441-
"The person slept without moving in space",
441+
"The person slept without changing position",
442442
"The interval was a still sleep period",
443443
"Across this sleep interval the person did not move",
444444
"Stationary sleep was observed across this interval"

templates/templates_hourly.json

Lines changed: 6 additions & 6 deletions
Original file line numberDiff line numberDiff line change
@@ -2,7 +2,7 @@
22
"statistical": [
33
"The average {name} value is {mean} {unit}, with extremes at {max} (max) and {min} (min), and a std of {std}.",
44
"The {name} data exhibits a mean of {mean} {unit}, a standard deviation of {std}, and its extreme values are {min} and {max}.",
5-
"{name} average {mean} {unit}, reaching a maximum of {max} and a minimum of {min}, with a standard deviation of {std}.",
5+
"{name} averages {mean} {unit}, reaching a maximum of {max} and a minimum of {min}, with a standard deviation of {std}.",
66
"{name} exhibits a mean of {mean} {unit}, with peak and minimal values reaching {max} and {min}, and a standard deviation of {std}.",
77
"For the {name} measurements, the mean is {mean} {unit}, the standard deviation is {std}, and the data lies between {min} and {max}.",
88
"Across all {name} samples, the mean is {mean} {unit}, the std is {std}, and values range from {min} to {max}.",
@@ -538,14 +538,14 @@
538538
"No movement was detected during this sleep period",
539539
"The person stayed put across this sleep interval",
540540
"During this sleep stretch the person did not move",
541-
"This sleep period was free of distance-based movement",
541+
"This sleep period was free of movement",
542542
"The person was at rest for the duration of this sleep interval",
543543
"No locomotor activity occurred during this sleep period",
544544
"The person remained at rest during this sleep stretch",
545545
"This sleep interval involved no positional change",
546546
"The sleep window saw the person remain stationary",
547547
"During the sleep period the person stayed put",
548-
"The person slept without moving in space",
548+
"The person slept without changing position",
549549
"The interval was a still sleep period",
550550
"Across this sleep interval the person did not move",
551551
"Stationary sleep was observed across this interval",
@@ -581,18 +581,18 @@
581581
"The watch tallied {total:.0f} steps for this interval",
582582
"Total watch steps during this stretch came to {total:.0f}",
583583
"The watch summed {total:.0f} steps across this interval",
584-
"Aggregate watch steps for the interval was {total:.0f}",
584+
"Aggregate watch steps for the interval were {total:.0f}",
585585
"The interval's total watch steps reached {total:.0f}",
586586
"Across this period the watch logged {total:.0f} steps",
587587
"Watch steps added up to {total:.0f} over this interval",
588588
"The watch accumulated {total:.0f} steps during this stretch",
589589
"Total watch steps for the period reached {total:.0f}",
590590
"This stretch saw {total:.0f} steps on the watch",
591-
"The watch's cumulative steps for the interval was {total:.0f}",
591+
"The watch's cumulative steps for the interval were {total:.0f}",
592592
"Across the interval the watch counted {total:.0f} steps",
593593
"Sum of watch steps over this period was {total:.0f}",
594594
"The watch totaled {total:.0f} steps during this interval",
595-
"Total watch-recorded steps for this stretch was {total:.0f}",
595+
"Total watch-recorded steps for this stretch were {total:.0f}",
596596
"Total steps on the watch reached {total:.0f} for this interval",
597597
"Watch logs sum to {total:.0f} steps for this stretch",
598598
"Combined watch steps for this period reached {total:.0f}"

time_series_datasets/mhc_caption_qa_dataset.py

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -17,7 +17,7 @@ def _get_answer(self, row) -> str:
1717

1818
def _get_post_prompt(self, _row) -> str:
1919
return (
20-
"Please generate a detailed caption for this day of sensor data, "
20+
"Generate a detailed caption for this day of sensor data, "
2121
"describing the person's activity, sleep, and physiological "
2222
"patterns as accurately as possible."
2323
)

0 commit comments

Comments
 (0)