|
1 | | -# SensorTSLM Core Framework |
| 1 | +# SensorTSLM |
2 | 2 |
|
3 | | -Dataset-agnostic captioning pipeline for sensor time-series data. |
| 3 | +Dataset-agnostic pipeline that turns raw multi-channel sensor recordings into |
| 4 | +language-model-ready captions. The current reference implementation targets the |
| 5 | +MyHeartCounts (MHC) wearable dataset at daily and weekly resolution. |
4 | 6 |
|
5 | | -## Setup |
| 7 | +## Pipeline overview |
| 8 | + |
| 9 | +``` |
| 10 | +HuggingFace dataset ──► Transformer ──► Recording ──► Annotator ──► CaptionResult |
| 11 | + │ |
| 12 | + ├── StructuralExtractor (trends, spikes) |
| 13 | + ├── SemanticExtractor (active windows) |
| 14 | + └── CrossChannelExtractor (workout / sleep) |
| 15 | +``` |
| 16 | + |
| 17 | +- **`Transformer`** maps a source-dataset row to the internal `Recording` schema |
| 18 | + (per-channel float arrays + metadata). |
| 19 | +- **`Annotator`** runs a list of `CaptionExtractor`s and returns one |
| 20 | + `Annotation` per (channel, time-window, caption-type) it observes. |
| 21 | +- **Caption phrasing** is template-driven (`templates/templates.json`, |
| 22 | + `templates/templates_hourly.json`); each annotation deterministically picks a |
| 23 | + template variant from a row-derived seed. |
6 | 24 |
|
7 | | -Install dependencies and set the dataset path before running: |
| 25 | +Adding a new source dataset means writing one `Dataset`, one `Transformer`, and |
| 26 | +a `ChannelConfig` describing channel names, units, and aggregators — no changes |
| 27 | +to the core pipeline. |
| 28 | + |
| 29 | +## Setup |
8 | 30 |
|
9 | 31 | ```bash |
10 | 32 | python3 -m pip install -r requirements.txt |
11 | 33 | ``` |
12 | 34 |
|
| 35 | +Point the loader at a HuggingFace MHC export: |
| 36 | + |
| 37 | +```bash |
| 38 | +export MHC_DATASET_DIR=<path-to-daily-hf-dataset> # for daily |
| 39 | +export MHC_WEEKLY_DATASET_DIR=<path-to-weekly-hf> # for weekly |
| 40 | +``` |
| 41 | + |
| 42 | +## Generating captions |
| 43 | + |
| 44 | +Single-process export: |
| 45 | + |
13 | 46 | ```bash |
14 | | -export MHC_DATASET_DIR="../hf-daily_max-nonwear=50" |
| 47 | +python scripts/export_captions.py \ |
| 48 | + --variant weekly \ |
| 49 | + --out exports/lean_full |
15 | 50 | ``` |
16 | 51 |
|
17 | | -## Usage |
| 52 | +Common flags: |
| 53 | + |
| 54 | +| Flag | Default | Notes | |
| 55 | +|------|---------|-------| |
| 56 | +| `--variant {daily,weekly}` | `weekly` | Which MHC resolution to caption. | |
| 57 | +| `--out <dir>` | `exports/lean_full` | Output directory for Arrow shards. | |
| 58 | +| `--max_rows <n>` | unset | Cap row count for a smoke test. | |
| 59 | +| `--start <i>` / `--end <j>` | full range | Slice for parallel/sharded runs. | |
| 60 | +| `--min_wear_pct <p>` | `0.0` | (daily) drop low-wear days. | |
| 61 | +| `--min_valid_hours <h>` | `0` | (weekly) drop weeks with too few valid hours. | |
| 62 | +| `--min_active_channels <k>` | `0` | Drop rows with fewer active channels. | |
| 63 | +| `--split_file <json>` | unset | Use canonical sharable-user splits. | |
| 64 | + |
| 65 | +Each shard writes an Arrow file under `<out>/recordings_*.arrow` containing the |
| 66 | +`Recording` data plus all `Annotation`s. |
| 67 | + |
| 68 | +### Parallel sharded export |
| 69 | + |
| 70 | +For large jobs the helper script splits the dataset into `N` Slurm jobs: |
18 | 71 |
|
19 | 72 | ```bash |
20 | | -python3 captionizer.py |
| 73 | +export MHC_WEEKLY_DATASET_DIR=<path> |
| 74 | +./scripts/export_captions_sharded.sh weekly 4 exports/lean_full |
21 | 75 | ``` |
22 | 76 |
|
23 | | -## Explorer |
| 77 | +## Library usage |
24 | 78 |
|
25 | | -Use the interactive explorer to inspect one row at a time, switch signals, and see which detector events fired where on the time series. |
| 79 | +```python |
| 80 | +from annotator import Annotator |
| 81 | +from captionizer import Captionizer |
| 82 | +from extractors.semantic import SemanticExtractor |
| 83 | +from extractors.structural import StructuralExtractor |
| 84 | +from mhc.constants import MHC_CHANNEL_CONFIG |
| 85 | +from mhc.cross_channel import default_extractor |
| 86 | +from mhc.dataset import MHCDataset |
| 87 | +from mhc.transformer import MHCTransformer |
26 | 88 |
|
27 | | -Start it with: |
| 89 | +dataset = MHCDataset() |
| 90 | +annotator = Annotator([ |
| 91 | + StructuralExtractor(MHC_CHANNEL_CONFIG), |
| 92 | + SemanticExtractor(MHC_CHANNEL_CONFIG), |
| 93 | + default_extractor(MHC_CHANNEL_CONFIG), |
| 94 | +]) |
| 95 | +captionizer = Captionizer(dataset, MHCTransformer(), annotator) |
| 96 | +result, _ = captionizer.run(max_rows=10) |
| 97 | +``` |
| 98 | + |
| 99 | +## Inspecting the output |
| 100 | + |
| 101 | +The interactive explorer steps through one row at a time, switches signals, and |
| 102 | +overlays detector events on the time series: |
28 | 103 |
|
29 | 104 | ```bash |
30 | | -python3 explorer.py --min-wear-pct=50.0 |
| 105 | +python explorer.py --min-wear-pct=50.0 |
31 | 106 | ``` |
32 | 107 |
|
33 | | -Useful controls: |
| 108 | +## Layout |
34 | 109 |
|
35 | | -- Use the bottom row slider or `<` / `>` buttons to move between dataset rows. |
36 | | -- Click a signal in the right-hand signal list or in the channel overview heatmap to switch channels. |
37 | | -- Use the Matplotlib zoom and pan tools on the main plot to inspect parts of the signal in detail. |
38 | | -- Click `reset` or press `home` to reset the zoom. |
39 | | -- Use the overlay buttons to toggle `trend`, `spike`, `drop`, and `nonwear` overlays. |
40 | | -- Use the `stats`, `events`, `captions`, and `help` tabs in the details panel to switch what metadata is shown. |
41 | | -- Scroll inside the details panel with the mouse wheel or the `^` / `v` buttons. |
| 110 | +``` |
| 111 | +captionizer.py Orchestration (Dataset + Transformer + Annotator) |
| 112 | +annotator.py Runs a list of extractors over one Recording |
| 113 | +extractors/ Caption extractors (statistical / structural / semantic / cross-channel) |
| 114 | +detectors/ Trend and spike detectors used by structural extractor |
| 115 | +synthesizers/ Cross-channel caption synthesizers (workout, sleep, …) |
| 116 | +templates/ Caption phrasing templates |
| 117 | +mhc/ MHC daily dataset adapter |
| 118 | +mhc_weekly/ MHC weekly dataset adapter |
| 119 | +timef/ Internal Recording / CaptionResult schema |
| 120 | +exporters/ Arrow shard writer |
| 121 | +scripts/ CLI entry points (caption export, training, eval) |
| 122 | +``` |
0 commit comments