JSONL record extraction fails on completions with `\r\n` line endings

### Priority Level

Medium (Annoying but has workaround)

### Describe the bug

The JSONL record extraction regex `{.+?}(?=\n|$)` in `record_utils.py` fails to match valid records when the model generates Windows-style `\r\n` line endings instead of `\n`. The lookahead `(?=\n|$)` requires `}` to be immediately followed by `\n`, but `\r` sits in between, so the entire completion is treated as non-record text.

## Impact
- All records in an affected completion are silently dropped (zero valid, zero invalid).
- For grouped/time-series generation, this surfaces as "Groupby Generation Failed" errors.
- The issue is non-deterministic -- the same model produces a mix of \n and \r\n completions across a batch, so generation still succeeds overall but with reduced yield.

## How it was discovered
The new generation token statistics feature quantifies non-record tokens per batch. During manual inspection of completions with high non-record token counts, the `\r\n` pattern was produced like the following:

```
<|im_start|>{"patient_id":"pmc-6028416-1","first_name":"Ayomi","last_name":"Yamamoto","date_of_birth":"05\\/23\\/1941","sex":"Female","race":"Asian","weight":140.0,"height":63.0,"event_id":1,"event_type":"Symptom","event_date":"04\\/15\\/2014","event_name":"Immediate eruption after tick bite","provider_name":"Dr. Keiko Sato","reason":"No apparent cause","result":"Eruption of numerous vesicular papulovesicular lesions","details":"{\\"intensity\\": \\"severe\\", \\"location\\": \\"scalp and extremities\\"}","notes":"Patient reported severe itch and stinging, lesion development observed rapidly after tick bite"}\r\n{"patient_id":"pmc-6028416-1","first_name":"Ayomi","last_name":"Yamamoto","date_of_birth":"05\\/23\\/1941","sex":"Female","race":"Asian","weight":140.0,"height":63.0,"event_id":2,"event_type":"Symptom","event_date":"04\\/16\\/2014","event_name":"Progressive vesicular eruption","provider_name":"Dr. Keiko Sato","reason":"Persisting symptom of itching and pain","result":"Multiple, painful vesicles with crusting lesions","details":"{\\"intensity\\": \\"moderate\\", \\"location\\": \\"scalp and Gephardian tonsillar area\\"}","notes":"Lesions observed in exhaustive tack of oval vitality without distinct central cumping. Typical symptoms of MS identified in poison ivy-associated disease."}\r\n{"patient_id":"pmc-6028416-1","first_name":"Ayomi","last_name":"Yamamoto","date_of_birth":"05\\/23\\/1941","sex":"Female","race":"Asian","weight":140.0,"height":63.0,"event_id":3,"event_type":"Diagnosis Test","event_date":"04\\/17\\/2014","event_name":"Antibody test for four-band potassimethylammonium hemotoperoxidase and four-band influenza A and B","provider_name":"Kawasaki General Hospital","reason":"Confirmation of potential MS diagnosis","result":"Negative for MAGP and influenza A and B antibodies","details":"{\\"intensity\\": null, \\"location\\": null}","notes":"Tests conducted confirm absence of MS indications. Antibody specific levels respected with control standards. Additional diagnostics recommended"}\r\n{"patient_id":"pmc-6028416-1","first_name":"Ayomi","last_name":"Yamamoto","date_of_birth":"05\\/23\\/1941","sex":"Female","race":"Asian","weight":140.0,"height":63.0,"event_id":4,"event_type":"Treatment","event_date":"04\\/16\\/2014","event_name":"Oral Acyclovir and Heqedaris","provider_name":"Dr. Harumi Ikegami","reason":"Treatment for scratched-specific meningoencephalitis","result":"Acyclovir administration for treatment of differentialiation. Patient\'s favorable response to antiviral therapy initiated","details":"{\\"dosage\\": \\"500 mg\\", \\"frequency\\": \\"every 8 hours\\"}","notes":"Patient responded well, without significant adverse effects. Patient will continue course with regular follow-up and ongoing observation."}\r\n<|im_end|>
```
### Steps/Code to reproduce bug

In an IPython breakpoint after `_generate_batch` in `vllm_backend.py`:
```
for idx, output in enumerate(outputs):
    text = output.outputs[0].text
    if "\r\n" in text:
        print(f"prompt {idx}: has \\r\\n ({text.count(chr(13))} occurrences)")
```

### Expected behavior

The expected behavior is that records with `\r\n` line endings are extracted and validated the same as records with `\n`, line ending style should have no effect on record extraction.

### Additional context

## Suggested fix
Strip `\r` in `Processor.__call__` alongside the existing UTF-8 sanitization:
```
text = text.encode("utf-8", "ignore").decode("utf-8")
text = text.replace("\r\n", "\n")
```
This is the narrowest fix. Alternatively, the regex could be updated to `{.+?}(?=\r?\n|$)`, but normalizing early is more defensive since \r could affect other downstream parsing too.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

JSONL record extraction fails on completions with `\r\n` line endings #405

Priority Level

Describe the bug

Impact

How it was discovered

Steps/Code to reproduce bug

Expected behavior

Additional context

Suggested fix

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

JSONL record extraction fails on completions with \r\n line endings #405

Description

Priority Level

Describe the bug

Impact

How it was discovered

Steps/Code to reproduce bug

Expected behavior

Additional context

Suggested fix

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions

JSONL record extraction fails on completions with `\r\n` line endings #405