Skip to content

JSONL record extraction fails on completions with \r\n line endings #405

@seayang-nv

Description

@seayang-nv

Priority Level

Medium (Annoying but has workaround)

Describe the bug

The JSONL record extraction regex {.+?}(?=\n|$) in record_utils.py fails to match valid records when the model generates Windows-style \r\n line endings instead of \n. The lookahead (?=\n|$) requires } to be immediately followed by \n, but \r sits in between, so the entire completion is treated as non-record text.

Impact

  • All records in an affected completion are silently dropped (zero valid, zero invalid).
  • For grouped/time-series generation, this surfaces as "Groupby Generation Failed" errors.
  • The issue is non-deterministic -- the same model produces a mix of \n and \r\n completions across a batch, so generation still succeeds overall but with reduced yield.

How it was discovered

The new generation token statistics feature quantifies non-record tokens per batch. During manual inspection of completions with high non-record token counts, the \r\n pattern was produced like the following:

<|im_start|>{"patient_id":"pmc-6028416-1","first_name":"Ayomi","last_name":"Yamamoto","date_of_birth":"05\\/23\\/1941","sex":"Female","race":"Asian","weight":140.0,"height":63.0,"event_id":1,"event_type":"Symptom","event_date":"04\\/15\\/2014","event_name":"Immediate eruption after tick bite","provider_name":"Dr. Keiko Sato","reason":"No apparent cause","result":"Eruption of numerous vesicular papulovesicular lesions","details":"{\\"intensity\\": \\"severe\\", \\"location\\": \\"scalp and extremities\\"}","notes":"Patient reported severe itch and stinging, lesion development observed rapidly after tick bite"}\r\n{"patient_id":"pmc-6028416-1","first_name":"Ayomi","last_name":"Yamamoto","date_of_birth":"05\\/23\\/1941","sex":"Female","race":"Asian","weight":140.0,"height":63.0,"event_id":2,"event_type":"Symptom","event_date":"04\\/16\\/2014","event_name":"Progressive vesicular eruption","provider_name":"Dr. Keiko Sato","reason":"Persisting symptom of itching and pain","result":"Multiple, painful vesicles with crusting lesions","details":"{\\"intensity\\": \\"moderate\\", \\"location\\": \\"scalp and Gephardian tonsillar area\\"}","notes":"Lesions observed in exhaustive tack of oval vitality without distinct central cumping. Typical symptoms of MS identified in poison ivy-associated disease."}\r\n{"patient_id":"pmc-6028416-1","first_name":"Ayomi","last_name":"Yamamoto","date_of_birth":"05\\/23\\/1941","sex":"Female","race":"Asian","weight":140.0,"height":63.0,"event_id":3,"event_type":"Diagnosis Test","event_date":"04\\/17\\/2014","event_name":"Antibody test for four-band potassimethylammonium hemotoperoxidase and four-band influenza A and B","provider_name":"Kawasaki General Hospital","reason":"Confirmation of potential MS diagnosis","result":"Negative for MAGP and influenza A and B antibodies","details":"{\\"intensity\\": null, \\"location\\": null}","notes":"Tests conducted confirm absence of MS indications. Antibody specific levels respected with control standards. Additional diagnostics recommended"}\r\n{"patient_id":"pmc-6028416-1","first_name":"Ayomi","last_name":"Yamamoto","date_of_birth":"05\\/23\\/1941","sex":"Female","race":"Asian","weight":140.0,"height":63.0,"event_id":4,"event_type":"Treatment","event_date":"04\\/16\\/2014","event_name":"Oral Acyclovir and Heqedaris","provider_name":"Dr. Harumi Ikegami","reason":"Treatment for scratched-specific meningoencephalitis","result":"Acyclovir administration for treatment of differentialiation. Patient\'s favorable response to antiviral therapy initiated","details":"{\\"dosage\\": \\"500 mg\\", \\"frequency\\": \\"every 8 hours\\"}","notes":"Patient responded well, without significant adverse effects. Patient will continue course with regular follow-up and ongoing observation."}\r\n<|im_end|>

Steps/Code to reproduce bug

In an IPython breakpoint after _generate_batch in vllm_backend.py:

for idx, output in enumerate(outputs):
    text = output.outputs[0].text
    if "\r\n" in text:
        print(f"prompt {idx}: has \\r\\n ({text.count(chr(13))} occurrences)")

Expected behavior

The expected behavior is that records with \r\n line endings are extracted and validated the same as records with \n, line ending style should have no effect on record extraction.

Additional context

Suggested fix

Strip \r in Processor.__call__ alongside the existing UTF-8 sanitization:

text = text.encode("utf-8", "ignore").decode("utf-8")
text = text.replace("\r\n", "\n")

This is the narrowest fix. Alternatively, the regex could be updated to {.+?}(?=\r?\n|$), but normalizing early is more defensive since \r could affect other downstream parsing too.

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions