The new generation token statistics feature quantifies non-record tokens per batch. During manual inspection of completions with high non-record token counts, the \r\n pattern was produced like the following:
<|im_start|>{"patient_id":"pmc-6028416-1","first_name":"Ayomi","last_name":"Yamamoto","date_of_birth":"05\\/23\\/1941","sex":"Female","race":"Asian","weight":140.0,"height":63.0,"event_id":1,"event_type":"Symptom","event_date":"04\\/15\\/2014","event_name":"Immediate eruption after tick bite","provider_name":"Dr. Keiko Sato","reason":"No apparent cause","result":"Eruption of numerous vesicular papulovesicular lesions","details":"{\\"intensity\\": \\"severe\\", \\"location\\": \\"scalp and extremities\\"}","notes":"Patient reported severe itch and stinging, lesion development observed rapidly after tick bite"}\r\n{"patient_id":"pmc-6028416-1","first_name":"Ayomi","last_name":"Yamamoto","date_of_birth":"05\\/23\\/1941","sex":"Female","race":"Asian","weight":140.0,"height":63.0,"event_id":2,"event_type":"Symptom","event_date":"04\\/16\\/2014","event_name":"Progressive vesicular eruption","provider_name":"Dr. Keiko Sato","reason":"Persisting symptom of itching and pain","result":"Multiple, painful vesicles with crusting lesions","details":"{\\"intensity\\": \\"moderate\\", \\"location\\": \\"scalp and Gephardian tonsillar area\\"}","notes":"Lesions observed in exhaustive tack of oval vitality without distinct central cumping. Typical symptoms of MS identified in poison ivy-associated disease."}\r\n{"patient_id":"pmc-6028416-1","first_name":"Ayomi","last_name":"Yamamoto","date_of_birth":"05\\/23\\/1941","sex":"Female","race":"Asian","weight":140.0,"height":63.0,"event_id":3,"event_type":"Diagnosis Test","event_date":"04\\/17\\/2014","event_name":"Antibody test for four-band potassimethylammonium hemotoperoxidase and four-band influenza A and B","provider_name":"Kawasaki General Hospital","reason":"Confirmation of potential MS diagnosis","result":"Negative for MAGP and influenza A and B antibodies","details":"{\\"intensity\\": null, \\"location\\": null}","notes":"Tests conducted confirm absence of MS indications. Antibody specific levels respected with control standards. Additional diagnostics recommended"}\r\n{"patient_id":"pmc-6028416-1","first_name":"Ayomi","last_name":"Yamamoto","date_of_birth":"05\\/23\\/1941","sex":"Female","race":"Asian","weight":140.0,"height":63.0,"event_id":4,"event_type":"Treatment","event_date":"04\\/16\\/2014","event_name":"Oral Acyclovir and Heqedaris","provider_name":"Dr. Harumi Ikegami","reason":"Treatment for scratched-specific meningoencephalitis","result":"Acyclovir administration for treatment of differentialiation. Patient\'s favorable response to antiviral therapy initiated","details":"{\\"dosage\\": \\"500 mg\\", \\"frequency\\": \\"every 8 hours\\"}","notes":"Patient responded well, without significant adverse effects. Patient will continue course with regular follow-up and ongoing observation."}\r\n<|im_end|>
for idx, output in enumerate(outputs):
text = output.outputs[0].text
if "\r\n" in text:
print(f"prompt {idx}: has \\r\\n ({text.count(chr(13))} occurrences)")
text = text.encode("utf-8", "ignore").decode("utf-8")
text = text.replace("\r\n", "\n")
This is the narrowest fix. Alternatively, the regex could be updated to {.+?}(?=\r?\n|$), but normalizing early is more defensive since \r could affect other downstream parsing too.
Priority Level
Medium (Annoying but has workaround)
Describe the bug
The JSONL record extraction regex
{.+?}(?=\n|$)inrecord_utils.pyfails to match valid records when the model generates Windows-style\r\nline endings instead of\n. The lookahead(?=\n|$)requires}to be immediately followed by\n, but\rsits in between, so the entire completion is treated as non-record text.Impact
How it was discovered
The new generation token statistics feature quantifies non-record tokens per batch. During manual inspection of completions with high non-record token counts, the
\r\npattern was produced like the following:Steps/Code to reproduce bug
In an IPython breakpoint after
_generate_batchinvllm_backend.py:Expected behavior
The expected behavior is that records with
\r\nline endings are extracted and validated the same as records with\n, line ending style should have no effect on record extraction.Additional context
Suggested fix
Strip
\rinProcessor.__call__alongside the existing UTF-8 sanitization:This is the narrowest fix. Alternatively, the regex could be updated to
{.+?}(?=\r?\n|$), but normalizing early is more defensive since \r could affect other downstream parsing too.