Skip to content

WhoColor: template-bleed span on {{multiple issues}} (parity-matches upstream) #1

@ragesoss

Description

@ragesoss

Symptom

For some articles whose wikitext contains a {{multiple issues}}
(or similar) template wrapping nested templates separated by
newlines, the WhoColor-extended HTML contains a MediaWiki preview
warning at the top:

Preview warning: Page using Template:Multiple issues with unknown parameter <templatestyles>...</templatestyles><table ...>

Example: en/Icaro at rev 1316552020 (token list captured
2026-05-24). Surfaced as 1/20 articles flagged in the Wiki Experts
course parity suite.

TL;DR

This is not a Rust port bug — our whocolor_wikitext.rs matches
the upstream Python WhoColor.parser.WikiMarkupParser exactly. Both
emit the same <span class="editor-token ...">}}</span> around the
outer }} of {{multiple issues|…}}. Production
(https://wikiwho-api.wmcloud.org) and our deployment
(https://wikiwho-rs.wmcloud.org) return byte-identical HTML modulo
trailing MW parser-cache metadata (server hostname, timestamps,
Lua/CPU timings, Render ID).

So we inherit this bug from upstream. Fixing it would mean
deliberately diverging from production — which we won't do without
explicit reason, per the project's parity-or-die quality bar.

Reproduction

scripts/icaro_compare_prod.py fetches both endpoints for en/Icaro
and confirms:

prod html bytes: 68790
ours html bytes: 68790
prod has "Preview warning": True
ours has "Preview warning": True
prod has "unknown parameter": True
ours has "unknown parameter": True

Total byte diff: 194 bytes, all in the trailing MW parser-cache
footer
(server hostname, timestamps, Template:* timing values,
Render ID). The span insertion location and surrounding text are
identical.

Minimal Python repro of the upstream bug (using
WhoColor.parser.WikiMarkupParser directly):

input:  "{{outer|{{inner-a|x=1}}\n{{inner-b|y=2}}}}"
output: '{{outer|{{inner-a|x=1}}\n{{inner-b|y=2}}<span class="editor-token token-editor-1" id="token-17">}}</span>'

Without the newline between {{inner-a}} and {{inner-b}}, no
bleed:

input:  "{{Infobox |a = {{nest |[[L1]] |[[L2]]}}|b = end}}"
output: '{{Infobox |a = {{nest |[[L1]] |[[L2]]}}|b = end}}'

Root cause (verified by tracing the Python parser)

In WhoColor/parser.py::WikiMarkupParser.__get_next_special_element:

def __get_next_special_element(self):
    next_ = {}
    for special_markup in SPECIAL_MARKUPS:
        found_markup = self.__get_first_regex(special_markup['start_regex'])
        if found_markup is not None and \
           (not next_ or next_['start'] > found_markup['start']) and \
           found_markup['start'] not in self._jumped_elems:
            next_ = special_markup
            ...

__get_first_regex returns the first match of the regex from
_wiki_text_pos. If that one match is in _jumped_elems, the entire
markup type is dropped for this call — the regex is NOT re-searched
past the jumped position.

Concretely on Icaro: when the parser has just descended into the
outer {{multiple issues at substituted position 54, _jumped_elems = {0, 43, 54}. The first __get_next_special_element call inside
that new frame at pos=54:

markup first match from pos=54 in jumped? becomes candidate?
{{ template pos=54 (the multi-issues {{ itself) yes no — entire type skipped
(=+|;) heading pos=100 (the = inside |date=April 2016) no yes
(WIKICOLORLB)+ pos=113 (newline after inner }}) no yes (but later)

So next_special_elem becomes the = at pos=100 — deep inside
the inner {{more citations needed}} body. The parser never
notices {{more citations needed at pos=72 (the nested template
open) and never descends into it.

Cascading effect:

  1. Multi-issues frame iterates tokens 9-17 (multiple, issues,
    |, {{, more, citations, needed, |, date). For each,
    next_special.start=100 < token.end is false, so no descent.
  2. At token-18 = (end=101), descends into the heading-marker =
    at pos=100. Single markup, no_jump=True. Consumes the =.
    Returns at pos=101.
  3. Re-derives next_special from pos=101 — now finds linebreak at
    pos=113.
  4. Continues. Writes tokens 19 (april), 20 (2016), 21 (}}
    the inner }} of more-citations-needed). Cursor → 113.
  5. At token-22 {{ (end=126, the original-research open),
    special_elem_end.end=113 < token.end=126 triggers. The
    multi-issues frame returns at pos=113, treating the inner }}
    as its own end.
  6. Top-level frame resumes at pos=113. Descends into WIKICOLORLB
    (linebreak), then into {{original research…}} normally.
  7. After original-research exits at pos=161, top-level processes
    token-31 }} (the outer multi-issues close) as a regular
    token at top-level with add_spans=True — wrapping it in a
    <span class="editor-token …" id="token-31">…</span>. That's
    the bleed.

Why the no-newline (Curzon) case works: same find_next_special
returns a wrong markup at first (an = inside the inner template),
but after that markup's recursion exits at pos=14, the next
find_next_special_markup from pos=14 correctly finds {{ at
pos=15 (NOT in jumped_elems, NOT shadowed by an earlier {{
match). The parser then descends into the inner template normally.
The newline case fails because after the = recursion exits at
pos=101, the cursor has already passed the inner template's open
at pos=72, so it's lost forever.

To fix upstream

__get_next_special_element needs to find the first match not in
_jumped_elems
rather than the first match (and reject the whole
markup type if jumped). The minimal change would be using
re.finditer and skipping jumped positions:

for special_markup in SPECIAL_MARKUPS:
    for m in special_markup['start_regex'].finditer(self.wiki_text, self._wiki_text_pos):
        if m.start() not in self._jumped_elems:
            found_markup = {'str': m.group(), 'start': m.start()}
            break
    else:
        continue
    if not next_ or next_['start'] > found_markup['start']:
        next_ = special_markup
        next_['start'] = found_markup['start']
        ...

This would make the Icaro case work, and shouldn't regress the
no-newline case since the existing first-match behavior is
preserved for any markup whose first match isn't jumped.

Why we're not fixing it here

Per CLAUDE.md:
parity-or-die. Diverging from upstream silently is exactly the
class of change the parity corpus is designed to catch — except
this case is a divergence we'd be introducing intentionally to
fix a bug that all consumers tolerate today. Not worth the risk of
shifting token IDs or breaking a consumer that relies on the
current behavior, unless an actual consumer surfaces a problem.

Right escalation if we ever want to fix it: contribute upstream to
wikimedia/WhoColor (or the
current canonical home) and let the fix flow back.

Reproduction artifacts

  • /tmp/icaro.wt (8,100 bytes) — captured wikitext
  • /tmp/icaro_tokens.json (2,144 tokens) — token list from algorithm
  • scripts/icaro_trace.py — runs the Python upstream parser on the
    captured input and shows the bled span in the output
  • scripts/icaro_compare_prod.py — fetches both production and our
    deployment to confirm byte-level parity
  • scripts/icaro_run_python.py — minimal Curzon-vs-Icaro synthetic
    comparison
  • scripts/icaro_trace_python.py — monkey-patched Python parser
    that logs every __parse_wiki_text / __get_next_special_element
    / __get_special_elem_end call in the multi-issues region

Follow-up notes

The current parity-suite (/tmp/whocolor_parity_suite.py) flags
Preview warning and unknown parameter only in our HTML, not
in prod's. A future tightening would flag only asymmetric
warnings (present in ours but not prod) — orthogonal to this issue.

The synthetic regression test
whocolor_wikitext::regression_tests::nested_template_inside_template_does_not_emit_spans
covers the no-newline Curzon-Ultimatum shape and continues to assert
that correctly. This issue is about the newline-separated variant
only.

crates/wikiwho-server/tests/icaro_repro.rs (gated #[ignore]) was
written assuming the bleed was a Rust port bug; its assertion
(span_count == 0) is wrong vs production behavior. Should be
deleted or rewritten to assert parity-with-production (1 span at
token-31) once we close this issue.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions