|
| 1 | +--- |
| 2 | +layout: default |
| 3 | +nav_active: shared-tasks |
| 4 | +title: PAN at CLEF 2026 - Generated Plagiarism Detection |
| 5 | +description: PAN at CLEF 2026 - Generated Plagiarism Detection |
| 6 | +--- |
| 7 | +<nav class="uk-container"> |
| 8 | +<ul class="uk-breadcrumb"> |
| 9 | +<li><a href="../../index.html">PAN</a></li> |
| 10 | +<li><a href="../../shared-tasks.html">Shared Tasks</a></li> |
| 11 | +<li class="uk-disabled"><a href="#">Generated Plagiarism Detection</a></li> |
| 12 | +</ul> |
| 13 | +</nav> |
| 14 | + |
| 15 | +<main class="uk-section uk-section-default"> |
| 16 | + <div class="uk-container"> |
| 17 | + <div class="uk-container uk-margin-small"> |
| 18 | + <div> |
| 19 | + <h1 class="uk-margin-remove-top">Generative Plagiarism Detection 2026</h1> |
| 20 | + <ul class="uk-list"> |
| 21 | + <li><span data-uk-icon="chevron-down"></span><a class="uk-margin-small-right" href="#synopsis">Synopsis</a></li> |
| 22 | + <li><span data-uk-icon="chevron-down"></span><a class="uk-margin-small-right" href="#task">Task Overview</a></li> |
| 23 | + <li><span data-uk-icon="chevron-down"></span><a class="uk-margin-small-right" href="#data">Data</a></li> |
| 24 | +<!-- <li><span data-uk-icon="chevron-down"></span><a class="uk-margin-small-right" href="#submission">Submission</a></li>--> |
| 25 | + <li><span data-uk-icon="chevron-down"></span><a class="uk-margin-small-right" href="#results">Results</a></li> |
| 26 | + <li><span data-uk-icon="chevron-down"></span><a class="uk-margin-small-right" href="#related-work">Related Work</a></li> |
| 27 | + <li><span data-uk-icon="chevron-down"></span><a class="uk-margin-small-right" href="#task-committee">Task Committee</a></li> |
| 28 | + </ul> |
| 29 | + </div> |
| 30 | + </div> |
| 31 | + |
| 32 | + |
| 33 | + <div class="uk-container uk-margin-medium"> |
| 34 | + <h2 id="synopsis">Synopsis</h2> |
| 35 | + <ul> |
| 36 | + <li>Task: Given a pair of documents, your task is to identify all contiguous maximal-length passages of reused text between them.</li> |
| 37 | + <li>Important dates: |
| 38 | + <ul> |
| 39 | + <li><strong>May 07, 2026:</strong> software submission</li> |
| 40 | + <li><strong>May 28, 2026:</strong> participant notebook submission |
| 41 | + [<a href="../../pan-notebook-paper-template/pan-notebook-paper-template.zip">template</a>] |
| 42 | + [<a href="https://easychair.org/conferences/?conf=clef2026">submission</a> – <em>select "Stylometry and Digital Text Forensics (PAN)"</em> ]</li> |
| 43 | + </ul> |
| 44 | + </li> |
| 45 | +<!-- <li>Input: [<a href="{{ 'data.html#pan25-text-alignment' | relative_url }}">data</a>].</li>--> |
| 46 | +<!-- <li>Baselines: [<a href="https://github.com/pan-webis-de/pan-code/blob/master/clef25/generated-plagiarism-detection" target="_blank">code</a>].</li>--> |
| 47 | +<!-- <li>Evaluation: [<a href="https://github.com/pan-webis-de/pan-code/blob/master/clef25/generated-plagiarism-detection/evaluation" target="_blank">code</a>].</li>--> |
| 48 | +<!-- <li>Submission: Deployment on TIRA [<a href="https://www.tira.io/task-overview/pan25-generated-plagiarism-detection">submit</a>]</li>--> |
| 49 | + </ul> |
| 50 | + |
| 51 | + <h2 id="task">Task Overview</h2> |
| 52 | + <p> |
| 53 | + To develop your software, we provide you with a training and validation corpus that consists of pairs of |
| 54 | + documents, one of which may contain passages of text resued from the other. The reused text is |
| 55 | + subject to automatic LLM paraphrasing to hide the fact it has been reused. Multiple LLMs have been utilized |
| 56 | + and the documents may contain additional genuine LLM paraphrased text (i.e., it is not reused). |
| 57 | + The input and output formats are the same as in previous text-alignment tasks. |
| 58 | + <a href="clef14/pan14-web/text-alignment.html">Learn more »</a> |
| 59 | + </p> |
| 60 | + |
| 61 | + |
| 62 | + <h2 id="data">Data</h2> |
| 63 | + <p>The dataset is available via <a href="https://zenodo.org/records/14969012">Zenodo</a>. |
| 64 | + Please register first at <a href="https://www.tira.io/task-overview/pan25-generated-plagiarism-detection">Tira</a>. |
| 65 | + The dataset contains copyrighted material and may be used only for research purposes. <strong>No redistribution allowed.</strong></p> |
| 66 | + |
| 67 | + <p>Enclosed in the train and validation corpora, two folders are found: (1) the text data and (2) the annotation data (<code>_truths</code> postfix). |
| 68 | + <ul> |
| 69 | + <li>Text Data: contains a <code>pairs</code> file which lists all pairs of suspicious documents (in the <code>susp</code> folder) and source documents (in the <code>src</code> folder) to be compared.</li> |
| 70 | + <li>Annotation Data: contains XML files for each pair in the <code>pairs</code> file providing information about the locations and its source of reused texts.</li> |
| 71 | + </ul> |
| 72 | + |
| 73 | + The annotation data contains the following information that should be used for training:</p> |
| 74 | + <pre class="prettyprint lang-xml" style="overflow-x:auto"><nobr><document reference="suspicious-documentXYZ.txt"></nobr> |
| 75 | + <feature |
| 76 | + name="plagiarism" |
| 77 | + this_offset="5" |
| 78 | + this_length="1000" |
| 79 | + <nobr>source_reference="source-documentABC.txt"</nobr> |
| 80 | + source_offset="100" |
| 81 | + source_length="1000" |
| 82 | + ... |
| 83 | + /> |
| 84 | + <feature |
| 85 | + name="altered" |
| 86 | + this_offset="5" |
| 87 | + this_length="1000" |
| 88 | + <nobr>source_reference="source-documentABC.txt"</nobr> |
| 89 | + ... |
| 90 | + /> |
| 91 | + ... |
| 92 | + </document></pre> |
| 93 | + <p>The <code>plagiarism</code> feature specifies an aligned passage of text between <code>suspicious-documentXYZ.txt</code> |
| 94 | + and <code>source-documentABC.txt</code>, and that it is of length 1000 characters, starting at |
| 95 | + character offset 5 in the suspicious document and at character offset 100 in the source |
| 96 | + document. The other attributes are used to allow for a more detailed analysis of the results and can be ignored for training.</p> |
| 97 | + |
| 98 | + <p>The <code>altered</code> feature specifies the location of paraphrased text that was not reused (no plagiarism). This allows |
| 99 | + to distinguish between genuine LLM generated texts and reused text. For the evaluation, only the <code>plagiarism</code> features |
| 100 | + need to be predicted.</p> |
| 101 | + |
| 102 | + <p>For each pair <code>suspicious-documentXYZ.txt</code> and <code>source-documentABC.txt</code> in the <code>pairs</code> file, |
| 103 | + your plagiarism detector shall output an XML file <code>suspicious-documentXYZ-source-documentABC.xml</code> |
| 104 | + which specifies the location of the plagiarism cases detected within. The name of the feature should be <code>detected-plagiarism</code> |
| 105 | + and specify the offsets and lengths in the suspicious and the source document. No other attributes are evaluated. For example:</p> |
| 106 | + <pre class="prettyprint lang-xml" style="overflow-x:auto"><nobr><document reference="suspicious-documentXYZ.txt"></nobr> |
| 107 | + <feature |
| 108 | + name="detected-plagiarism" |
| 109 | + this_offset="5" |
| 110 | + this_length="1000" |
| 111 | + <nobr>source_reference="source-documentABC.txt"</nobr> |
| 112 | + source_offset="100" |
| 113 | + source_length="1000" |
| 114 | + /> |
| 115 | + <feature ... /> |
| 116 | + ... |
| 117 | + </document></pre> |
| 118 | + <p>For evaluation, the offset and length attributes <code>detected-plagiarism</code> features will be compared against the <code>plagiarism</code> features in the annotation data. |
| 119 | + No other information will be evaluated.</p> |
| 120 | + |
| 121 | + <h2 id="results">Results</h2> |
| 122 | + tba. |
| 123 | + |
| 124 | + |
| 125 | + <h2 id="related-work">Related Work</h2> |
| 126 | + <ol> |
| 127 | + <li> |
| 128 | + <a href="{{ 'publications.html#?q=2014%20plagiarism%20potthast' | relative_url }}">Plagiarism Detection, PAN @ CLEF'14</a> |
| 129 | + </li> |
| 130 | + <li> |
| 131 | + <a href="{{ 'publications.html#?q=2013%20plagiarism%20potthast' | relative_url }}">Plagiarism Detection, PAN @ CLEF'13</a> |
| 132 | + </li> |
| 133 | + <li> |
| 134 | + <a href="{{ 'publications.html#?q=2012%20plagiarism%20potthast' | relative_url }}">Plagiarism Detection, PAN @ CLEF'12</a> |
| 135 | + </li> |
| 136 | + <li> |
| 137 | + <a href="{{ 'publications.html#?q=2011%20plagiarism%20potthast' | relative_url }}">Plagiarism Detection, PAN @ CLEF'11</a> |
| 138 | + </li> |
| 139 | + <li> |
| 140 | + <a href="{{ 'publications.html#?q=2010%20plagiarism%20potthast' | relative_url }}">Plagiarism Detection, PAN @ CLEF'10</a> |
| 141 | + </li> |
| 142 | + <li> |
| 143 | + <a href="{{ 'publications.html#?q=2009%20plagiarism%20potthast' | relative_url }}">Plagiarism Detection, PAN @ SEPLN'09</a> |
| 144 | + </li> |
| 145 | + </ol> |
| 146 | + |
| 147 | + <h2 id="task-committee">Task Committee</h2> |
| 148 | + <div data-uk-grid class="uk-grid uk-grid-match uk-grid-small thumbnail-card-grid"> |
| 149 | + {% include people-cards/greinerpetter.html %} |
| 150 | + {% include people-cards/philipwahle.html %} |
| 151 | + {% include people-cards/ruas.html %} |
| 152 | + {% include people-cards/gipp.html %} |
| 153 | + </div> |
| 154 | + <div class="uk-container uk-padding-large uk-padding-remove-bottom"> |
| 155 | + {% include organizations/clef-organizations-section.html year=2026 %} |
| 156 | + </div> |
| 157 | + </div> |
| 158 | + </div> |
| 159 | +</main> |
0 commit comments