Skip to content

NLP2CT/NLPCC-2025-Task1

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

39 Commits
 
 
 
 
 
 

Repository files navigation

NLPCC2025 Shared-Task 1: LLM-Generated Text Detection

Task Introduction

With the rapid advancement of large language models (LLMs), the quality of their generated text is increasingly approaching that of human-written content. However, these models also pose significant challenges, as they may generate hallucinated information, harmful content, or be misused in various ways. As a result, effectively distinguishing between text generated by LLMs and text authored by humans has become a critical and pressing issue. While significant progress has been made in detecting text generated by LLMs, most of this research has been focused on English. In contrast, studies targeting Chinese remain relatively scarce. This shared task seeks to address this gap by developing more robust detectors for identifying text generated by LLMs, thereby advancing research in this area for Chinese.

Participants are tasked with designing and building detection algorithms using the provided raw training data to distinguish between text generated by LLMs and human-written text. During the evaluation phase, the submitted detectors will undergo rigorous testing under conditions that closely simulate real-world scenarios, especially with out-of-distribution data, to comprehensively assess their practical effectiveness and robustness. To ensure fairness and the traceability of results, participants are strictly prohibited from using external data sources or generating new data samples based on external knowledge. Additionally, all training data and related scripts must be submitted for review to guarantee the fairness, transparency, and reproducibility of the task.

Latest News

❗ We will release the full DetectRL-ZH dataset soon.

  • [ 2025.04.30 ] 📢 Test results have been released. Thank you to all the teams for their support and participation!
  • [ 2025.04.11 ] The test data for NLPCC-2025 Task 1 has been released!
  • [ 2025.02.27 ] We have released detailed task guidelines and training data—start building a reliable detector now!

PS: We have sent notification emails regarding the release of the test data to all participating teams. If you have registered but did not receive the email, please contact us via email ASAP to avoid missing important registion information.

Data Description

We present DetectRL-ZH, a benchmark specifically designed for detecting LLM-generated text in the Chinese domain. It is the Chinese extension of DetectRL, an English benchmark for detecting LLM-generated text in real-world scenarios. DetectRL-ZH is a carefully constructed dataset that simulates real-world conditions, featuring diverse paraphrased, adversarial, and mixed samples. Our test set will be also evaluated under real-world scenarios, and the statistics for the dataset for this shared task are provided below.

Statistics of Data

  • Training Set: The training set includes data from 3 types of LLMs and 3 domains. Specifically, the data sources are ASAP, CNewSum, and CSL. The generators include GPT-4o, GLM-4-flash, and Qwen-turbo. The training set contains totally 32,400 samples.
Split Source GPT4o GLM Qwen Machine Human Total
Train ASAP 2700 2700 2700 8100 2700 32400
CNewSum 2700 2700 2700 8100 2700
CSL 2700 2700 2700 8100 2700
Dev - - - - 1700 1100 2800
Test - - - - - - -

Test Data

The distribution of the test data is completely out of the training data's domain and LLMs, consisting of STORAL dataset (creative writing) generated by DeepSeek-V3. The test content includes three scenarios: Normal, Attack, and Varying Lengths.

  • Normal
    This scenario evaluates the detector's ability to identify texts directly generated by LLMs. It consists of 2000 pairs of LLM-generated texts and human-written texts, with sample IDs ranging from 1 to 4000.

  • Attack
    This scenario evaluates the detector's robustness and ability to detect common attacks and complex scenarios. It includes the following:

    1. For the last 500 pairs of texts in the Normal scenario, we applied mixed attacks (sample IDs 4001–5000). In these attacks, human-written samples (comprising less than 50% of the content) are mixed into the LLM-generated texts to confuse the detector at the paragraph/document level.
    2. For the last 500 pairs of texts in the Normal scenario, we applied paraphrase attacks (sample IDs 5001–6000). These simulate non-native speakers using machine translation by translating the LLM-generated text into English and back into Chinese.
    3. For the last 500 pairs of texts in the Normal scenario, we applied perturbation attacks (sample IDs 6001–7000). These replace less than 5% of the characters in the LLM-generated texts with visually similar characters, simulating human writing errors or noise to evade detection.
  • Varying Lengths
    This scenario evaluates the detector's ability to handle texts of different lengths and its robustness across them. It consists of 2000 pairs of LLM-generated texts and human-written texts, with sample IDs ranging from 7001 to 11000.
    For the last 500 pairs of texts in the Normal scenario, we performed semantic segmentation to ensure coherence and extracted text lengths closest to 64, 128, 256, and 512 characters. The sample IDs for these lengths are as follows:

    • 64 characters: sample IDs 7001–8000
    • 128 characters: sample IDs 8001–9000
    • 256 characters: sample IDs 9001–10000
    • 512 characters: sample IDs 10001–11000

For all three scenarios, the evaluation metric is macro F1 score, and the final score for each team is the average performance across the three scenarios.

We have released test_with_label.json to allow participating teams to conduct in-depth analyses and provide more valuable insights.

Data Download

The training data and development data can be found in the following Google Drive folder link or Github link:

The test data can be found in the following Github link:

Data Restriction

  • Please note that the provided development set is strictly for model tuning and must not be used for model training.

  • To support the development of these detection systems, participants are allowed to perform data augmentation based on the provided raw training data. However, data augmentation must be limited to processing or transforming the original data, such as creating new data samples through methods like cropping, splitting, word replacement, or format adjustment. It is crucial that any paraphrasing strictly preserves the original semantic meaning and does not introduce external knowledge or create entirely new content. This includes a prohibition on using generative LLMs for paraphrasing, as this could inadvertently introduce out-of-distribution knowledge and lead to unfair advantages, but allowing the use of traditional encoder-based models or seq2seq models for paraphrasing.

Data Format

The data is structured as a JSON object.

For training data:

{
  "text": "text generated by a machine or written by a human",
  "label": "label (human text: 0, machine text: 1)",
  "model": "model that generated the data",
  "source": "source (ASAP, CNewSum, CSL)"
}

For dev and test data:

{
  "text": "text generated by a machine or written by a human",
  "label": "label (human text: 0, machine text: 1)",
}

Evaluation Metric

The official evaluation metric for the task is the F1-Score, which effectively measures the performance of a classifier in binary classification tasks.

Submission & Evaluation

Submission Requirements

When submitting your results, please package the following materials into a zip file and send it to [email protected]:

  1. Test Result File
  • Your test result file must be a JSON file containing all samples.
  • Please ensure that the text and id fields remain unchanged.
  • Each sample in the JSON file should include the following fields:
    • "id": The unique identifier of the sample
    • "text": The text content of the sample
    • "label": The classification result based on the following rules:
      • Human-written text: Label as 0
      • Text generated by large language models: Label as 1
  1. Code and Data
  • The code folder should contain all the necessary code for data augmentation, data processing, model training, and model inference, as well as the complete dataset used to train your detector (no prohibited data may be used for training).
  • As submitted code may be reviewed and reproduced, please include a simple README.md (or equivalent documentation) and an environment configuration file (e.g., requirements.txt, if applicable).
  • In the documentation, provide a brief explanation of the reproduction process to ensure that your submitted results can be replicated.
  1. Technical Brief Report
  • The technical brief should describe in detail the methods you used to solve the task, including data processing, data augmentation, and the specific methods/model architectures and parameters used.
  • If necessary, you may include formulas and pseudocode to aid understanding.

! Please note that this Shared Task does not host a real-time leaderboard on any third-party platform for ranking purposes. All participating teams can submit their results an unlimited number of times before the submission deadline. The final results will be calculated based on the last submitted test results. Additionally, our final evaluation metric will be the macro-averaged F1 score.

Important Dates

  • April 20, 2025: Deadline for submission of test results
  • April 30, 2025: Announcement of evaluation results

Notes

  • Any behavior that violates the competition rules will result in disqualification.
  • Submissions made after the deadline will not be accepted.
  • If you encounter any issues during the competition, feel free to contact us at: [email protected]

Important dates

Time Events
2025/02/17 announcement of shared tasks and call for participation
2025/02/17 registration open
2025/02/28 release of detailed task guidelines & training data
2025/03/25 registration deadline
2025/04/11 release of test data
2025/04/20 participants’ results submission deadline
2025/04/30 evaluation results release and call for system reports and conference paper
2025/05/22 conference paper submission deadline (only for shared tasks)
2025/06/12 conference paper accept/reject notification
2025/06/25 camera-ready paper submission deadline

Time Zone: All deadlines are based on Beijing Time (GMT+8) and refer to 11:59 PM on the specified day.

Results

System Name(ID) Normal Attacks Varying Lengths Avg. F1 Score Team Members Team Organization
LLM-Generated-System 1.0000 0.9943 0.9770 0.9904 Yanan Cao (Team Leader); Zhuoshang Wang; Guoyu Zhao; Xiaowei Zhu; Hao Li; Yubing Ren ASCII Lab, Institute of Information Engineering, Chinese Academy of Sciences
TeleAI 0.9997 0.9823 0.8894 0.9571 - -
POLYNLPCC 0.9662 0.9100 0.9720 0.9494 - -
AI Text Trackers 0.9730 0.9217 0.9389 0.9445 - -
BlueSpace 0.9865 0.9533 0.8899 0.9432 - -
MCI^2_LLMGTD 0.9900 0.8429 0.9110 0.9146 Huanghao Feng (Team Leader); Houji Jin; Ganyu Gui; Wei Liu; Jian Wang Meta Computation for Intelligent Instrument Lab (MCI^2), Computer Science and Engineering Dept., Suzhou University of Technology; University of Southern California
ZZU-NLP 0.9982 0.8389 0.8588 0.8987 - -
DS 0.9484 0.7572 0.8727 0.8594 - -
GenAuditor 0.9440 0.8673 0.7658 0.8590 - -
Sprinting 0.8917 0.7930 0.8483 0.8443 Fu Rao Beijing Institute of Technology
ClassiFire 0.8922 0.8483 0.7858 0.8421 Yuqi Zhu (Team Leader); Yuchen Wang; Jinglan Gong; Haoyu Wu; Xiaozhe Ji Beijing Normal University; Beihang University; University of Science and Technology of China; Renmin University of China; Xiamen University
HC 0.9374 0.8219 0.7629 0.8407 Qi Huang (Team Leader); Yihang Chen; Wenliang Chen; Zhanwei Huang Jiangxi Normal University
YNU-HPCC 0.9940 0.8341 0.6517 0.8266 - -
ZZUNLP 0.9997 0.9010 0.4464 0.7824 - -
MSFLab 0.7357 0.7361 0.8706 0.7808 - -
LLM-Generated Text Detection 0.9243 0.8149 0.5673 0.7689 - -
SEIG-NLP 0.8782 0.7750 0.6480 0.7671 - -
DUFL2025 0.9195 0.5146 0.7140 0.7160 - -
ZZUNLP_Han 0.9597 0.5141 0.6170 0.6969 - -
NPUNLP Research Group 0.7495 0.7217 0.3919 0.6211 - -
PAK NLP 0.4892 0.4508 0.4902 0.4767 - -
YouTuLab_Jarvis 0.3982 0.3974 0.4077 0.4011 - -

Awards

  • The top 3 participating teams in each task and track will be certificated by NLPCC and CCF-NLP.

Orangizer & Contact

This shared-task is orangized by NLP2CT Lab, University of Macau.

If you have any questions about this task, please email to [email protected].

FAQ

Q1: Where can I register for this shared task?

A1: The latest registration method is available on the NLPCC 2025 Shared Task official website (http://tcci.ccf.org.cn/conference/2025/cfpt.php). Please fill out the Shared Task 1 registration form (Word document) (http://tcci.ccf.org.cn/conference/2025/sharedTasks/NLPCC2025.SharedTask1.RegistrationForm.doc) as required and send it to [email protected]. If you have any questions, feel free to reach out.

Q2: Is it allowed to use additional data?

A2: Using external data sources is not permitted. However, data augmentation is allowed (see Data Restriction for details).

Q3: There is an inconsistency between the description of "data restrictions" on the task's official GitHub page and the conference's official website:

  • GitHub: Generative large language models (Generative LLMs) cannot be used for paraphrasing; only traditional encoder models or sequence-to-sequence (seq2seq) models are allowed.
  • Conference website: Paraphrasing is limited to open-source models and API-based models. The use of GPT-o1, MoE, or models with parameter sizes exceeding 80B is not allowed.

A3: Please refer to the GitHub guidelines for the correct rules:

  • Only traditional encoder models or sequence-to-sequence (seq2seq) models are allowed for paraphrasing.
  • Generative decoder models are strictly prohibited, even if the model has fewer than 80 billion parameters (e.g., glm-4-9B).
  • This restriction ensures fairness, as our test set already includes data from decoder models, and introducing additional generative models could result in unfair advantages.

Q4: Does the restriction on “paraphrasing” mean that large language models cannot be used throughout the entire task process?

A4:

  • Definition of Paraphrasing: Making semantically equivalent modifications to text, such as polishing, synonym replacement, or adjusting expression style.
  • Restriction on Paraphrasing: During data augmentation (e.g., paraphrasing), only encoder-based or seq2seq models can be used. Generative LLMs are strictly prohibited, including for polishing or revising text.
  • Other Phases of the Task: Outside of data augmentation, LLMs can be used, including fine-tuning LLMs directly to complete the task.

Q5: If paraphrasing does not involve LLMs, is it permissible to use fine-tuned LLMs as detectors (e.g., qwen2.5-3B)?

A5: LLMs are prohibited during data augmentation only. For detector construction phases, LLMs are allowed. This includes:

  • Fine-tuning LLMs directly to complete the task.
  • Using open-source models to extract internal features for assistance in detection.

Q6: Can we use encoder-only models for fine-tuning?

A6: Yes, you can use any method to build your detector, including:

  • Fine-tuning encoder models or decoder models as classifiers.
  • Using statistical methods to extract classification features.
  • Restrictions apply only to data augmentation, where generative LLMs cannot be used for text paraphrasing.

Q7: The development set can only be used for "tuning," not "training." What is the difference?

A7:

  • Training: Use the training set (train.json) to adjust model parameters (e.g., neural network weights) or extract classification features and thresholds.
  • Tuning: Use the development set (dev.json) only after training to adjust hyperparameters (e.g., learning rate), classification features, or model architecture. Evaluate detector performance and optimize the method to improve performance on unseen data.
  • Example: Suppose you fine-tune an encoder model on the training set but find poor performance on the development set. This may indicate overfitting, and you should consider additional measures (e.g., data augmentation or extracting more diverse features) to improve generalization.

Q8: When should the competition code be submitted, and where?

A8: Deadline: By April 20, 2025 (11:59 PM Beijing Time).

  • Submission Content: Final test result file, training data, source code (including a simple README.md or instructions for running the code), and technical brief report.
  • Where to Submit: Email [email protected].
  • Updates or changes to submission details will be announced on the GitHub official website.

Q9: Do all team members need to register individually? Can someone who has graduated and is now working in a company participate?

A9: Only one registration form is needed per team; no need for individual registrations. Team members who have graduated and are working in a company can participate.

Q10: After the test data is released, do we only need to submit the final test result *.json file?

A10: Files to Submit: - Final test result file (*.json). - Detector training data and source code (for method verification). - Technical Brief Report. If additional files are required, the organizers will notify participants via email and the official GitHub website.

References

If you're new to this field, We believe the following papers can help you quickly get familiar with it (continuously updated):

  • Wu, J., Yang, S., Zhan, R., Yuan, Y., Chao, L. S., & Wong, D. F. (2025). A survey on LLM-generated text detection: Necessity, methods, and future directions. Computational Linguistics, 1-66.
  • Wu, J., Zhan, R., Wong, D. F., Yang, S., Yang, X., Yuan, Y., & Chao, L. S. (2024). DetectRL: Benchmarking LLM-Generated Text Detection in Real-World Scenarios. In The Thirty-eight Conference on Neural Information Processing Systems Datasets and Benchmarks Track.
  • Mitchell, E., Lee, Y., Khazatsky, A., Manning, C. D., & Finn, C. (2023, July). Detectgpt: Zero-shot machine-generated text detection using probability curvature. In International Conference on Machine Learning (pp. 24950-24962). PMLR.
  • Bao, G., Zhao, Y., Teng, Z., Yang, L., & Zhang, Y. Fast-DetectGPT: Efficient Zero-Shot Detection of Machine-Generated Text via Conditional Probability Curvature. In The Twelfth International Conference on Learning Representations.
  • Hans, A., Schwarzschild, A., Cherepanova, V., Kazemi, H., Saha, A., Goldblum, M., ... & Goldstein, T. (2024, July). Spotting LLMs With Binoculars: Zero-Shot Detection of Machine-Generated Text. In International Conference on Machine Learning (pp. 17519-17537). PMLR.
  • Guo, B., Zhang, X., Wang, Z., Jiang, M., Nie, J., Ding, Y., ... & Wu, Y. (2023). How close is chatgpt to human experts? comparison corpus, evaluation, and detection. arXiv preprint arXiv:2301.07597.
  • Wu, J., Zhan, R., Wong, D. F., Yang, S., Liu, X., Chao, L. S., & Zhang, M. (2025, January). Who Wrote This? The Key to Zero-Shot LLM-Generated Text Detection Is GECScore. In Proceedings of the 31st International Conference on Computational Linguistics (pp. 10275-10292).

About

NLPCC-2025 Shared-Task 1: LLM-Generated Text Detection

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published