The rise of Large Language Models (LLMs) has transformed natural language processing, enabling high-quality text generation at scale. As these models are increasingly used to produce publicly available content, a key issue emerges: what happens when LLMs are trained on data largely generated by other LLMs? This recursive use of synthetic data introduces the risk of model collapse, a phenomenon where models progressively lose alignment with real human language.
SelfPlagAI systematically investigates this effect on question-answering tasks. By simulating multiple generations of recursive fine-tuning—each cycle using AI-generated outputs as training data—the project measures how output quality degrades over time in terms of coherence, factual accuracy, and linguistic diversity. The aim is to provide concrete evidence of performance decay and to highlight the risks of uncontrolled self-learning loops in large-scale generative models.
Model collapse occurs when generative models, repeatedly trained on their own outputs, drift away from the true distribution of human language. Early stages lose rare but informative patterns, while later stages produce homogenized, low-variance outputs. This gradual degradation stems from accumulating statistical, expressive, and optimization-related errors across generations, ultimately threatening the sustainability and reliability of future LLMs.
-
MongoDB Community Server Install from the official website: https://www.mongodb.com/try/download/community
-
Python 3.10+ All dependencies are listed in
requirements.txt. Install via:pip install -r requirements.txt
SelfPlagAI is organized into several modular components, each responsible for a specific phase of the recursive training pipeline:
-
db_utils.pyProvides MongoDB utilities for connection, insertion, retrieval, updating, and cleanup of collections. Handles conversion between pandas DataFrames and MongoDB documents, with support for nested arrays and metadata stamping. -
train_utils.py
- Data Preparation & Custom Trainer: Reads SQuAD splits from MongoDB, formats prompts, tokenizes examples, and sets up a bespoke Trainer that combines standard language modeling loss with a BERTScore-based objective and an early-stopping callback.
- Synthetic Data Generation: Uses the fine-tuned model to produce new answers, validates them against context, tracks generation success rates, and packages outputs into HuggingFace Datasets.
- Iterative Training Loop: Automates multi-generation workflows, handling model loading (base or previous checkpoint), LoRA-based parameter-efficient fine-tuning, checkpoint management, and memory cleanup between iterations.
- Model Evaluation: Generates predictions on a held-out test set, computes a suite of metrics (Exact Match, token-level F1, BERTScore F1, Jaccard semantic similarity), and saves both a detailed text report and structured JSON/CSV summaries.
- Prediction Export: Exports model outputs to JSON files and optionally to MongoDB, enabling downstream analysis or visualization.
-
load_result_to_mongo.pyImplements command-line tools to ingest JSON log files—both evaluation results and prediction exports—into MongoDB. Automatically tags each record with source file and directory metadata, and offers flags to clear existing collections before loading. -
selfTrain.pyOrchestrates the end-to-end recursive fine-tuning process: it extracts a subset of SQuAD v2 examples, loops through successive training cycles, generates synthetic question-answer pairs using the fine-tuned model, evaluates model performance after each cycle, and exports both data and metadata back to MongoDB.
-
Configure environment variables Create a
.envfile with your MongoDB credentials (MONGO_USERNAME,MONGO_PASSWORD). -
Load data into MongoDB Use the loader script to ingest raw SQuAD v2 JSON files or previous evaluation logs into your database.
-
Run recursive training Execute
selfTrain.pywith command-line options to specify database names, collection names, number of generations, and whether to start from a pre-fine-tuned checkpoint. The script will:- Sample a controlled subset of SQuAD examples.
- Fine-tune the model for each generation.
- Generate synthetic training data and save it back to MongoDB.
-
Evaluate each generation Invoke the evaluation functions to compute and save metrics on the held-out test set. Evaluation artifacts include per-generation JSON results and a comparative CSV summarizing performance trends.
-
Export predictions Optionally, run the export utilities to dump all model predictions (contexts, questions, references, raw outputs) into JSON files and MongoDB collections for further analysis or for powering a front-end.
A user-friendly Streamlit app provides interactive visualization and exploration of SelfPlagAI’s results. It consists of two main modules:
-
main.py-
Dashboard Layout: A wide-layout with three control panels:
- Model selector to choose among base model names.
- Database selector to pick the MongoDB evaluation collection.
- View mode selector to switch between aggregate trends and per-question analysis.
-
Generations View: Plots line charts of evaluation metrics (Exact Match, F1, BERTScore F1, Semantic Similarity and Type-Token Ratio) across all generations for the selected model/database pair.
-
Question View:
- Presents a dropdown of all test-set questions.
- Displays a multi-metric line chart showing how the selected question’s performance evolves across generations.
- Shows the question’s context and reference answer, alongside a table of the model’s prediction at each generation.
-
-
mongo.py-
Data Retrieval: Establishes a connection to MongoDB using credentials loaded from
key.envand shared functions indb_utils.py. It reads two collections—evaluation_results(per-generation metrics and metadata) andttr_results(type-token ratio metrics)—into pandas DataFrames, cached for performance. -
Metric Aggregation and Enrichment:
- Iterates over all
generation_*fields in the evaluation DataFrame to unpack each generation’s dictionary of scores (Exact Match, F1, BERTScore, semantic similarity, lengths, etc.) and per-example arrays (predictions, references, questions, contexts). - For each generation, it also retrieves individual TTR values (one per test question) and a mean TTR for that generation from the
ttr_resultsDataFrame, addingindividual_ttrandttrcolumns. - Constructs a unified DataFrame where each row represents one generation, combining all aggregated metrics, per-question scores, and newly integrated TTR statistics for downstream plotting and analysis.
- Iterates over all
-
With this backend-to-frontend pipeline, SelfPlagAI equips researchers and engineers to:
- Monitor model quality over recursive training cycles.
- Investigate individual examples, spotting where self-training helps or fails.
- Compare multiple metrics side by side, supporting data-driven decisions on filtering or data-provenance controls.
To start the Streamlit dashboard, run the following commands from the root of the project:
# From the root directory (e.g., SelfPlagAI/)
export PYTHONPATH=.
streamlit run streamlit_app/main.pyℹ️ Setting
PYTHONPATH=.ensures that the application can properly import project-level modules such asdb_utils.py.
This project is licensed under the GPL-3.0 license. For full terms, see the LICENSE file.
