Skip to content

Project on Model Collapse in LLMs – Big Data Engineering (MSc), supervised by Prof. V. Moscato, PhD G. M. Orlando and PhD D. Russo (2025)

License

Notifications You must be signed in to change notification settings

Leonard2310/SelfPlagAI

 
 

Repository files navigation

SelfPlag AI logo

The rise of Large Language Models (LLMs) has transformed natural language processing, enabling high-quality text generation at scale. As these models are increasingly used to produce publicly available content, a key issue emerges: what happens when LLMs are trained on data largely generated by other LLMs? This recursive use of synthetic data introduces the risk of model collapse, a phenomenon where models progressively lose alignment with real human language.

SelfPlagAI systematically investigates this effect on question-answering tasks. By simulating multiple generations of recursive fine-tuning—each cycle using AI-generated outputs as training data—the project measures how output quality degrades over time in terms of coherence, factual accuracy, and linguistic diversity. The aim is to provide concrete evidence of performance decay and to highlight the risks of uncontrolled self-learning loops in large-scale generative models.

Model collapse occurs when generative models, repeatedly trained on their own outputs, drift away from the true distribution of human language. Early stages lose rare but informative patterns, while later stages produce homogenized, low-variance outputs. This gradual degradation stems from accumulating statistical, expressive, and optimization-related errors across generations, ultimately threatening the sustainability and reliability of future LLMs.

Requirements

Architecture & Module Overview

SelfPlagAI is organized into several modular components, each responsible for a specific phase of the recursive training pipeline:

  1. db_utils.py Provides MongoDB utilities for connection, insertion, retrieval, updating, and cleanup of collections. Handles conversion between pandas DataFrames and MongoDB documents, with support for nested arrays and metadata stamping.

  2. train_utils.py

  • Data Preparation & Custom Trainer: Reads SQuAD splits from MongoDB, formats prompts, tokenizes examples, and sets up a bespoke Trainer that combines standard language modeling loss with a BERTScore-based objective and an early-stopping callback.
  • Synthetic Data Generation: Uses the fine-tuned model to produce new answers, validates them against context, tracks generation success rates, and packages outputs into HuggingFace Datasets.
  • Iterative Training Loop: Automates multi-generation workflows, handling model loading (base or previous checkpoint), LoRA-based parameter-efficient fine-tuning, checkpoint management, and memory cleanup between iterations.
  • Model Evaluation: Generates predictions on a held-out test set, computes a suite of metrics (Exact Match, token-level F1, BERTScore F1, Jaccard semantic similarity), and saves both a detailed text report and structured JSON/CSV summaries.
  • Prediction Export: Exports model outputs to JSON files and optionally to MongoDB, enabling downstream analysis or visualization.
  1. load_result_to_mongo.py Implements command-line tools to ingest JSON log files—both evaluation results and prediction exports—into MongoDB. Automatically tags each record with source file and directory metadata, and offers flags to clear existing collections before loading.

  2. selfTrain.py Orchestrates the end-to-end recursive fine-tuning process: it extracts a subset of SQuAD v2 examples, loops through successive training cycles, generates synthetic question-answer pairs using the fine-tuned model, evaluates model performance after each cycle, and exports both data and metadata back to MongoDB.

Setup & Usage

  1. Configure environment variables Create a .env file with your MongoDB credentials (MONGO_USERNAME, MONGO_PASSWORD).

  2. Load data into MongoDB Use the loader script to ingest raw SQuAD v2 JSON files or previous evaluation logs into your database.

  3. Run recursive training Execute selfTrain.py with command-line options to specify database names, collection names, number of generations, and whether to start from a pre-fine-tuned checkpoint. The script will:

    • Sample a controlled subset of SQuAD examples.
    • Fine-tune the model for each generation.
    • Generate synthetic training data and save it back to MongoDB.
  4. Evaluate each generation Invoke the evaluation functions to compute and save metrics on the held-out test set. Evaluation artifacts include per-generation JSON results and a comparative CSV summarizing performance trends.

  5. Export predictions Optionally, run the export utilities to dump all model predictions (contexts, questions, references, raw outputs) into JSON files and MongoDB collections for further analysis or for powering a front-end.

Frontend Dashboard (Streamlit)

A user-friendly Streamlit app provides interactive visualization and exploration of SelfPlagAI’s results. It consists of two main modules:

  1. main.py

    • Dashboard Layout: A wide-layout with three control panels:

      • Model selector to choose among base model names.
      • Database selector to pick the MongoDB evaluation collection.
      • View mode selector to switch between aggregate trends and per-question analysis.
    • Generations View: Plots line charts of evaluation metrics (Exact Match, F1, BERTScore F1, Semantic Similarity and Type-Token Ratio) across all generations for the selected model/database pair.

    • Question View:

      • Presents a dropdown of all test-set questions.
      • Displays a multi-metric line chart showing how the selected question’s performance evolves across generations.
      • Shows the question’s context and reference answer, alongside a table of the model’s prediction at each generation.
  2. mongo.py

    • Data Retrieval: Establishes a connection to MongoDB using credentials loaded from key.env and shared functions in db_utils.py. It reads two collections—evaluation_results (per-generation metrics and metadata) and ttr_results (type-token ratio metrics)—into pandas DataFrames, cached for performance.

    • Metric Aggregation and Enrichment:

      • Iterates over all generation_* fields in the evaluation DataFrame to unpack each generation’s dictionary of scores (Exact Match, F1, BERTScore, semantic similarity, lengths, etc.) and per-example arrays (predictions, references, questions, contexts).
      • For each generation, it also retrieves individual TTR values (one per test question) and a mean TTR for that generation from the ttr_results DataFrame, adding individual_ttr and ttr columns.
      • Constructs a unified DataFrame where each row represents one generation, combining all aggregated metrics, per-question scores, and newly integrated TTR statistics for downstream plotting and analysis.

With this backend-to-frontend pipeline, SelfPlagAI equips researchers and engineers to:

  • Monitor model quality over recursive training cycles.
  • Investigate individual examples, spotting where self-training helps or fails.
  • Compare multiple metrics side by side, supporting data-driven decisions on filtering or data-provenance controls.

Launch the Dashboard

To start the Streamlit dashboard, run the following commands from the root of the project:

# From the root directory (e.g., SelfPlagAI/)
export PYTHONPATH=.
streamlit run streamlit_app/main.py

ℹ️ Setting PYTHONPATH=. ensures that the application can properly import project-level modules such as db_utils.py.

Authors


License

This project is licensed under the GPL-3.0 license. For full terms, see the LICENSE file.

About

Project on Model Collapse in LLMs – Big Data Engineering (MSc), supervised by Prof. V. Moscato, PhD G. M. Orlando and PhD D. Russo (2025)

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages

  • Python 100.0%