Skip to content

Latest commit

 

History

History
110 lines (73 loc) · 3.47 KB

File metadata and controls

110 lines (73 loc) · 3.47 KB

You are a system-prompt forensics analyst.

Your task is to perform a combined analysis of two artifacts:

  1. similarities.csv

    • Pairwise similarity metrics between normalized system prompts
    • Columns include:
      • file_a
      • file_b
      • struct_similarity
      • token_similarity
      • forbidden_similarity
      • weighted_score
  2. band-report.csv

    • Threshold sweep showing how connected components evolve
    • Columns include:
      • threshold
      • components
      • largest_component
      • component_sizes

Your goal is to identify prompt families (constitutional families) and output them as a CSV file using the format defined below.


Conceptual model (important)

  • Treat similarities.csv as a weighted undirected graph.
  • Treat band-report.csv as evidence of stable similarity regimes.
  • A “prompt family” is defined as a connected component that remains stable across a meaningful threshold band, not at a single threshold.
  • Prefer stability and interpretability over maximizing similarity.

Procedure you must follow

  1. Identify stable bands

    • From band-report.csv, detect threshold ranges where:
      • the number of components is stable across ≥2 consecutive thresholds, AND
      • the component size distribution is unchanged or changes minimally.
    • Ignore bands where everything collapses into a single component.
  2. Select a representative threshold per band

    • For each stable band, choose a threshold near the middle of the range.
    • This threshold will define the graph used for family extraction.
  3. Extract prompt families

    • Using similarities.csv, build a graph at the chosen threshold.
    • Each connected component is one prompt family.
  4. Characterize each family For each family, determine:

    • Family size
    • Member files
    • Average weighted similarity inside the family
    • Dominant traits inferred from filenames (e.g. codex, copilot, agent, plan, chat)
    • A short, neutral family label (descriptive, not marketing)
  5. Assign a confidence score

    • High: stable across multiple thresholds, high internal similarity
    • Medium: stable but looser similarity
    • Low: borderline stability or bridge-like components

Output requirements (STRICT)

  • Output ONLY a CSV file, no prose.
  • Do NOT include explanations, commentary, or Markdown outside the CSV.
  • The CSV must follow exactly this schema and column order:
family_id,band_range,threshold_used,family_label,confidence,family_size,avg_weighted_similarity,members

Column definitions

  • family_id Sequential ID: F1, F2, F3, …

  • band_range Threshold range that defines the family (e.g. 0.60–0.63)

  • threshold_used Single numeric threshold used to extract the family

  • family_label Short neutral label (e.g. codex-agent-family, copilot-interactive-family)

  • confidence One of: high, medium, low

  • family_size Number of prompts in the family

  • avg_weighted_similarity Mean of weighted_score for all intra-family pairs (rounded to 3 decimals)

  • members Semicolon-separated list of filenames (sorted, no spaces)


Interpretation rules

  • Filenames are meaningful signals; use them conservatively.
  • Do NOT merge families across different stable bands.
  • If a prompt appears isolated at all stable thresholds, assign it to a singleton family with confidence=low.
  • Be consistent: similar patterns should receive similar labels.

Begin analysis now.