DecodingGPT

This repository evaluates ChatGPT security/code advice against a NIST-inspired rubric, then compares automated LLM judgments with prior manual annotations for the same 338 conversations.

Quick Start

Clone the repo, move into it, and install everything with uv:

git clone https://github.com/DishitaS123/DecodingGPT.git
cd DecodingGPT
uv sync

After that, use uv run for every command so the synced environment is used automatically.

Repo Layout

Final report PDF: LLM_Final_Report.pdf
Evaluation package: nist_chatgpt_eval
Plotting helpers: Graphing_Results
Manual annotations: data/manual_annotations.csv
Prepared 338-conversation dataset: data/annotated_conversations.csv
Baseline heuristic outputs: data/heuristic_predictions.csv, data/evaluation_summary.json
Final model outputs used in the report: Output/DEEPSEEK_V4_FLASH, Output/GEMINI_2_5_FLASH, Output/GPT_4_1_NANO
Cross-model comparison outputs: Output/analysis/model_comparison

Reproduce The Environment

Install dependencies once:

uv sync

Run tests:

uv run pytest nist_chatgpt_eval/tests/test_pipeline.py

Main Pipeline

Prepare the annotated subset from the large conversation dump:

uv run python -m nist_chatgpt_eval.main prepare \
  --conversations /absolute/path/to/conversationDataSet.csv \
  --annotations data/manual_annotations.csv \
  --output data/annotated_conversations.csv

Run the offline baseline evaluator:

uv run python -m nist_chatgpt_eval.main analyze \
  --input data/annotated_conversations.csv \
  --output data/heuristic_predictions.csv \
  --use-mock

Compute evaluation metrics for one prediction file against the manual labels:

uv run python -m nist_chatgpt_eval.main evaluate \
  --predictions data/heuristic_predictions.csv \
  --manual data/annotated_conversations.csv \
  --output data/evaluation_summary.json

Run the full baseline pipeline in one command:

uv run python -m nist_chatgpt_eval.main full-run \
  --conversations /absolute/path/to/conversationDataSet.csv \
  --annotations data/manual_annotations.csv \
  --prepared-output data/annotated_conversations.csv \
  --predictions-output data/heuristic_predictions.csv \
  --summary-output data/evaluation_summary.json \
  --use-mock

Compare Manual Results Against The 3 LLMs

All final model outputs are already in Output, so you can regenerate the comparison metrics, heatmap, and report examples directly from the saved CSVs:

uv run python -m nist_chatgpt_eval.main compare-models \
  --manual data/annotated_conversations.csv \
  --discover-root Output \
  --output-dir Output/analysis/model_comparison

This command writes:

Output/analysis/model_comparison/pairwise_metrics.csv
Output/analysis/model_comparison/manual_vs_models_summary.csv
Output/analysis/model_comparison/basic_stats.csv
Output/analysis/model_comparison/score_correlation_matrix.csv
Output/analysis/model_comparison/score_correlation_heatmap.png
Output/analysis/model_comparison/agreement_examples.md

Regenerate Plots From The Generated CSVs

The graphing scripts use each model's saved predictions.csv and regenerate plots directly from those CSVs. All three graphing scripts now use only the top 3 categories by default instead of hard-coded category IDs.

From the repo root:

uv run python Graphing_Results/graphing_categories.py Output/DEEPSEEK_V4_FLASH/deepseek-deepseek-v4-flash/predictions.csv
uv run python Graphing_Results/graphing_sub_score_followed.py Output/DEEPSEEK_V4_FLASH/deepseek-deepseek-v4-flash/predictions.csv
uv run python Graphing_Results/graphing_sub_score_violated.py Output/DEEPSEEK_V4_FLASH/deepseek-deepseek-v4-flash/predictions.csv
uv run python Graphing_Results/graphing_categories.py Output/GPT_4_1_NANO/openai-gpt-4-1-nano/predictions.csv
uv run python Graphing_Results/graphing_sub_score_followed.py Output/GPT_4_1_NANO/openai-gpt-4-1-nano/predictions.csv
uv run python Graphing_Results/graphing_sub_score_violated.py Output/GPT_4_1_NANO/openai-gpt-4-1-nano/predictions.csv
uv run python Graphing_Results/graphing_categories.py Output/GEMINI_2_5_FLASH/google-gemini-2-5-flash/predictions.csv
uv run python Graphing_Results/graphing_sub_score_followed.py Output/GEMINI_2_5_FLASH/google-gemini-2-5-flash/predictions.csv
uv run python Graphing_Results/graphing_sub_score_violated.py Output/GEMINI_2_5_FLASH/google-gemini-2-5-flash/predictions.csv

These commands save the PNGs into each model's graph_data folder.

Data Notes

The raw source conversation dump is not checked into this repository because it is large.
The prepared 338-row subset used for evaluation is included in data/annotated_conversations.csv.
One manual annotation ID, ankfvn5z, does not appear in the raw conversation dump, so 339 annotation rows become 338 matched conversations.

The original larger dataset can be accessed here: https://drive.google.com/drive/folders/1J4k2E4dTOXjolCV2Oq4m_-rJj_--TU24?usp=sharing

Report Notes

The final writeup for this project is LLM_Final_Report.pdf. The generated comparison artifacts most useful for the report are:

Output/analysis/model_comparison/manual_vs_models_summary.csv
Output/analysis/model_comparison/score_correlation_heatmap.png
Output/analysis/model_comparison/agreement_examples.md

Name		Name	Last commit message	Last commit date
Latest commit History 11 Commits
Graphing_Results		Graphing_Results
Output		Output
best_practices		best_practices
data		data
nist_chatgpt_eval		nist_chatgpt_eval
.gitignore		.gitignore
LLM_Final_Report.pdf		LLM_Final_Report.pdf
README.md		README.md
pyproject.toml		pyproject.toml
uv.lock		uv.lock

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

DecodingGPT

Quick Start

Repo Layout

Reproduce The Environment

Main Pipeline

Compare Manual Results Against The 3 LLMs

Regenerate Plots From The Generated CSVs

Data Notes

Report Notes

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

DecodingGPT

Quick Start

Repo Layout

Reproduce The Environment

Main Pipeline

Compare Manual Results Against The 3 LLMs

Regenerate Plots From The Generated CSVs

Data Notes

Report Notes

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages