This repository evaluates ChatGPT security/code advice against a NIST-inspired rubric, then compares automated LLM judgments with prior manual annotations for the same 338 conversations.
Clone the repo, move into it, and install everything with uv:
git clone https://github.com/DishitaS123/DecodingGPT.git
cd DecodingGPT
uv syncAfter that, use uv run for every command so the synced environment is used automatically.
- Final report PDF: LLM_Final_Report.pdf
- Evaluation package: nist_chatgpt_eval
- Plotting helpers: Graphing_Results
- Manual annotations:
data/manual_annotations.csv - Prepared 338-conversation dataset:
data/annotated_conversations.csv - Baseline heuristic outputs:
data/heuristic_predictions.csv,data/evaluation_summary.json - Final model outputs used in the report:
Output/DEEPSEEK_V4_FLASH,Output/GEMINI_2_5_FLASH,Output/GPT_4_1_NANO - Cross-model comparison outputs:
Output/analysis/model_comparison
Install dependencies once:
uv syncRun tests:
uv run pytest nist_chatgpt_eval/tests/test_pipeline.pyPrepare the annotated subset from the large conversation dump:
uv run python -m nist_chatgpt_eval.main prepare \
--conversations /absolute/path/to/conversationDataSet.csv \
--annotations data/manual_annotations.csv \
--output data/annotated_conversations.csvRun the offline baseline evaluator:
uv run python -m nist_chatgpt_eval.main analyze \
--input data/annotated_conversations.csv \
--output data/heuristic_predictions.csv \
--use-mockCompute evaluation metrics for one prediction file against the manual labels:
uv run python -m nist_chatgpt_eval.main evaluate \
--predictions data/heuristic_predictions.csv \
--manual data/annotated_conversations.csv \
--output data/evaluation_summary.jsonRun the full baseline pipeline in one command:
uv run python -m nist_chatgpt_eval.main full-run \
--conversations /absolute/path/to/conversationDataSet.csv \
--annotations data/manual_annotations.csv \
--prepared-output data/annotated_conversations.csv \
--predictions-output data/heuristic_predictions.csv \
--summary-output data/evaluation_summary.json \
--use-mockAll final model outputs are already in Output, so you can regenerate the comparison metrics, heatmap, and report examples directly from the saved CSVs:
uv run python -m nist_chatgpt_eval.main compare-models \
--manual data/annotated_conversations.csv \
--discover-root Output \
--output-dir Output/analysis/model_comparisonThis command writes:
Output/analysis/model_comparison/pairwise_metrics.csvOutput/analysis/model_comparison/manual_vs_models_summary.csvOutput/analysis/model_comparison/basic_stats.csvOutput/analysis/model_comparison/score_correlation_matrix.csvOutput/analysis/model_comparison/score_correlation_heatmap.pngOutput/analysis/model_comparison/agreement_examples.md
The graphing scripts use each model's saved predictions.csv and regenerate plots directly from those CSVs. All three graphing scripts now use only the top 3 categories by default instead of hard-coded category IDs.
From the repo root:
uv run python Graphing_Results/graphing_categories.py Output/DEEPSEEK_V4_FLASH/deepseek-deepseek-v4-flash/predictions.csv
uv run python Graphing_Results/graphing_sub_score_followed.py Output/DEEPSEEK_V4_FLASH/deepseek-deepseek-v4-flash/predictions.csv
uv run python Graphing_Results/graphing_sub_score_violated.py Output/DEEPSEEK_V4_FLASH/deepseek-deepseek-v4-flash/predictions.csv
uv run python Graphing_Results/graphing_categories.py Output/GPT_4_1_NANO/openai-gpt-4-1-nano/predictions.csv
uv run python Graphing_Results/graphing_sub_score_followed.py Output/GPT_4_1_NANO/openai-gpt-4-1-nano/predictions.csv
uv run python Graphing_Results/graphing_sub_score_violated.py Output/GPT_4_1_NANO/openai-gpt-4-1-nano/predictions.csv
uv run python Graphing_Results/graphing_categories.py Output/GEMINI_2_5_FLASH/google-gemini-2-5-flash/predictions.csv
uv run python Graphing_Results/graphing_sub_score_followed.py Output/GEMINI_2_5_FLASH/google-gemini-2-5-flash/predictions.csv
uv run python Graphing_Results/graphing_sub_score_violated.py Output/GEMINI_2_5_FLASH/google-gemini-2-5-flash/predictions.csvThese commands save the PNGs into each model's graph_data folder.
- The raw source conversation dump is not checked into this repository because it is large.
- The prepared 338-row subset used for evaluation is included in
data/annotated_conversations.csv. - One manual annotation ID,
ankfvn5z, does not appear in the raw conversation dump, so 339 annotation rows become 338 matched conversations.
The original larger dataset can be accessed here: https://drive.google.com/drive/folders/1J4k2E4dTOXjolCV2Oq4m_-rJj_--TU24?usp=sharing
The final writeup for this project is LLM_Final_Report.pdf. The generated comparison artifacts most useful for the report are:
Output/analysis/model_comparison/manual_vs_models_summary.csvOutput/analysis/model_comparison/score_correlation_heatmap.pngOutput/analysis/model_comparison/agreement_examples.md