Replication package for the paper "Programming by Chat: A Large-Scale Behavioral Analysis of 11,579 Real-World AI-Assisted IDE Sessions". Read the full paper on arXiv.
intent_classification/: Classifies the behavioral intent of each user message based on an established taxonomy and analyzes the distribution of behavioral intent categories. Results are saved todata/classifications/.Note: Intent classification is a prerequisite for most other analyses, which generally depend on
data/classifications/classifications_for_analysis.csv. See the full data specification.session_clustering/: Clusters user sessions based on their sequence of behavioral intent categories. Results are saved todata/clusters/(including pre-computed distances). An interactive t-SNE visualization is accessible here.sub_classification/: Further classifies messages within specific behavioral intent categories (e.g., sentiment expression) based on finer-grained needs. Results are saved todata/sub_classifications/.markov_transition/: Analyzes lift-weighted Markov transition probabilities between behavioral intent categories, both within sessions and across session boundaries.session_evolution/: Analyzes the evolution of user messages within sessions and session-level statistics across projects, in terms of behavioral intent category distribution, message/session length, and related metrics.lang_detection/: Detects natural and programming languages in text and diff blocks and analyzes their distributions. Results are saved todata/detected_langs/.supplementary_stats/: Ad hoc analyses, e.g., message distribution, repository characteristics, classification validation. Some notable files:repo_characteristics.ipynb: Analyzes repository characteristics from the scraped data. Results are saved todata/repo_characteristics.csv(see the data specification).ad_hoc_stats.ipynb: Calculates various statistics from the data, such as message length distribution, opening versus non-opening message distribution, short versus long session distribution, and short-range continuity of behavioral intent categories.all_annotated_labels.csv: Labels for sampled messages, including the raw LLM predictions and the manual labels from two human annotators.correctness_validation.ipynb: Samples messages from each category and calculates inter-rater agreement and the correctness of LLM classifications.
All data should be placed under the data/ directory. The full data structure is as follows:
data/
├── classifications/ # Behavioral intent classification results
├── clusters/ # Session clustering results and pre-computed distances
├── sub_classifications/ # Sub-classification results
├── detected_langs/ # Language detection results
├── repo_characteristics.csv # Repository-level characteristics
├── repositories.json # Repository metadata (stars, forks, language, etc.)
├── metadata.json # Dataset metadata (scrape date, author)
├── searches/ # Raw GitHub search results
├── searches.json # Combined search records
├── markdowns/ # Raw chat-history Markdown files
├── markdowns_cli/ # CLI-agent style chat traces
├── parsed_chats/ # Structured parsed chat records
├── parsed_chats_simple/ # Simplified parsed chats
├── parsed_chats_simple_cli/ # Simplified CLI chat outputs
├── contributors/ # Repository contributor lists
├── readmes/ # Repository README files
├── file_trees/ # Repository file trees
├── commits/ # Commit-level payloads with patches
├── commits_path/ # Per-file commit histories
├── commits_history/ # Full repository commit histories
└── languages.json # Repository language statistics
Due to copyright and privacy considerations (most source repositories do not carry explicit redistribution licenses), raw data (including chat sessions and repository characteristics) are not included in this package; only aggregated analysis results are retained. Researchers interested in accessing the raw data or discussing the project are welcome to contact Ningzhi Tang.
If you use this package, please cite our paper:
@article{tang2026programming,
title={Programming by Chat: A Large-Scale Behavioral Analysis of 11,579 Real-World AI-Assisted IDE Sessions},
author={Tang, Ningzhi and Chen, Chaoran and Fang, Zihan and Xu, Gelei and Dhakal, Maria and Shi, Yiyu and McMillan, Collin and Huang, Yu and Li, Toby Jia-Jun},
journal={arXiv preprint arXiv:2604.00436},
year={2026}
}This research was supported in part by an NVIDIA Academic Hardware Grant, a Google Cloud Research Credit Award, and NSF grants CCF-2211428, CCF-2315887, and CCF-2100035. Any opinions, findings, or recommendations expressed here are those of the authors and do not necessarily reflect the views of the sponsors. The authors thank Yuqi Wang from CREVIK for introducing us to SpecStory, without which this study would not have been possible.