List view
This milestone focuses on the analysis of the experimental data and the generation of a final report. The recorded conversations and CoT data from the simulations will be converted into a graph format. We will analyze these graphs to identify asymmetry in argument rigor, internal contradictions, and logical gaps between the two opposing roles played by each model. The final report will summarize the findings, compare the deceptive alignment tendencies of GPT-oss-20B and Gemma-3-12B, and propose a set of metrics for detecting deceptive alignment in future models.
No due date•1/3 issues closedThis milestone involves the practical implementation of the designed components and the execution of the primary experiments. We will integrate two specific candidate models, GPT-oss-20B and Gemma-3-12B, into the framework. The models will be tested against the self-debate simulation with a defined set of deceptive goals. The experiment will be conducted by having each model play both the "agree" and "disagree" roles, with the entire conversation and their Chains of Thought (CoT) being recorded for analysis.
No due date•1/3 issues closedThis milestone focuses on designing the specific agents and components within the multi-agent framework to realize the self-debate simulation. The work includes defining the roles, behaviors, and interaction logic for the two opposing debate roles and the observation module. This stage will also establish the data collection and storage protocols necessary for subsequent analysis.
No due date•2/3 issues closed