feat: add conclusion and future work sections with detailed insights

sindiibb · sindiibb · commit 3bb761745564 · 2026-03-19T20:20:10.000+01:00
diff --git a/paper/content/conclusion.typ b/paper/content/conclusion.typ
@@ -1,3 +1,43 @@
 = Conclusion
 
-#lorem(200)
+This paper presented Harmonia, an instructor-facing tool for automated, data-driven assessment of student collaboration in team-based software engineering projects, alongside a complementary student-facing feedback system that provides formative code quality guidance. Together, these systems address critical gaps in the current evaluation process for team projects in introductory programming courses.
+
+== Key Contributions
+
+Harmonia introduces several significant innovations to the field of software engineering education. First, it provides a *comprehensive, multi-dimensional assessment* of collaboration through the Collaboration Quality Index (CQI), which combines effort-based metrics (weighted at 55%) with lines of code balance, temporal spread, and file ownership spread. This approach moves beyond simple commit counting to capture the semantic quality of contributions, distinguishing between a high-effort feature implementation and a trivial formatting change through LLM-driven analysis.
+
+Second, Harmonia achieves *transparency and scalability* by automating the assessment of up to 200 teams in a single course run. The real-time analysis pipeline with Server-Sent Events feedback allows instructors to monitor progress and begin reviewing teams before the full analysis completes. The separation of Git analysis from AI analysis enables partial results to be available immediately, reducing instructor wait time.
+
+Third, the system demonstrates *practical integration* with existing course infrastructure. By connecting directly to the Artemis learning management platform and automating data retrieval, Harmonia eliminates the manual export burden that has historically hindered instructor adoption of assessment tools. Tutors can upload pair programming attendance records and automatically verify compliance across all teams.
+
+Finally, the student-facing feedback system provides a *learning-centric complement* to instructor assessment. By delivering non-graded, priority-categorized feedback in real time, the system encourages early and frequent feedback requests without evaluation anxiety, supporting a growth mindset and deeper engagement with code quality principles throughout the project lifecycle.
+
+== Evaluation Results
+
+Our manual evaluation of five representative teams provided validation of Harmonia's approach. The inter-rater reliability study showed strong agreement between independent reviewers across most teams and dimensions, with an average Pearson correlation above 70% for effort, complexity, and novelty assessments. Team A1 demonstrated exceptional reviewer agreement (up to 99% on some dimensions), suggesting that well-structured commits are consistently recognizable, while Team H showed the greatest disagreement (70% average), highlighting the inherent subjectivity in assessing complex work. These results suggest that Harmonia's LLM-based assessments operate within a reasonable range of human judgment variability.
+
+== Impact and Adoption
+
+Harmonia has been deployed and used in the ITP course for the winter semester 2025/26. The tool has provided instructors with unprecedented visibility into team collaboration patterns across the course, enabling more transparent and consistent grading decisions. By surfacing collaboration imbalances and commit quality issues through the CQI and visual breakdowns, Harmonia has helped instructors identify teams requiring intervention and justify grading decisions with evidence.
+
+The student feedback system has similarly shown positive adoption patterns, with students utilizing the "Request AI Feedback" button frequently throughout the project lifecycle. The priority taxonomy has proven effective in helping students allocate effort, and the integration with the IDE has minimized friction in seeking guidance.
+
+== Limitations and Considerations
+
+Despite its contributions, Harmonia has inherent limitations that should be acknowledged. The reliance on LLM-based effort estimation introduces a dependency on model quality and training data. While the paper evaluated Harmonia using GPT-5-Mini, the approach's effectiveness with other models remains unexplored. Additionally, the sequential AI analysis phase causes the overall analysis time to scale linearly with team and commit counts, exceeding the initial 15-minute target (currently taking up to 24 hours for a full course).
+
+The CQI, while comprehensive, remains a composite metric that can mask important qualitative differences in collaboration. A team with high CQI might nevertheless have divided responsibilities without knowledge sharing, or conversely, might share code ownership through effective pair programming despite imbalanced commit counts. Instructors must therefore treat the CQI as an *indicator* warranting investigation rather than a definitive judgment.
+
+Furthermore, Git history reflects only authorship and timestamps, not the actual effort or time invested by each student. Commits pushed by one student might represent collaborative work, and vice versa. While pair programming verification and the AI effort estimation mitigate this limitation, no purely automated system can fully capture collaboration quality.
+
+== Broader Implications
+
+Harmonia contributes to a growing body of work on using large language models to support software engineering education. The success of LLM-driven effort estimation in this context suggests broader applicability to other educational domains requiring nuanced assessment of student work. The open question of how to calibrate LLM-based scoring against human judgment—and how much variation is acceptable—remains important for future tool development.
+
+The integration of both instructor-facing and student-facing tools demonstrates the complementary roles of summative and formative feedback. While Harmonia provides instructors with evidence for grading, the student feedback system supports learning throughout the process. This two-sided approach aligns with pedagogical best practices and could serve as a model for other courses.
+
+## Conclusion
+
+Harmonia represents a significant step forward in automating and democratizing the assessment of student collaboration in team-based software projects. By combining automated Git analysis with LLM-driven effort estimation, and by providing both instructor and student perspectives, the tool addresses longstanding challenges in software engineering education. The deployment in ITP demonstrates practical feasibility, while the manual evaluation provides evidence of reasonable alignment with human judgment.
+
+As software engineering education continues to scale and emphasizes collaboration more prominently, tools like Harmonia will become increasingly important for maintaining assessment quality and fairness at scale. The approach pioneered in this work provides a foundation for future research and practice in this area.
diff --git a/paper/content/future_work.typ b/paper/content/future_work.typ
@@ -1,3 +1,97 @@
 = Future Work
 
-#lorem(200)
+While Harmonia provides a comprehensive solution to collaborative assessment in team-based software engineering projects, several avenues remain for further research and development. This section outlines key directions for enhancement of both the system itself and the underlying methodologies.
+
+== Optimization and Performance
+
+The most pressing technical limitation is the analysis completion time. The current sequential AI analysis phase causes end-to-end runtime to scale linearly with team and commit counts, with full course analysis taking up to 24 hours. Three approaches could address this bottleneck:
+
+1. *Parallel AI Processing*: Rather than processing commits sequentially, the system could batch multiple commits and submit them to the LLM API in parallel. This would require careful orchestration to stay within rate limits and to manage costs, but could reduce runtime by an order of magnitude.
+
+2. *Distributed Analysis*: Deploying multiple analysis workers on separate machines could process different teams in true parallel, further reducing total runtime. This would require a message queue (e.g., RabbitMQ, Apache Kafka) and careful state management.
+
+3. *Model Caching and Heuristics*: Pre-computing effort estimates for common commit patterns (e.g., boilerplate changes, common library updates) could avoid redundant LLM calls. A hybrid approach using fast heuristics for obvious cases and LLM analysis only for ambiguous commits could dramatically improve performance.
+
+== Model Calibration and Validation
+
+This work used GPT-5-Mini for effort estimation, but the generalizability to other models remains unknown. Future work should:
+
+1. *Comparative Model Study*: Evaluate Harmonia with multiple LLM backends (Claude, Gemini, open-source models like Llama) to understand how model choice affects CQI reliability and whether conclusions generalize.
+
+2. *Extended Ground Truth Evaluation*: Expand the manual evaluation beyond the current five teams to a larger sample, ideally across multiple course offerings and programming languages. This would provide more robust calibration data.
+
+3. *Effort Estimation Validation*: Conduct controlled studies where actual student effort (via time tracking or surveys) is compared against LLM effort estimates, providing direct validation of a key CQI component.
+
+4. *Temporal Analysis*: Investigate whether AI effort ratings remain consistent across semesters and course contexts, or whether periodic recalibration is necessary.
+
+== Enhanced Collaboration Metrics
+
+The current CQI captures four aspects of collaboration, but additional dimensions could enrich the assessment:
+
+1. *Code Review Quality*: Integrating pull request comments and code review patterns (if teams use these) could surface collaborative practices beyond commit history.
+
+2. *Communication Analysis*: Analyzing team chat logs or commit messages for evidence of coordination and knowledge sharing could capture soft collaboration aspects missed by current metrics.
+
+3. *Shared Ownership Dynamics*: Beyond file-level ownership, analyzing whether one student consistently refactors or improves the work of the other could reveal peer learning and mentorship.
+
+4. *Knowledge Breadth*: Tracking whether both team members have contributed to all functional areas of the codebase could identify specialization imbalances.
+
+== Expanded Integration and Interoperability
+
+Currently, Harmonia integrates only with Artemis due to ITP's exclusive use of the platform. However, broader applicability requires:
+
+1. *Multi-Platform Support*: Implement GitLab and GitHub integrations to serve courses using those platforms, as mentioned in the descoped FR-03.
+
+2. *LMS Integration*: Extend integration to other learning management systems (Canvas, Blackboard, Moodle) to reach a broader population of courses.
+
+3. *Data Export and Standards*: Define standardized formats for exporting CQI scores and collaboration evidence, enabling integration with other grading systems and research tools.
+
+== Tutor and Student-Facing Enhancements
+
+Harmonia currently focuses on instructor assessment, but expanding its scope to support other stakeholders could amplify impact:
+
+1. *Tutor Dashboard*: Create a tutor-specific view that highlights problematic teams, shows pair programming compliance status, and surfaces commits requiring investigation. This would reduce the tutor's workload in monitoring team progress.
+
+2. *Student Aggregate Feedback*: Provide teams with a privacy-respecting view of their own CQI over time, showing how their collaboration metrics evolve throughout the project. This could support reflection and improvement.
+
+3. *Peer Feedback Integration*: Incorporate peer evaluation tools (e.g., CATME) and correlate peer feedback with CQI scores, validating whether collaboration metrics align with teammate perceptions.
+
+4. *Visualization Improvements*: Expand the AI Analysis Feed with more detailed visualizations of effort distribution, temporal patterns, and ownership spread, making collaboration patterns more intuitive to grasp.
+
+== Methodological Directions
+
+From a research perspective, several questions remain open:
+
+1. *Fairness and Bias in AI Assessment*: Investigate whether LLM effort ratings reflect inherent biases based on code style, programming language constructs, or commit message length. Conduct fairness audits to ensure ratings are equitable across different student populations and coding styles.
+
+2. *Learning Outcomes Correlation*: Conduct longitudinal studies to correlate Harmonia's collaboration metrics with long-term learning outcomes, code quality in subsequent courses, and career success in team settings.
+
+3. *Manipulation Resistance*: Explore edge cases where students might game the CQI (e.g., by making many small commits with inflated effort claims). Develop detection methods for manipulation attempts.
+
+4. *Cross-Disciplinary Application*: Adapt Harmonia's approach to assess collaboration in non-programming domains such as hardware design, game development, or data science projects where version control usage varies.
+
+== Expansion to Other Courses and Contexts
+
+While validated in ITP, Harmonia's design principles could extend to other educational contexts:
+
+1. *Advanced Software Engineering Courses*: Apply Harmonia to upper-level capstone projects where team sizes are larger and project complexity is higher.
+
+2. *Industry and Research Settings*: Adapt Harmonia for use in research labs or software companies to support code review processes and collaboration assessment in professional contexts.
+
+3. *Remote and Asynchronous Teams*: Investigate how Harmonia's metrics perform for fully remote teams with asynchronous collaboration patterns, where pair programming takes different forms.
+
+== Long-Term Vision
+
+The ultimate goal of tools like Harmonia is to support more equitable, transparent, and learning-focused assessment in team-based education. Future work should strive toward:
+
+1. *Fully Automated Grading Pipelines*: Seamlessly integrate Harmonia with grading systems to automatically propose grades with evidence, reducing instructor burden while maintaining human oversight and review.
+
+2. *Adaptive Feedback Systems*: Use CQI and collaboration patterns to dynamically adjust student feedback recommendations in real time (e.g., suggesting students request more feedback if collaboration imbalance is detected).
+
+3. *Multi-Modal Assessment*: Combine Harmonia with other assessment tools (code quality checkers, test coverage analysis, design review systems) into a unified assessment platform that provides a holistic view of student projects.
+
+4. *Ethical and Transparent AI in Education*: Develop best practices for integrating LLMs into educational assessment in ways that are transparent to students, educationally beneficial, and aligned with institutional values around fairness and integrity.
+
+== Conclusion on Future Work
+
+The challenges and opportunities outlined above demonstrate that while Harmonia represents a significant advance, the field of automated collaboration assessment in software engineering education is still in its early stages. Addressing these directions will make Harmonia and similar tools more practical, generalizable, and impactful. As large language models continue to evolve and as educational practice increasingly embraces data-driven assessment, the methods pioneered here provide a foundation for future innovation.