- Published on
π AI-Driven Group Work Conflict ResolutionοΌDeep Dive Analysis
- Authors

- Name
- Chengchang Yu
- @chengchangyu
π― The Core Problem
The Nightmare Scenario: You're in a group project. One person does 80% of the work. Another disappears. Someone submits garbage. Everyone gets the same grade. Sound familiar?
The Real Challenge:
- Manual investigation is costly: Professors spend hours reviewing chat logs, code commits, and he-said-she-said disputes
- Existing tools are inadequate: Peer assessment suffers from bias (friends rate friends higher), and analytics tools only provide surface-level metrics
- No AI integration for conflict: Current systems use AI for plagiarism detection or feedback generation - NOT for investigating who actually did the work
The Gap: There's no comprehensive system that combines objective evidence (code commits, chat logs) with AI-powered analysis to fairly adjudicate group work disputes.
π‘ The Key Insight
The researchers applied forensic investigation principles to group work assessment, asking: "What if we could analyze ALL the evidence - code, chat, meetings, peer reviews - like a detective solving a case, but with AI doing the heavy lifting?"
Their Breakthrough Framework:
3 Dimensions Γ 9 Benchmarks = Comprehensive Conflict Detection
βββββββββββββββββββ¬βββββββββββββββββββ¬ββββββββββββββββββ
β CONTRIBUTION β INTERACTION β ROLE β
βββββββββββββββββββΌβββββββββββββββββββΌββββββββββββββββββ€
β 1. Quantity β 4. Tone β 7. Adherence β
β 2. Quality β 5. Effectiveness β 8. Organisation β
β 3. Relevance β 6. Presence β 9. Support β
βββββββββββββββββββ΄βββββββββββββββββββ΄ββββββββββββββββββ
The Formula:
Fair Assessment =
(Objective Metrics + Inequality Analysis + LLM Contextual Reasoning)
Γ Human Oversight
π§ The Method: How It Works
Architecture Overview
Evidence β Metrics β Conflict Markers β AI Analysis β Advisory Judgment
Stage 1: Evidence Collection
Three Categories of Evidence:
| Category | What It Includes | Why It Matters |
|---|---|---|
| Submission | Code commits, text documents, media files | Shows WHO did WHAT |
| Conversation | Chat logs, emails, meeting minutes | Reveals team dynamics & communication quality |
| Coordination | Task assignments, deadlines, attendance | Tracks responsibility & follow-through |
Stage 2: Metrics Extraction
Submission Metrics (Objective Work Output):
- Code: Line count, commit frequency, time intervals, code quality (linting scores)
- Text: Word count, character count, complexity (readability scores)
- Multimodal: Media workload (video editing time, design iterations)
Conversation Metrics (Communication Quality):
- Message count: Who's talking?
- Char/Message ratio: Are messages substantive or just "ok" and "lol"?
- Response time: Who's engaged vs. ghosting?
- Sentiment analysis: Detecting negativity, rudeness, or conflict
- Interaction diversity: Are there cliques? Is one person isolated?
Coordination Metrics (Team Responsibility):
- Attendance: Who shows up to meetings?
- Task fidelity: Did assigned work get completed?
- Assignment fidelity: Did deliverables match the task description?
- Task diversity: Is one person doing ALL the coding while others do nothing?
Stage 3: Conflict Markers (The "Red Flags")
The system uses the Gini Index (inequality measure) to detect disparity:
Two Key Scenarios:
- Scenario A: High Gini + Above-average individual score = One person carrying the team
- Scenario B: Low Gini + Below-average individual score = One person slacking
Example Conflict Markers:
| Benchmark | High Performer Issue | Low Performer Issue |
|---|---|---|
| Quantity | Overcentralization (one person doing everything) | Social loafing (hitchhiking) |
| Tone | Professional in toxic team | Rude/negative behavior |
| Adherence | High responsibility in lax team | Missing deadlines, not following plans |
Stage 4: AI Expert Analysis (The Brain)
LLM Architecture:
βββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β INPUT: Metrics + Conflict Markers + Context β
βββββββββββββββββββ¬ββββββββββββββββββββββββββββββββββββ
β
βββββββββββΌββββββββββ
β Category Analysis β (Contribution, Interaction, Role)
βββββββββββ¬ββββββββββ
β
βββββββββββΌββββββββββ
β Global Synthesis β (Overall judgment)
βββββββββββ¬ββββββββββ
β
βββββββββββΌββββββββββ
β Double Validation β (Hallucination check)
βββββββββββ¬ββββββββββ
β
βββββββββββΌββββββββββ
β Advisory Report β (Transparent, explainable)
βββββββββββββββββββββ
Key Innovation: The LLM doesn't just spit out a grade - it provides evidence-backed reasoning:
"Student A contributed 65% of code commits but had low interaction scores (avg response time: 8 hours). Student B contributed 15% of code but organized all meetings and maintained task documentation. Conflict marker: High Gini in Quantity (0.72) suggests overcentralization."
π Real-World Example: The Case of Team Chaos
The Team:
- Alice: Computer Science major, strong coder
- Bob: Business major, weak technical skills
- Charlie: Disappeared after Week 2
- Diana: Submitted plagiarized work
Traditional Assessment: Everyone gets 75/100 (team grade)
AI-Driven Investigation:
Evidence Collected:
- Git Logs: Alice (450 commits), Bob (12 commits), Charlie (3 commits), Diana (8 commits)
- Chat Logs: Alice (320 messages), Bob (180 messages), Charlie (15 messages), Diana (45 messages)
- Meeting Attendance: Alice (8/8), Bob (7/8), Charlie (2/8), Diana (6/8)
Metrics Calculated:
| Student | Contribution Score | Interaction Score | Role Score |
|---|---|---|---|
| Alice | 0.92 | 0.65 | 0.88 |
| Bob | 0.28 | 0.82 | 0.75 |
| Charlie | 0.05 | 0.12 | 0.15 |
| Diana | 0.35 | 0.58 | 0.42 |
Conflict Markers Raised:
- β οΈ High Gini (0.78) in Quantity: Alice overcentralized
- β οΈ Low Presence (Charlie): 2/8 meetings, 6-day average silence
- β οΈ Quality Issue (Diana): Plagiarism detected (code similarity 95% to Stack Overflow)
AI Analysis Output:
Alice: Exceptional technical contribution (92nd percentile) but lower interaction scores suggest potential communication barriers. Recommend grade adjustment: +15% (90/100)
Bob: Moderate technical contribution but strong organizational role (meeting facilitation, documentation). Balanced team player. Recommend: Team grade (75/100)
Charlie: Minimal contribution across all dimensions. Attendance: 25%. Last activity: Week 2. Recommend grade adjustment: -40% (35/100)
Diana: Submitted plagiarized code (95% similarity). Low task fidelity (work didn't match assignments). Recommend: Academic integrity investigation + grade penalty (30/100)
π Survey of Existing Tools (What's Missing?)
The researchers surveyed 11 major platforms: CATME, SPARK, WebPA, TEAMMATES, Buddycheck, etc.
What They Found:
| Feature | Adoption Rate | Gap |
|---|---|---|
| AutoRating normalization | 91% | β Common |
| Likert-scale surveys | 82% | β Common |
| Early warning systems | 45% | β οΈ Limited |
| AI screening | 18% | β Rare (only plagiarism) |
| Conflict investigation | 9% | β Almost nonexistent |
| AI-powered judgment | 0% | β NONE |
Key Insight: Current tools focus on prevention (peer assessment, early warnings) but NOT resolution (investigating disputes with AI).
π¬ One-Sentence Summary
This research proposes an AI-enhanced framework that forensically investigates group work disputes by analyzing heterogeneous evidence (code commits, chat logs, meeting attendance) across three dimensions (Contribution, Interaction, Role), using inequality metrics to flag conflicts and LLMs to generate transparent, evidence-backed advisory judgments - filling a critical gap in current peer assessment tools that lack comprehensive conflict resolution capabilities.
π§ The Simple Version
Imagine your teacher is Sherlock Holmes with a supercomputer.
When your group has a fight about who did the work:
- Sherlock collects evidence: Your code commits (like fingerprints), chat messages (like witness statements), meeting notes (like alibis)
- The computer analyzes patterns: "Alice wrote 80% of the code but only sent 20% of messages - she's the quiet workhorse"
- AI writes the verdict: "Based on the evidence, here's what actually happened and what grades are fair"
Instead of your teacher spending 5 hours reading Slack messages, the AI does it in 5 minutes - and catches things humans miss (like Bob never responding to technical questions but organizing all the meetings).
π The Technical Innovation
1. Hybrid Metrics (Objective + Subjective)
Unlike existing tools that rely ONLY on peer ratings (biased) or ONLY on code commits (incomplete), this system combines:
Final Score =
(0.4 Γ Objective Metrics) +
(0.3 Γ Peer Assessment) +
(0.3 Γ AI Contextual Analysis)
2. Gini Index for Inequality
What is it? A measure from economics (used to measure wealth inequality)
How it's used here?
- Gini = 0: Everyone contributed equally
- Gini = 1: One person did everything
Example:
- Team A: [25%, 25%, 25%, 25%] β Gini = 0.00 (perfect equality)
- Team B: [70%, 15%, 10%, 5%] β Gini = 0.52 (high inequality = conflict!)
3. LLM-Powered Semantic Analysis
Problem: How do you measure "relevance" of work?
Solution: Use LLMs to generate hypothetical ideal outputs based on the task description, then measure similarity:
# Pseudo-code
task_description = "Build a React app with user authentication"
ideal_output = LLM.generate_hypothetical(task_description)
student_output = extract_from_git(student_commits)
relevance_score = cosine_similarity(ideal_output, student_output)
Example:
- Alice built authentication system β 95% relevance
- Bob created a random chatbot β 20% relevance (off-task!)
4. Double-Pass Validation (Hallucination Prevention)
Pass 1: Generate judgment
Pass 2: Validate judgment against evidence
If conflict detected β Flag for human review
β οΈ Limitations & Challenges
1. Missing Evidence
- What if students work offline (whiteboard sessions, in-person meetings)?
- Solution: Allow manual evidence upload
2. Gaming the System
- Students could spam commits (1-line changes) to inflate metrics
- Solution: Weight by code quality, not just quantity
3. Cultural & Linguistic Bias
- Non-native English speakers might have lower "effectiveness" scores in chat
- Solution: Normalize by language proficiency, use multilingual sentiment analysis
4. Privacy Concerns
- Reading all chat messages = surveillance?
- Solution: Anonymize data, get consent, comply with GDPR
5. The "Human-in-the-Loop" Requirement
- AI can't make final grading decisions (UK/EU policy)
- Solution: System provides advisory judgment, instructor makes final call
π€ Critical Questions
1. Can AI really understand "soft contributions"?
- Example: Bob didn't code much but kept the team motivated during a crisis
- Challenge: Sentiment analysis might miss this nuance
- Solution: Include peer assessment qualitative feedback
2. What about neurodivergent students?
- Example: Alice (autistic) prefers async communication, low chat scores
- Challenge: System might penalize her "Interaction" score
- Solution: Allow students to declare accommodations, adjust weights
3. How do you handle "invisible labor"?
- Example: Diana did all the research but didn't commit it to Git
- Challenge: No digital trace = no credit
- Solution: Allow manual evidence submission (meeting notes, research docs)
4. What if the AI is wrong?
- Example: False plagiarism detection (code from team's own library)
- Challenge: Over-reliance on AI judgment
- Solution: Human review of all flagged cases, appeal process
π― Key Takeaways for AI Builders
1. Multi-Modal Evidence is Key
- Don't rely on just code OR just surveys - combine everything
- Each evidence type reveals different aspects of contribution
2. Inequality Metrics > Raw Scores
- Gini index reveals team dynamics that averages hide
- High inequality = potential conflict (even if team grade is good)
3. LLMs for Semantic Understanding
- Use AI to understand "relevance" and "quality" (not just count lines of code)
- Generate hypothetical outputs to benchmark against
4. Transparency Builds Trust
- Students accept AI judgments IF they see the evidence
- Black-box scores β distrust; evidence-backed reasoning β acceptance
5. Human-in-the-Loop is Non-Negotiable
- AI advises, humans decide (legal + ethical requirement)
- Design for collaboration, not replacement
This analysis is based on the research paper "AI-Driven Contribution Evaluation and Conflict Resolution: A Framework & Design for Group Workload Investigation" published in arXiv:2511.07667v1 [cs.AI]