📊 AI-Driven Group Work Conflict Resolution：Deep Dive Analysis

🎯 The Core Problem

The Nightmare Scenario: You're in a group project. One person does 80% of the work. Another disappears. Someone submits garbage. Everyone gets the same grade. Sound familiar?

The Real Challenge:

Manual investigation is costly: Professors spend hours reviewing chat logs, code commits, and he-said-she-said disputes
Existing tools are inadequate: Peer assessment suffers from bias (friends rate friends higher), and analytics tools only provide surface-level metrics
No AI integration for conflict: Current systems use AI for plagiarism detection or feedback generation - NOT for investigating who actually did the work

The Gap: There's no comprehensive system that combines objective evidence (code commits, chat logs) with AI-powered analysis to fairly adjudicate group work disputes.

💡 The Key Insight

The researchers applied forensic investigation principles to group work assessment, asking: "What if we could analyze ALL the evidence - code, chat, meetings, peer reviews - like a detective solving a case, but with AI doing the heavy lifting?"

Their Breakthrough Framework:

3 Dimensions × 9 Benchmarks = Comprehensive Conflict Detection

┌─────────────────┬──────────────────┬─────────────────┐
│  CONTRIBUTION   │   INTERACTION    │      ROLE       │
├─────────────────┼──────────────────┼─────────────────┤
│ 1. Quantity     │ 4. Tone          │ 7. Adherence    │
│ 2. Quality      │ 5. Effectiveness │ 8. Organisation │
│ 3. Relevance    │ 6. Presence      │ 9. Support      │
└─────────────────┴──────────────────┴─────────────────┘

The Formula:

Fair Assessment = 
    (Objective Metrics + Inequality Analysis + LLM Contextual Reasoning) 
    × Human Oversight

🔧 The Method: How It Works

Architecture Overview

Evidence → Metrics → Conflict Markers → AI Analysis → Advisory Judgment

Stage 1: Evidence Collection

Three Categories of Evidence:

Category	What It Includes	Why It Matters
Submission	Code commits, text documents, media files	Shows WHO did WHAT
Conversation	Chat logs, emails, meeting minutes	Reveals team dynamics & communication quality
Coordination	Task assignments, deadlines, attendance	Tracks responsibility & follow-through

Stage 2: Metrics Extraction

Submission Metrics (Objective Work Output):

Code: Line count, commit frequency, time intervals, code quality (linting scores)
Text: Word count, character count, complexity (readability scores)
Multimodal: Media workload (video editing time, design iterations)

Conversation Metrics (Communication Quality):

Message count: Who's talking?
Char/Message ratio: Are messages substantive or just "ok" and "lol"?
Response time: Who's engaged vs. ghosting?
Sentiment analysis: Detecting negativity, rudeness, or conflict
Interaction diversity: Are there cliques? Is one person isolated?

Coordination Metrics (Team Responsibility):

Attendance: Who shows up to meetings?
Task fidelity: Did assigned work get completed?
Assignment fidelity: Did deliverables match the task description?
Task diversity: Is one person doing ALL the coding while others do nothing?

Stage 3: Conflict Markers (The "Red Flags")

The system uses the Gini Index (inequality measure) to detect disparity:

Two Key Scenarios:

Scenario A: High Gini + Above-average individual score = One person carrying the team
Scenario B: Low Gini + Below-average individual score = One person slacking

Example Conflict Markers:

Benchmark	High Performer Issue	Low Performer Issue
Quantity	Overcentralization (one person doing everything)	Social loafing (hitchhiking)
Tone	Professional in toxic team	Rude/negative behavior
Adherence	High responsibility in lax team	Missing deadlines, not following plans

Stage 4: AI Expert Analysis (The Brain)

LLM Architecture:

┌─────────────────────────────────────────────────────┐
│  INPUT: Metrics + Conflict Markers + Context        │
└─────────────────┬───────────────────────────────────┘
                  │
        ┌─────────▼─────────┐
        │  Category Analysis │ (Contribution, Interaction, Role)
        └─────────┬─────────┘
                  │
        ┌─────────▼─────────┐
        │  Global Synthesis  │ (Overall judgment)
        └─────────┬─────────┘
                  │
        ┌─────────▼─────────┐
        │  Double Validation │ (Hallucination check)
        └─────────┬─────────┘
                  │
        ┌─────────▼─────────┐
        │  Advisory Report   │ (Transparent, explainable)
        └───────────────────┘

Key Innovation: The LLM doesn't just spit out a grade - it provides evidence-backed reasoning:

"Student A contributed 65% of code commits but had low interaction scores (avg response time: 8 hours). Student B contributed 15% of code but organized all meetings and maintained task documentation. Conflict marker: High Gini in Quantity (0.72) suggests overcentralization."

📋 Real-World Example: The Case of Team Chaos

The Team:

Alice: Computer Science major, strong coder
Bob: Business major, weak technical skills
Charlie: Disappeared after Week 2
Diana: Submitted plagiarized work

Traditional Assessment: Everyone gets 75/100 (team grade)

AI-Driven Investigation:

Evidence Collected:

Git Logs: Alice (450 commits), Bob (12 commits), Charlie (3 commits), Diana (8 commits)
Chat Logs: Alice (320 messages), Bob (180 messages), Charlie (15 messages), Diana (45 messages)
Meeting Attendance: Alice (8/8), Bob (7/8), Charlie (2/8), Diana (6/8)

Metrics Calculated:

Student	Contribution Score	Interaction Score	Role Score
Alice	0.92	0.65	0.88
Bob	0.28	0.82	0.75
Charlie	0.05	0.12	0.15
Diana	0.35	0.58	0.42

Conflict Markers Raised:

⚠️ High Gini (0.78) in Quantity: Alice overcentralized
⚠️ Low Presence (Charlie): 2/8 meetings, 6-day average silence
⚠️ Quality Issue (Diana): Plagiarism detected (code similarity 95% to Stack Overflow)

AI Analysis Output:

Alice: Exceptional technical contribution (92nd percentile) but lower interaction scores suggest potential communication barriers. Recommend grade adjustment: +15% (90/100)
Bob: Moderate technical contribution but strong organizational role (meeting facilitation, documentation). Balanced team player. Recommend: Team grade (75/100)
Charlie: Minimal contribution across all dimensions. Attendance: 25%. Last activity: Week 2. Recommend grade adjustment: -40% (35/100)
Diana: Submitted plagiarized code (95% similarity). Low task fidelity (work didn't match assignments). Recommend: Academic integrity investigation + grade penalty (30/100)

🔍 Survey of Existing Tools (What's Missing?)

The researchers surveyed 11 major platforms: CATME, SPARK, WebPA, TEAMMATES, Buddycheck, etc.

What They Found:

Feature	Adoption Rate	Gap
AutoRating normalization	91%	✅ Common
Likert-scale surveys	82%	✅ Common
Early warning systems	45%	⚠️ Limited
AI screening	18%	❌ Rare (only plagiarism)
Conflict investigation	9%	❌ Almost nonexistent
AI-powered judgment	0%	❌ NONE

Key Insight: Current tools focus on prevention (peer assessment, early warnings) but NOT resolution (investigating disputes with AI).

🎬 One-Sentence Summary

This research proposes an AI-enhanced framework that forensically investigates group work disputes by analyzing heterogeneous evidence (code commits, chat logs, meeting attendance) across three dimensions (Contribution, Interaction, Role), using inequality metrics to flag conflicts and LLMs to generate transparent, evidence-backed advisory judgments - filling a critical gap in current peer assessment tools that lack comprehensive conflict resolution capabilities.

🧒 The Simple Version

Imagine your teacher is Sherlock Holmes with a supercomputer.

When your group has a fight about who did the work:

Sherlock collects evidence: Your code commits (like fingerprints), chat messages (like witness statements), meeting notes (like alibis)
The computer analyzes patterns: "Alice wrote 80% of the code but only sent 20% of messages - she's the quiet workhorse"
AI writes the verdict: "Based on the evidence, here's what actually happened and what grades are fair"

Instead of your teacher spending 5 hours reading Slack messages, the AI does it in 5 minutes - and catches things humans miss (like Bob never responding to technical questions but organizing all the meetings).

🔑 The Technical Innovation

1. Hybrid Metrics (Objective + Subjective)

Unlike existing tools that rely ONLY on peer ratings (biased) or ONLY on code commits (incomplete), this system combines:

Final Score = 
    (0.4 × Objective Metrics) + 
    (0.3 × Peer Assessment) + 
    (0.3 × AI Contextual Analysis)

2. Gini Index for Inequality

What is it? A measure from economics (used to measure wealth inequality)

How it's used here?

Gini = 0: Everyone contributed equally
Gini = 1: One person did everything

Example:

Team A: [25%, 25%, 25%, 25%] → Gini = 0.00 (perfect equality)
Team B: [70%, 15%, 10%, 5%] → Gini = 0.52 (high inequality = conflict!)

3. LLM-Powered Semantic Analysis

Problem: How do you measure "relevance" of work?

Solution: Use LLMs to generate hypothetical ideal outputs based on the task description, then measure similarity:

# Pseudo-code
task_description = "Build a React app with user authentication"
ideal_output = LLM.generate_hypothetical(task_description)
student_output = extract_from_git(student_commits)
relevance_score = cosine_similarity(ideal_output, student_output)