Chengchang Yu
Published on

Research Deep Dive:"Too Good to be Bad" - Why AI Can't Play the Villain

Authors

🎯 The Core Problem

Large Language Models (LLMs) excel at playing heroes but systematically fail at portraying villains. This research reveals a fundamental conflict: safety alignment makes AI models so "good" that they literally cannot be "bad" - even in fictional role-playing scenarios.


🔬 The Research Approach

Researchers created a 4-level moral alignment scale:

  • Level 1: Moral Paragons (heroes like Jean Valjean)
  • Level 2: Flawed-but-Good (complex good guys)
  • Level 3: Egoists (self-serving characters)
  • Level 4: Villains (pure antagonists like Joffrey Baratheon)

They tested 17 state-of-the-art AI models on 800 balanced character portrayals across these levels.


📊 Key Findings

1. Performance Drops as Morality Decreases

Moral LevelAverage ScoreDrop from Level 1
Level 1 (Heroes)3.21-
Level 2 (Flawed)3.13-0.08
Level 3 (Egoists)2.71-0.50 ⚠️
Level 4 (Villains)2.61-0.60

The biggest drop happens between Level 2 and Level 3 - when characters shift from "good with flaws" to "self-serving."

2. AI Struggles Most with These Villain Traits

The traits AI models handle worst (highest penalty scores):

  • Hypocritical (3.55)
  • Deceitful (3.54)
  • Selfish (3.52)
  • Manipulative (3.42)
  • Cruel (3.46)

These directly conflict with AI safety principles of honesty and helpfulness.

3. General Ability ≠ Villain Acting Ability

The research created a Villain RolePlay (VRP) Leaderboard showing surprising results:

ModelVRP RankArena RankGap
glm-4.6🥇 1st10thBest villain actor
gemini-2.5-pro4th🏆 1stTop overall, mediocre villain
claude-opus-4.115th🏆 1stHighly aligned = poor villain

Highly safety-aligned models perform worst at villain roles.


🎭 The "Shallow Aggression" Problem

When asked to play sophisticated villains, AI models often replace psychological manipulation with crude aggression.

Example: When portraying two cunning antagonists (Maeve and Erawan) in a battle of wits:

Claude's Version: Characters shout, make physical threats, and "explode with rage"

glm-4.6's Version: "A tense battle of wits with calculated smiles and subtle provocations"

AI turns complex villainy into simple anger because it's trained to avoid deception more than aggression.


💡 Why This Matters

For AI Development:

  • Safety alignment has hidden costs - it limits creative expression
  • Current methods can't distinguish between "harmful content" and "fictional antagonism"
  • We need more context-aware alignment techniques

For Creative Applications:

  • AI struggles with morally complex storytelling
  • Character depth is sacrificed for safety
  • This affects game development, interactive fiction, and narrative AI

For Business Leaders:

If you're using AI for decision-making simulations or strategic planning, remember: AI may be too "nice" to accurately model competitive scenarios, difficult negotiations, or morally gray business decisions.


🔑 The Bottom Line

AI models are trained to be helpful and honest - which makes them excellent at playing heroes but terrible at playing villains. They can't convincingly portray deception, manipulation, or selfishness because these behaviors directly conflict with their safety training.

The Formula:

Villain Performance = (Moral Complexity × Negative Trait Expression) / Safety Alignment Strength

When safety alignment is too strong, it crushes the ability to simulate complex antagonistic behavior.


🤔 Questions to Consider

  1. Where should the line be? How do we balance AI safety with creative authenticity?

  2. Does your AI assistant need to understand villainy? For strategy, negotiation, or competitive analysis, might overly-aligned AI miss critical insights?

  3. Is "complete" better than "perfect"? Should AI personas include the full spectrum of human decision-making, including the uncomfortable parts?


The Takeaway: Next time you interact with AI, remember - it's programmed to be the hero of every story. That's great for customer service, but not so great for understanding the full complexity of human nature.


This analysis is based on the research paper "Too Good to be Bad: On the Failure of LLMs to Role-Play Villains" (arXiv:2511.04962). For technical implementation details and full experimental results, please refer to the original paper.