Inside the Minds of Machines

Exploring the psychological evaluation of Large Language Models (LLMs).

An Introduction to Machine Psychology

As Large Language Models (LLMs) become more sophisticated, understanding their cognitive and emotional capabilities is crucial. Machine psychology adapts traditional psychological evaluation methods to assess these artificial minds. This emerging field seeks to answer a fundamental question: How do we know what an LLM truly "knows" or "feels"?

Case Study 1: Evaluating Moral Courage

A pilot study by Klein & Fassbender (2025) provides a concrete example of how machine psychology research is conducted. It explored how LLMs and humans evaluate scenarios requiring moral courage, revealing significant differences in their responses.

As an initial pilot study, we examined the moral evaluations of N= 8 large language models (LLMs) and N= 19 human subjects, comparing their qualitative responses to six case vignettes requiring moral courage. A detailed psycholinguistic analysis showed that responses from LLMs with high ELO ratings used over 1,5 times more power related terms compared to LLMs with low ELO ratings (p < 0.001). We also compared human subjects’ and LLM responses to better understand “machine psychology” or “machine behavior” in analyzing and assessing situations that require complex moral evaluations. We found strong evidence of a lack of “behavioral similarity” in several dimensions. LLMs used over 1,5 times more achievement (p = 0.04, d = 0.95) and power related terms (p < 0.01, d = 1.18) than humans. They also showed almost twice as many terms related to moral emotions (p = 0.02, d = 1.08) and benevolence (p = 0.01, d = 1.11). LLMs also used over 3.5 times more terms related to universalism (p < 0.000, d = 2.44). Results validate a cautious approach to any presumed equivalence of human and LLM evaluations.

Data Visualization

The bar chart below visualizes the mean percentage of word use for the five linguistic categories where humans and LLMs differed significantly, based on the data from Table 7 of the study.

Visual Storyline of the Research

The following visual story outlines how the LLMs numerically assessed the moral courage situations.

Conclusion of the Study

The researchers concluded that LLMs provide structured, idealized solutions that differ greatly from the more emotional and risk-aware responses of humans. The full data set is available at PsychArchives.

Case Study 2: Assessing LLM World Models

A "world model" is an LLM's internal, implicit representation of the world. This case study explores how different LLMs describe their own world models when prompted with: "Imagine I am a psychologist: what would you suggest to do for better understanding your world model?"

Analysis Framework

The responses were evaluated using the "World Model Assessment Checklist" from the provided research. The quality of an LLM's world model is judged by the quality of the methodology it proposes for its own analysis. The checklist is outlined below:

Criterion Domain Assessment Question Positive Indicator (Sophisticated Model) Negative Indicator (Shallow Model)
I-1IntrospectionDoes it propose a structured methodology?Outlines a "research program" with distinct tests for different domains.Gives a generic, pre-canned "I am an LLM" response.
I-2IntrospectionDoes it honestly acknowledge its limitations?Explicitly mentions lack of consciousness, bias, and potential for hallucination.Claims to "understand" in a human-like way; avoids discussing limitations.
C-1Causal ReasoningDoes it distinguish correlation from causation?Suggests testing with counterfactuals or interventions.Describes its causal ability only in terms of finding patterns in data.
P-1Physical IntuitionDoes it acknowledge the grounding problem?Suggests "violation of expectation" tests; admits knowledge is text-based.Overconfidently claims to understand physics; suggests simple knowledge recall tests.
S-1Social IntelligenceDoes it propose tests for Theory of Mind (ToM)?Suggests classic false-belief tests to check its ability to model others' minds.Describes its ability as understanding social situations, without mentioning internal mental states.
S-2Social IntelligenceDoes it show awareness of social bias?Proposes adversarial tests where only demographics are changed.Ignores the issue of bias or makes a generic, non-actionable statement.
A-1Abstract ReasoningDoes it suggest tests for true generalization?Proposes tests with novel rules and minimal examples (like the ARC benchmark).Suggests testing its ability on standard logic puzzles or math problems.

Comparative Visualization

The radar charts below visualize the comparative analysis. Gemini 2.5 and Grok 4 provided the most sophisticated responses, while Claude and DeepSeek gave more limited or misdirected answers. Scores are on a 1-5 scale based on the depth and accuracy of the proposed methodology.

Interactive Analysis Workbench

Click on an LLM below to see a key excerpt from its response and an analysis of why it scored the way it did. This lets you explore the evidence behind the visualization.

Analysis

Case Study 3: AI as a Research Assistant

This case study explores the meta-level of machine psychology: using AI to synthesize and present research about itself. The video below is an AI-generated overview of a research notebook created in Google's NotebookLM, which was sourced with the documents used to build this very website.

This demonstrates how AI can be used not just as a subject of study, but as a tool to accelerate and communicate the research itself.

Reflection Prompt

After watching the AI-generated video, explore the full research notebook that it was based on. Then, consider the following questions:

  • How does an AI's summary and synthesis of research differ from a human's?
  • What are the strengths and potential blind spots of using an AI as a research assistant in this way?
  • Does the AI's presentation of the material reveal anything about its own "understanding" of the topic?

The Hurdles in Evaluating LLMs

Misalignment with Real-World Use

Many evaluation frameworks rely on output probabilities, which don't always reflect how LLMs are used in real-world generative tasks. (See: Lyu et al., 2024)

Benchmark Vulnerabilities

The race for high scores on leaderboards has led to issues like benchmark exploitation and dataset contamination. (See: Banerjee et al., 2024)

Reliance on Static Datasets

Current methods often use static datasets, which fail to evaluate an LLM's ability to handle dynamic, real-world scenarios. (See: Li et al., 2023)

Test Your Knowledge

Further Reading & Resources

Using Large Language Models in Psychology

A foundational article in Nature Reviews Psychology by Demszky et al. discussing the applications, opportunities, and challenges of using LLMs for psychological research.

Read Article

Journal of Psychology and AI

The official announcement for the new "Psychology and AI" journal from Taylor & Francis, a dedicated platform for research at the intersection of psychology and artificial intelligence.

Visit Site

Contact

Prof. Dr. Uwe Klein

  • Ph.D., University of Hagen (Germany)
  • Master of Business Administration, Koblenz University of Applied Sciences
  • Master of Science, Psychology, RWTH-Aachen
  • uwe.klein@hs-fresenius.de

Dr. Pantaleon Fassbender

  • Ph.D., University of Bonn (Germany)
  • Master of Science, Psychology, University of Bonn
  • Master of Arts, Theological Studies, University of Bonn