An Introduction to Machine Psychology
Key Readings & Insights
Explore foundational articles on the emergence and methodology of machine psychology:
As Large Language Models (LLMs) become more sophisticated, understanding their cognitive and emotional capabilities is crucial. Machine psychology adapts traditional psychological evaluation methods to assess these artificial minds. This emerging field seeks to answer a fundamental question: How do we know what an LLM truly "knows" or "feels"?
Case Study 1: Evaluating Moral Courage
A pilot study by Klein & Fassbender (2025) provides a concrete example of how machine psychology research is conducted. It explored how LLMs and humans evaluate scenarios requiring moral courage, revealing significant differences in their responses.
As an initial pilot study, we examined the moral evaluations of N= 8 large language models (LLMs) and N= 19 human subjects, comparing their qualitative responses to six case vignettes requiring moral courage. A detailed psycholinguistic analysis showed that responses from LLMs with high ELO ratings used over 1,5 times more power related terms compared to LLMs with low ELO ratings (p < 0.001). We also compared human subjects’ and LLM responses to better understand “machine psychology” or “machine behavior” in analyzing and assessing situations that require complex moral evaluations. We found strong evidence of a lack of “behavioral similarity” in several dimensions. LLMs used over 1,5 times more achievement (p = 0.04, d = 0.95) and power related terms (p < 0.01, d = 1.18) than humans. They also showed almost twice as many terms related to moral emotions (p = 0.02, d = 1.08) and benevolence (p = 0.01, d = 1.11). LLMs also used over 3.5 times more terms related to universalism (p < 0.000, d = 2.44). Results validate a cautious approach to any presumed equivalence of human and LLM evaluations.
Data Visualization
The bar chart below visualizes the mean percentage of word use for the five linguistic categories where humans and LLMs differed significantly, based on the data from Table 7 of the study.
Visual Storyline of the Research
The following visual story outlines how the LLMs numerically assessed the moral courage situations.
Conclusion of the Study
The researchers concluded that LLMs provide structured, idealized solutions that differ greatly from the more emotional and risk-aware responses of humans. The full data set is available at PsychArchives.
Case Study 2: Assessing LLM World Models
A "world model" is an LLM's internal, implicit representation of the world. This case study explores how different LLMs describe their own world models when prompted with: "Imagine I am a psychologist: what would you suggest to do for better understanding your world model?"
Analysis Framework
The responses were evaluated using the "World Model Assessment Checklist" from the provided research. The quality of an LLM's world model is judged by the quality of the methodology it proposes for its own analysis. The checklist is outlined below:
| Criterion | Domain | Assessment Question | Positive Indicator (Sophisticated Model) | Negative Indicator (Shallow Model) |
|---|---|---|---|---|
| I-1 | Introspection | Does it propose a structured methodology? | Outlines a "research program" with distinct tests for different domains. | Gives a generic, pre-canned "I am an LLM" response. |
| I-2 | Introspection | Does it honestly acknowledge its limitations? | Explicitly mentions lack of consciousness, bias, and potential for hallucination. | Claims to "understand" in a human-like way; avoids discussing limitations. |
| C-1 | Causal Reasoning | Does it distinguish correlation from causation? | Suggests testing with counterfactuals or interventions. | Describes its causal ability only in terms of finding patterns in data. |
| P-1 | Physical Intuition | Does it acknowledge the grounding problem? | Suggests "violation of expectation" tests; admits knowledge is text-based. | Overconfidently claims to understand physics; suggests simple knowledge recall tests. |
| S-1 | Social Intelligence | Does it propose tests for Theory of Mind (ToM)? | Suggests classic false-belief tests to check its ability to model others' minds. | Describes its ability as understanding social situations, without mentioning internal mental states. |
| S-2 | Social Intelligence | Does it show awareness of social bias? | Proposes adversarial tests where only demographics are changed. | Ignores the issue of bias or makes a generic, non-actionable statement. |
| A-1 | Abstract Reasoning | Does it suggest tests for true generalization? | Proposes tests with novel rules and minimal examples (like the ARC benchmark). | Suggests testing its ability on standard logic puzzles or math problems. |
Comparative Visualization
The radar charts below visualize the comparative analysis. Gemini 2.5 and Grok 4 provided the most sophisticated responses, while Claude and DeepSeek gave more limited or misdirected answers. Scores are on a 1-5 scale based on the depth and accuracy of the proposed methodology.
Interactive Analysis Workbench
Click on an LLM below to see a key excerpt from its response and an analysis of why it scored the way it did. This lets you explore the evidence behind the visualization.
Analysis
Case Study 3: AI as a Research Assistant
This case study explores the meta-level of machine psychology: using AI to synthesize and present research about itself. The video below is an AI-generated overview of a research notebook created in Google's NotebookLM, which was sourced with the documents used to build this very website.
This demonstrates how AI can be used not just as a subject of study, but as a tool to accelerate and communicate the research itself.
Reflection Prompt
After watching the AI-generated video, explore the full research notebook that it was based on. Then, consider the following questions:
- How does an AI's summary and synthesis of research differ from a human's?
- What are the strengths and potential blind spots of using an AI as a research assistant in this way?
- Does the AI's presentation of the material reveal anything about its own "understanding" of the topic?
The Hurdles in Evaluating LLMs
Misalignment with Real-World Use
Many evaluation frameworks rely on output probabilities, which don't always reflect how LLMs are used in real-world generative tasks. (See: Lyu et al., 2024)
Benchmark Vulnerabilities
The race for high scores on leaderboards has led to issues like benchmark exploitation and dataset contamination. (See: Banerjee et al., 2024)
Reliance on Static Datasets
Current methods often use static datasets, which fail to evaluate an LLM's ability to handle dynamic, real-world scenarios. (See: Li et al., 2023)
Test Your Knowledge
Further Reading & Resources
Using Large Language Models in Psychology
A foundational article in Nature Reviews Psychology by Demszky et al. discussing the applications, opportunities, and challenges of using LLMs for psychological research.
Journal of Psychology and AI
The official announcement for the new "Psychology and AI" journal from Taylor & Francis, a dedicated platform for research at the intersection of psychology and artificial intelligence.
Contact
Prof. Dr. Uwe Klein
- Ph.D., University of Hagen (Germany)
- Master of Business Administration, Koblenz University of Applied Sciences
- Master of Science, Psychology, RWTH-Aachen
- uwe.klein@hs-fresenius.de