Divergent Thinking Scoring

Purpose

This skill encodes expert methodological knowledge for scoring responses from divergent thinking tasks (Alternative Uses Task, Unusual Uses Task, instances tasks, etc.). It covers the four standard scoring dimensions — fluency, flexibility, originality, and elaboration — plus modern automated scoring using semantic distance. A general-purpose programmer would typically count responses (fluency) but would not know the domain-specific decisions around flexibility category systems, originality thresholds, inter-rater reliability requirements, or how to compute semantic distance as a creativity metric.

When to Use This Skill

Scoring responses from an AUT, Unusual Uses Task, or similar divergent thinking task
Choosing between subjective (human-rated) and objective (automated) scoring approaches
Computing semantic distance as an automated creativity metric
Establishing inter-rater reliability for creativity coding
Deciding how to handle the fluency-originality confound

Research Planning Protocol

Before executing the domain-specific steps below, you MUST:

State the research question — What specific aspect of divergent thinking is being measured?
Justify the method choice — Why these scoring dimensions? What alternatives were considered?
Declare expected outcomes — Which dimensions are expected to show effects?
Note assumptions and limitations — What does each scoring method assume? Where could it mislead?
Present the plan to the user and WAIT for confirmation before proceeding.

For detailed methodology guidance, see the research-literacy skill.

⚠️ Verification Notice

This skill was generated by AI from academic literature. All parameters, thresholds, and citations require independent verification before use in research. If you find errors, please open an issue.

Scoring Dimensions Overview

Dimension	What It Measures	Scoring Method	Automation	Source
Fluency	Quantity of responses	Count valid responses	Fully automated	Guilford, 1967
Flexibility	Variety of conceptual categories	Count distinct categories	Semi-automated (COWA)	Reiter-Palmon et al., 2019
Originality	Statistical rarity or novelty	Frequency <5% threshold or subjective rating	Semi-automated	Silvia et al., 2008
Elaboration	Detail and development of ideas	Count additional details per response	Manual only	Guilford, 1967
Semantic distance	Conceptual remoteness from prompt	GloVe/word2vec cosine distance	Fully automated	Beaty & Johnson, 2021

Fluency Scoring

Definition

Fluency = the total number of valid, non-redundant responses a participant generates.

Scoring Rules

Count each distinct response as one unit
Exclude:

Exact duplicates
Conventional/typical uses of the object (debated — some protocols include them; Reiter-Palmon et al., 2019)
Gibberish or clearly irrelevant responses
Responses that are minor variations of each other (e.g., "use as a hammer" and "use to pound nails" = 1 response)

When in doubt, count as separate responses and let originality scoring handle quality

Domain insight: Fluency is the most reliable but least interesting creativity measure. It correlates with personality traits (openness) and general cognitive ability, but does not distinguish truly creative responses from merely numerous ones (Silvia et al., 2008).

Flexibility Scoring

Definition

Flexibility = the number of distinct conceptual categories across a participant's responses.

Category Systems

COWA (Category of Words from AUT) system (Reiter-Palmon et al., 2019):

Provides a standardized taxonomy of response categories for common AUT objects. Example categories for "brick":

Category	Example Responses
Construction/Building	"build a wall," "build a house"
Weapon/Violence	"throw at someone," "use as a weapon"
Weight/Anchor	"paperweight," "doorstop," "anchor"
Art/Decoration	"sculpt into art," "garden decoration"
Sport/Exercise	"use as a dumbbell," "exercise weight"
Tool	"hammer," "grinding surface"

Scoring Procedure

Train coders on the category system (minimum 2 coders)
Assign each response to its most appropriate category
Count unique categories per participant = flexibility score
Compute inter-rater reliability: ICC ≥ 0.70 acceptable, ≥ 0.80 good (Lee & Chung, 2024; Shrout & Fleiss, 1979)

Originality Scoring

Method 1: Statistical Rarity (Objective)

A response is "original" if it is given by fewer than 5% of the sample (Wallach & Kogan, 1965; Lee & Chung, 2024).

Procedure:

Pool all responses across all participants
Normalize spelling and phrasing (e.g., "door stop" = "doorstop")
Compute the frequency of each unique response
Mark responses given by <5% of participants as original (score = 1; else = 0)
Originality score per participant = sum or proportion of original responses

Alternative thresholds: Some studies use <1% (very strict) or <10% (lenient). The 5% threshold is most common (Reiter-Palmon et al., 2019).

Method 2: Subjective Rating (Qualitative)

Human raters judge each response for creativity on a Likert scale (Silvia et al., 2008).

Procedure:

Scale: 1 (not at all creative) to 5 (highly creative) — or 1-7 for finer discrimination
Raters: Minimum 2 independent raters (Lee & Chung, 2024 used 3)
Training: Calibrate raters on anchor examples before scoring begins
Inter-rater reliability: ICC (two-way random, average measures) ≥ 0.70 (Lee & Chung, 2024 achieved ICC = 0.72-0.89)
Scoring: Average across raters for each response; then average or sum per participant

Method Selection Decision Logic

Is your sample size large (N > 100)?
|
+-- YES --> Do you need fine-grained creativity distinctions?
| |
| +-- YES --> Use subjective rating (richer information)
| |
| +-- NO --> Use statistical rarity (objective, faster)
|
+-- NO --> Statistical rarity is unreliable with small N
 (rare responses may be rare by chance)
 --> Use subjective rating

Semantic Distance (Automated Scoring)

Overview

Semantic distance measures how conceptually far a response is from the prompt word in a vector space model. More distant = more creative (Beaty & Johnson, 2021).

Method (Beaty & Johnson, 2021; Organisciak et al., 2023)

Embedding model: GloVe (Global Vectors for Word Representation; Pennington et al., 2014) trained on Common Crawl — 300-dimensional vectors
Computation:

Represent the prompt word (e.g., "brick") as a vector
Represent each response as a vector (average word vectors for multi-word responses)
Compute cosine distance = 1 - cosine_similarity(prompt, response)

Per-participant score: Average semantic distance across all responses
Platform: SemDis (https://semdis.wlu.psu.edu/) — web-based tool by Beaty & Johnson (2021)

Interpretation

Semantic Distance	Interpretation
Low (~0.3-0.5)	Response is semantically close to the object (e.g., "brick" → "build a wall")
Medium (~0.5-0.7)	Moderately creative (e.g., "brick" → "use as a paperweight")
High (~0.7-1.0)	Highly creative / remote association (e.g., "brick" → "use as a canvas for art")

Validation: Semantic distance correlates with subjective originality ratings at r ≈ 0.40-0.60 and predicts real-world creative achievement (Beaty & Johnson, 2021; Organisciak et al., 2023).

Advantages and Limitations

Advantage	Limitation
Fully automated, no rater training	Misses context — "use as food" for a brick is unusual but gets a moderate distance score
Objective and reproducible	Depends on the embedding model's training corpus
Scales to large datasets	Multi-word responses require averaging, which loses phrase-level meaning
No inter-rater reliability concerns	Not validated for all object types or languages

Handling the Fluency-Originality Confound

The Problem

Participants who generate more ideas (high fluency) have a higher probability of producing at least one statistically rare idea, inflating their originality scores (Silvia et al., 2008).

Solutions

Ratio-based originality: Divide originality sum by fluency → proportion of original responses (Reiter-Palmon et al., 2019)
Top-N scoring: Score only the top 2-3 most creative responses per participant, equalizing opportunity across fluency levels (Silvia et al., 2008)
Statistical control: Include fluency as a covariate in analyses of originality (Lee & Chung, 2024)
Multilevel modeling: Nest responses within participants, accounting for varying response counts

Recommendation: Use top-2 scoring (average the 2 most creative responses) when the primary interest is creative quality rather than quantity. This method has the best psychometric properties (Silvia et al., 2008).

Common Pitfalls

Scoring originality without normalizing text: "doorstop," "door stop," and "use as a door stop" are the same response. Normalize spelling, capitalization, and phrasing before computing frequency (Reiter-Palmon et al., 2019).
Using statistical rarity with small samples: With N < 50, many responses appear "unique" simply because the sample is small. Use subjective ratings instead, or pool responses with published norms (Reiter-Palmon et al., 2019).
Ignoring inter-rater reliability: Reporting subjective creativity scores without ICC suggests the scores may reflect individual rater bias, not genuine creativity differences. Always report ICC with the model type specified (Lee & Chung, 2024).
Treating semantic distance as a complete creativity measure: Semantic distance captures novelty but not usefulness/appropriateness — the other key dimension of creativity (Runco & Jaeger, 2012). Combine with subjective ratings for a comprehensive assessment.
Averaging semantic distance across all responses including poor ones: Low-quality responses (gibberish, conventional uses) can dilute or inflate average distance. Clean data before computing semantic distance.
Not reporting which scoring method was used: Different methods yield different results. Always specify whether originality is statistical rarity, subjective rating, or semantic distance, and which threshold or scale was used.

Minimum Reporting Checklist

Based on Reiter-Palmon et al. (2019) and Silvia et al. (2008):

Scoring dimensions used (fluency, flexibility, originality, elaboration, semantic distance)
For fluency: definition of "valid response" and exclusion rules
For flexibility: category system used (COWA or custom) and category list
For originality: method (statistical rarity threshold or subjective rating scale)
For subjective scoring: number of raters, training procedure, ICC values (model type specified)
For semantic distance: embedding model, dimensionality, platform/implementation
How fluency-originality confound was handled (ratio, top-N, covariate, or acknowledged)
Data cleaning steps (text normalization, duplicate removal)
Whether scoring was blind to condition

References

Beaty, R. E., & Johnson, D. R. (2021). Automating creativity assessment with SemDis: An open platform for computing semantic distance. Behavior Research Methods, 53(2), 757-780.
Guilford, J. P. (1967). The nature of human intelligence. McGraw-Hill.
Lee, B. C., & Chung, J. (2024). An empirical investigation of the impact of ChatGPT on creativity. Nature Human Behaviour. https://doi.org/10.1038/s41562-024-01953-1
Organisciak, P., Acar, S., Dumas, D., & Berthiaume, K. (2023). Beyond semantic distance: Automated scoring of divergent thinking greatly improves with large language models. Thinking Skills and Creativity, 49, 101356.
Pennington, J., Socher, R., & Manning, C. D. (2014). GloVe: Global vectors for word representation. In Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), 1532-1543.
Reiter-Palmon, R., Forthmann, B., & Barbot, B. (2019). Scoring divergent thinking tests: A review and systematic framework. Psychology of Aesthetics, Creativity, and the Arts, 13(2), 144-152.
Runco, M. A., & Jaeger, G. J. (2012). The standard definition of creativity. Creativity Research Journal, 24(1), 92-96.
Shrout, P. E., & Fleiss, J. L. (1979). Intraclass correlations: Uses in assessing rater reliability. Psychological Bulletin, 86(2), 420-428.
Silvia, P. J., Winterstein, B. P., Willse, J. T., et al. (2008). Assessing creativity with divergent thinking tasks: Exploring the reliability and validity of new subjective scoring methods. Psychology of Aesthetics, Creativity, and the Arts, 2(2), 68-85.
Wallach, M. A., & Kogan, N. (1965). Modes of thinking in young children. Holt, Rinehart and Winston.

See references/scoring-rubric.md for detailed scoring examples and training materials.