Divergent Thinking Scoring

Domain-validated multi-dimensional scoring system for divergent thinking tasks, including fluency, flexibility, originality, and automated semantic distance methods

Safety Notice

This listing is imported from skills.sh public index metadata. Review upstream SKILL.md and repository scripts before running.

Copy this and send it to your AI assistant to learn

Install skill "Divergent Thinking Scoring" with this command: npx skills add haoxuanlithuai/awesome_cognitive_and_neuroscience_skills/haoxuanlithuai-awesome-cognitive-and-neuroscience-skills-divergent-thinking-scoring

Divergent Thinking Scoring

Purpose

This skill encodes expert methodological knowledge for scoring responses from divergent thinking tasks (Alternative Uses Task, Unusual Uses Task, instances tasks, etc.). It covers the four standard scoring dimensions — fluency, flexibility, originality, and elaboration — plus modern automated scoring using semantic distance. A general-purpose programmer would typically count responses (fluency) but would not know the domain-specific decisions around flexibility category systems, originality thresholds, inter-rater reliability requirements, or how to compute semantic distance as a creativity metric.

When to Use This Skill

  • Scoring responses from an AUT, Unusual Uses Task, or similar divergent thinking task
  • Choosing between subjective (human-rated) and objective (automated) scoring approaches
  • Computing semantic distance as an automated creativity metric
  • Establishing inter-rater reliability for creativity coding
  • Deciding how to handle the fluency-originality confound

Research Planning Protocol

Before executing the domain-specific steps below, you MUST:

  1. State the research question — What specific aspect of divergent thinking is being measured?
  2. Justify the method choice — Why these scoring dimensions? What alternatives were considered?
  3. Declare expected outcomes — Which dimensions are expected to show effects?
  4. Note assumptions and limitations — What does each scoring method assume? Where could it mislead?
  5. Present the plan to the user and WAIT for confirmation before proceeding.

For detailed methodology guidance, see the research-literacy skill.

⚠️ Verification Notice

This skill was generated by AI from academic literature. All parameters, thresholds, and citations require independent verification before use in research. If you find errors, please open an issue.

Scoring Dimensions Overview

DimensionWhat It MeasuresScoring MethodAutomationSource
FluencyQuantity of responsesCount valid responsesFully automatedGuilford, 1967
FlexibilityVariety of conceptual categoriesCount distinct categoriesSemi-automated (COWA)Reiter-Palmon et al., 2019
OriginalityStatistical rarity or noveltyFrequency <5% threshold or subjective ratingSemi-automatedSilvia et al., 2008
ElaborationDetail and development of ideasCount additional details per responseManual onlyGuilford, 1967
Semantic distanceConceptual remoteness from promptGloVe/word2vec cosine distanceFully automatedBeaty & Johnson, 2021

Fluency Scoring

Definition

Fluency = the total number of valid, non-redundant responses a participant generates.

Scoring Rules

  1. Count each distinct response as one unit
  2. Exclude:
  • Exact duplicates
  • Conventional/typical uses of the object (debated — some protocols include them; Reiter-Palmon et al., 2019)
  • Gibberish or clearly irrelevant responses
  • Responses that are minor variations of each other (e.g., "use as a hammer" and "use to pound nails" = 1 response)
  1. When in doubt, count as separate responses and let originality scoring handle quality

Domain insight: Fluency is the most reliable but least interesting creativity measure. It correlates with personality traits (openness) and general cognitive ability, but does not distinguish truly creative responses from merely numerous ones (Silvia et al., 2008).

Flexibility Scoring

Definition

Flexibility = the number of distinct conceptual categories across a participant's responses.

Category Systems

COWA (Category of Words from AUT) system (Reiter-Palmon et al., 2019):

Provides a standardized taxonomy of response categories for common AUT objects. Example categories for "brick":

CategoryExample Responses
Construction/Building"build a wall," "build a house"
Weapon/Violence"throw at someone," "use as a weapon"
Weight/Anchor"paperweight," "doorstop," "anchor"
Art/Decoration"sculpt into art," "garden decoration"
Sport/Exercise"use as a dumbbell," "exercise weight"
Tool"hammer," "grinding surface"

Scoring Procedure

  1. Train coders on the category system (minimum 2 coders)
  2. Assign each response to its most appropriate category
  3. Count unique categories per participant = flexibility score
  4. Compute inter-rater reliability: ICC ≥ 0.70 acceptable, ≥ 0.80 good (Lee & Chung, 2024; Shrout & Fleiss, 1979)

Originality Scoring

Method 1: Statistical Rarity (Objective)

A response is "original" if it is given by fewer than 5% of the sample (Wallach & Kogan, 1965; Lee & Chung, 2024).

Procedure:

  1. Pool all responses across all participants
  2. Normalize spelling and phrasing (e.g., "door stop" = "doorstop")
  3. Compute the frequency of each unique response
  4. Mark responses given by <5% of participants as original (score = 1; else = 0)
  5. Originality score per participant = sum or proportion of original responses

Alternative thresholds: Some studies use <1% (very strict) or <10% (lenient). The 5% threshold is most common (Reiter-Palmon et al., 2019).

Method 2: Subjective Rating (Qualitative)

Human raters judge each response for creativity on a Likert scale (Silvia et al., 2008).

Procedure:

  1. Scale: 1 (not at all creative) to 5 (highly creative) — or 1-7 for finer discrimination
  2. Raters: Minimum 2 independent raters (Lee & Chung, 2024 used 3)
  3. Training: Calibrate raters on anchor examples before scoring begins
  4. Inter-rater reliability: ICC (two-way random, average measures) ≥ 0.70 (Lee & Chung, 2024 achieved ICC = 0.72-0.89)
  5. Scoring: Average across raters for each response; then average or sum per participant

Method Selection Decision Logic

Is your sample size large (N > 100)?
|
+-- YES --> Do you need fine-grained creativity distinctions?
| |
| +-- YES --> Use subjective rating (richer information)
| |
| +-- NO --> Use statistical rarity (objective, faster)
|
+-- NO --> Statistical rarity is unreliable with small N
 (rare responses may be rare by chance)
 --> Use subjective rating

Semantic Distance (Automated Scoring)

Overview

Semantic distance measures how conceptually far a response is from the prompt word in a vector space model. More distant = more creative (Beaty & Johnson, 2021).

Method (Beaty & Johnson, 2021; Organisciak et al., 2023)

  1. Embedding model: GloVe (Global Vectors for Word Representation; Pennington et al., 2014) trained on Common Crawl — 300-dimensional vectors
  2. Computation:
  • Represent the prompt word (e.g., "brick") as a vector
  • Represent each response as a vector (average word vectors for multi-word responses)
  • Compute cosine distance = 1 - cosine_similarity(prompt, response)
  1. Per-participant score: Average semantic distance across all responses
  2. Platform: SemDis (https://semdis.wlu.psu.edu/) — web-based tool by Beaty & Johnson (2021)

Interpretation

Semantic DistanceInterpretation
Low (~0.3-0.5)Response is semantically close to the object (e.g., "brick" → "build a wall")
Medium (~0.5-0.7)Moderately creative (e.g., "brick" → "use as a paperweight")
High (~0.7-1.0)Highly creative / remote association (e.g., "brick" → "use as a canvas for art")

Validation: Semantic distance correlates with subjective originality ratings at r ≈ 0.40-0.60 and predicts real-world creative achievement (Beaty & Johnson, 2021; Organisciak et al., 2023).

Advantages and Limitations

AdvantageLimitation
Fully automated, no rater trainingMisses context — "use as food" for a brick is unusual but gets a moderate distance score
Objective and reproducibleDepends on the embedding model's training corpus
Scales to large datasetsMulti-word responses require averaging, which loses phrase-level meaning
No inter-rater reliability concernsNot validated for all object types or languages

Handling the Fluency-Originality Confound

The Problem

Participants who generate more ideas (high fluency) have a higher probability of producing at least one statistically rare idea, inflating their originality scores (Silvia et al., 2008).

Solutions

  1. Ratio-based originality: Divide originality sum by fluency → proportion of original responses (Reiter-Palmon et al., 2019)
  2. Top-N scoring: Score only the top 2-3 most creative responses per participant, equalizing opportunity across fluency levels (Silvia et al., 2008)
  3. Statistical control: Include fluency as a covariate in analyses of originality (Lee & Chung, 2024)
  4. Multilevel modeling: Nest responses within participants, accounting for varying response counts

Recommendation: Use top-2 scoring (average the 2 most creative responses) when the primary interest is creative quality rather than quantity. This method has the best psychometric properties (Silvia et al., 2008).

Common Pitfalls

  1. Scoring originality without normalizing text: "doorstop," "door stop," and "use as a door stop" are the same response. Normalize spelling, capitalization, and phrasing before computing frequency (Reiter-Palmon et al., 2019).

  2. Using statistical rarity with small samples: With N < 50, many responses appear "unique" simply because the sample is small. Use subjective ratings instead, or pool responses with published norms (Reiter-Palmon et al., 2019).

  3. Ignoring inter-rater reliability: Reporting subjective creativity scores without ICC suggests the scores may reflect individual rater bias, not genuine creativity differences. Always report ICC with the model type specified (Lee & Chung, 2024).

  4. Treating semantic distance as a complete creativity measure: Semantic distance captures novelty but not usefulness/appropriateness — the other key dimension of creativity (Runco & Jaeger, 2012). Combine with subjective ratings for a comprehensive assessment.

  5. Averaging semantic distance across all responses including poor ones: Low-quality responses (gibberish, conventional uses) can dilute or inflate average distance. Clean data before computing semantic distance.

  6. Not reporting which scoring method was used: Different methods yield different results. Always specify whether originality is statistical rarity, subjective rating, or semantic distance, and which threshold or scale was used.

Minimum Reporting Checklist

Based on Reiter-Palmon et al. (2019) and Silvia et al. (2008):

  • Scoring dimensions used (fluency, flexibility, originality, elaboration, semantic distance)
  • For fluency: definition of "valid response" and exclusion rules
  • For flexibility: category system used (COWA or custom) and category list
  • For originality: method (statistical rarity threshold or subjective rating scale)
  • For subjective scoring: number of raters, training procedure, ICC values (model type specified)
  • For semantic distance: embedding model, dimensionality, platform/implementation
  • How fluency-originality confound was handled (ratio, top-N, covariate, or acknowledged)
  • Data cleaning steps (text normalization, duplicate removal)
  • Whether scoring was blind to condition

References

  • Beaty, R. E., & Johnson, D. R. (2021). Automating creativity assessment with SemDis: An open platform for computing semantic distance. Behavior Research Methods, 53(2), 757-780.
  • Guilford, J. P. (1967). The nature of human intelligence. McGraw-Hill.
  • Lee, B. C., & Chung, J. (2024). An empirical investigation of the impact of ChatGPT on creativity. Nature Human Behaviour. https://doi.org/10.1038/s41562-024-01953-1
  • Organisciak, P., Acar, S., Dumas, D., & Berthiaume, K. (2023). Beyond semantic distance: Automated scoring of divergent thinking greatly improves with large language models. Thinking Skills and Creativity, 49, 101356.
  • Pennington, J., Socher, R., & Manning, C. D. (2014). GloVe: Global vectors for word representation. In Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), 1532-1543.
  • Reiter-Palmon, R., Forthmann, B., & Barbot, B. (2019). Scoring divergent thinking tests: A review and systematic framework. Psychology of Aesthetics, Creativity, and the Arts, 13(2), 144-152.
  • Runco, M. A., & Jaeger, G. J. (2012). The standard definition of creativity. Creativity Research Journal, 24(1), 92-96.
  • Shrout, P. E., & Fleiss, J. L. (1979). Intraclass correlations: Uses in assessing rater reliability. Psychological Bulletin, 86(2), 420-428.
  • Silvia, P. J., Winterstein, B. P., Willse, J. T., et al. (2008). Assessing creativity with divergent thinking tasks: Exploring the reliability and validity of new subjective scoring methods. Psychology of Aesthetics, Creativity, and the Arts, 2(2), 68-85.
  • Wallach, M. A., & Kogan, N. (1965). Modes of thinking in young children. Holt, Rinehart and Winston.

See references/scoring-rubric.md for detailed scoring examples and training materials.

Source Transparency

This detail page is rendered from real SKILL.md content. Trust labels are metadata-based hints, not a safety guarantee.

Related Skills

Related by shared tags or category signals.

General

eeg preprocessing pipeline guide

No summary provided by upstream source.

Repository SourceNeeds Review
General

self-paced reading designer

No summary provided by upstream source.

Repository SourceNeeds Review
General

visual search array generator

No summary provided by upstream source.

Repository SourceNeeds Review
General

lesion-symptom mapping guide

No summary provided by upstream source.

Repository SourceNeeds Review
Divergent Thinking Scoring | V50.AI