Sentence Stimulus Norming

Specifies norming procedures for linguistic stimuli including cloze probability, plausibility ratings, acceptability judgments, and lexical controls

Safety Notice

This listing is imported from skills.sh public index metadata. Review upstream SKILL.md and repository scripts before running.

Copy this and send it to your AI assistant to learn

Install skill "Sentence Stimulus Norming" with this command: npx skills add haoxuanlithuai/awesome_cognitive_and_neuroscience_skills/haoxuanlithuai-awesome-cognitive-and-neuroscience-skills-sentence-stimulus-norming

Sentence Stimulus Norming

Purpose

This skill encodes expert methodological knowledge for norming linguistic stimuli before running psycholinguistic experiments. A competent programmer without linguistics training would likely construct stimuli based on intuition, failing to control for critical lexical variables (word frequency, length, neighborhood density), skipping cloze norming, using inappropriate rating scales, or under-powering the norming study. Poor stimulus norming is the single most common methodological weakness in psycholinguistic research, because confounds in the materials propagate to every analysis.

When to Use

Use this skill when:

  • Creating sentence stimuli for reading experiments (self-paced reading, eye-tracking, ERP)
  • Norming the predictability (cloze probability) of critical words in sentence contexts
  • Collecting plausibility, naturalness, or acceptability ratings for sentence materials
  • Controlling lexical properties of critical words across experimental conditions
  • Designing Latin square counterbalancing for within-item designs
  • Planning filler items and practice trials

Do not use this skill when:

  • Working with single-word stimuli without sentence context (use lexical database tools directly)
  • Designing non-linguistic stimuli (visual search arrays, tones)
  • Analyzing existing normed materials without creating new ones

Research Planning Protocol

Before executing the domain-specific steps below, you MUST:

  1. State the research question -- What specific question is this analysis/paradigm addressing?
  2. Justify the method choice -- Why is this approach appropriate? What alternatives were considered?
  3. Declare expected outcomes -- What results would support vs. refute the hypothesis?
  4. Note assumptions and limitations -- What does this method assume? Where could it mislead?
  5. Present the plan to the user and WAIT for confirmation before proceeding.

For detailed methodology guidance, see the research-literacy skill.

⚠️ Verification Notice

This skill was generated by AI from academic literature. All parameters, thresholds, and citations require independent verification before use in research. If you find errors, please open an issue.

Cloze Probability Norming

What Is Cloze Probability?

Cloze probability is the proportion of people who complete a sentence fragment with a particular word (Taylor, 1953). It is the standard measure of a word's predictability in context and is a critical control variable in nearly all sentence processing research.

Procedure

  1. Create sentence fragments: Truncate each sentence immediately before the critical word
  2. Present fragments one at a time to participants
  3. Instruct: "Please complete each sentence with the first word that comes to mind. Write only one word."
  4. Score: For each item, cloze probability = (number of completions matching the target word) / (total number of respondents)

Design Parameters

ParameterRecommended ValueCitation / Rationale
N per itemMinimum 30 ratersTaylor, 1953; Bloom & Fischler, 1980; standard minimum for stable estimates
Preferred N40-50 ratersMore stable estimates, especially for medium-cloze items
Items per participant50-100 fragments per norming sessionAvoid fatigue; pilot to calibrate
Time limit~10-15 seconds per item or untimedUntimed is standard; brief limit prevents overthinking
PopulationSame as experimental population (e.g., native English speakers, same age range)Ensures cloze values generalize

Scoring Conventions

  • Exact match: Only the target word counts (standard)
  • Morphological variants: Decide a priori whether "run" and "running" count as the same completion. Standard practice: count only the exact form (Staub et al., 2015)
  • Spelling errors: Accept obvious misspellings of the target
  • Blank/nonsense responses: Exclude from the denominator (participant did not engage)

Cloze Probability Benchmarks

Cloze RangeLabelUse Case
> 0.80High cloze / highly predictableN400 amplitude studies; predictability effects (Kutas & Hillyard, 1984)
0.30 - 0.70Medium clozeModerate predictability manipulations
< 0.10Low cloze / unpredictableBaseline; unexpected completions
0.00Zero clozeAnomalous or implausible continuations

Online vs. Lab Norming

AspectLabOnline (e.g., Prolific, MTurk)
Quality controlDirect observationMust include catch trials and attention checks
Sample sizeLimited by lab capacityEasy to reach N = 40-50 per item
PopulationTypically university studentsMore diverse; specify inclusion criteria
ValidityGold standardComparable for cloze (Schütze & Sprouse, 2014)
CostLab timeParticipant payment (~$10-15/hour; Prolific standards)

Recommendation for online norming: Include 10-15% catch trials (sentences with obvious completions, e.g., "The dog chased the ___") and exclude participants who fail > 20% of catch trials.

Plausibility and Naturalness Ratings

When to Collect

  • When cloze probability alone is insufficient (e.g., both conditions have low cloze but differ in plausibility)
  • When manipulating semantic fit or thematic role plausibility
  • When verifying that "anomalous" conditions are genuinely perceived as odd

Rating Scale Design

ParameterRecommendedCitation / Rationale
Scale typeLikert scaleStandard for sentence ratings (Schütze & Sprouse, 2014)
Number of points7-point scaleBalances sensitivity and reliability; standard in psycholinguistics (Schütze & Sprouse, 2014)
Anchors1 = "very unnatural/implausible" to 7 = "very natural/plausible"Labeled endpoints with unlabeled intermediate points
N per itemMinimum 20 raters; preferred 30+Sufficient for stable means per item (Sprouse & Almeida, 2012)
Items per rater40-80 items per sessionAvoid fatigue effects
Practice items3-5 items spanning the full range before data collectionCalibrate scale use

Instructions Template

"You will read a series of sentences. For each sentence, please rate how natural or plausible it sounds on a scale from 1 to 7, where 1 means 'very unnatural / makes no sense' and 7 means 'perfectly natural / makes complete sense.' There are no right or wrong answers; we are interested in your intuition."

Critical Design Considerations

  • Within-list design: Each rater sees only one version of each item (Latin square). Raters should never see multiple conditions of the same item, or they will rate contrastively rather than absolutely.
  • Filler items: Include filler sentences spanning the full rating range. This prevents range restriction.
  • Order effects: Randomize item order per participant.

Acceptability Judgments

When to Collect

  • When manipulating syntactic structure (grammaticality, island constraints, movement dependencies)
  • When testing formal linguistic predictions about sentence well-formedness
  • For factorial designs crossing syntactic factors (e.g., 2x2 designs testing island effects; Sprouse et al., 2012)

Rating Methods

MethodDescriptionProsConsCitation
Likert scale (7-point)Rate acceptability 1-7Simple; familiar; sufficient for most purposesCeiling/floor possible; ordinal dataSchütze & Sprouse, 2014
Magnitude estimation (ME)Assign a number proportional to perceived acceptability relative to a reference sentenceUnbounded scale; ratio-level data (in theory)More complex; participants need training; debated whether it outperforms LikertBard et al., 1996; Sprouse, 2011
Forced choiceChoose the more acceptable of two sentencesBinary; easy; avoids scale-use differencesLow sensitivity; many trials neededSprouse & Almeida, 2012
Yes/No judgment"Is this sentence acceptable?"Simple; binaryVery low sensitivity; cannot distinguish degrees of unacceptability--

Recommendation: Use 7-point Likert as the default. It provides sufficient sensitivity for most research questions and has been shown to replicate formal linguistic judgments as reliably as magnitude estimation (Sprouse & Almeida, 2012; Sprouse, 2011).

Sample Size for Acceptability

DesignMinimum NRationaleCitation
Simple grammatical/ungrammatical20 participantsLarge effect sizes (d > 1.0 typical)Sprouse & Almeida, 2012
Factorial (2x2) with interaction30-40 participantsInteraction effects are smallerSprouse et al., 2012
Subtle contrasts50+ participantsSmall effect sizes require more powerPower analysis recommended

Lexical Controls

Variables That Must Be Controlled Across Conditions

Every critical word manipulation must control for confounding lexical variables. The target word and its condition-matched alternatives should be equated on the following:

VariableDatabase / SourceWhy It MattersCitation
Word frequencySUBTLEX-US (log10 word frequency per million)Most powerful predictor of reading time; ~30-60 ms effect for high vs. low (Brysbaert & New, 2009)Brysbaert & New, 2009
Word lengthCharacter countLonger words = longer reading times; ~20-30 ms per character (Rayner, 2009)Rayner, 1998
Orthographic neighborhood density (N)N-Watch; CLEARPONDNumber of words differing by one letter; affects lexical access (Coltheart et al., 1977)Andrews, 1997
ConcretenessBrysbaert et al. (2014) ratingsConcrete words processed faster than abstract wordsBrysbaert et al., 2014
Age of acquisition (AoA)Kuperman et al. (2012) ratingsEarlier-acquired words processed fasterKuperman et al., 2012
Number of syllablesAny pronunciation dictionaryAffects phonological processing timeRayner, 1998
Morphological complexityManual codingDerived words (e.g., un-happi-ness) processed differently than monomorphemic wordsTaft, 2004

Frequency Database Selection

DatabaseLanguageMeasureRecommended?Citation
SUBTLEX-USEnglish (US)Subtitle-based frequency per millionYes -- best predictor of processing timesBrysbaert & New, 2009
SUBTLEX-UKEnglish (UK)Subtitle-based frequencyYes, for British English materialsvan Heuven et al., 2014
HALEnglishUsenet corpus frequencyOutdated; SUBTLEX preferredLund & Burgess, 1996
CELEXEnglish, Dutch, GermanMixed corpus frequencyAcceptable but less predictive than SUBTLEXBaayen et al., 1995

Key recommendation: Use SUBTLEX log frequency values. They explain more variance in lexical decision and naming times than older norms (Brysbaert & New, 2009).

How to Match Across Conditions

  1. Select critical words for each condition
  2. Retrieve lexical metrics from SUBTLEX-US and norming databases
  3. Compute condition means for each metric
  4. Test for differences: Run t-tests or ANOVAs across conditions on each lexical variable
  5. Criterion: No significant differences (p > 0.20 is a reasonable threshold; some use p > 0.30) on any controlled variable
  6. If matching fails: replace items or add the unmatched variable as a covariate in the analysis

Latin Square Counterbalancing

Purpose

In a within-item design, each item appears in all conditions, but each participant sees each item in only one condition. A Latin square assigns items to conditions across participant lists.

Construction

For a design with k conditions and n items (where n is divisible by k):

  1. Divide items into k groups of n/k items each
  2. Create k lists; in each list, assign each item group to a different condition
  3. Each participant receives one list
  4. Result: every item appears in every condition across participants; each participant sees an equal number of items per condition

Example: 2-Condition Design

With 40 items and 2 conditions (A, B):

ListItems 1-20Items 21-40
List 1Condition ACondition B
List 2Condition BCondition A

Requirements

ParameterValueRationale
Minimum items per condition per list16-24Standard for psycholinguistic experiments; fewer items = lower power (Brysbaert & Stevens, 2018)
Recommended items24-40 per conditionMore stable estimates, especially for eye-tracking
Participants per listEqual across lists; minimum 4-6 per listEnsures balanced representation
Total participantsDivisible by number of listsCritical for balanced design

Filler Items

Purpose

Fillers prevent participants from noticing the experimental manipulation and adopting strategies.

Design Parameters

ParameterRecommended ValueRationale
Filler-to-target ratio2:1 or 3:1 (fillers:targets)Standard in psycholinguistics; prevents pattern detection (Schütze & Sprouse, 2014)
Filler diversityFillers should span the full range of sentence types, lengths, and structuresPrevents target sentences from standing out
Filler acceptability rangeInclude some clearly good and some mildly awkward fillersPrevents raters from using only part of the scale
Filler lengthMatch the average length of target sentencesControls for sentence length expectations

Filler Construction Tips

  • Use fillers from different syntactic constructions than your targets
  • Include some fillers with comprehension questions (for reading studies) to maintain attentive reading
  • If targets are semantically anomalous, include some fillers that are also slightly odd (but in different ways) so anomaly is not a cue

Practice and Warm-Up Items

ParameterRecommended ValueRationale
Number of practice items4-6 items (minimum 3)Familiarize participants with the task and interface
Practice item compositionSpan the range of difficulty/acceptabilityCalibrate participant expectations
Practice dataAlways exclude from analysisPractice responses are contaminated by learning effects
Warm-up items at start of main experiment2-3 additional filler itemsAllow settling into the task; exclude from analysis

Online Norming Considerations

Platform Recommendations

PlatformProsConsTypical Pay Rate
ProlificDiverse participants; pre-screening; good data qualitySmaller pool than MTurk~$10-15/hour (Prolific minimum: $8/hour)
Amazon MTurkLarge pool; fast recruitmentLower data quality; less diverse; requires careful screening~$10-15/hour recommended
PCIbex / Ibex FarmFree hosting; designed for linguisticsRequires programming; no built-in recruitment(hosting only)
GorillaGUI-based; good for complex designsSubscription cost(hosting only)

Quality Control for Online Studies

MeasureImplementationThreshold
Catch trialsInclude 10-15% filler items with obvious answersExclude participants failing > 20%
Completion timeRecord total timeExclude participants completing in < 50% of median time
Straight-liningCheck for same response on all itemsExclude participants with zero variance in ratings
Bot detectionInclude reCAPTCHA or similarExclude flagged responses
Native speaker checkSelf-report + brief language background questionnaireExclude non-native speakers (unless studying L2)

Common Pitfalls

  1. Not norming cloze probability: Claiming words are "predictable" or "unpredictable" based on experimenter intuition rather than empirical cloze norms. Always collect cloze data (Taylor, 1953).

  2. Too few raters per item: With N < 20 raters for cloze, individual item estimates are unstable. A word with true cloze of 0.50 could yield observed cloze of 0.20-0.80 with only 10 raters. Use minimum 30 raters (Bloom & Fischler, 1980).

  3. Not controlling word frequency: Frequency is the strongest single predictor of reading time. A 1 log-unit difference in SUBTLEX frequency corresponds to ~30-40 ms in gaze duration (Brysbaert & New, 2009; Rayner, 1998). Always match or control.

  4. Using the wrong frequency database: HAL and Kucera-Francis norms are outdated. SUBTLEX-US explains significantly more variance in behavioral data (Brysbaert & New, 2009).

  5. Showing raters multiple conditions of the same item: This introduces contrastive evaluation. Raters must see each item in only one condition (Latin square for norming too).

  6. Insufficient filler items: A 1:1 target-to-filler ratio makes the manipulation transparent. Use at least 2:1 fillers to targets (Schütze & Sprouse, 2014).

  7. Not piloting the norming study: Always pilot with 5-10 participants to catch unclear instructions, ambiguous items, and timing issues before running the full norming sample.

  8. Ignoring age of acquisition: AoA effects are independent of frequency (Kuperman et al., 2012). Failing to control AoA can introduce confounds, especially for studies comparing concrete vs. abstract words.

Minimum Reporting Checklist

Based on Schütze & Sprouse (2014) and current psycholinguistic standards:

  • Number of items per condition
  • Cloze probability values: mean, SD, and range per condition (if collected)
  • Cloze norming details: N raters, population, procedure, scoring criteria
  • Plausibility/acceptability ratings: scale type, N raters, mean and SD per condition
  • Lexical control variables: list each controlled variable, database source, and condition means
  • Statistical test confirming conditions do not differ on controlled variables
  • Latin square design: number of lists, items per list per condition, participants per list
  • Filler-to-target ratio and description of filler types
  • Number of practice/warm-up items
  • For online norming: platform, pay rate, attention check procedure, exclusion criteria and N excluded
  • Full item list (in supplementary materials or online repository)

References

  • Andrews, S. (1997). The effect of orthographic similarity on lexical retrieval: Resolving neighborhood conflicts. Psychonomic Bulletin & Review, 4, 439-461.
  • Baayen, R. H., Davidson, D. J., & Bates, D. M. (2008). Mixed-effects modeling with crossed random effects for subjects and items. Journal of Memory and Language, 59, 390-412.
  • Baayen, R. H., Piepenbrock, R., & Gulikers, L. (1995). The CELEX lexical database (CD-ROM). Linguistic Data Consortium, University of Pennsylvania.
  • Bard, E. G., Robertson, D., & Sorace, A. (1996). Magnitude estimation of linguistic acceptability. Language, 72, 32-68.
  • Bloom, P. A., & Fischler, I. (1980). Completion norms for 329 sentence contexts. Memory & Cognition, 8, 631-642.
  • Brysbaert, M., & New, B. (2009). Moving beyond Kucera and Francis: A critical evaluation of current word frequency norms and the introduction of a new and improved word frequency measure for American English. Behavior Research Methods, 41, 977-990.
  • Brysbaert, M., & Stevens, M. (2018). Power analysis and effect size in mixed effects models: A tutorial. Journal of Cognition, 1, 9.
  • Brysbaert, M., Warriner, A. B., & Kuperman, V. (2014). Concreteness ratings for 40 thousand generally known English word lemmas. Behavior Research Methods, 46, 904-911.
  • Coltheart, M., Davelaar, E., Jonasson, J. T., & Besner, D. (1977). Access to the internal lexicon. In S. Dornic (Ed.), Attention and performance VI. Hillsdale, NJ: Erlbaum.
  • Kuperman, V., Stadthagen-Gonzalez, H., & Brysbaert, M. (2012). Age-of-acquisition ratings for 30,000 English words. Behavior Research Methods, 44, 978-990.
  • Kutas, M., & Hillyard, S. A. (1984). Brain potentials during reading reflect word expectancy and semantic association. Nature, 307, 161-163.
  • Lund, K., & Burgess, C. (1996). Producing high-dimensional semantic spaces from lexical co-occurrence. Behavior Research Methods, Instruments, & Computers, 28, 203-208.
  • Rayner, K. (1998). Eye movements in reading and information processing: 20 years of research. Psychological Bulletin, 124, 372-422.
  • Rayner, K. (2009). Eye movements and attention in reading, scene perception, and visual search. Quarterly Journal of Experimental Psychology, 62, 1457-1506.
  • Schütze, C. T., & Sprouse, J. (2014). Judgment data. In R. J. Podesva & D. Sharma (Eds.), Research methods in linguistics. Cambridge University Press.
  • Sprouse, J. (2011). A test of the cognitive assumptions of magnitude estimation: Commutativity does not hold for acceptability judgments. Language, 87, 274-288.
  • Sprouse, J., & Almeida, D. (2012). Assessing the reliability of textbook data in syntax: Adger's Core Syntax. Journal of Linguistics, 48, 609-652.
  • Sprouse, J., Schütze, C. T., & Almeida, D. (2012). A comparison of informal and formal acceptability judgments using a random sample from Linguistic Inquiry 2001-2010. Lingua, 134, 219-248.
  • Staub, A., Grant, M., Astheimer, L., & Cohen, A. (2015). The influence of cloze probability and item constraint on cloze task response time. Journal of Memory and Language, 82, 1-17.
  • Taft, M. (2004). Morphological decomposition and the reverse base frequency effect. Quarterly Journal of Experimental Psychology, 57A, 745-765.
  • Taylor, W. L. (1953). "Cloze procedure": A new tool for measuring readability. Journalism Quarterly, 30, 415-433.
  • van Heuven, W. J. B., Mandera, P., Keuleers, E., & Brysbaert, M. (2014). SUBTLEX-UK: A new and improved word frequency database for British English. Quarterly Journal of Experimental Psychology, 67, 1176-1190.

See references/lexical-databases-guide.md for detailed instructions on accessing and querying lexical control databases.

Source Transparency

This detail page is rendered from real SKILL.md content. Trust labels are metadata-based hints, not a safety guarantee.

Related Skills

Related by shared tags or category signals.

General

eeg preprocessing pipeline guide

No summary provided by upstream source.

Repository SourceNeeds Review
General

self-paced reading designer

No summary provided by upstream source.

Repository SourceNeeds Review
General

visual search array generator

No summary provided by upstream source.

Repository SourceNeeds Review