Sentence Stimulus Norming

Purpose

This skill encodes expert methodological knowledge for norming linguistic stimuli before running psycholinguistic experiments. A competent programmer without linguistics training would likely construct stimuli based on intuition, failing to control for critical lexical variables (word frequency, length, neighborhood density), skipping cloze norming, using inappropriate rating scales, or under-powering the norming study. Poor stimulus norming is the single most common methodological weakness in psycholinguistic research, because confounds in the materials propagate to every analysis.

When to Use

Use this skill when:

Creating sentence stimuli for reading experiments (self-paced reading, eye-tracking, ERP)
Norming the predictability (cloze probability) of critical words in sentence contexts
Collecting plausibility, naturalness, or acceptability ratings for sentence materials
Controlling lexical properties of critical words across experimental conditions
Designing Latin square counterbalancing for within-item designs
Planning filler items and practice trials

Do not use this skill when:

Working with single-word stimuli without sentence context (use lexical database tools directly)
Designing non-linguistic stimuli (visual search arrays, tones)
Analyzing existing normed materials without creating new ones

Research Planning Protocol

Before executing the domain-specific steps below, you MUST:

State the research question -- What specific question is this analysis/paradigm addressing?
Justify the method choice -- Why is this approach appropriate? What alternatives were considered?
Declare expected outcomes -- What results would support vs. refute the hypothesis?
Note assumptions and limitations -- What does this method assume? Where could it mislead?
Present the plan to the user and WAIT for confirmation before proceeding.

For detailed methodology guidance, see the research-literacy skill.

⚠️ Verification Notice

This skill was generated by AI from academic literature. All parameters, thresholds, and citations require independent verification before use in research. If you find errors, please open an issue.

Cloze Probability Norming

What Is Cloze Probability?

Cloze probability is the proportion of people who complete a sentence fragment with a particular word (Taylor, 1953). It is the standard measure of a word's predictability in context and is a critical control variable in nearly all sentence processing research.

Procedure

Create sentence fragments: Truncate each sentence immediately before the critical word
Present fragments one at a time to participants
Instruct: "Please complete each sentence with the first word that comes to mind. Write only one word."
Score: For each item, cloze probability = (number of completions matching the target word) / (total number of respondents)

Design Parameters

Parameter	Recommended Value	Citation / Rationale
N per item	Minimum 30 raters	Taylor, 1953; Bloom & Fischler, 1980; standard minimum for stable estimates
Preferred N	40-50 raters	More stable estimates, especially for medium-cloze items
Items per participant	50-100 fragments per norming session	Avoid fatigue; pilot to calibrate
Time limit	~10-15 seconds per item or untimed	Untimed is standard; brief limit prevents overthinking
Population	Same as experimental population (e.g., native English speakers, same age range)	Ensures cloze values generalize

Scoring Conventions

Exact match: Only the target word counts (standard)
Morphological variants: Decide a priori whether "run" and "running" count as the same completion. Standard practice: count only the exact form (Staub et al., 2015)
Spelling errors: Accept obvious misspellings of the target
Blank/nonsense responses: Exclude from the denominator (participant did not engage)

Cloze Probability Benchmarks

Cloze Range	Label	Use Case
> 0.80	High cloze / highly predictable	N400 amplitude studies; predictability effects (Kutas & Hillyard, 1984)
0.30 - 0.70	Medium cloze	Moderate predictability manipulations
< 0.10	Low cloze / unpredictable	Baseline; unexpected completions
0.00	Zero cloze	Anomalous or implausible continuations

Online vs. Lab Norming

Aspect	Lab	Online (e.g., Prolific, MTurk)
Quality control	Direct observation	Must include catch trials and attention checks
Sample size	Limited by lab capacity	Easy to reach N = 40-50 per item
Population	Typically university students	More diverse; specify inclusion criteria
Validity	Gold standard	Comparable for cloze (Schütze & Sprouse, 2014)
Cost	Lab time	Participant payment (~$10-15/hour; Prolific standards)

Recommendation for online norming: Include 10-15% catch trials (sentences with obvious completions, e.g., "The dog chased the ___") and exclude participants who fail > 20% of catch trials.

Plausibility and Naturalness Ratings

When to Collect

When cloze probability alone is insufficient (e.g., both conditions have low cloze but differ in plausibility)
When manipulating semantic fit or thematic role plausibility
When verifying that "anomalous" conditions are genuinely perceived as odd

Rating Scale Design

Parameter	Recommended	Citation / Rationale
Scale type	Likert scale	Standard for sentence ratings (Schütze & Sprouse, 2014)
Number of points	7-point scale	Balances sensitivity and reliability; standard in psycholinguistics (Schütze & Sprouse, 2014)
Anchors	1 = "very unnatural/implausible" to 7 = "very natural/plausible"	Labeled endpoints with unlabeled intermediate points
N per item	Minimum 20 raters; preferred 30+	Sufficient for stable means per item (Sprouse & Almeida, 2012)
Items per rater	40-80 items per session	Avoid fatigue effects
Practice items	3-5 items spanning the full range before data collection	Calibrate scale use

Instructions Template

"You will read a series of sentences. For each sentence, please rate how natural or plausible it sounds on a scale from 1 to 7, where 1 means 'very unnatural / makes no sense' and 7 means 'perfectly natural / makes complete sense.' There are no right or wrong answers; we are interested in your intuition."

Critical Design Considerations

Within-list design: Each rater sees only one version of each item (Latin square). Raters should never see multiple conditions of the same item, or they will rate contrastively rather than absolutely.
Filler items: Include filler sentences spanning the full rating range. This prevents range restriction.
Order effects: Randomize item order per participant.

Acceptability Judgments

When to Collect

When manipulating syntactic structure (grammaticality, island constraints, movement dependencies)
When testing formal linguistic predictions about sentence well-formedness
For factorial designs crossing syntactic factors (e.g., 2x2 designs testing island effects; Sprouse et al., 2012)

Rating Methods

Method	Description	Pros	Cons	Citation
Likert scale (7-point)	Rate acceptability 1-7	Simple; familiar; sufficient for most purposes	Ceiling/floor possible; ordinal data	Schütze & Sprouse, 2014
Magnitude estimation (ME)	Assign a number proportional to perceived acceptability relative to a reference sentence	Unbounded scale; ratio-level data (in theory)	More complex; participants need training; debated whether it outperforms Likert	Bard et al., 1996; Sprouse, 2011
Forced choice	Choose the more acceptable of two sentences	Binary; easy; avoids scale-use differences	Low sensitivity; many trials needed	Sprouse & Almeida, 2012
Yes/No judgment	"Is this sentence acceptable?"	Simple; binary	Very low sensitivity; cannot distinguish degrees of unacceptability	--

Recommendation: Use 7-point Likert as the default. It provides sufficient sensitivity for most research questions and has been shown to replicate formal linguistic judgments as reliably as magnitude estimation (Sprouse & Almeida, 2012; Sprouse, 2011).

Sample Size for Acceptability

Design	Minimum N	Rationale	Citation
Simple grammatical/ungrammatical	20 participants	Large effect sizes (d > 1.0 typical)	Sprouse & Almeida, 2012
Factorial (2x2) with interaction	30-40 participants	Interaction effects are smaller	Sprouse et al., 2012
Subtle contrasts	50+ participants	Small effect sizes require more power	Power analysis recommended

Lexical Controls

Variables That Must Be Controlled Across Conditions

Every critical word manipulation must control for confounding lexical variables. The target word and its condition-matched alternatives should be equated on the following:

Variable	Database / Source	Why It Matters	Citation
Word frequency	SUBTLEX-US (log10 word frequency per million)	Most powerful predictor of reading time; ~30-60 ms effect for high vs. low (Brysbaert & New, 2009)	Brysbaert & New, 2009
Word length	Character count	Longer words = longer reading times; ~20-30 ms per character (Rayner, 2009)	Rayner, 1998
Orthographic neighborhood density (N)	N-Watch; CLEARPOND	Number of words differing by one letter; affects lexical access (Coltheart et al., 1977)	Andrews, 1997
Concreteness	Brysbaert et al. (2014) ratings	Concrete words processed faster than abstract words	Brysbaert et al., 2014
Age of acquisition (AoA)	Kuperman et al. (2012) ratings	Earlier-acquired words processed faster	Kuperman et al., 2012
Number of syllables	Any pronunciation dictionary	Affects phonological processing time	Rayner, 1998
Morphological complexity	Manual coding	Derived words (e.g., un-happi-ness) processed differently than monomorphemic words	Taft, 2004

Frequency Database Selection

Database	Language	Measure	Recommended?	Citation
SUBTLEX-US	English (US)	Subtitle-based frequency per million	Yes -- best predictor of processing times	Brysbaert & New, 2009
SUBTLEX-UK	English (UK)	Subtitle-based frequency	Yes, for British English materials	van Heuven et al., 2014
HAL	English	Usenet corpus frequency	Outdated; SUBTLEX preferred	Lund & Burgess, 1996
CELEX	English, Dutch, German	Mixed corpus frequency	Acceptable but less predictive than SUBTLEX	Baayen et al., 1995

Key recommendation: Use SUBTLEX log frequency values. They explain more variance in lexical decision and naming times than older norms (Brysbaert & New, 2009).

How to Match Across Conditions

Select critical words for each condition
Retrieve lexical metrics from SUBTLEX-US and norming databases
Compute condition means for each metric
Test for differences: Run t-tests or ANOVAs across conditions on each lexical variable
Criterion: No significant differences (p > 0.20 is a reasonable threshold; some use p > 0.30) on any controlled variable
If matching fails: replace items or add the unmatched variable as a covariate in the analysis

Latin Square Counterbalancing

Purpose

In a within-item design, each item appears in all conditions, but each participant sees each item in only one condition. A Latin square assigns items to conditions across participant lists.

Construction

For a design with k conditions and n items (where n is divisible by k):

Divide items into k groups of n/k items each
Create k lists; in each list, assign each item group to a different condition
Each participant receives one list
Result: every item appears in every condition across participants; each participant sees an equal number of items per condition

Example: 2-Condition Design

With 40 items and 2 conditions (A, B):

List	Items 1-20	Items 21-40
List 1	Condition A	Condition B
List 2	Condition B	Condition A

Requirements

Parameter	Value	Rationale
Minimum items per condition per list	16-24	Standard for psycholinguistic experiments; fewer items = lower power (Brysbaert & Stevens, 2018)
Recommended items	24-40 per condition	More stable estimates, especially for eye-tracking
Participants per list	Equal across lists; minimum 4-6 per list	Ensures balanced representation
Total participants	Divisible by number of lists	Critical for balanced design

Filler Items

Purpose

Fillers prevent participants from noticing the experimental manipulation and adopting strategies.

Design Parameters

Parameter	Recommended Value	Rationale
Filler-to-target ratio	2:1 or 3:1 (fillers:targets)	Standard in psycholinguistics; prevents pattern detection (Schütze & Sprouse, 2014)
Filler diversity	Fillers should span the full range of sentence types, lengths, and structures	Prevents target sentences from standing out
Filler acceptability range	Include some clearly good and some mildly awkward fillers	Prevents raters from using only part of the scale
Filler length	Match the average length of target sentences	Controls for sentence length expectations

Filler Construction Tips

Use fillers from different syntactic constructions than your targets
Include some fillers with comprehension questions (for reading studies) to maintain attentive reading
If targets are semantically anomalous, include some fillers that are also slightly odd (but in different ways) so anomaly is not a cue

Practice and Warm-Up Items

Parameter	Recommended Value	Rationale
Number of practice items	4-6 items (minimum 3)	Familiarize participants with the task and interface
Practice item composition	Span the range of difficulty/acceptability	Calibrate participant expectations
Practice data	Always exclude from analysis	Practice responses are contaminated by learning effects
Warm-up items at start of main experiment	2-3 additional filler items	Allow settling into the task; exclude from analysis

Online Norming Considerations

Platform Recommendations

Platform	Pros	Cons	Typical Pay Rate
Prolific	Diverse participants; pre-screening; good data quality	Smaller pool than MTurk	~$10-15/hour (Prolific minimum: $8/hour)
Amazon MTurk	Large pool; fast recruitment	Lower data quality; less diverse; requires careful screening	~$10-15/hour recommended
PCIbex / Ibex Farm	Free hosting; designed for linguistics	Requires programming; no built-in recruitment	(hosting only)
Gorilla	GUI-based; good for complex designs	Subscription cost	(hosting only)

Quality Control for Online Studies

Measure	Implementation	Threshold
Catch trials	Include 10-15% filler items with obvious answers	Exclude participants failing > 20%
Completion time	Record total time	Exclude participants completing in < 50% of median time
Straight-lining	Check for same response on all items	Exclude participants with zero variance in ratings
Bot detection	Include reCAPTCHA or similar	Exclude flagged responses
Native speaker check	Self-report + brief language background questionnaire	Exclude non-native speakers (unless studying L2)

Common Pitfalls

Not norming cloze probability: Claiming words are "predictable" or "unpredictable" based on experimenter intuition rather than empirical cloze norms. Always collect cloze data (Taylor, 1953).
Too few raters per item: With N < 20 raters for cloze, individual item estimates are unstable. A word with true cloze of 0.50 could yield observed cloze of 0.20-0.80 with only 10 raters. Use minimum 30 raters (Bloom & Fischler, 1980).
Not controlling word frequency: Frequency is the strongest single predictor of reading time. A 1 log-unit difference in SUBTLEX frequency corresponds to ~30-40 ms in gaze duration (Brysbaert & New, 2009; Rayner, 1998). Always match or control.
Using the wrong frequency database: HAL and Kucera-Francis norms are outdated. SUBTLEX-US explains significantly more variance in behavioral data (Brysbaert & New, 2009).
Showing raters multiple conditions of the same item: This introduces contrastive evaluation. Raters must see each item in only one condition (Latin square for norming too).
Insufficient filler items: A 1:1 target-to-filler ratio makes the manipulation transparent. Use at least 2:1 fillers to targets (Schütze & Sprouse, 2014).
Not piloting the norming study: Always pilot with 5-10 participants to catch unclear instructions, ambiguous items, and timing issues before running the full norming sample.
Ignoring age of acquisition: AoA effects are independent of frequency (Kuperman et al., 2012). Failing to control AoA can introduce confounds, especially for studies comparing concrete vs. abstract words.

Minimum Reporting Checklist

Based on Schütze & Sprouse (2014) and current psycholinguistic standards:

References

Andrews, S. (1997). The effect of orthographic similarity on lexical retrieval: Resolving neighborhood conflicts. Psychonomic Bulletin & Review, 4, 439-461.
Baayen, R. H., Davidson, D. J., & Bates, D. M. (2008). Mixed-effects modeling with crossed random effects for subjects and items. Journal of Memory and Language, 59, 390-412.
Baayen, R. H., Piepenbrock, R., & Gulikers, L. (1995). The CELEX lexical database (CD-ROM). Linguistic Data Consortium, University of Pennsylvania.
Bard, E. G., Robertson, D., & Sorace, A. (1996). Magnitude estimation of linguistic acceptability. Language, 72, 32-68.
Bloom, P. A., & Fischler, I. (1980). Completion norms for 329 sentence contexts. Memory & Cognition, 8, 631-642.
Brysbaert, M., & New, B. (2009). Moving beyond Kucera and Francis: A critical evaluation of current word frequency norms and the introduction of a new and improved word frequency measure for American English. Behavior Research Methods, 41, 977-990.
Brysbaert, M., & Stevens, M. (2018). Power analysis and effect size in mixed effects models: A tutorial. Journal of Cognition, 1, 9.
Brysbaert, M., Warriner, A. B., & Kuperman, V. (2014). Concreteness ratings for 40 thousand generally known English word lemmas. Behavior Research Methods, 46, 904-911.
Coltheart, M., Davelaar, E., Jonasson, J. T., & Besner, D. (1977). Access to the internal lexicon. In S. Dornic (Ed.), Attention and performance VI. Hillsdale, NJ: Erlbaum.
Kuperman, V., Stadthagen-Gonzalez, H., & Brysbaert, M. (2012). Age-of-acquisition ratings for 30,000 English words. Behavior Research Methods, 44, 978-990.
Kutas, M., & Hillyard, S. A. (1984). Brain potentials during reading reflect word expectancy and semantic association. Nature, 307, 161-163.
Lund, K., & Burgess, C. (1996). Producing high-dimensional semantic spaces from lexical co-occurrence. Behavior Research Methods, Instruments, & Computers, 28, 203-208.
Rayner, K. (1998). Eye movements in reading and information processing: 20 years of research. Psychological Bulletin, 124, 372-422.
Rayner, K. (2009). Eye movements and attention in reading, scene perception, and visual search. Quarterly Journal of Experimental Psychology, 62, 1457-1506.
Schütze, C. T., & Sprouse, J. (2014). Judgment data. In R. J. Podesva & D. Sharma (Eds.), Research methods in linguistics. Cambridge University Press.
Sprouse, J. (2011). A test of the cognitive assumptions of magnitude estimation: Commutativity does not hold for acceptability judgments. Language, 87, 274-288.
Sprouse, J., & Almeida, D. (2012). Assessing the reliability of textbook data in syntax: Adger's Core Syntax. Journal of Linguistics, 48, 609-652.
Sprouse, J., Schütze, C. T., & Almeida, D. (2012). A comparison of informal and formal acceptability judgments using a random sample from Linguistic Inquiry 2001-2010. Lingua, 134, 219-248.
Staub, A., Grant, M., Astheimer, L., & Cohen, A. (2015). The influence of cloze probability and item constraint on cloze task response time. Journal of Memory and Language, 82, 1-17.
Taft, M. (2004). Morphological decomposition and the reverse base frequency effect. Quarterly Journal of Experimental Psychology, 57A, 745-765.
Taylor, W. L. (1953). "Cloze procedure": A new tool for measuring readability. Journalism Quarterly, 30, 415-433.
van Heuven, W. J. B., Mandera, P., Keuleers, E., & Brysbaert, M. (2014). SUBTLEX-UK: A new and improved word frequency database for British English. Quarterly Journal of Experimental Psychology, 67, 1176-1190.

See references/lexical-databases-guide.md for detailed instructions on accessing and querying lexical control databases.