Statistical Analysis
Overview
Statistical analysis is a systematic process for testing hypotheses and quantifying relationships. Conduct hypothesis tests (t-test, ANOVA, chi-square), regression, correlation, and Bayesian analyses with assumption checks and APA reporting. Apply this skill for academic research.
When to Use This Skill
This skill should be used when:
-
Conducting statistical hypothesis tests (t-tests, ANOVA, chi-square)
-
Performing regression or correlation analyses
-
Running Bayesian statistical analyses
-
Checking statistical assumptions and diagnostics
-
Calculating effect sizes and conducting power analyses
-
Reporting statistical results in APA format
-
Analyzing experimental or observational data for research
Core Capabilities
- Test Selection and Planning
-
Choose appropriate statistical tests based on research questions and data characteristics
-
Conduct a priori power analyses to determine required sample sizes
-
Plan analysis strategies including multiple comparison corrections
- Assumption Checking
-
Automatically verify all relevant assumptions before running tests
-
Provide diagnostic visualizations (Q-Q plots, residual plots, box plots)
-
Recommend remedial actions when assumptions are violated
- Statistical Testing
-
Hypothesis testing: t-tests, ANOVA, chi-square, non-parametric alternatives
-
Regression: linear, multiple, logistic, with diagnostics
-
Correlations: Pearson, Spearman, with confidence intervals
-
Bayesian alternatives: Bayesian t-tests, ANOVA, regression with Bayes Factors
- Effect Sizes and Interpretation
-
Calculate and interpret appropriate effect sizes for all analyses
-
Provide confidence intervals for effect estimates
-
Distinguish statistical from practical significance
- Professional Reporting
-
Generate APA-style statistical reports
-
Create publication-ready figures and tables
-
Provide complete interpretation with all required statistics
Workflow Decision Tree
Use this decision tree to determine your analysis path:
START │ ├─ Need to SELECT a statistical test? │ └─ YES → See "Test Selection Guide" │ └─ NO → Continue │ ├─ Ready to check ASSUMPTIONS? │ └─ YES → See "Assumption Checking" │ └─ NO → Continue │ ├─ Ready to run ANALYSIS? │ └─ YES → See "Running Statistical Tests" │ └─ NO → Continue │ └─ Need to REPORT results? └─ YES → See "Reporting Results"
Test Selection Guide
Quick Reference: Choosing the Right Test
Use references/test_selection_guide.md for comprehensive guidance. Quick reference:
Comparing Two Groups:
-
Independent, continuous, normal → Independent t-test
-
Independent, continuous, non-normal → Mann-Whitney U test
-
Paired, continuous, normal → Paired t-test
-
Paired, continuous, non-normal → Wilcoxon signed-rank test
-
Binary outcome → Chi-square or Fisher's exact test
Comparing 3+ Groups:
-
Independent, continuous, normal → One-way ANOVA
-
Independent, continuous, non-normal → Kruskal-Wallis test
-
Paired, continuous, normal → Repeated measures ANOVA
-
Paired, continuous, non-normal → Friedman test
Relationships:
-
Two continuous variables → Pearson (normal) or Spearman correlation (non-normal)
-
Continuous outcome with predictor(s) → Linear regression
-
Binary outcome with predictor(s) → Logistic regression
Bayesian Alternatives: All tests have Bayesian versions that provide:
-
Direct probability statements about hypotheses
-
Bayes Factors quantifying evidence
-
Ability to support null hypothesis
-
See references/bayesian_statistics.md
Assumption Checking
Systematic Assumption Verification
ALWAYS check assumptions before interpreting test results.
Use the provided scripts/assumption_checks.py module for automated checking:
from scripts.assumption_checks import comprehensive_assumption_check
Comprehensive check with visualizations
results = comprehensive_assumption_check( data=df, value_col='score', group_col='group', # Optional: for group comparisons alpha=0.05 )
This performs:
-
Outlier detection (IQR and z-score methods)
-
Normality testing (Shapiro-Wilk test + Q-Q plots)
-
Homogeneity of variance (Levene's test + box plots)
-
Interpretation and recommendations
Individual Assumption Checks
For targeted checks, use individual functions:
from scripts.assumption_checks import ( check_normality, check_normality_per_group, check_homogeneity_of_variance, check_linearity, detect_outliers )
Example: Check normality with visualization
result = check_normality( data=df['score'], name='Test Score', alpha=0.05, plot=True ) print(result['interpretation']) print(result['recommendation'])
What to Do When Assumptions Are Violated
Normality violated:
-
Mild violation + n > 30 per group → Proceed with parametric test (robust)
-
Moderate violation → Use non-parametric alternative
-
Severe violation → Transform data or use non-parametric test
Homogeneity of variance violated:
-
For t-test → Use Welch's t-test
-
For ANOVA → Use Welch's ANOVA or Brown-Forsythe ANOVA
-
For regression → Use robust standard errors or weighted least squares
Linearity violated (regression):
-
Add polynomial terms
-
Transform variables
-
Use non-linear models or GAM
See references/assumptions_and_diagnostics.md for comprehensive guidance.
Running Statistical Tests
Python Libraries
Primary libraries for statistical analysis:
-
scipy.stats: Core statistical tests
-
statsmodels: Advanced regression and diagnostics
-
pingouin: User-friendly statistical testing with effect sizes
-
pymc: Bayesian statistical modeling
-
arviz: Bayesian visualization and diagnostics
Example Analyses
T-Test with Complete Reporting
import pingouin as pg import numpy as np
Run independent t-test
result = pg.ttest(group_a, group_b, correction='auto')
Extract results
t_stat = result['T'].values[0] df = result['dof'].values[0] p_value = result['p-val'].values[0] cohens_d = result['cohen-d'].values[0] ci_lower = result['CI95%'].values[0][0] ci_upper = result['CI95%'].values[0][1]
Report
print(f"t({df:.0f}) = {t_stat:.2f}, p = {p_value:.3f}") print(f"Cohen's d = {cohens_d:.2f}, 95% CI [{ci_lower:.2f}, {ci_upper:.2f}]")
ANOVA with Post-Hoc Tests
import pingouin as pg
One-way ANOVA
aov = pg.anova(dv='score', between='group', data=df, detailed=True) print(aov)
If significant, conduct post-hoc tests
if aov['p-unc'].values[0] < 0.05: posthoc = pg.pairwise_tukey(dv='score', between='group', data=df) print(posthoc)
Effect size
eta_squared = aov['np2'].values[0] # Partial eta-squared print(f"Partial η² = {eta_squared:.3f}")
Linear Regression with Diagnostics
import statsmodels.api as sm from statsmodels.stats.outliers_influence import variance_inflation_factor
Fit model
X = sm.add_constant(X_predictors) # Add intercept model = sm.OLS(y, X).fit()
Summary
print(model.summary())
Check multicollinearity (VIF)
vif_data = pd.DataFrame() vif_data["Variable"] = X.columns vif_data["VIF"] = [variance_inflation_factor(X.values, i) for i in range(X.shape[1])] print(vif_data)
Check assumptions
residuals = model.resid fitted = model.fittedvalues
Residual plots
import matplotlib.pyplot as plt fig, axes = plt.subplots(2, 2, figsize=(12, 10))
Residuals vs fitted
axes[0, 0].scatter(fitted, residuals, alpha=0.6) axes[0, 0].axhline(y=0, color='r', linestyle='--') axes[0, 0].set_xlabel('Fitted values') axes[0, 0].set_ylabel('Residuals') axes[0, 0].set_title('Residuals vs Fitted')
Q-Q plot
from scipy import stats stats.probplot(residuals, dist="norm", plot=axes[0, 1]) axes[0, 1].set_title('Normal Q-Q')
Scale-Location
axes[1, 0].scatter(fitted, np.sqrt(np.abs(residuals / residuals.std())), alpha=0.6) axes[1, 0].set_xlabel('Fitted values') axes[1, 0].set_ylabel('√|Standardized residuals|') axes[1, 0].set_title('Scale-Location')
Residuals histogram
axes[1, 1].hist(residuals, bins=20, edgecolor='black', alpha=0.7) axes[1, 1].set_xlabel('Residuals') axes[1, 1].set_ylabel('Frequency') axes[1, 1].set_title('Histogram of Residuals')
plt.tight_layout() plt.show()
Bayesian T-Test
import pymc as pm import arviz as az import numpy as np
with pm.Model() as model: # Priors mu1 = pm.Normal('mu_group1', mu=0, sigma=10) mu2 = pm.Normal('mu_group2', mu=0, sigma=10) sigma = pm.HalfNormal('sigma', sigma=10)
# Likelihood
y1 = pm.Normal('y1', mu=mu1, sigma=sigma, observed=group_a)
y2 = pm.Normal('y2', mu=mu2, sigma=sigma, observed=group_b)
# Derived quantity
diff = pm.Deterministic('difference', mu1 - mu2)
# Sample
trace = pm.sample(2000, tune=1000, return_inferencedata=True)
Summarize
print(az.summary(trace, var_names=['difference']))
Probability that group1 > group2
prob_greater = np.mean(trace.posterior['difference'].values > 0) print(f"P(μ₁ > μ₂ | data) = {prob_greater:.3f}")
Plot posterior
az.plot_posterior(trace, var_names=['difference'], ref_val=0)
Effect Sizes
Always Calculate Effect Sizes
Effect sizes quantify magnitude, while p-values only indicate existence of an effect.
See references/effect_sizes_and_power.md for comprehensive guidance.
Quick Reference: Common Effect Sizes
Test Effect Size Small Medium Large
T-test Cohen's d 0.20 0.50 0.80
ANOVA η²_p 0.01 0.06 0.14
Correlation r 0.10 0.30 0.50
Regression R² 0.02 0.13 0.26
Chi-square Cramér's V 0.07 0.21 0.35
Important: Benchmarks are guidelines. Context matters!
Calculating Effect Sizes
Most effect sizes are automatically calculated by pingouin:
T-test returns Cohen's d
result = pg.ttest(x, y) d = result['cohen-d'].values[0]
ANOVA returns partial eta-squared
aov = pg.anova(dv='score', between='group', data=df) eta_p2 = aov['np2'].values[0]
Correlation: r is already an effect size
corr = pg.corr(x, y) r = corr['r'].values[0]
Confidence Intervals for Effect Sizes
Always report CIs to show precision:
from pingouin import compute_effsize_from_t
For t-test
d, ci = compute_effsize_from_t( t_statistic, nx=len(group1), ny=len(group2), eftype='cohen' ) print(f"d = {d:.2f}, 95% CI [{ci[0]:.2f}, {ci[1]:.2f}]")
Power Analysis
A Priori Power Analysis (Study Planning)
Determine required sample size before data collection:
from statsmodels.stats.power import ( tt_ind_solve_power, FTestAnovaPower )
T-test: What n is needed to detect d = 0.5?
n_required = tt_ind_solve_power( effect_size=0.5, alpha=0.05, power=0.80, ratio=1.0, alternative='two-sided' ) print(f"Required n per group: {n_required:.0f}")
ANOVA: What n is needed to detect f = 0.25?
anova_power = FTestAnovaPower() n_per_group = anova_power.solve_power( effect_size=0.25, ngroups=3, alpha=0.05, power=0.80 ) print(f"Required n per group: {n_per_group:.0f}")
Sensitivity Analysis (Post-Study)
Determine what effect size you could detect:
With n=50 per group, what effect could we detect?
detectable_d = tt_ind_solve_power( effect_size=None, # Solve for this nobs1=50, alpha=0.05, power=0.80, ratio=1.0, alternative='two-sided' ) print(f"Study could detect d ≥ {detectable_d:.2f}")
Note: Post-hoc power analysis (calculating power after study) is generally not recommended. Use sensitivity analysis instead.
See references/effect_sizes_and_power.md for detailed guidance.
Reporting Results
APA Style Statistical Reporting
Follow guidelines in references/reporting_standards.md .
Essential Reporting Elements
-
Descriptive statistics: M, SD, n for all groups/variables
-
Test statistics: Test name, statistic, df, exact p-value
-
Effect sizes: With confidence intervals
-
Assumption checks: Which tests were done, results, actions taken
-
All planned analyses: Including non-significant findings
Example Report Templates
Independent T-Test
Group A (n = 48, M = 75.2, SD = 8.5) scored significantly higher than Group B (n = 52, M = 68.3, SD = 9.2), t(98) = 3.82, p < .001, d = 0.77, 95% CI [0.36, 1.18], two-tailed. Assumptions of normality (Shapiro-Wilk: Group A W = 0.97, p = .18; Group B W = 0.96, p = .12) and homogeneity of variance (Levene's F(1, 98) = 1.23, p = .27) were satisfied.
One-Way ANOVA
A one-way ANOVA revealed a significant main effect of treatment condition on test scores, F(2, 147) = 8.45, p < .001, η²_p = .10. Post hoc comparisons using Tukey's HSD indicated that Condition A (M = 78.2, SD = 7.3) scored significantly higher than Condition B (M = 71.5, SD = 8.1, p = .002, d = 0.87) and Condition C (M = 70.1, SD = 7.9, p < .001, d = 1.07). Conditions B and C did not differ significantly (p = .52, d = 0.18).
Multiple Regression
Multiple linear regression was conducted to predict exam scores from study hours, prior GPA, and attendance. The overall model was significant, F(3, 146) = 45.2, p < .001, R² = .48, adjusted R² = .47. Study hours (B = 1.80, SE = 0.31, β = .35, t = 5.78, p < .001, 95% CI [1.18, 2.42]) and prior GPA (B = 8.52, SE = 1.95, β = .28, t = 4.37, p < .001, 95% CI [4.66, 12.38]) were significant predictors, while attendance was not (B = 0.15, SE = 0.12, β = .08, t = 1.25, p = .21, 95% CI [-0.09, 0.39]). Multicollinearity was not a concern (all VIF < 1.5).
Bayesian Analysis
A Bayesian independent samples t-test was conducted using weakly informative priors (Normal(0, 1) for mean difference). The posterior distribution indicated that Group A scored higher than Group B (M_diff = 6.8, 95% credible interval [3.2, 10.4]). The Bayes Factor BF₁₀ = 45.3 provided very strong evidence for a difference between groups, with a 99.8% posterior probability that Group A's mean exceeded Group B's mean. Convergence diagnostics were satisfactory (all R̂ < 1.01, ESS > 1000).
Bayesian Statistics
When to Use Bayesian Methods
Consider Bayesian approaches when:
-
You have prior information to incorporate
-
You want direct probability statements about hypotheses
-
Sample size is small or planning sequential data collection
-
You need to quantify evidence for the null hypothesis
-
The model is complex (hierarchical, missing data)
See references/bayesian_statistics.md for comprehensive guidance on:
-
Bayes' theorem and interpretation
-
Prior specification (informative, weakly informative, non-informative)
-
Bayesian hypothesis testing with Bayes Factors
-
Credible intervals vs. confidence intervals
-
Bayesian t-tests, ANOVA, regression, and hierarchical models
-
Model convergence checking and posterior predictive checks
Key Advantages
-
Intuitive interpretation: "Given the data, there is a 95% probability the parameter is in this interval"
-
Evidence for null: Can quantify support for no effect
-
Flexible: No p-hacking concerns; can analyze data as it arrives
-
Uncertainty quantification: Full posterior distribution
Resources
This skill includes comprehensive reference materials:
References Directory
-
test_selection_guide.md: Decision tree for choosing appropriate statistical tests
-
assumptions_and_diagnostics.md: Detailed guidance on checking and handling assumption violations
-
effect_sizes_and_power.md: Calculating, interpreting, and reporting effect sizes; conducting power analyses
-
bayesian_statistics.md: Complete guide to Bayesian analysis methods
-
reporting_standards.md: APA-style reporting guidelines with examples
Scripts Directory
-
assumption_checks.py: Automated assumption checking with visualizations
-
comprehensive_assumption_check() : Complete workflow
-
check_normality() : Normality testing with Q-Q plots
-
check_homogeneity_of_variance() : Levene's test with box plots
-
check_linearity() : Regression linearity checks
-
detect_outliers() : IQR and z-score outlier detection
Best Practices
-
Pre-register analyses when possible to distinguish confirmatory from exploratory
-
Always check assumptions before interpreting results
-
Report effect sizes with confidence intervals
-
Report all planned analyses including non-significant results
-
Distinguish statistical from practical significance
-
Visualize data before and after analysis
-
Check diagnostics for regression/ANOVA (residual plots, VIF, etc.)
-
Conduct sensitivity analyses to assess robustness
-
Share data and code for reproducibility
-
Be transparent about violations, transformations, and decisions
Common Pitfalls to Avoid
-
P-hacking: Don't test multiple ways until something is significant
-
HARKing: Don't present exploratory findings as confirmatory
-
Ignoring assumptions: Check them and report violations
-
Confusing significance with importance: p < .05 ≠ meaningful effect
-
Not reporting effect sizes: Essential for interpretation
-
Cherry-picking results: Report all planned analyses
-
Misinterpreting p-values: They're NOT probability that hypothesis is true
-
Multiple comparisons: Correct for family-wise error when appropriate
-
Ignoring missing data: Understand mechanism (MCAR, MAR, MNAR)
-
Overinterpreting non-significant results: Absence of evidence ≠ evidence of absence
Getting Started Checklist
When beginning a statistical analysis:
-
Define research question and hypotheses
-
Determine appropriate statistical test (use test_selection_guide.md)
-
Conduct power analysis to determine sample size
-
Load and inspect data
-
Check for missing data and outliers
-
Verify assumptions using assumption_checks.py
-
Run primary analysis
-
Calculate effect sizes with confidence intervals
-
Conduct post-hoc tests if needed (with corrections)
-
Create visualizations
-
Write results following reporting_standards.md
-
Conduct sensitivity analyses
-
Share data and code
Support and Further Reading
For questions about:
-
Test selection: See references/test_selection_guide.md
-
Assumptions: See references/assumptions_and_diagnostics.md
-
Effect sizes: See references/effect_sizes_and_power.md
-
Bayesian methods: See references/bayesian_statistics.md
-
Reporting: See references/reporting_standards.md
Key textbooks:
-
Cohen, J. (1988). Statistical Power Analysis for the Behavioral Sciences
-
Field, A. (2013). Discovering Statistics Using IBM SPSS Statistics
-
Gelman, A., & Hill, J. (2006). Data Analysis Using Regression and Multilevel/Hierarchical Models
-
Kruschke, J. K. (2014). Doing Bayesian Data Analysis
Online resources:
-
APA Style Guide: https://apastyle.apa.org/
-
Statistical Consulting: Cross Validated (stats.stackexchange.com)