exploratory-data-analysis

Exploratory Data Analysis (EDA)

Safety Notice

This listing is imported from skills.sh public index metadata. Review upstream SKILL.md and repository scripts before running.

Copy this and send it to your AI assistant to learn

Install skill "exploratory-data-analysis" with this command: npx skills add braselog/researchassistant/braselog-researchassistant-exploratory-data-analysis

Exploratory Data Analysis (EDA)

Systematically explore and understand datasets before formal analysis.

When to Use

  • User provides a data file for analysis (CSV, Excel, HDF5, etc.)

  • User asks to "explore", "analyze", or "summarize" data

  • Starting the ANALYSIS phase of a research project

  • Before running formal statistical tests

  • Assessing data quality and completeness

  • Understanding data distributions and relationships

  • Identifying outliers and anomalies

EDA Workflow

  1. LOAD DATA → Read file, check structure
  2. SUMMARIZE → Basic statistics, data types
  3. QUALITY → Missing values, outliers, duplicates
  4. DISTRIBUTIONS → Visualize variable distributions
  5. RELATIONSHIPS → Correlations, group comparisons
  6. DOCUMENT → Generate EDA report

Step 1: Load and Inspect Data

import pandas as pd import numpy as np

Load data

df = pd.read_csv('data.csv') # Adjust for your file type

Basic inspection

print(f"Shape: {df.shape[0]} rows × {df.shape[1]} columns") print(f"\nColumn types:\n{df.dtypes}") print(f"\nFirst few rows:\n{df.head()}") print(f"\nMemory usage: {df.memory_usage(deep=True).sum() / 1024**2:.2f} MB")

Step 2: Summary Statistics

Numerical columns

print("Numerical Summary:") print(df.describe().T[['count', 'mean', 'std', 'min', '25%', '50%', '75%', 'max']])

Categorical columns

for col in df.select_dtypes(include=['object', 'category']).columns: print(f"\n{col}:") print(df[col].value_counts().head(10))

Key Statistics to Report

Statistic Purpose

n (count) Sample size, completeness

Mean Central tendency

SD Spread/variability

Min/Max Range, potential outliers

Quartiles Distribution shape

Unique values Cardinality for categoricals

Step 3: Data Quality Assessment

Missing Values

Missing value summary

missing = df.isnull().sum() missing_pct = (missing / len(df) * 100).round(2) missing_df = pd.DataFrame({ 'Missing': missing, 'Percent': missing_pct }).query('Missing > 0').sort_values('Percent', ascending=False)

print("Missing Values:") print(missing_df)

Visualize missing pattern

import matplotlib.pyplot as plt import seaborn as sns

plt.figure(figsize=(10, 6)) sns.heatmap(df.isnull(), cbar=True, yticklabels=False, cmap='viridis') plt.title('Missing Data Pattern') plt.tight_layout() plt.savefig('results/eda_missing_pattern.png', dpi=150)

Outlier Detection

def detect_outliers_iqr(data, column): """Detect outliers using IQR method.""" Q1 = data[column].quantile(0.25) Q3 = data[column].quantile(0.75) IQR = Q3 - Q1 lower = Q1 - 1.5 * IQR upper = Q3 + 1.5 * IQR outliers = data[(data[column] < lower) | (data[column] > upper)] return outliers, lower, upper

Check all numerical columns

for col in df.select_dtypes(include=[np.number]).columns: outliers, lower, upper = detect_outliers_iqr(df, col) if len(outliers) > 0: print(f"{col}: {len(outliers)} outliers ({len(outliers)/len(df)*100:.1f}%)") print(f" Range: [{lower:.2f}, {upper:.2f}]")

Duplicates

Check for duplicate rows

duplicates = df.duplicated().sum() print(f"Duplicate rows: {duplicates} ({duplicates/len(df)*100:.1f}%)")

Check for duplicate IDs (if applicable)

if 'id' in df.columns: dup_ids = df['id'].duplicated().sum() print(f"Duplicate IDs: {dup_ids}")

Step 4: Distribution Analysis

Numerical Variables

import matplotlib.pyplot as plt import seaborn as sns

numerical_cols = df.select_dtypes(include=[np.number]).columns

fig, axes = plt.subplots(len(numerical_cols), 2, figsize=(12, 4*len(numerical_cols)))

for i, col in enumerate(numerical_cols): # Histogram axes[i, 0].hist(df[col].dropna(), bins=30, edgecolor='black', alpha=0.7) axes[i, 0].set_title(f'{col} - Distribution') axes[i, 0].set_xlabel(col) axes[i, 0].set_ylabel('Frequency')

# Box plot
axes[i, 1].boxplot(df[col].dropna())
axes[i, 1].set_title(f'{col} - Box Plot')
axes[i, 1].set_ylabel(col)

plt.tight_layout() plt.savefig('results/eda_distributions.png', dpi=150)

Categorical Variables

categorical_cols = df.select_dtypes(include=['object', 'category']).columns

for col in categorical_cols: plt.figure(figsize=(10, 4)) df[col].value_counts().head(15).plot(kind='bar', edgecolor='black') plt.title(f'{col} - Value Counts') plt.xlabel(col) plt.ylabel('Count') plt.xticks(rotation=45, ha='right') plt.tight_layout() plt.savefig(f'results/eda_{col}_counts.png', dpi=150)

Step 5: Relationship Analysis

Correlation Matrix

Numerical correlations

corr_matrix = df.select_dtypes(include=[np.number]).corr()

plt.figure(figsize=(10, 8)) mask = np.triu(np.ones_like(corr_matrix, dtype=bool)) sns.heatmap(corr_matrix, mask=mask, annot=True, fmt='.2f', cmap='RdBu_r', center=0, square=True) plt.title('Correlation Matrix') plt.tight_layout() plt.savefig('results/eda_correlations.png', dpi=150)

Identify strong correlations

strong_corr = [] for i in range(len(corr_matrix.columns)): for j in range(i+1, len(corr_matrix.columns)): if abs(corr_matrix.iloc[i, j]) > 0.7: strong_corr.append({ 'var1': corr_matrix.columns[i], 'var2': corr_matrix.columns[j], 'correlation': corr_matrix.iloc[i, j] }) if strong_corr: print("Strong correlations (|r| > 0.7):") for c in strong_corr: print(f" {c['var1']} ↔ {c['var2']}: r = {c['correlation']:.3f}")

Pairwise Scatter Plots

For key variables only (to avoid overwhelming output)

key_vars = ['var1', 'var2', 'var3'] # Adjust to your variables sns.pairplot(df[key_vars], diag_kind='hist') plt.savefig('results/eda_pairplot.png', dpi=150)

Group Comparisons

If you have a grouping variable

if 'group' in df.columns: for col in df.select_dtypes(include=[np.number]).columns: plt.figure(figsize=(8, 5)) df.boxplot(column=col, by='group') plt.title(f'{col} by Group') plt.suptitle('') # Remove automatic title plt.tight_layout() plt.savefig(f'results/eda_{col}_by_group.png', dpi=150)

Step 6: Generate EDA Report

Report Template

Exploratory Data Analysis Report

Dataset: [filename] Date: [date] Analyst: [name]

1. Data Overview

  • Rows: X
  • Columns: Y
  • File size: Z MB

2. Variable Summary

VariableTypeNon-NullUniqueMeanSD
var1float641005025.35.2
..................

3. Data Quality

Missing Values

  • [List variables with missing data and percentages]

Outliers

  • [List variables with outliers detected]

Duplicates

  • [Number of duplicate rows]

4. Key Findings

  1. Finding 1: Description
  2. Finding 2: Description
  3. Finding 3: Description

5. Recommendations

  • Handle missing values in [variable] using [method]
  • Consider transformation for [variable] (skewed distribution)
  • Investigate outliers in [variable]
  • Check data collection for [issue noted]

6. Next Steps

Based on this EDA, the following analyses are recommended:

  1. [Recommended analysis 1]
  2. [Recommended analysis 2]

Integration with RA Workflow

ANALYSIS Phase Connection

After completing EDA:

  • Document findings in .research/logs/activity.md

  • Update tasks.md with identified issues to address

  • Proceed to formal statistical analysis with /statistical_analysis

  • Save figures to results/intermediate/ or manuscript/figures/

Files to Create

File Location Purpose

EDA report results/eda_report.md

Document findings

Distribution plots results/intermediate/

Quality check

Correlation matrix results/intermediate/

Relationship overview

Missing data pattern results/intermediate/

Data quality

Quick EDA Checklist

  • Loaded data and verified structure

  • Checked data types are correct

  • Calculated summary statistics

  • Identified and documented missing values

  • Detected outliers

  • Checked for duplicates

  • Visualized distributions

  • Examined correlations/relationships

  • Documented key findings

  • Listed recommended next steps

Source Transparency

This detail page is rendered from real SKILL.md content. Trust labels are metadata-based hints, not a safety guarantee.

Related Skills

Related by shared tags or category signals.

Research

scientific-writing

No summary provided by upstream source.

Repository SourceNeeds Review
Research

deep-research

No summary provided by upstream source.

Repository SourceNeeds Review
Research

statistical-analysis

No summary provided by upstream source.

Repository SourceNeeds Review