sweetviz

Sweetviz EDA Comparison Skill

Safety Notice

This listing is imported from skills.sh public index metadata. Review upstream SKILL.md and repository scripts before running.

Copy this and send it to your AI assistant to learn

Install skill "sweetviz" with this command: npx skills add vamseeachanta/workspace-hub/vamseeachanta-workspace-hub-sweetviz

Sweetviz EDA Comparison Skill

Master Sweetviz for automated exploratory data analysis with powerful comparison capabilities, target variable analysis, and beautiful HTML reports. Sweetviz excels at comparing datasets (train vs test) and analyzing features against a target variable.

When to Use This Skill

USE Sweetviz when:

  • Dataset comparison - Comparing train vs test, before vs after, or any two datasets

  • Target variable analysis - Understanding how features relate to a target

  • Quick EDA reports - Need comprehensive EDA in one line of code

  • Feature comparison - Analyzing feature distributions across subsets

  • HTML reports - Creating shareable, interactive analysis reports

  • Intra-set analysis - Comparing subpopulations within a dataset

  • Data validation - Checking for data drift between datasets

  • Feature selection - Identifying important features for modeling

DON'T USE Sweetviz when:

  • Very large datasets - Over 1M rows (use sampling)

  • Streaming data - Need real-time analysis

  • Deep statistical tests - Need p-values and hypothesis testing

  • Custom visualizations - Specific chart requirements

  • Interactive dashboards - Use Streamlit or Dash instead

  • Text/NLP analysis - Use dedicated NLP tools

Prerequisites

Basic installation

pip install sweetviz

Using uv (recommended)

uv pip install sweetviz pandas numpy

With Jupyter support

pip install sweetviz pandas numpy jupyter

Verify installation

python -c "import sweetviz as sv; print(f'Sweetviz version: {sv.version}')"

System Requirements

  • Python 3.6 or higher

  • pandas 0.25.3 or higher

  • numpy

  • matplotlib (for internal plotting)

  • Modern web browser (for viewing HTML reports)

Core Capabilities

  1. Basic EDA Report (Analyze)

Single Dataset Analysis:

import sweetviz as sv import pandas as pd import numpy as np

Load data

df = pd.read_csv("data.csv")

Generate basic EDA report

report = sv.analyze(df)

Save HTML report

report.show_html("sweetviz_report.html")

Show in notebook (auto-opens browser)

report.show_notebook()

With Source Name:

import sweetviz as sv import pandas as pd

df = pd.read_csv("sales_data.csv")

Generate report with custom name

report = sv.analyze( source=df, pairwise_analysis="auto" # "on", "off", or "auto" )

report.show_html("sales_analysis.html", open_browser=True)

Sample Dataset for Examples:

import sweetviz as sv import pandas as pd import numpy as np from datetime import datetime, timedelta

Create comprehensive sample dataset

np.random.seed(42) n = 5000

df = pd.DataFrame({ # Numeric features "age": np.random.randint(18, 80, n), "income": np.random.exponential(50000, n), "credit_score": np.random.normal(700, 50, n).clip(300, 850).astype(int), "account_balance": np.random.exponential(10000, n), "transaction_count": np.random.poisson(15, n),

# Categorical features
"gender": np.random.choice(["Male", "Female", "Other"], n, p=[0.48, 0.48, 0.04]),
"education": np.random.choice(
    ["High School", "Bachelor", "Master", "PhD"],
    n, p=[0.3, 0.4, 0.2, 0.1]
),
"employment_status": np.random.choice(
    ["Employed", "Self-employed", "Unemployed", "Retired"],
    n, p=[0.6, 0.2, 0.1, 0.1]
),
"region": np.random.choice(["North", "South", "East", "West"], n),

# Date feature
"join_date": [
    datetime(2020, 1, 1) + timedelta(days=int(d))
    for d in np.random.uniform(0, 1460, n)
],

# Target variable (binary classification)
"churned": np.random.choice([0, 1], n, p=[0.8, 0.2])

})

Add some missing values

df.loc[np.random.choice(n, 200), "income"] = np.nan df.loc[np.random.choice(n, 100), "credit_score"] = np.nan df.loc[np.random.choice(n, 150), "education"] = np.nan

Basic analysis

report = sv.analyze(df) report.show_html("customer_analysis.html")

  1. Target Variable Analysis

Binary Target Analysis:

import sweetviz as sv import pandas as pd import numpy as np

Create dataset with target variable

np.random.seed(42) n = 3000

df = pd.DataFrame({ "feature_1": np.random.randn(n), "feature_2": np.random.exponential(10, n), "feature_3": np.random.choice(["A", "B", "C"], n), "feature_4": np.random.randint(1, 100, n), "target": np.random.choice([0, 1], n, p=[0.7, 0.3]) })

Analyze with target variable

Shows how each feature relates to the target

report = sv.analyze( source=df, target_feat="target" # Specify target column )

report.show_html("target_analysis.html")

Continuous Target Analysis:

import sweetviz as sv import pandas as pd import numpy as np

Regression target example

np.random.seed(42) n = 2000

Features that affect the target

x1 = np.random.randn(n) x2 = np.random.exponential(5, n) x3 = np.random.choice([0, 1], n)

Target is a function of features + noise

target = 10 + 2x1 + 0.5x2 + 3*x3 + np.random.randn(n)

df = pd.DataFrame({ "feature_linear": x1, "feature_exp": x2, "feature_binary": x3, "feature_noise": np.random.randn(n), # Unrelated feature "price": target # Continuous target })

Analyze with continuous target

report = sv.analyze( source=df, target_feat="price" )

report.show_html("regression_target_analysis.html")

Multi-class Target:

import sweetviz as sv import pandas as pd import numpy as np

np.random.seed(42) n = 2500

df = pd.DataFrame({ "feature_1": np.random.randn(n), "feature_2": np.random.uniform(0, 100, n), "category": np.random.choice(["A", "B", "C"], n), "class_label": np.random.choice( ["Class_A", "Class_B", "Class_C", "Class_D"], n, p=[0.4, 0.3, 0.2, 0.1] ) })

Multi-class target analysis

report = sv.analyze( source=df, target_feat="class_label" )

report.show_html("multiclass_analysis.html")

  1. Dataset Comparison (Compare)

Train vs Test Comparison:

import sweetviz as sv import pandas as pd import numpy as np from sklearn.model_selection import train_test_split

Create sample dataset

np.random.seed(42) n = 5000

df = pd.DataFrame({ "feature_1": np.random.randn(n), "feature_2": np.random.exponential(50, n), "feature_3": np.random.choice(["X", "Y", "Z"], n), "feature_4": np.random.randint(1, 100, n), "target": np.random.choice([0, 1], n, p=[0.75, 0.25]) })

Split into train and test

train_df, test_df = train_test_split(df, test_size=0.2, random_state=42)

print(f"Train shape: {train_df.shape}") print(f"Test shape: {test_df.shape}")

Compare train vs test datasets

comparison_report = sv.compare( source=[train_df, "Training Data"], compare=[test_df, "Test Data"], target_feat="target" )

comparison_report.show_html("train_test_comparison.html")

Before vs After Comparison:

import sweetviz as sv import pandas as pd import numpy as np

np.random.seed(42)

Original data with issues

df_before = pd.DataFrame({ "value": np.concatenate([ np.random.randn(900), np.array([50, -30, 100, 75, -50]) # Outliers ]), "category": np.random.choice(["A", "B", "C"], 905), "score": np.random.uniform(0, 100, 905) })

Add missing values

df_before.loc[np.random.choice(905, 80), "value"] = np.nan

Cleaned data

df_after = df_before.copy()

Remove outliers using IQR

Q1 = df_after["value"].quantile(0.25) Q3 = df_after["value"].quantile(0.75) IQR = Q3 - Q1 df_after = df_after[ (df_after["value"].isna()) | # Keep NaN for now ((df_after["value"] >= Q1 - 1.5 * IQR) & (df_after["value"] <= Q3 + 1.5 * IQR)) ]

Fill missing values

df_after["value"] = df_after["value"].fillna(df_after["value"].median())

Compare before vs after cleaning

comparison = sv.compare( source=[df_before, "Before Cleaning"], compare=[df_after, "After Cleaning"] )

comparison.show_html("cleaning_comparison.html")

Production vs Development Data:

import sweetviz as sv import pandas as pd import numpy as np

np.random.seed(42)

Development data (historical)

df_dev = pd.DataFrame({ "feature_1": np.random.randn(3000), "feature_2": np.random.exponential(100, 3000), "category": np.random.choice(["A", "B", "C"], 3000, p=[0.5, 0.3, 0.2]) })

Production data (slightly different distribution - data drift)

df_prod = pd.DataFrame({ "feature_1": np.random.randn(1000) * 1.2 + 0.3, # Shifted and scaled "feature_2": np.random.exponential(120, 1000), # Different mean "category": np.random.choice(["A", "B", "C", "D"], 1000, p=[0.4, 0.3, 0.2, 0.1]) # New category })

Detect data drift

drift_report = sv.compare( source=[df_dev, "Development"], compare=[df_prod, "Production"] )

drift_report.show_html("data_drift_analysis.html")

  1. Intra-set Comparison (Compare_Intra)

Compare Subpopulations Within Dataset:

import sweetviz as sv import pandas as pd import numpy as np

np.random.seed(42) n = 4000

Create dataset with a categorical variable for splitting

df = pd.DataFrame({ "age": np.random.randint(18, 70, n), "income": np.random.exponential(50000, n), "purchases": np.random.poisson(10, n), "satisfaction_score": np.random.uniform(1, 5, n), "customer_type": np.random.choice(["Premium", "Standard"], n, p=[0.3, 0.7]), "churned": np.random.choice([0, 1], n, p=[0.85, 0.15]) })

Compare Premium vs Standard customers within same dataset

intra_report = sv.compare_intra( source_df=df, condition_series=df["customer_type"] == "Premium", names=["Premium Customers", "Standard Customers"], target_feat="churned" )

intra_report.show_html("customer_type_comparison.html")

Compare by Boolean Condition:

import sweetviz as sv import pandas as pd import numpy as np

np.random.seed(42) n = 3000

df = pd.DataFrame({ "age": np.random.randint(18, 80, n), "income": np.random.exponential(60000, n), "credit_score": np.random.normal(700, 50, n).astype(int), "loan_amount": np.random.exponential(20000, n), "default": np.random.choice([0, 1], n, p=[0.9, 0.1]) })

Compare high-income vs low-income customers

median_income = df["income"].median()

income_comparison = sv.compare_intra( source_df=df, condition_series=df["income"] > median_income, names=["High Income", "Low Income"], target_feat="default" )

income_comparison.show_html("income_segment_comparison.html")

Multiple Segment Analysis:

import sweetviz as sv import pandas as pd import numpy as np

np.random.seed(42) n = 5000

df = pd.DataFrame({ "age": np.random.randint(18, 80, n), "spend": np.random.exponential(500, n), "visits": np.random.poisson(5, n), "region": np.random.choice(["North", "South", "East", "West"], n), "converted": np.random.choice([0, 1], n, p=[0.8, 0.2]) })

Create age groups

df["age_group"] = pd.cut( df["age"], bins=[0, 30, 50, 100], labels=["Young", "Middle", "Senior"] )

Compare Young vs Senior (Middle excluded)

age_comparison = sv.compare_intra( source_df=df, condition_series=df["age_group"] == "Young", names=["Young (18-30)", "Senior (50+)"], target_feat="converted" )

age_comparison.show_html("age_group_comparison.html")

  1. Feature Configuration

Specifying Feature Types:

import sweetviz as sv import pandas as pd import numpy as np

np.random.seed(42) n = 2000

df = pd.DataFrame({ "id": range(1, n + 1), "zip_code": np.random.randint(10000, 99999, n), # Should be categorical "rating": np.random.randint(1, 6, n), # Ordinal (1-5 stars) "revenue": np.random.exponential(1000, n), "category": np.random.choice(["A", "B", "C"], n), "target": np.random.choice([0, 1], n) })

Configure feature types

feature_config = sv.FeatureConfig( skip=["id"], # Skip ID column force_cat=["zip_code", "rating"], # Force as categorical force_num=[] # Force as numerical (if needed) )

report = sv.analyze( source=df, target_feat="target", feat_cfg=feature_config )

report.show_html("configured_analysis.html")

Skipping Features:

import sweetviz as sv import pandas as pd import numpy as np

np.random.seed(42) n = 1500

df = pd.DataFrame({ "user_id": range(n), "session_id": [f"sess_{i}" for i in range(n)], "email": [f"user_{i}@example.com" for i in range(n)], "feature_1": np.random.randn(n), "feature_2": np.random.exponential(10, n), "outcome": np.random.choice([0, 1], n) })

Skip ID and PII columns

config = sv.FeatureConfig( skip=["user_id", "session_id", "email"] )

report = sv.analyze( source=df, target_feat="outcome", feat_cfg=config )

report.show_html("filtered_analysis.html")

  1. Pairwise Analysis Control

Controlling Correlation Analysis:

import sweetviz as sv import pandas as pd import numpy as np

np.random.seed(42) n = 2000

Dataset with many features

df = pd.DataFrame({ f"feature_{i}": np.random.randn(n) for i in range(20) }) df["target"] = np.random.choice([0, 1], n)

Disable pairwise analysis for speed

report_fast = sv.analyze( source=df, target_feat="target", pairwise_analysis="off" # Faster, no correlation matrix )

Enable pairwise analysis for full correlations

report_full = sv.analyze( source=df, target_feat="target", pairwise_analysis="on" # Shows feature correlations )

Auto mode (default) - enables if < 20 features

report_auto = sv.analyze( source=df, target_feat="target", pairwise_analysis="auto" )

report_full.show_html("full_pairwise.html")

Complete Examples

Example 1: ML Dataset Profiling Pipeline

#!/usr/bin/env python3 """ml_profiling_pipeline.py - Complete ML dataset profiling with Sweetviz"""

import sweetviz as sv import pandas as pd import numpy as np from sklearn.model_selection import train_test_split from datetime import datetime import os import json

def ml_profiling_pipeline( df: pd.DataFrame, target_col: str, output_dir: str, test_size: float = 0.2, id_cols: list = None, cat_cols: list = None ) -> dict: """ Complete ML dataset profiling pipeline.

Args:
    df: Input DataFrame
    target_col: Target variable column name
    output_dir: Directory for output reports
    test_size: Test set proportion
    id_cols: Columns to skip (IDs, etc.)
    cat_cols: Columns to force as categorical

Returns:
    Dictionary with profiling results
"""
os.makedirs(output_dir, exist_ok=True)
timestamp = datetime.now().strftime("%Y%m%d_%H%M%S")

print(f"Starting ML profiling pipeline")
print(f"Dataset shape: {df.shape}")
print(f"Target column: {target_col}")

results = {
    "timestamp": timestamp,
    "original_shape": df.shape,
    "target_column": target_col,
    "reports": {}
}

# Configure features
feat_cfg = sv.FeatureConfig(
    skip=id_cols or [],
    force_cat=cat_cols or []
)

# 1. Full dataset analysis with target
print("\n1. Generating full dataset analysis...")
full_report = sv.analyze(
    source=df,
    target_feat=target_col,
    feat_cfg=feat_cfg,
    pairwise_analysis="auto"
)

full_path = os.path.join(output_dir, f"full_analysis_{timestamp}.html")
full_report.show_html(full_path, open_browser=False)
results["reports"]["full_analysis"] = full_path
print(f"   Saved: {full_path}")

# 2. Split data
print("\n2. Splitting into train/test...")
train_df, test_df = train_test_split(
    df, test_size=test_size, random_state=42,
    stratify=df[target_col] if df[target_col].nunique() &#x3C; 20 else None
)

results["train_shape"] = train_df.shape
results["test_shape"] = test_df.shape
print(f"   Train: {train_df.shape}, Test: {test_df.shape}")

# 3. Train vs Test comparison
print("\n3. Generating train vs test comparison...")
comparison_report = sv.compare(
    source=[train_df, "Training Set"],
    compare=[test_df, "Test Set"],
    target_feat=target_col,
    feat_cfg=feat_cfg
)

comparison_path = os.path.join(output_dir, f"train_test_comparison_{timestamp}.html")
comparison_report.show_html(comparison_path, open_browser=False)
results["reports"]["train_test_comparison"] = comparison_path
print(f"   Saved: {comparison_path}")

# 4. Target class analysis (for classification)
if df[target_col].nunique() &#x3C;= 10:
    print("\n4. Generating target class comparison...")

    # Get most common classes
    top_classes = df[target_col].value_counts().nlargest(2).index.tolist()

    if len(top_classes) >= 2:
        class_comparison = sv.compare_intra(
            source_df=df,
            condition_series=df[target_col] == top_classes[0],
            names=[f"Class: {top_classes[0]}", f"Class: {top_classes[1]}"],
            feat_cfg=feat_cfg
        )

        class_path = os.path.join(output_dir, f"class_comparison_{timestamp}.html")
        class_comparison.show_html(class_path, open_browser=False)
        results["reports"]["class_comparison"] = class_path
        print(f"   Saved: {class_path}")

# 5. Summary statistics
print("\n5. Calculating summary statistics...")

results["feature_summary"] = {
    "total_features": len(df.columns) - 1,
    "numeric_features": len(df.select_dtypes(include=[np.number]).columns) - 1,
    "categorical_features": len(df.select_dtypes(include=["object", "category"]).columns),
    "missing_values": df.isnull().sum().sum(),
    "missing_percentage": (df.isnull().sum().sum() / (df.shape[0] * df.shape[1])) * 100
}

# Target distribution
target_dist = df[target_col].value_counts(normalize=True).to_dict()
results["target_distribution"] = {str(k): round(v, 4) for k, v in target_dist.items()}

# Save results JSON
results_path = os.path.join(output_dir, f"profiling_results_{timestamp}.json")
with open(results_path, "w") as f:
    json.dump(results, f, indent=2, default=str)

print(f"\n{'='*60}")
print(f"Pipeline Complete!")
print(f"Reports saved to: {output_dir}")
print(f"Results JSON: {results_path}")
print(f"{'='*60}")

return results

def create_sample_ml_dataset(n_samples: int = 5000) -> pd.DataFrame: """Create sample ML dataset for demonstration.""" np.random.seed(42)

df = pd.DataFrame({
    "customer_id": range(1, n_samples + 1),
    "age": np.random.randint(18, 80, n_samples),
    "income": np.random.exponential(50000, n_samples),
    "credit_score": np.random.normal(700, 50, n_samples).clip(300, 850).astype(int),
    "account_age_days": np.random.exponential(365, n_samples).astype(int),
    "num_products": np.random.poisson(3, n_samples),
    "total_transactions": np.random.poisson(50, n_samples),
    "avg_transaction_value": np.random.exponential(100, n_samples),
    "support_tickets": np.random.poisson(2, n_samples),
    "satisfaction_score": np.random.uniform(1, 5, n_samples),
    "tenure_category": np.random.choice(["New", "Regular", "Loyal"], n_samples, p=[0.3, 0.4, 0.3]),
    "channel": np.random.choice(["Online", "Mobile", "Branch"], n_samples),
    "region": np.random.choice(["North", "South", "East", "West"], n_samples),
    "churned": np.random.choice([0, 1], n_samples, p=[0.85, 0.15])
})

# Add missing values
df.loc[np.random.choice(n_samples, 200), "income"] = np.nan
df.loc[np.random.choice(n_samples, 100), "credit_score"] = np.nan
df.loc[np.random.choice(n_samples, 150), "satisfaction_score"] = np.nan

return df

Example usage

if name == "main": # Create sample dataset df = create_sample_ml_dataset(5000)

# Run profiling pipeline
results = ml_profiling_pipeline(
    df=df,
    target_col="churned",
    output_dir="ml_profiling_output",
    test_size=0.2,
    id_cols=["customer_id"],
    cat_cols=["credit_score"]
)

print("\nResults Summary:")
print(f"  Train shape: {results['train_shape']}")
print(f"  Test shape: {results['test_shape']}")
print(f"  Target distribution: {results['target_distribution']}")

Example 2: Data Quality Assessment

#!/usr/bin/env python3 """data_quality_assessment.py - Data quality assessment with Sweetviz"""

import sweetviz as sv import pandas as pd import numpy as np from datetime import datetime import os import json

def assess_data_quality( df: pd.DataFrame, output_dir: str, reference_df: pd.DataFrame = None, target_col: str = None ) -> dict: """ Comprehensive data quality assessment.

Args:
    df: DataFrame to assess
    output_dir: Output directory for reports
    reference_df: Reference/baseline dataset for comparison
    target_col: Optional target variable

Returns:
    Quality assessment results
"""
os.makedirs(output_dir, exist_ok=True)
timestamp = datetime.now().strftime("%Y%m%d_%H%M%S")

results = {
    "timestamp": timestamp,
    "dataset_shape": df.shape,
    "quality_metrics": {},
    "reports": []
}

# 1. Basic quality metrics
print("Calculating quality metrics...")

# Completeness
total_cells = df.shape[0] * df.shape[1]
missing_cells = df.isnull().sum().sum()
completeness = (1 - missing_cells / total_cells) * 100

# Missing by column
missing_by_col = df.isnull().sum().to_dict()
missing_pct_by_col = {k: round(v / len(df) * 100, 2) for k, v in missing_by_col.items()}

# Uniqueness
duplicate_rows = df.duplicated().sum()
uniqueness = (1 - duplicate_rows / len(df)) * 100

# Cardinality
cardinality = {col: df[col].nunique() for col in df.columns}
high_cardinality = [col for col, nunique in cardinality.items()
                    if nunique / len(df) > 0.9]
constant_cols = [col for col, nunique in cardinality.items() if nunique &#x3C;= 1]

results["quality_metrics"] = {
    "completeness_score": round(completeness, 2),
    "uniqueness_score": round(uniqueness, 2),
    "total_missing_cells": int(missing_cells),
    "missing_by_column": missing_pct_by_col,
    "duplicate_rows": int(duplicate_rows),
    "high_cardinality_columns": high_cardinality,
    "constant_columns": constant_cols,
    "cardinality": cardinality
}

# 2. Generate Sweetviz report
print("Generating Sweetviz analysis report...")

# Configure to skip obvious ID columns
potential_ids = [col for col in df.columns
                 if df[col].nunique() == len(df) or "id" in col.lower()]

feat_cfg = sv.FeatureConfig(skip=potential_ids)

analysis_report = sv.analyze(
    source=df,
    target_feat=target_col,
    feat_cfg=feat_cfg,
    pairwise_analysis="auto"
)

report_path = os.path.join(output_dir, f"quality_analysis_{timestamp}.html")
analysis_report.show_html(report_path, open_browser=False)
results["reports"].append(report_path)

# 3. Reference comparison (if provided)
if reference_df is not None:
    print("Generating reference comparison report...")

    comparison_report = sv.compare(
        source=[reference_df, "Reference/Baseline"],
        compare=[df, "Current Dataset"],
        target_feat=target_col,
        feat_cfg=feat_cfg
    )

    comparison_path = os.path.join(output_dir, f"reference_comparison_{timestamp}.html")
    comparison_report.show_html(comparison_path, open_browser=False)
    results["reports"].append(comparison_path)

    # Calculate drift metrics
    drift_metrics = {}
    numeric_cols = df.select_dtypes(include=[np.number]).columns

    for col in numeric_cols:
        if col in reference_df.columns:
            ref_mean = reference_df[col].mean()
            curr_mean = df[col].mean()
            ref_std = reference_df[col].std()

            if ref_std > 0:
                drift = abs(curr_mean - ref_mean) / ref_std
                drift_metrics[col] = round(drift, 3)

    results["drift_metrics"] = drift_metrics
    results["significant_drift"] = [
        col for col, score in drift_metrics.items() if score > 0.5
    ]

# 4. Save results
results_path = os.path.join(output_dir, f"quality_results_{timestamp}.json")
with open(results_path, "w") as f:
    json.dump(results, f, indent=2, default=str)

# 5. Print summary
print(f"\n{'='*60}")
print("DATA QUALITY ASSESSMENT SUMMARY")
print(f"{'='*60}")
print(f"Dataset Shape: {df.shape[0]} rows x {df.shape[1]} columns")
print(f"\nQuality Scores:")
print(f"  Completeness: {completeness:.1f}%")
print(f"  Uniqueness:   {uniqueness:.1f}%")
print(f"\nIssues Found:")
print(f"  Missing values: {missing_cells:,}")
print(f"  Duplicate rows: {duplicate_rows:,}")
print(f"  Constant columns: {len(constant_cols)}")
print(f"  High cardinality: {len(high_cardinality)}")

if reference_df is not None and results.get("significant_drift"):
    print(f"\nData Drift Warning:")
    print(f"  Significant drift in: {results['significant_drift']}")

print(f"\nReports saved to: {output_dir}")
print(f"{'='*60}")

return results

Example usage

if name == "main": np.random.seed(42) n = 3000

# Create current dataset
df_current = pd.DataFrame({
    "id": range(n),
    "value_a": np.random.randn(n),
    "value_b": np.random.exponential(100, n),
    "category": np.random.choice(["X", "Y", "Z"], n),
    "score": np.random.uniform(0, 100, n)
})

# Add some quality issues
df_current.loc[np.random.choice(n, 200), "value_a"] = np.nan
df_current.loc[np.random.choice(n, 100), "value_b"] = np.nan

# Create reference dataset (slightly different)
df_reference = pd.DataFrame({
    "id": range(n),
    "value_a": np.random.randn(n),
    "value_b": np.random.exponential(100, n),
    "category": np.random.choice(["X", "Y", "Z"], n),
    "score": np.random.uniform(0, 100, n)
})

# Run assessment
results = assess_data_quality(
    df=df_current,
    output_dir="quality_assessment_output",
    reference_df=df_reference
)

Example 3: Feature Selection Analysis

#!/usr/bin/env python3 """feature_selection_analysis.py - Feature analysis for ML with Sweetviz"""

import sweetviz as sv import pandas as pd import numpy as np from sklearn.model_selection import train_test_split import os

def analyze_features_for_selection( df: pd.DataFrame, target_col: str, output_dir: str ) -> dict: """ Analyze features for selection using Sweetviz.

Args:
    df: Input DataFrame with features and target
    target_col: Target variable column name
    output_dir: Output directory

Returns:
    Feature analysis results
"""
os.makedirs(output_dir, exist_ok=True)

# Identify feature types
numeric_features = df.select_dtypes(include=[np.number]).columns.tolist()
categorical_features = df.select_dtypes(include=["object", "category"]).columns.tolist()

# Remove target from features
if target_col in numeric_features:
    numeric_features.remove(target_col)
if target_col in categorical_features:
    categorical_features.remove(target_col)

print(f"Analyzing {len(numeric_features)} numeric and {len(categorical_features)} categorical features")

# 1. Full feature analysis with target
print("\n1. Generating full feature analysis...")
full_report = sv.analyze(
    source=df,
    target_feat=target_col,
    pairwise_analysis="on"  # Enable correlation analysis
)
full_report.show_html(
    os.path.join(output_dir, "full_feature_analysis.html"),
    open_browser=False
)

# 2. Analyze by target classes
if df[target_col].nunique() &#x3C;= 10:
    print("\n2. Analyzing feature distributions by target class...")

    target_values = df[target_col].unique()

    if len(target_values) >= 2:
        # Compare top two classes
        class1, class2 = sorted(
            df[target_col].value_counts().nlargest(2).index.tolist()
        )

        class_report = sv.compare_intra(
            source_df=df,
            condition_series=df[target_col] == class1,
            names=[f"Target={class1}", f"Target={class2}"]
        )
        class_report.show_html(
            os.path.join(output_dir, "feature_by_target_class.html"),
            open_browser=False
        )

# 3. Analyze missing patterns
missing_cols = df.columns[df.isnull().any()].tolist()

if missing_cols:
    print(f"\n3. Found {len(missing_cols)} columns with missing values")

    # Create missing indicator for analysis
    df_missing = df.copy()
    for col in missing_cols:
        df_missing[f"{col}_missing"] = df[col].isnull().astype(int)

    # Analyze how missing values relate to target
    missing_cols_indicators = [f"{col}_missing" for col in missing_cols]
    df_subset = df_missing[missing_cols_indicators + [target_col]]

    missing_report = sv.analyze(
        source=df_subset,
        target_feat=target_col
    )
    missing_report.show_html(
        os.path.join(output_dir, "missing_pattern_analysis.html"),
        open_browser=False
    )

# 4. Calculate basic feature importance indicators
print("\n4. Calculating feature statistics...")

feature_stats = []

for col in numeric_features:
    stats = {
        "feature": col,
        "type": "numeric",
        "missing_pct": df[col].isnull().mean() * 100,
        "unique_ratio": df[col].nunique() / len(df),
        "std": df[col].std(),
        "cv": df[col].std() / df[col].mean() if df[col].mean() != 0 else 0
    }

    # Correlation with target if target is numeric
    if df[target_col].dtype in [np.int64, np.float64]:
        stats["target_correlation"] = df[col].corr(df[target_col])

    feature_stats.append(stats)

for col in categorical_features:
    stats = {
        "feature": col,
        "type": "categorical",
        "missing_pct": df[col].isnull().mean() * 100,
        "unique_ratio": df[col].nunique() / len(df),
        "n_categories": df[col].nunique()
    }
    feature_stats.append(stats)

feature_df = pd.DataFrame(feature_stats)
feature_df.to_csv(
    os.path.join(output_dir, "feature_statistics.csv"),
    index=False
)

print(f"\nReports saved to: {output_dir}")

return {
    "numeric_features": numeric_features,
    "categorical_features": categorical_features,
    "feature_stats": feature_stats,
    "missing_columns": missing_cols
}

Example usage

if name == "main": np.random.seed(42) n = 3000

# Create features with varying importance
x1 = np.random.randn(n)  # Important
x2 = np.random.randn(n)  # Important
x3 = np.random.randn(n)  # Noise

df = pd.DataFrame({
    "important_feature_1": x1,
    "important_feature_2": x2,
    "noise_feature": x3,
    "categorical_feature": np.random.choice(["A", "B", "C"], n),
    "target": ((x1 + x2) > 0).astype(int)
})

# Add missing values
df.loc[np.random.choice(n, 100), "important_feature_1"] = np.nan

results = analyze_features_for_selection(
    df=df,
    target_col="target",
    output_dir="feature_analysis_output"
)

Integration Examples

Sweetviz with Streamlit

#!/usr/bin/env python3 """sweetviz_streamlit.py - Streamlit app for Sweetviz reports"""

import streamlit as st import sweetviz as sv import pandas as pd import tempfile import os

st.set_page_config(page_title="Sweetviz EDA", layout="wide") st.title("Sweetviz Exploratory Data Analysis")

File upload

uploaded_file = st.file_uploader("Upload CSV file", type=["csv"])

if uploaded_file: df = pd.read_csv(uploaded_file)

st.subheader("Data Preview")
st.dataframe(df.head(100))

st.subheader("Dataset Info")
col1, col2, col3 = st.columns(3)
col1.metric("Rows", df.shape[0])
col2.metric("Columns", df.shape[1])
col3.metric("Missing Values", df.isnull().sum().sum())

# Analysis options
st.sidebar.header("Analysis Options")

target_col = st.sidebar.selectbox(
    "Target Variable (optional)",
    ["None"] + list(df.columns)
)

pairwise = st.sidebar.selectbox(
    "Pairwise Analysis",
    ["auto", "on", "off"]
)

skip_cols = st.sidebar.multiselect(
    "Columns to Skip",
    list(df.columns)
)

if st.button("Generate Report"):
    with st.spinner("Generating Sweetviz report..."):
        feat_cfg = sv.FeatureConfig(skip=skip_cols) if skip_cols else None

        report = sv.analyze(
            source=df,
            target_feat=target_col if target_col != "None" else None,
            feat_cfg=feat_cfg,
            pairwise_analysis=pairwise
        )

        # Save to temp file
        with tempfile.NamedTemporaryFile(delete=False, suffix=".html") as f:
            report.show_html(f.name, open_browser=False)

            with open(f.name, "r") as html_file:
                html_content = html_file.read()

            os.unlink(f.name)

        # Display in iframe
        st.components.v1.html(html_content, height=800, scrolling=True)

Sweetviz with Jupyter Magic

In Jupyter notebook

import sweetviz as sv import pandas as pd

Load data

df = pd.read_csv("data.csv")

Quick analysis (opens in new tab)

report = sv.analyze(df) report.show_notebook() # Opens in browser from notebook

Inline display (for newer Jupyter versions)

report.show_notebook( w="100%", # Width h="600px", # Height scale=0.8 # Scale factor )

For comparison

train_df = df.sample(frac=0.8) test_df = df.drop(train_df.index)

comparison = sv.compare( [train_df, "Train"], [test_df, "Test"] ) comparison.show_notebook()

Sweetviz in Data Pipeline

#!/usr/bin/env python3 """data_pipeline_sweetviz.py - Integrate Sweetviz in data pipeline"""

import sweetviz as sv import pandas as pd import numpy as np from datetime import datetime import os import logging

logging.basicConfig(level=logging.INFO) logger = logging.getLogger(name)

class DataPipelineProfiler: """Sweetviz profiler for data pipelines."""

def __init__(self, output_dir: str):
    self.output_dir = output_dir
    os.makedirs(output_dir, exist_ok=True)
    self.reports = []

def profile_stage(
    self,
    df: pd.DataFrame,
    stage_name: str,
    target_col: str = None,
    previous_df: pd.DataFrame = None
) -> str:
    """
    Profile data at a pipeline stage.

    Args:
        df: DataFrame at current stage
        stage_name: Name of the pipeline stage
        target_col: Target variable (optional)
        previous_df: DataFrame from previous stage (optional)

    Returns:
        Path to generated report
    """
    timestamp = datetime.now().strftime("%Y%m%d_%H%M%S")

    if previous_df is not None:
        # Comparison report
        logger.info(f"Generating comparison report for stage: {stage_name}")

        report = sv.compare(
            source=[previous_df, "Before"],
            compare=[df, "After"],
            target_feat=target_col
        )

        report_path = os.path.join(
            self.output_dir,
            f"{stage_name}_comparison_{timestamp}.html"
        )
    else:
        # Single analysis report
        logger.info(f"Generating analysis report for stage: {stage_name}")

        report = sv.analyze(
            source=df,
            target_feat=target_col
        )

        report_path = os.path.join(
            self.output_dir,
            f"{stage_name}_analysis_{timestamp}.html"
        )

    report.show_html(report_path, open_browser=False)
    self.reports.append(report_path)

    logger.info(f"Report saved: {report_path}")
    return report_path

def generate_summary(self) -> dict:
    """Generate summary of all profiling reports."""
    return {
        "total_reports": len(self.reports),
        "reports": self.reports,
        "output_dir": self.output_dir
    }

Example pipeline usage

def example_pipeline(): """Example data pipeline with profiling.""" profiler = DataPipelineProfiler("pipeline_reports")

# Stage 1: Raw data
np.random.seed(42)
df_raw = pd.DataFrame({
    "value": np.concatenate([np.random.randn(950), [100, -50, np.nan] * 10]),
    "category": np.random.choice(["A", "B", "C"], 980),
    "target": np.random.choice([0, 1], 980)
})

profiler.profile_stage(df_raw, "01_raw_data", target_col="target")

# Stage 2: Missing value handling
df_cleaned = df_raw.copy()
df_cleaned["value"] = df_cleaned["value"].fillna(df_cleaned["value"].median())

profiler.profile_stage(
    df_cleaned, "02_missing_handled",
    target_col="target",
    previous_df=df_raw
)

# Stage 3: Outlier removal
Q1 = df_cleaned["value"].quantile(0.25)
Q3 = df_cleaned["value"].quantile(0.75)
IQR = Q3 - Q1

df_no_outliers = df_cleaned[
    (df_cleaned["value"] >= Q1 - 1.5 * IQR) &#x26;
    (df_cleaned["value"] &#x3C;= Q3 + 1.5 * IQR)
]

profiler.profile_stage(
    df_no_outliers, "03_outliers_removed",
    target_col="target",
    previous_df=df_cleaned
)

# Summary
summary = profiler.generate_summary()
print(f"\nPipeline profiling complete!")
print(f"Generated {summary['total_reports']} reports")

return summary

if name == "main": example_pipeline()

Best Practices

  1. Use Target Analysis for ML Projects

GOOD: Always specify target for ML datasets

report = sv.analyze(df, target_feat="target")

AVOID: Missing target analysis

report = sv.analyze(df) # No target relationship shown

  1. Sample Large Datasets

GOOD: Sample for large datasets

if len(df) > 100000: df_sample = df.sample(n=100000, random_state=42) report = sv.analyze(df_sample) else: report = sv.analyze(df)

AVOID: Analyzing huge datasets directly

report = sv.analyze(df_with_millions_of_rows) # Very slow

  1. Configure Feature Types Properly

GOOD: Force categorical for ID-like numeric columns

config = sv.FeatureConfig( skip=["customer_id", "transaction_id"], force_cat=["zip_code", "area_code", "rating"] ) report = sv.analyze(df, feat_cfg=config)

AVOID: Letting Sweetviz treat zip codes as numeric

  1. Use Comparison for Validation

GOOD: Compare train/test for data leakage detection

comparison = sv.compare( [train_df, "Train"], [test_df, "Test"], target_feat="target" )

AVOID: Only analyzing training data

  1. Control Pairwise Analysis

GOOD: Disable for speed on many features

report = sv.analyze(df, pairwise_analysis="off") # Fast

GOOD: Enable when feature correlations matter

report = sv.analyze(df, pairwise_analysis="on") # Full correlations

Troubleshooting

Common Issues

Issue: Report generation is slow

Solution 1: Disable pairwise analysis

report = sv.analyze(df, pairwise_analysis="off")

Solution 2: Sample data

report = sv.analyze(df.sample(50000))

Solution 3: Skip high-cardinality columns

config = sv.FeatureConfig(skip=["high_card_col"]) report = sv.analyze(df, feat_cfg=config)

Issue: Memory error with large dataset

Solution: Process in chunks or sample

sample_size = min(len(df), 100000) report = sv.analyze(df.sample(sample_size, random_state=42))

Issue: HTML report won't open

Solution: Save and open manually

report.show_html("report.html", open_browser=False)

Then open report.html in browser

Or specify layout

report.show_html("report.html", layout="vertical")

Issue: Categorical variables treated as numeric

Solution: Force categorical type

config = sv.FeatureConfig(force_cat=["zip_code", "rating"]) report = sv.analyze(df, feat_cfg=config)

Or convert before analysis

df["zip_code"] = df["zip_code"].astype(str)

Issue: Date columns not recognized

Solution: Convert to proper datetime

df["date_col"] = pd.to_datetime(df["date_col"]) report = sv.analyze(df)

Issue: Report shows too many categories

Sweetviz automatically limits to top categories

For custom handling, reduce cardinality before analysis

df["category"] = df["category"].apply( lambda x: x if x in top_categories else "Other" )

Version History

  • 1.0.0 (2026-01-17): Initial release

  • Basic EDA report generation (analyze)

  • Target variable analysis

  • Dataset comparison (compare)

  • Intra-set comparison (compare_intra)

  • Feature configuration options

  • Pairwise analysis control

  • ML profiling pipeline example

  • Data quality assessment example

  • Feature selection analysis example

  • Streamlit integration

  • Data pipeline integration

  • Best practices and troubleshooting

Resources

Generate powerful EDA comparison reports with Sweetviz - analyze, compare, and understand your data!

Source Transparency

This detail page is rendered from real SKILL.md content. Trust labels are metadata-based hints, not a safety guarantee.

Related Skills

Related by shared tags or category signals.

General

echarts

No summary provided by upstream source.

Repository SourceNeeds Review
General

pandoc

No summary provided by upstream source.

Repository SourceNeeds Review
General

mkdocs

No summary provided by upstream source.

Repository SourceNeeds Review
General

gis

No summary provided by upstream source.

Repository SourceNeeds Review