clustering-analyzer

Cluster data using K-Means, DBSCAN, hierarchical clustering. Use for customer segmentation, pattern discovery, or data grouping.

Safety Notice

This listing is imported from skills.sh public index metadata. Review upstream SKILL.md and repository scripts before running.

Copy this and send it to your AI assistant to learn

Install skill "clustering-analyzer" with this command: npx skills add dkyazzentwatwa/chatgpt-skills/dkyazzentwatwa-chatgpt-skills-clustering-analyzer

Clustering Analyzer

Analyze and cluster data using multiple algorithms with visualization and evaluation.

Features

  • K-Means: Partition-based clustering with elbow method
  • DBSCAN: Density-based clustering for arbitrary shapes
  • Hierarchical: Agglomerative clustering with dendrograms
  • Evaluation: Silhouette scores, cluster statistics
  • Visualization: 2D/3D plots, dendrograms, elbow curves
  • Export: Labeled data, cluster summaries

Quick Start

from clustering_analyzer import ClusteringAnalyzer

analyzer = ClusteringAnalyzer()
analyzer.load_csv("customers.csv")

# K-Means clustering
result = analyzer.kmeans(n_clusters=3)
print(f"Silhouette Score: {result['silhouette_score']:.3f}")

# Visualize
analyzer.plot_clusters("clusters.png")

CLI Usage

# K-Means clustering
python clustering_analyzer.py --input data.csv --method kmeans --clusters 3

# Find optimal clusters (elbow method)
python clustering_analyzer.py --input data.csv --method kmeans --find-optimal

# DBSCAN clustering
python clustering_analyzer.py --input data.csv --method dbscan --eps 0.5 --min-samples 5

# Hierarchical clustering
python clustering_analyzer.py --input data.csv --method hierarchical --clusters 4

# Generate plots
python clustering_analyzer.py --input data.csv --method kmeans --clusters 3 --plot clusters.png

# Export labeled data
python clustering_analyzer.py --input data.csv --method kmeans --clusters 3 --output labeled.csv

# Select specific columns
python clustering_analyzer.py --input data.csv --columns age,income,spending --method kmeans --clusters 3

API Reference

ClusteringAnalyzer Class

class ClusteringAnalyzer:
    def __init__(self)

    # Data loading
    def load_csv(self, filepath: str, columns: list = None) -> 'ClusteringAnalyzer'
    def load_dataframe(self, df: pd.DataFrame, columns: list = None) -> 'ClusteringAnalyzer'

    # Clustering methods
    def kmeans(self, n_clusters: int, **kwargs) -> dict
    def dbscan(self, eps: float = 0.5, min_samples: int = 5) -> dict
    def hierarchical(self, n_clusters: int, linkage: str = "ward") -> dict

    # Optimal clusters
    def find_optimal_clusters(self, max_k: int = 10) -> dict
    def elbow_plot(self, output: str, max_k: int = 10) -> str

    # Evaluation
    def silhouette_score(self) -> float
    def cluster_statistics(self) -> dict

    # Visualization
    def plot_clusters(self, output: str, dimensions: list = None) -> str
    def plot_dendrogram(self, output: str) -> str
    def plot_silhouette(self, output: str) -> str

    # Export
    def get_labels(self) -> list
    def to_dataframe(self) -> pd.DataFrame
    def save_labeled(self, output: str) -> str

Clustering Methods

K-Means

Best for spherical clusters with known number of groups:

result = analyzer.kmeans(n_clusters=3)

# Returns:
{
    "labels": [0, 1, 2, 0, ...],
    "n_clusters": 3,
    "silhouette_score": 0.65,
    "inertia": 1234.56,
    "cluster_sizes": {0: 150, 1: 200, 2: 100},
    "centroids": [[...], [...], [...]]
}

DBSCAN

Best for arbitrary-shaped clusters:

result = analyzer.dbscan(eps=0.5, min_samples=5)

# Returns:
{
    "labels": [0, 0, 1, -1, ...],  # -1 = noise
    "n_clusters": 3,
    "n_noise": 15,
    "silhouette_score": 0.58,
    "cluster_sizes": {0: 150, 1: 200, 2: 100}
}

Hierarchical (Agglomerative)

Best for understanding cluster hierarchy:

result = analyzer.hierarchical(n_clusters=4, linkage="ward")

# Returns:
{
    "labels": [0, 1, 2, 3, ...],
    "n_clusters": 4,
    "silhouette_score": 0.62,
    "cluster_sizes": {0: 100, 1: 150, 2: 120, 3: 80}
}

Finding Optimal Clusters

Elbow Method

optimal = analyzer.find_optimal_clusters(max_k=10)

# Returns:
{
    "optimal_k": 4,
    "inertias": [1000, 800, 500, 300, 280, ...],
    "silhouettes": [0.5, 0.55, 0.6, 0.65, 0.63, ...]
}

Elbow Plot

analyzer.elbow_plot("elbow.png", max_k=10)

Generates plot showing inertia vs number of clusters.

Cluster Statistics

stats = analyzer.cluster_statistics()

# Returns:
{
    "n_clusters": 3,
    "cluster_sizes": {0: 150, 1: 200, 2: 100},
    "cluster_means": {
        0: {"age": 25.5, "income": 45000, ...},
        1: {"age": 45.2, "income": 75000, ...},
        2: {"age": 35.1, "income": 55000, ...}
    },
    "cluster_std": {
        0: {"age": 5.2, "income": 8000, ...},
        ...
    },
    "overall_silhouette": 0.65
}

Visualization

Cluster Plot

# 2D plot (uses first 2 features or PCA)
analyzer.plot_clusters("clusters_2d.png")

# Specify dimensions
analyzer.plot_clusters("clusters.png", dimensions=["age", "income"])

Dendrogram

# For hierarchical clustering
analyzer.hierarchical(n_clusters=4)
analyzer.plot_dendrogram("dendrogram.png")

Silhouette Plot

analyzer.plot_silhouette("silhouette.png")

Shows silhouette coefficient for each sample.

Export Results

Get Labels

labels = analyzer.get_labels()
# [0, 1, 2, 0, 1, ...]

Save Labeled Data

analyzer.save_labeled("labeled_data.csv")
# Original data + cluster_label column

Get Full DataFrame

df = analyzer.to_dataframe()
# DataFrame with cluster_label column

Example Workflows

Customer Segmentation

analyzer = ClusteringAnalyzer()
analyzer.load_csv("customers.csv", columns=["age", "income", "spending_score"])

# Find optimal number of segments
optimal = analyzer.find_optimal_clusters(max_k=8)
print(f"Optimal segments: {optimal['optimal_k']}")

# Cluster with optimal k
result = analyzer.kmeans(n_clusters=optimal['optimal_k'])

# Get segment characteristics
stats = analyzer.cluster_statistics()
for cluster_id, means in stats["cluster_means"].items():
    print(f"\nSegment {cluster_id}:")
    for feature, value in means.items():
        print(f"  {feature}: {value:.2f}")

# Save segmented data
analyzer.save_labeled("customer_segments.csv")

Anomaly Detection with DBSCAN

analyzer = ClusteringAnalyzer()
analyzer.load_csv("transactions.csv", columns=["amount", "frequency"])

# DBSCAN identifies noise points as potential anomalies
result = analyzer.dbscan(eps=0.3, min_samples=10)

print(f"Found {result['n_noise']} potential anomalies")

# Get anomalous records
df = analyzer.to_dataframe()
anomalies = df[df["cluster_label"] == -1]

Document Clustering

# After TF-IDF transformation
analyzer = ClusteringAnalyzer()
analyzer.load_dataframe(tfidf_matrix)

# Hierarchical clustering to see document relationships
result = analyzer.hierarchical(n_clusters=5)
analyzer.plot_dendrogram("doc_dendrogram.png")

Data Preprocessing

The analyzer automatically:

  • Handles missing values (imputation)
  • Scales features (standardization)
  • Reduces dimensions for visualization (PCA)

For custom preprocessing:

from sklearn.preprocessing import StandardScaler

# Preprocess manually
df = pd.read_csv("data.csv")
scaler = StandardScaler()
df_scaled = pd.DataFrame(scaler.fit_transform(df), columns=df.columns)

# Load preprocessed data
analyzer.load_dataframe(df_scaled)

Dependencies

  • scikit-learn>=1.3.0
  • pandas>=2.0.0
  • numpy>=1.24.0
  • matplotlib>=3.7.0
  • scipy>=1.10.0

Source Transparency

This detail page is rendered from real SKILL.md content. Trust labels are metadata-based hints, not a safety guarantee.

Related Skills

Related by shared tags or category signals.

General

ocr-document-processor

No summary provided by upstream source.

Repository SourceNeeds Review
General

text-summarizer

No summary provided by upstream source.

Repository SourceNeeds Review
General

document-converter-suite

No summary provided by upstream source.

Repository SourceNeeds Review
General

financial-calculator

No summary provided by upstream source.

Repository SourceNeeds Review