Cluster — Data Clustering Analysis Tool
Cluster is a command-line data clustering analysis tool that supports k-means and hierarchical clustering algorithms. It reads numerical data from CSV/JSONL sources, performs clustering, evaluates cluster quality, and exports results.
Data is stored in ~/.cluster/data.jsonl as JSONL records. Each record represents a clustering run with its parameters, assignments, centroids, and evaluation metrics.
Prerequisites
- Python 3.8+ with standard library (no external packages required for basic operations)
bashshell
Commands
run
Run a clustering algorithm on input data.
Environment Variables:
INPUT(required) — Path to input CSV/JSONL file with numerical dataK— Number of clusters (default: 3)ALGORITHM— Algorithm to use:kmeansorhierarchical(default: kmeans)MAX_ITER— Maximum iterations for k-means (default: 100)SEED— Random seed for reproducibility
Example:
INPUT=/path/to/data.csv K=5 ALGORITHM=kmeans bash scripts/script.sh run
assign
Assign new data points to existing clusters from a previous run.
Environment Variables:
RUN_ID(required) — ID of the clustering run to useINPUT(required) — Path to new data points (CSV/JSONL)
Example:
RUN_ID=abc123 INPUT=/path/to/new_data.csv bash scripts/script.sh assign
centroids
Display or export centroid coordinates for a clustering run.
Environment Variables:
RUN_ID(required) — ID of the clustering runFORMAT— Output format:table,json,csv(default: table)
evaluate
Evaluate clustering quality with silhouette score, inertia, and Davies-Bouldin index.
Environment Variables:
RUN_ID(required) — ID of the clustering run to evaluate
visualize
Generate a text-based or ASCII visualization of cluster assignments.
Environment Variables:
RUN_ID(required) — ID of the clustering runDIMS— Dimensions to plot, comma-separated (default: first two)
export
Export clustering results to a file.
Environment Variables:
RUN_ID(required) — ID of the run to exportOUTPUT— Output file path (default: stdout)FORMAT— Export format:json,csv,jsonl(default: json)
import
Import a previously exported clustering run.
Environment Variables:
INPUT(required) — Path to the file to import
config
View or update configuration settings.
Environment Variables:
KEY— Configuration key to setVALUE— Configuration value
list
List all stored clustering runs with summary info.
Environment Variables:
LIMIT— Maximum runs to display (default: 20)SORT— Sort field:date,k,score(default: date)
stats
Show aggregate statistics across all clustering runs.
help
Display usage information and available commands.
version
Display the current version of the cluster tool.
Data Storage
All clustering runs are stored in ~/.cluster/data.jsonl. Each line is a JSON object with fields:
id— Unique run identifiertimestamp— ISO 8601 creation timealgorithm— Algorithm usedk— Number of clusterscentroids— List of centroid coordinatesassignments— Mapping of data point indices to cluster IDsmetrics— Evaluation metrics (silhouette, inertia, etc.)input_file— Source data file pathnum_points— Number of data points clustered
Configuration
Config is stored in ~/.cluster/config.json. Available keys:
default_k— Default number of clusters (default: 3)default_algorithm— Default algorithm (default: kmeans)max_iterations— Default max iterations (default: 100)random_seed— Default random seed (default: 42)
Powered by BytesAgain | bytesagain.com | hello@bytesagain.com