data-quality

Purpose

This skill ensures data accuracy, completeness, and consistency in pipelines by performing validation, cleaning, and monitoring. It integrates with data engineering workflows to catch issues early, using tools like rule-based checks and automated alerts.

When to Use

Use this skill when building or maintaining data pipelines that handle large datasets, such as ETL processes, to prevent downstream errors. Apply it for compliance checks in regulated industries or when integrating new data sources that might introduce inconsistencies. Avoid it for simple data tasks without validation needs.

Key Capabilities

Validate data schemas and values using predefined rules (e.g., check for nulls, duplicates, or type mismatches).
Clean datasets by applying transformations like trimming strings or imputing missing values.
Monitor pipelines in real-time with metrics and alerts for anomalies.
Support for various data formats (CSV, JSON, Parquet) and integration with storage systems like S3 or databases.
Generate reports or logs for auditing, including summary statistics and error counts.

Usage Patterns

Invoke this skill via CLI for quick checks or integrate it into Python scripts for automated pipelines. Always load configuration files first, then run validations. For monitoring, set up recurring jobs. Example pattern: Load data, apply validation, clean if needed, and monitor outputs.

Common Commands/API

Use the OpenClaw CLI for direct execution or the REST API for programmatic access. Authentication requires setting the environment variable $OPENCLAW_API_KEY before running commands.

CLI Command: Validate a file with rules

openclaw data-quality validate --file data.csv --rules config.json --output report.txt

This checks data against rules in config.json and saves results to report.txt.

CLI Command: Clean and monitor data

openclaw data-quality clean --input data.parquet --rules clean_rules.json --monitor

Applies cleaning rules and enables monitoring for ongoing checks.

API Endpoint: POST /api/data-quality/validate

Send a JSON body like:

{ "file_url": "s3://bucket/data.csv", "rules": { "columns": ["id", "name"], "checks": ["no_nulls"] } }

Response: JSON with status and errors, e.g., { "status": "success", "errors": [] } .

API Endpoint: GET /api/data-quality/monitor/{job_id}

Fetch monitoring status:

curl -H "Authorization: Bearer $OPENCLAW_API_KEY" https://api.openclaw.ai/api/data-quality/monitor/12345

Returns metrics like error rate.

Config format: Use JSON for rules, e.g.,

{ "checks": [ { "type": "null_check", "column": "age", "action": "alert" } ] }

Integration Notes

Integrate with Python using the OpenClaw SDK: Import and initialize with your API key. For example, in a data pipeline:

import openclaw

client = openclaw.Client(api_key=os.environ['OPENCLAW_API_KEY'])

result = client.validate_data(file_path='data.csv', rules={'checks': ['duplicates']})

Ensure data pipelines handle asynchronous API calls by checking response status codes. For cloud integration, configure webhooks for alerts, e.g., POST to your endpoint on validation failures. Always test integrations in a staging environment first.

Error Handling

Handle errors by checking return codes from CLI or API responses. For CLI, if a command fails, it exits with a non-zero code; parse stderr for details. In code, catch exceptions like:

try:

result = client.validate_data(...)

except openclaw.APIError as e:

log_error(e.message) # e.message contains error details

Common errors include authentication failures (e.g., 401 Unauthorized) or invalid configs (e.g., 400 Bad Request). Use retry logic for transient issues, like network errors, with exponential backoff.

Graph Relationships

Related to: data-processing (for pipeline integration), data-monitoring (for shared alerting features)
Depends on: data-storage (for accessing data sources)
Complements: machine-learning (for data prep in ML workflows)

data-quality

Safety Notice

Copy this and send it to your AI assistant to learn

Source Transparency

Related Skills

playwright-scraper

clawflows

tavily-web-search