llmops-platform-engineering | V50.AI

llmops-platform-engineering

LLMOps Platform Engineering

Safety Notice

This listing is imported from skills.sh public index metadata. Review upstream SKILL.md and repository scripts before running.

Copy this and send it to your AI assistant to learn

Install skill "llmops-platform-engineering" with this command: npx skills add bagelhole/devops-security-agent-skills/bagelhole-devops-security-agent-skills-llmops-platform-engineering

LLMOps Platform Engineering

Design and operate an internal LLM platform that supports rapid experimentation without compromising reliability, cost, or compliance.

Outcomes

Standardized path from experiment to production
Safe model rollout with quality and safety gates
Repeatable infra modules for inference, vector DB, and observability
Clear ownership model across platform, app, and security teams

Reference Architecture

Control Plane: model registry, prompt/version catalog, policy checks, eval pipeline.
Data Plane: inference gateway, vector database, cache, feature store.
Ops Plane: telemetry, alerting, SLO dashboards, cost analytics.
Security Plane: IAM boundaries, secret rotation, content filters, audit logs.

Golden Delivery Workflow

Train/fine-tune or onboard provider model.
Register artifact and metadata (license, intended use, constraints).
Run automated eval suite (quality + safety + latency + cost).
Deploy canary behind gateway with strict traffic policy.
Promote after SLO and business KPI thresholds pass.
Keep rollback target hot for fast reversion.

CI/CD Design for AI Services

Build immutable containers with pinned dependencies and model hashes.
Use environment promotion: dev -> stage -> prod .
Fail deployment if:
regression evals drop below baseline,
safety tests exceed risk threshold,
p95 latency exceeds SLO budget.
Store deployment evidence for audits (commit SHA, eval report, approver).

Operational SLOs

Availability: 99.9% for synchronous inference endpoints.
Latency: p95 under product-specific target (for example, <1200ms ).
Cost: per-request and per-tenant budget ceilings.
Quality: task success rate and groundedness thresholds.

Platform Guardrails

Enforce tenant quotas and model allow-lists.
Require structured output contracts for automation paths.
Default to low-risk model settings for critical workflows.
Disable unconstrained tool execution in production.

Tooling Stack (Example)

Orchestration: Argo Workflows / GitHub Actions / Airflow.
Model Registry: MLflow / custom metadata DB.
Gateway: LiteLLM / Envoy-based API gateway.
Observability: OpenTelemetry + Prometheus + Grafana + Langfuse.
Policy: OPA/Rego for deployment and runtime checks.

Incident Readiness

Runbooks for model outage, provider timeout spikes, and cost surges.
Chaos drills for provider failover and vector DB degradation.
Pre-approved rollback path with one-command execution.

Related Skills

ai-pipeline-orchestration - Orchestrate ingestion and inference workflows
agent-evals - Build evaluation gates for releases
llm-gateway - Route and control LLM traffic

Source Transparency

This detail page is rendered from real SKILL.md content. Trust labels are metadata-based hints, not a safety guarantee.

Open in GitHub Open in ClawHub

Related Skills

Related by shared tags or category signals.

Security

sops-encryption

No summary provided by upstream source.

Repository SourceNeeds Review

-31

Security

linux-administration

No summary provided by upstream source.

Repository SourceNeeds Review

-29

Security

linux-hardening

No summary provided by upstream source.

Repository SourceNeeds Review

-26