llmops-platform-engineering

LLMOps Platform Engineering

Safety Notice

This listing is imported from skills.sh public index metadata. Review upstream SKILL.md and repository scripts before running.

Copy this and send it to your AI assistant to learn

Install skill "llmops-platform-engineering" with this command: npx skills add bagelhole/devops-security-agent-skills/bagelhole-devops-security-agent-skills-llmops-platform-engineering

LLMOps Platform Engineering

Design and operate an internal LLM platform that supports rapid experimentation without compromising reliability, cost, or compliance.

Outcomes

  • Standardized path from experiment to production

  • Safe model rollout with quality and safety gates

  • Repeatable infra modules for inference, vector DB, and observability

  • Clear ownership model across platform, app, and security teams

Reference Architecture

  • Control Plane: model registry, prompt/version catalog, policy checks, eval pipeline.

  • Data Plane: inference gateway, vector database, cache, feature store.

  • Ops Plane: telemetry, alerting, SLO dashboards, cost analytics.

  • Security Plane: IAM boundaries, secret rotation, content filters, audit logs.

Golden Delivery Workflow

  • Train/fine-tune or onboard provider model.

  • Register artifact and metadata (license, intended use, constraints).

  • Run automated eval suite (quality + safety + latency + cost).

  • Deploy canary behind gateway with strict traffic policy.

  • Promote after SLO and business KPI thresholds pass.

  • Keep rollback target hot for fast reversion.

CI/CD Design for AI Services

  • Build immutable containers with pinned dependencies and model hashes.

  • Use environment promotion: dev -> stage -> prod .

  • Fail deployment if:

  • regression evals drop below baseline,

  • safety tests exceed risk threshold,

  • p95 latency exceeds SLO budget.

  • Store deployment evidence for audits (commit SHA, eval report, approver).

Operational SLOs

  • Availability: 99.9% for synchronous inference endpoints.

  • Latency: p95 under product-specific target (for example, <1200ms ).

  • Cost: per-request and per-tenant budget ceilings.

  • Quality: task success rate and groundedness thresholds.

Platform Guardrails

  • Enforce tenant quotas and model allow-lists.

  • Require structured output contracts for automation paths.

  • Default to low-risk model settings for critical workflows.

  • Disable unconstrained tool execution in production.

Tooling Stack (Example)

  • Orchestration: Argo Workflows / GitHub Actions / Airflow.

  • Model Registry: MLflow / custom metadata DB.

  • Gateway: LiteLLM / Envoy-based API gateway.

  • Observability: OpenTelemetry + Prometheus + Grafana + Langfuse.

  • Policy: OPA/Rego for deployment and runtime checks.

Incident Readiness

  • Runbooks for model outage, provider timeout spikes, and cost surges.

  • Chaos drills for provider failover and vector DB degradation.

  • Pre-approved rollback path with one-command execution.

Related Skills

  • ai-pipeline-orchestration - Orchestrate ingestion and inference workflows

  • agent-evals - Build evaluation gates for releases

  • llm-gateway - Route and control LLM traffic

Source Transparency

This detail page is rendered from real SKILL.md content. Trust labels are metadata-based hints, not a safety guarantee.

Related Skills

Related by shared tags or category signals.

Security

sops-encryption

No summary provided by upstream source.

Repository SourceNeeds Review
Security

linux-administration

No summary provided by upstream source.

Repository SourceNeeds Review
Security

linux-hardening

No summary provided by upstream source.

Repository SourceNeeds Review