using-deep-rl

Using Deep RL Meta-Skill

Safety Notice

This listing is imported from skills.sh public index metadata. Review upstream SKILL.md and repository scripts before running.

Copy this and send it to your AI assistant to learn

Install skill "using-deep-rl" with this command: npx skills add tachyon-beep/skillpacks/tachyon-beep-skillpacks-using-deep-rl

Using Deep RL Meta-Skill

When to Use This Skill

Invoke this meta-skill when you encounter:

  • RL Implementation: Implementing reinforcement learning algorithms (Q-learning, DQN, PPO, SAC, etc.)

  • Agent Training: Training agents in environments (games, robotics, control systems)

  • Sequential Decision-Making: Problems requiring learning from trial and error

  • Policy Optimization: Learning policies that maximize cumulative rewards

  • RL Debugging: Debugging training issues, agents not learning, reward problems

This is the entry point for the deep-rl pack. It routes to 13 specialized skills based on problem characteristics.

How to Access Reference Sheets

IMPORTANT: All reference sheets are located in the SAME DIRECTORY as this SKILL.md file.

When this skill is loaded from: skills/using-deep-rl/SKILL.md

Reference sheets like rl-foundations.md are at: skills/using-deep-rl/rl-foundations.md

NOT at: skills/rl-foundations.md ← WRONG PATH

Core Principle

Problem type determines algorithm family.

The correct approach depends on:

  • Action Space: Discrete (button presses) vs Continuous (joint angles)

  • Data Regime: Online (interact with environment) vs Offline (fixed dataset)

  • Experience Level: Need foundations vs ready to implement

  • Special Requirements: Multi-agent, model-based, exploration, reward design

Always clarify the problem BEFORE suggesting algorithms.

The 13 Deep RL Skills

  • rl-foundations - MDP formulation, Bellman equations, value vs policy basics

  • value-based-methods - Q-learning, DQN, Double DQN, Dueling DQN, Rainbow

  • policy-gradient-methods - REINFORCE, PPO, TRPO, policy optimization

  • actor-critic-methods - A2C, A3C, SAC, TD3, advantage functions

  • model-based-rl - World models, Dyna, MBPO, planning with learned models

  • offline-rl - Batch RL, CQL, IQL, learning from fixed datasets

  • multi-agent-rl - MARL, cooperative/competitive, communication

  • exploration-strategies - ε-greedy, UCB, curiosity, RND, intrinsic motivation

  • reward-shaping - Reward design, potential-based shaping, inverse RL

  • counterfactual-reasoning - Causal inference, HER, off-policy evaluation, twin networks

  • rl-debugging - Common RL bugs, why not learning, systematic debugging

  • rl-environments - Gym, MuJoCo, custom envs, wrappers, vectorization

  • rl-evaluation - Evaluation methodology, variance, sample efficiency metrics

Routing Decision Framework

Step 1: Assess Experience Level

  • If user asks "what is RL" or "how does RL work" → rl-foundations

  • If confused about value vs policy, on-policy vs off-policy → rl-foundations

  • If user has specific problem and RL background → Continue to Step 2

Why foundations first: Cannot implement algorithms without understanding MDPs, Bellman equations, and exploration-exploitation tradeoffs.

Step 2: Classify Action Space

Discrete Actions (buttons, menu selections, discrete signals)

Condition Route To Why

Small action space (< 100) + online value-based-methods (DQN) Q-networks excel at discrete

Large action space OR need policy flexibility policy-gradient-methods (PPO) Scales to larger spaces

Continuous Actions (joint angles, motor forces, steering)

Condition Route To Why

Sample efficiency critical actor-critic-methods (SAC) Off-policy, automatic entropy

Stability critical actor-critic-methods (TD3) Deterministic, handles overestimation

Simplicity preferred policy-gradient-methods (PPO) On-policy, simpler

CRITICAL: NEVER suggest DQN for continuous actions. DQN requires discrete actions.

Step 3: Identify Data Regime

Online Learning (Agent Interacts with Environment)

  • Discrete → value-based-methods OR policy-gradient-methods

  • Continuous → actor-critic-methods

  • Sample efficiency critical → Consider model-based-rl

Offline Learning (Fixed Dataset, No Interaction)

→ offline-rl (CQL, IQL)

Red Flag: If user has fixed dataset and suggests DQN/PPO/SAC, STOP and route to offline-rl. Standard algorithms assume online interaction and will fail.

Step 4: Special Problem Types

Problem Route To Key Consideration

Multiple agents multi-agent-rl Non-stationarity, credit assignment

Sample efficiency extreme model-based-rl Learns environment model

Counterfactual/causal counterfactual-reasoning HER, off-policy evaluation

Step 5: Debugging and Infrastructure

Problem Route To Why

"Not learning" / reward flat rl-debugging FIRST 80% of issues are bugs, not algorithms

Exploration problems exploration-strategies Curiosity, RND, intrinsic motivation

Reward design issues reward-shaping Potential-based shaping, inverse RL

Environment setup rl-environments Gym API, wrappers, vectorization

Evaluation questions rl-evaluation Deterministic vs stochastic, multiple seeds

Red Flag: If user immediately wants to change algorithms because "it's not learning," route to rl-debugging first.

Rationalization Resistance Table

Rationalization Reality Counter-Guidance

"Just use PPO for everything" PPO is general but not optimal for all cases Clarify: discrete or continuous? Sample efficiency constraints?

"DQN for continuous actions" DQN requires discrete actions Use SAC or TD3 for continuous

"Offline RL is just RL on a dataset" Offline has distribution shift, needs special algorithms Route to offline-rl for CQL, IQL

"More data always helps" Sample efficiency and distribution matter Off-policy vs on-policy matters

"My algorithm isn't learning, I need a better one" Usually bugs, not algorithm Route to rl-debugging first

"I'll discretize continuous actions for DQN" Discretization loses precision, explodes action space Use actor-critic-methods

"Epsilon-greedy is enough for exploration" Complex environments need sophisticated exploration Route to exploration-strategies

"I'll just increase the reward when it doesn't learn" Reward scaling breaks learning Route to rl-debugging

"I can reuse online RL code for offline data" Offline needs conservative algorithms Route to offline-rl

"Test reward lower than training = overfitting" Exploration vs exploitation difference Route to rl-evaluation

Red Flags Checklist

Watch for these signs of incorrect routing:

  • Algorithm-First Thinking: Recommending algorithm before asking about action space, data regime

  • DQN for Continuous: Suggesting DQN/Q-learning for continuous action spaces

  • Offline Blindness: Not recognizing fixed dataset requires offline-rl

  • PPO Cargo-Culting: Defaulting to PPO without considering alternatives

  • No Problem Characterization: Not asking: discrete vs continuous? online vs offline?

  • Skipping Foundations: Implementing algorithms when user doesn't understand RL basics

  • Debug-Last: Suggesting algorithm changes before systematic debugging

  • Sample Efficiency Ignorance: Not asking about sample constraints

If any red flag triggered → STOP → Ask diagnostic questions → Route correctly

Routing Decision Tree Summary

START: RL problem

├─ Need foundations? → rl-foundations │ ├─ DISCRETE actions? │ ├─ Small space + online → value-based-methods (DQN) │ └─ Large space → policy-gradient-methods (PPO) │ ├─ CONTINUOUS actions? │ ├─ Sample efficiency → actor-critic-methods (SAC) │ ├─ Stability → actor-critic-methods (TD3) │ └─ Simplicity → policy-gradient-methods (PPO) │ ├─ OFFLINE data? → offline-rl (CQL, IQL) [CRITICAL] │ ├─ MULTI-AGENT? → multi-agent-rl │ ├─ Sample efficiency EXTREME? → model-based-rl │ ├─ COUNTERFACTUAL? → counterfactual-reasoning │ └─ DEBUGGING? ├─ Not learning → rl-debugging ├─ Exploration → exploration-strategies ├─ Reward design → reward-shaping ├─ Environment → rl-environments └─ Evaluation → rl-evaluation

Diagnostic Questions

Action Space

  • "Discrete choices or continuous values?"

  • "How many actions? Small (< 100), large, or infinite?"

Data Regime

  • "Can agent interact with environment, or fixed dataset?"

  • "Online learning or offline?"

Experience Level

  • "New to RL, or specific problem?"

  • "Understand MDPs, value functions, policy gradients?"

Special Requirements

  • "Multiple agents? Cooperate or compete?"

  • "Sample efficiency critical? How many episodes?"

  • "Sparse reward (only at goal) or dense (every step)?"

When NOT to Use This Pack

User Request Correct Pack Reason

"Train classifier on labeled data" training-optimization Supervised learning

"Design transformer architecture" neural-architectures Architecture design

"Deploy model to production" ml-production Deployment

"Fine-tune LLM with RLHF" llm-specialist LLM-specific

Multi-Skill Scenarios

See multi-skill-scenarios.md for detailed routing sequences:

  • Complete beginner to RL

  • Continuous control (robotics)

  • Offline RL from dataset

  • Multi-agent cooperative task

  • Sample-efficient learning

  • Sparse reward problem

  • RL-controlled neural architecture

Final Reminders

  • Problem characterization BEFORE algorithm selection

  • DQN for discrete ONLY (never continuous)

  • Offline data needs offline-rl (CQL, IQL)

  • PPO is not universal (good general-purpose, not optimal everywhere)

  • Debug before changing algorithms (route to rl-debugging)

  • Ask questions, don't assume (action space? data regime?)

Deep RL Specialist Skills

After routing, load the appropriate specialist skill for detailed guidance:

  • rl-foundations.md - MDP formulation, Bellman equations, value vs policy basics

  • value-based-methods.md - Q-learning, DQN, Double DQN, Dueling DQN, Rainbow

  • policy-gradient-methods.md - REINFORCE, PPO, TRPO, policy optimization

  • actor-critic-methods.md - A2C, A3C, SAC, TD3, advantage functions

  • model-based-rl.md - World models, Dyna, MBPO, planning with learned models

  • offline-rl.md - Batch RL, CQL, IQL, learning from fixed datasets

  • multi-agent-rl.md - MARL, cooperative/competitive, communication

  • exploration-strategies.md - ε-greedy, UCB, curiosity, RND, intrinsic motivation

  • reward-shaping-engineering.md - Reward design, potential-based shaping, inverse RL

  • counterfactual-reasoning.md - Causal inference, HER, off-policy evaluation, twin networks

  • rl-debugging.md - Common RL bugs, why not learning, systematic debugging

  • rl-environments.md - Gym, MuJoCo, custom envs, wrappers, vectorization

  • rl-evaluation.md - Evaluation methodology, variance, sample efficiency metrics

  • multi-skill-scenarios.md - Common problem routing sequences

Source Transparency

This detail page is rendered from real SKILL.md content. Trust labels are metadata-based hints, not a safety guarantee.

Related Skills

Related by shared tags or category signals.

General

lifecycle-adoption

No summary provided by upstream source.

Repository SourceNeeds Review
General

using-web-backend

No summary provided by upstream source.

Repository SourceNeeds Review
General

using-ml-production

No summary provided by upstream source.

Repository SourceNeeds Review
General

design-and-build

No summary provided by upstream source.

Repository SourceNeeds Review