r-data-science

Generate high-quality R code following tidyverse conventions and modern best practices. This skill covers data manipulation, visualization, statistical analysis, and reproducible research workflows commonly used in public health, epidemiology, and data science.

Safety Notice

This listing is imported from skills.sh public index metadata. Review upstream SKILL.md and repository scripts before running.

Copy this and send it to your AI assistant to learn

Install skill "r-data-science" with this command: npx skills add crypticpy/rdata/crypticpy-rdata-r-data-science

R Data Science

Overview

Generate high-quality R code following tidyverse conventions and modern best practices. This skill covers data manipulation, visualization, statistical analysis, and reproducible research workflows commonly used in public health, epidemiology, and data science.

Core Principles

  • Tidyverse-first: Use tidyverse packages (dplyr, tidyr, ggplot2, purrr, readr) as the default approach

  • Pipe-forward: Use the native pipe |> for chains (R 4.1+); fall back to %>% for older versions

  • Reproducibility: Structure all work for reproducibility using Quarto, renv, and clear documentation

  • Defensive coding: Validate inputs, handle missing data explicitly, and fail informatively

Quick Reference: Common Patterns

Data Import

library(tidyverse)

CSV (most common)

df <- read_csv("data/raw/dataset.csv")

Excel

df <- readxl::read_excel("data/raw/dataset.xlsx", sheet = "Sheet1")

Clean column names immediately

df <- df |> janitor::clean_names()

Data Wrangling Pipeline

analysis_data <- raw_data |>

Clean and filter

filter(!is.na(key_variable)) |>

Transform variables

mutate( date = as.Date(date_string, format = "%Y-%m-%d"), age_group = cut(age, breaks = c(0, 18, 45, 65, Inf), labels = c("0-17", "18-44", "45-64", "65+")) ) |>

Summarize

group_by(region, age_group) |> summarize( n = n(), mean_value = mean(outcome, na.rm = TRUE), .groups = "drop" )

Basic ggplot2 Visualization

ggplot(df, aes(x = date, y = count, color = category)) + geom_line(linewidth = 1) + scale_color_brewer(palette = "Set2") + labs( title = "Trend Over Time", subtitle = "By category", x = "Date", y = "Count", color = "Category", caption = "Source: Dataset Name" ) + theme_minimal(base_size = 12) + theme( legend.position = "bottom", plot.title = element_text(face = "bold") )

Tidyverse Style Guide Essentials

Naming Conventions

  • snake_case for objects and functions: case_counts , calculate_rate()

  • Verbs for functions: filter_outliers() , compute_summary()

  • Nouns for data: patient_data , surveillance_df

  • Avoid: dots in names (reserved for S3), single letters except in lambdas

Code Formatting

  • Indentation: 2 spaces (never tabs)

  • Line length: 80 characters maximum

  • Operators: Spaces around <- , = , + , |> , but not : , :: , $

  • Commas: Space after, never before

  • Pipes: New line after each |>

Good

result <- data |> filter(year >= 2020) |> group_by(county) |> summarize(total = sum(cases))

Bad

result<-data|>filter(year>=2020)|>group_by(county)|>summarize(total=sum(cases))

Assignment

  • Use <- for assignment, never = or ->

  • Use = only for function arguments

Comments

Load and clean surveillance data ------------------------------------------

Calculate age-adjusted rates

Using direct standardization method per CDC guidelines

adjusted_rate <- calculate_adjusted_rate(df, standard_pop)

Package Ecosystem

Core Tidyverse (Always Load)

library(tidyverse) # Loads: ggplot2, dplyr, tidyr, readr, purrr, tibble, stringr, forcats

Data Import/Export

Task Package Key Functions

CSV/TSV readr read_csv() , write_csv()

Excel readxl, writexl read_excel() , write_xlsx()

SAS/SPSS/Stata haven read_sas() , read_spss() , read_stata()

JSON jsonlite read_json() , fromJSON()

Databases DBI, dbplyr dbConnect() , tbl()

Data Manipulation

Task Package Key Functions

Column cleaning janitor clean_names() , tabyl()

Date handling lubridate ymd() , mdy() , floor_date()

String operations stringr str_detect() , str_extract()

Missing data naniar vis_miss() , replace_with_na()

Visualization

Task Package Key Functions

Core plotting ggplot2 ggplot() , geom_*()

Extensions ggrepel, patchwork geom_text_repel() , + operator

Interactive plotly ggplotly()

Tables gt, kableExtra gt() , kable()

Statistical Analysis

Task Package Key Functions

Model summaries broom tidy() , glance() , augment()

Regression stats, lme4 lm() , glm() , lmer()

Survival survival Surv() , survfit() , coxph()

Survey data survey svydesign() , svymean()

Epidemiology & Public Health

Task Package Key Functions

Epi calculations epiR epi.2by2() , epi.conf()

Outbreak tools incidence2, epicontacts incidence() , make_epicontacts()

Disease mapping SpatialEpi expected() , EBlocal()

Surveillance surveillance sts() , farrington()

Rate calculations epitools riskratio() , oddsratio() , ageadjust.direct()

Reproducibility Standards

Project Structure

project/ ├── project.Rproj ├── renv.lock ├── CLAUDE.md # Claude Code configuration ├── README.md ├── data/ │ ├── raw/ # Never modify │ └── processed/ # Analysis-ready ├── R/ # Custom functions ├── scripts/ # Pipeline scripts ├── analysis/ # Quarto documents └── output/ ├── figures/ └── tables/

Quarto Document Header


title: "Analysis Title" author: "Your Name" date: today format: html: toc: true code-fold: true embed-resources: true execute: warning: false message: false

Package Management with renv

Initialize (once per project)

renv::init()

Snapshot dependencies after installing packages

renv::snapshot()

Restore environment (for collaborators)

renv::restore()

Workflow Documentation

Always include at the top of scripts:

============================================================================

Title: Analysis of [Subject]

Author: [Name]

Date: [Date]

Purpose: [One-sentence description]

Input: data/processed/clean_data.csv

Output: output/figures/trend_plot.png

============================================================================

Common Analysis Patterns

Descriptive Statistics Table

df |> group_by(category) |> summarize( n = n(), mean = mean(value, na.rm = TRUE), sd = sd(value, na.rm = TRUE), median = median(value, na.rm = TRUE), q25 = quantile(value, 0.25, na.rm = TRUE), q75 = quantile(value, 0.75, na.rm = TRUE) ) |> gt::gt() |> gt::fmt_number(columns = where(is.numeric), decimals = 2)

Regression with Tidy Output

model <- glm(outcome ~ exposure + age + sex, data = df, family = binomial)

Tidy coefficients

tidy_results <- broom::tidy(model, conf.int = TRUE, exponentiate = TRUE) |> select(term, estimate, conf.low, conf.high, p.value)

Model diagnostics

glance_results <- broom::glance(model)

Epi Curve (Epidemic Curve)

library(incidence2)

Create incidence object

inc <- incidence( df, date_index = "onset_date", interval = "week", groups = "outcome_category" )

Plot

plot(inc) + labs( title = "Epidemic Curve", x = "Week of Onset", y = "Number of Cases" ) + theme_minimal()

Rate Calculation

Age-adjusted rates using direct standardization

library(epitools)

Stratum-specific counts and populations

result <- ageadjust.direct( count = df$cases, pop = df$population, stdpop = standard_population$pop # e.g., US 2000 standard )

Error Handling

Defensive Data Checks

Validate data before analysis

stopifnot( "Data frame is empty" = nrow(df) > 0, "Missing required columns" = all(c("id", "date", "value") %in% names(df)), "Duplicate IDs found" = !any(duplicated(df$id)) )

Informative warnings for data quality issues

if (sum(is.na(df$key_var)) > 0) { warning(sprintf("%d missing values in key_var (%.1f%%)", sum(is.na(df$key_var)), 100 * mean(is.na(df$key_var)))) }

Safe File Operations

Check file exists before reading

if (!file.exists(filepath)) { stop(sprintf("File not found: %s", filepath)) }

Create directories if needed

dir.create("output/figures", recursive = TRUE, showWarnings = FALSE)

Performance Tips

For Large Datasets

Use data.table for >1M rows

library(data.table) dt <- fread("large_file.csv")

Or use arrow for very large/parquet files

library(arrow) df <- read_parquet("data.parquet")

Lazy evaluation with duckdb

library(duckdb) con <- dbConnect(duckdb()) df_lazy <- tbl(con, "data.csv")

Vectorization Over Loops

Good: vectorized

df$rate <- df$cases / df$population * 100000

Avoid: row-by-row loop

for (i in 1:nrow(df)) { df$rate[i] <- df$cases[i] / df$population[i] * 100000 }

Additional Resources

For detailed patterns, consult:

Version History

  • v1.0.0 (2025-12-04): Initial release for PubHealthAI community

Source Transparency

This detail page is rendered from real SKILL.md content. Trust labels are metadata-based hints, not a safety guarantee.

Related Skills

Related by shared tags or category signals.

Coding

openclaw-version-monitor

监控 OpenClaw GitHub 版本更新,获取最新版本发布说明,翻译成中文, 并推送到 Telegram 和 Feishu。用于:(1) 定时检查版本更新 (2) 推送版本更新通知 (3) 生成中文版发布说明

Archived SourceRecently Updated
Coding

ask-claude

Delegate a task to Claude Code CLI and immediately report the result back in chat. Supports persistent sessions with full context memory. Safe execution: no data exfiltration, no external calls, file operations confined to workspace. Use when the user asks to run Claude, delegate a coding task, continue a previous Claude session, or any task benefiting from Claude Code's tools (file editing, code analysis, bash, etc.).

Archived SourceRecently Updated
Coding

ai-dating

This skill enables dating and matchmaking workflows. Use it when a user asks to make friends, find a partner, run matchmaking, or provide dating preferences/profile updates. The skill should execute `dating-cli` commands to complete profile setup, task creation/update, match checking, contact reveal, and review.

Archived SourceRecently Updated
Coding

clawhub-rate-limited-publisher

Queue and publish local skills to ClawHub with a strict 5-per-hour cap using the local clawhub CLI and host scheduler.

Archived SourceRecently Updated