data-engineering-storage-authentication

Cloud Storage Authentication

Safety Notice

This listing is imported from skills.sh public index metadata. Review upstream SKILL.md and repository scripts before running.

Copy this and send it to your AI assistant to learn

Install skill "data-engineering-storage-authentication" with this command: npx skills add legout/data-platform-agent-skills/legout-data-platform-agent-skills-data-engineering-storage-authentication

Cloud Storage Authentication

Secure authentication patterns for accessing cloud storage (S3, GCS, Azure Blob) and cloud services in data pipelines. Covers IAM roles, service principals, secret managers, and best practices for credential management.

Quick Reference

Provider Recommended Auth Alternative

AWS IAM roles (EC2/ECS/Lambda) Environment variables, Secrets Manager

GCP Workload Identity / ADC Service account keys (discouraged)

Azure Managed Identity Service principal with certificate

Local Dev .env files + local credentials Static keys (temporary only)

Core Principles

  • Least Privilege: Grant only necessary permissions (read-only, specific bucket)

  • Short-lived credentials: Use STS tokens, OIDC, not long-term keys

  • Automatic rotation: Prefer managed identities that rotate automatically

  • Secret management: Never commit credentials; use secret managers

  • Audit everything: Enable CloudTrail/Azure Audit Logs/GCP Audit Logs

  • Separate environments: Different credentials for dev/staging/prod

When to Use What?

  • Production on cloud VMs: Use IAM roles/Managed Identities (no credentials in code)

  • CI/CD pipelines: Use workload identity federation (OIDC) or short-lived tokens

  • Local development: .env files with user credentials from aws configure , gcloud auth , az login

  • Third-party integrations: Service principals with scoped permissions

  • Cross-account access: Role assumption (AWS), workload identity (GCP), service principal (Azure)

Skill Dependencies

This skill is foundational for:

  • @data-engineering-storage-remote-access

  • All cloud storage backends

  • @data-engineering-storage-lakehouse

  • Delta Lake/Iceberg with cloud catalogs

  • @data-engineering-streaming

  • Kafka connectors with cloud auth

  • @data-engineering-ai-ml

  • OpenAI, vector DBs with cloud storage

  • @data-engineering-orchestration

  • dbt, Prefect, Dagster cloud connectors

Detailed Guides

AWS Authentication

See: aws.md

  • IAM roles (EC2 instance profiles, ECS task roles, Lambda execution roles)

  • IAM users with access keys (discouraged for production)

  • STS temporary credentials (AssumeRole, GetSessionToken)

  • S3 presigned URLs for temporary file access

  • Cross-account access patterns

  • AWS Secrets Manager integration

  • Environment variable resolution (AWS_ACCESS_KEY_ID , AWS_SECRET_ACCESS_KEY , AWS_SESSION_TOKEN )

Google Cloud Platform

See: gcp.md

  • Service accounts (JSON keys)

  • Workload Identity Federation (no keys needed!)

  • Application Default Credentials (ADC)

  • Cloud Storage signed URLs

  • Secret Manager integration

  • Environment variables (GOOGLE_APPLICATION_CREDENTIALS )

  • GCP workload identity for GKE, Cloud Run, Compute Engine

Azure

See: azure.md

  • Managed Identities (system-assigned, user-assigned)

  • Service Principals (client secret, certificate)

  • SAS tokens for Blob Storage

  • Azure Key Vault integration

  • Environment variables (AZURE_STORAGE_ACCOUNT , AZURE_STORAGE_KEY )

  • Azure AD workload identity for AKS, App Service, VMs

Patterns & Best Practices

See: patterns.md

  • Secret rotation automation

  • Multi-environment credential management

  • Local development setup without production keys

  • CI/CD pipeline authentication (GitHub Actions, GitLab CI, Jenkins)

  • Testing with mock credentials (Moto, google-cloud-testutils)

  • Credential leakage prevention (.gitignore, pre-commit hooks)

Testing Strategies

See: testing.md

  • Mocking cloud services for unit tests

  • Using local emulators (MinIO, Azurite, LocalStack)

  • Test credential patterns with placeholders

  • Integration test setup with temporary credentials

Quick Examples

AWS IAM Role (Production)

No credentials in code - automatically from EC2/ECS/Lambda

import boto3 s3 = boto3.client('s3') # Uses instance metadata

GCP Workload Identity (Production)

Enable workload identity on GKE/Cloud Run

Then in Python:

import google.auth credentials, project = google.auth.default()

No env vars needed!

Azure Managed Identity (Production)

from azure.identity import DefaultAzureCredential from azure.storage.blob import BlobServiceClient

credential = DefaultAzureCredential() # Auto-detects managed identity client = BlobServiceClient(account_url="...", credential=credential)

Local Development

AWS

aws configure # Enter keys from IAM user (dev only)

GCP

gcloud auth application-default login

Azure

az login

Common Pitfalls

❌ Hardcoding credentials - Committing to git → rotate immediately ❌ Using root/admin accounts - Create scoped users/service principals ❌ Long-lived keys - Rotate every 90 days or less ❌ Over-permissive roles - Grant s3:GetObject not s3:*

❌ Missing environment separation - Dev credentials in prod ❌ Disabling TLS verification - Except for local MinIO testing only

References

  • AWS IAM Best Practices

  • GCP Workload Identity

  • Azure Managed Identities

  • HashiCorp Vault

  • Legacy @data-engineering-storage-remote-access auth notes are deprecated; use this skill as the source of truth.

Source Transparency

This detail page is rendered from real SKILL.md content. Trust labels are metadata-based hints, not a safety guarantee.

Related Skills

Related by shared tags or category signals.

Automation

data-science-eda

No summary provided by upstream source.

Repository SourceNeeds Review
Automation

data-engineering-core

No summary provided by upstream source.

Repository SourceNeeds Review
Automation

data-science-feature-engineering

No summary provided by upstream source.

Repository SourceNeeds Review