google-cloud-configs

- Setting up BigQuery ML for SQL-based machine learning

Safety Notice

This listing is imported from skills.sh public index metadata. Review upstream SKILL.md and repository scripts before running.

Copy this and send it to your AI assistant to learn

Install skill "google-cloud-configs" with this command: npx skills add vanman2024/ai-dev-marketplace/vanman2024-ai-dev-marketplace-google-cloud-configs

Use when:

  • Setting up BigQuery ML for SQL-based machine learning

  • Configuring Vertex AI custom training jobs

  • Setting up GCP authentication for ML workflows

  • Selecting appropriate GPU/TPU configurations

  • Estimating costs for GCP ML training

  • Deploying models to Vertex AI endpoints

  • Configuring distributed training on GCP

  • Optimizing cost vs performance for cloud ML

Platform Overview

BigQuery ML

What it is: SQL-based machine learning directly in BigQuery Best for:

  • Quick ML prototypes using existing data warehouse data

  • Classification, regression, forecasting on structured data

  • Users familiar with SQL but not Python/ML frameworks

  • Large-scale batch predictions

Available Models:

  • Linear/Logistic Regression

  • XGBoost (BOOSTED_TREE)

  • Deep Neural Networks (DNN)

  • AutoML Tables

  • TensorFlow/PyTorch imported models

Pricing:

  • Based on data processed (same as BigQuery queries)

  • $5 per TB processed for analysis

  • AutoML: $19.32/hour for training

Vertex AI Training

What it is: Fully managed ML training platform Best for:

  • Custom PyTorch/TensorFlow training

  • Large-scale distributed training

  • GPU/TPU-accelerated workloads

  • Production ML pipelines

Available Compute:

  • CPUs: n1-standard, n1-highmem, n1-highcpu

  • GPUs: NVIDIA T4, P4, V100, P100, A100, L4

  • TPUs: v2, v3, v4, v5e (8 cores to 512 cores)

Pricing:

  • CPU: $0.05-0.30/hour depending on machine type

  • GPU T4: $0.35/hour

  • GPU A100: $3.67/hour (40GB) or $4.95/hour (80GB)

  • TPU v3: $8.00/hour (8 cores)

  • TPU v4: $11.00/hour (8 cores)

GPU/TPU Selection Guide

GPU Selection (Vertex AI)

T4 (16GB VRAM):

  • Use case: Inference, light training, small models

  • Cost: $0.35/hour

  • Good for: BERT-base, small CNNs, inference serving

V100 (16GB VRAM):

  • Use case: Mid-size training, mixed precision training

  • Cost: $2.48/hour

  • Good for: ResNet training, medium transformers

A100 (40GB/80GB VRAM):

  • Use case: Large model training, distributed training

  • Cost: $3.67/hour (40GB), $4.95/hour (80GB)

  • Good for: GPT-style models, large vision models, multi-GPU training

L4 (24GB VRAM):

  • Use case: Modern alternative to T4, better performance

  • Cost: $0.66/hour

  • Good for: Mid-size models, efficient inference

TPU Selection (Vertex AI)

TPU v2 (8 cores):

  • Use case: TensorFlow/JAX training, matrix operations

  • Cost: $4.50/hour

  • Memory: 8GB per core (64GB total)

  • Good for: Legacy TensorFlow models

TPU v3 (8 cores):

  • Use case: Standard TPU training

  • Cost: $8.00/hour

  • Memory: 16GB per core (128GB total)

  • Good for: BERT, T5, image classification

TPU v4 (8 cores):

  • Use case: Latest generation, best performance

  • Cost: $11.00/hour

  • Memory: 32GB per core (256GB total)

  • Good for: Large language models, cutting-edge research

TPU v5e (8 cores):

  • Use case: Cost-optimized TPU

  • Cost: $2.50/hour

  • Good for: Development, training at scale on budget

Multi-node TPU Pods:

  • v3-32: 32 cores, $32/hour

  • v3-128: 128 cores, $128/hour

  • v4-128: 128 cores, $176/hour

  • Use for: Massive distributed training (GPT-3 scale)

Usage

Setup BigQuery ML Environment

bash scripts/setup-bigquery-ml.sh

Prompts for:

  • GCP Project ID

  • BigQuery dataset name

  • Service account credentials

  • Default model type preference

Creates:

  • bigquery_config.json

  • Project configuration

  • .bigqueryrc

  • CLI configuration

  • Example training SQL in examples/

Setup Vertex AI Training Environment

bash scripts/setup-vertex-ai.sh

Prompts for:

  • GCP Project ID

  • Region (us-central1, europe-west4, etc.)

  • Service account credentials

  • Default machine type

  • GPU/TPU preference

Creates:

  • vertex_config.yaml

  • Training job configuration

  • vertex_requirements.txt

  • Python dependencies

  • Training script template

Configure GCP Authentication

bash scripts/configure-auth.sh

Prompts for:

  • Authentication method (service account, user account, workload identity)

  • Service account key path (if applicable)

  • IAM roles needed

Creates:

  • .gcp_auth_config

  • Authentication configuration

  • Sets GOOGLE_APPLICATION_CREDENTIALS environment variable

  • Validates permissions

Required IAM Roles:

  • BigQuery ML: roles/bigquery.dataEditor , roles/bigquery.jobUser

  • Vertex AI: roles/aiplatform.user , roles/storage.objectAdmin

  • Both: roles/serviceusage.serviceUsageConsumer

Estimate GCP Training Costs

bash scripts/estimate-gcp-cost.sh

Interactive prompts:

  • Platform: BigQuery ML or Vertex AI

  • If BigQuery ML: Data size to process

  • If Vertex AI:

  • Machine type (CPU/GPU/TPU)

  • Number of machines

  • Training duration estimate

  • Storage requirements

Output:

  • Estimated compute cost

  • Storage cost

  • Data transfer cost (if applicable)

  • Total estimated cost

  • Cost comparison with other GCP options

Templates

BigQuery ML Training Template (templates/bigquery_ml_training.sql )

SQL template for creating and training models:

  • Model creation syntax

  • Feature engineering examples

  • Training options (L1/L2 reg, learning rate, etc.)

  • Evaluation queries

  • Prediction queries

Supported model types:

  • LINEAR_REG, LOGISTIC_REG

  • BOOSTED_TREE_CLASSIFIER, BOOSTED_TREE_REGRESSOR

  • DNN_CLASSIFIER, DNN_REGRESSOR

  • AUTOML_CLASSIFIER, AUTOML_REGRESSOR

Vertex AI Training Job Template (templates/vertex_training_job.py )

Python template for custom training:

  • Training loop structure

  • Distributed training setup (PyTorch DDP)

  • Checkpointing and model saving

  • Metrics logging to Vertex AI

  • Hyperparameter tuning integration

Includes:

  • Single GPU training

  • Multi-GPU training (DataParallel, DistributedDataParallel)

  • TPU training with PyTorch/XLA

  • Cloud Storage integration

GPU Configuration Template (templates/vertex_gpu_config.yaml )

YAML configuration for GPU training jobs:

  • Machine type selection

  • GPU type and count

  • Disk configuration

  • Network configuration

  • Environment variables

Presets included:

  • Single T4 (budget)

  • Single A100 (standard)

  • 4x A100 (distributed)

  • 8x A100 (large-scale)

TPU Configuration Template (templates/vertex_tpu_config.yaml )

YAML configuration for TPU training jobs:

  • TPU type and topology

  • TPU version selection

  • JAX/TensorFlow runtime

  • XLA compilation flags

Presets included:

  • v3-8 (single TPU)

  • v4-32 (TPU pod slice)

  • v5e-8 (cost-optimized)

GCP Authentication Template (templates/gcp_auth.json )

Service account configuration template:

  • Project ID

  • Service account email

  • Key file path

  • Required scopes

  • IAM role assignments

Security notes:

  • Uses placeholders only (never real keys)

  • Documents how to create service accounts

  • Includes .gitignore protection

Examples

BigQuery ML Regression Example (examples/bigquery-regression-example.sql )

Complete example:

  • Dataset: NYC taxi trip data

  • Task: Predict trip duration

  • Model: BOOSTED_TREE_REGRESSOR

  • Includes feature engineering, training, evaluation

Demonstrates:

  • CREATE MODEL syntax

  • TRANSFORM clause for feature engineering

  • MODEL evaluation

  • Batch predictions

Vertex AI PyTorch Training Example (examples/vertex-pytorch-training.py )

Complete training script:

  • Dataset: IMDB sentiment analysis

  • Model: DistilBERT fine-tuning

  • Training: Single GPU

  • Logging: Vertex AI experiments

Demonstrates:

  • Loading data from GCS

  • Training loop with mixed precision

  • Checkpointing to GCS

  • Metrics logging

  • Model export to Vertex AI

Vertex AI Distributed Training Example (examples/vertex-distributed-training.py )

Multi-GPU training example:

  • Dataset: ImageNet subset

  • Model: ResNet-50

  • Training: 4x A100 with DDP

  • Scaling: Linear scaling rule

Demonstrates:

  • PyTorch DistributedDataParallel

  • Gradient accumulation

  • Learning rate scaling

  • Synchronized batch norm

  • Multi-node coordination

Hugging Face Fine-tuning on Vertex AI (examples/vertex-huggingface-finetuning.py )

Production fine-tuning template:

  • Dataset: Custom text classification

  • Model: BERT/RoBERTa/DeBERTa

  • Training: Hugging Face Trainer API

  • Deployment: Vertex AI endpoint

Demonstrates:

  • Hugging Face Trainer integration

  • Hyperparameter tuning with Vertex AI

  • Model versioning

  • Endpoint deployment

  • Online predictions

Cost Optimization Tips

BigQuery ML

Reduce data processed:

  • Use partitioned tables

  • Filter data in WHERE clause before training

  • Use table sampling for experimentation

  • Cache intermediate results

Use appropriate model types:

  • Start with LINEAR_REG/LOGISTIC_REG (cheapest)

  • Use BOOSTED_TREE for better accuracy at moderate cost

  • Reserve AutoML for when simpler models fail

Optimize queries:

  • Avoid SELECT * (specify columns)

  • Use clustering on filter columns

  • Materialize views for repeated training

Vertex AI

Machine type selection:

  • Start with CPU for prototyping

  • Use T4 for small models (cheapest GPU)

  • Use A100 only for large models that need it

  • Consider TPU v5e for TensorFlow/JAX (very cost-effective)

Training optimization:

  • Use preemptible instances (60-70% cheaper, can be interrupted)

  • Enable automatic checkpoint/resume for preemptible

  • Use mixed precision training (FP16/BF16) for faster training

  • Profile to eliminate CPU bottlenecks

Storage optimization:

  • Store datasets in Cloud Storage (cheaper than persistent disk)

  • Use Filestore only if needed for POSIX filesystem

  • Clean up old model artifacts

  • Use lifecycle policies to archive old data

Multi-GPU efficiency:

  • Ensure near-linear scaling before adding more GPUs

  • Profile inter-GPU communication

  • Use gradient accumulation instead of larger batch sizes

  • Consider 2x GPUs instead of 1x larger GPU (often same cost, better availability)

Integration with ML Training Plugin

This skill integrates with other ml-training components:

  • training-patterns: Provides GCP configs for generated training scripts

  • cost-calculator: Uses GCP pricing data for budget planning

  • monitoring-dashboard: Integrates with Vertex AI TensorBoard

  • validation-scripts: Validates GCP credentials and permissions

  • integration-helpers: Deploys trained models to Vertex AI endpoints

Common Workflows

Workflow 1: Quick BigQuery ML Prototype

  • Run bash scripts/setup-bigquery-ml.sh

  • Copy templates/bigquery_ml_training.sql to your project

  • Modify SQL for your dataset and features

  • Run training query in BigQuery console

  • Evaluate with built-in ML.EVALUATE()

  • Export predictions with ML.PREDICT()

Time: 30 minutes setup + training time Cost: $5 per TB of data processed

Workflow 2: Custom PyTorch Training on Vertex AI

  • Run bash scripts/configure-auth.sh

  • Run bash scripts/setup-vertex-ai.sh

  • Copy templates/vertex_training_job.py

  • Customize training loop for your model

  • Copy templates/vertex_gpu_config.yaml

  • Submit job: gcloud ai custom-jobs create ...

  • Monitor in Vertex AI console

Time: 1 hour setup + training time Cost: Depends on GPU/TPU selection

Workflow 3: Large-Scale Distributed Training

  • Setup Vertex AI (workflow 2)

  • Copy examples/vertex-distributed-training.py

  • Adapt for your model architecture

  • Test locally with 1 GPU

  • Test with 2 GPUs to verify scaling

  • Scale to 4-8 GPUs for full training

  • Use preemptible instances with checkpointing

Time: 2-4 hours setup + training time Cost: $15-60/hour depending on GPU count

Troubleshooting

BigQuery ML Issues

"Insufficient permissions":

  • Verify roles/bigquery.dataEditor and roles/bigquery.jobUser

  • Check dataset-level permissions

  • Ensure billing is enabled

"Model training failed":

  • Check for NULL values in features

  • Verify data types match model expectations

  • Review feature engineering TRANSFORM clause

  • Check for sufficient training data

Vertex AI Issues

"Service account lacks permissions":

  • Verify roles/aiplatform.user

  • Add roles/storage.objectAdmin for GCS access

  • Check project-level IAM policies

"GPU/TPU quota exceeded":

  • Request quota increase in GCP console

  • Use different region with availability

  • Start with smaller GPU/TPU configuration

  • Use preemptible instances (separate quota)

"Training job crashes":

  • Check for CUDA OOM (reduce batch size)

  • Verify dependencies in requirements.txt

  • Review logs in Cloud Logging

  • Test locally before submitting to Vertex

Security Best Practices

Credentials Management

DO:

  • ✅ Use service accounts with minimal permissions

  • ✅ Store credentials in Secret Manager

  • ✅ Use Workload Identity for GKE deployments

  • ✅ Rotate service account keys regularly

  • ✅ Add .gitignore for *.json key files

DON'T:

  • ❌ Hardcode credentials in code

  • ❌ Commit service account keys to git

  • ❌ Use overly permissive roles (e.g., Owner)

  • ❌ Share service account keys across projects

  • ❌ Use personal credentials for production

IAM Best Practices

  • Use separate service accounts for training vs serving

  • Grant roles at resource level, not project level when possible

  • Use Workload Identity Federation instead of keys when possible

  • Enable Cloud Audit Logs for ML API usage

  • Review IAM permissions quarterly

Performance Benchmarks

BigQuery ML vs Vertex AI

BigQuery ML:

  • Best for: Structured data, SQL users, quick prototypes

  • Training time: Minutes to hours (depends on data size)

  • Scalability: Automatic (serverless)

  • Cost: $5/TB processed

Vertex AI Custom Training:

  • Best for: Deep learning, custom architectures, GPU/TPU workloads

  • Training time: Hours to days (configurable hardware)

  • Scalability: Manual (choose machine type)

  • Cost: $0.35-20/hour depending on hardware

Rule of thumb:

  • Use BigQuery ML for tabular data with < 100M rows

  • Use Vertex AI for images, text, audio, or custom models

  • Use Vertex AI for models requiring GPU/TPU acceleration

Additional Resources

Source Transparency

This detail page is rendered from real SKILL.md content. Trust labels are metadata-based hints, not a safety guarantee.

Related Skills

Related by shared tags or category signals.

Coding

document-parsers

No summary provided by upstream source.

Repository SourceNeeds Review
Coding

stt-integration

No summary provided by upstream source.

Repository SourceNeeds Review
Coding

model-routing-patterns

No summary provided by upstream source.

Repository SourceNeeds Review
Coding

react-email-templates

No summary provided by upstream source.

Repository SourceNeeds Review