Valohai Data I/O Migration
Migrate data loading and model saving to use Valohai's managed input/output system. This eliminates cloud SDK boilerplate, handles authentication automatically, and enables full data lineage tracking.
Philosophy
Valohai separates data access from code. Inputs are declared in valohai.yaml with cloud storage URLs. Valohai downloads them to /valohai/inputs/{name}/ before your code runs. Outputs are saved to /valohai/outputs/ and automatically uploaded. No boto3, no credentials in code, no download logic.
Step-by-Step: Inputs
1. Identify Data Loading Code
Look for patterns like:
boto3.client('s3')/s3.download_file()from google.cloud import storageazure.storage.blobrequests.get()for downloading datasetswgetorcurlin shell commands- Hardcoded local paths like
/data/train/,./datasets/ pd.read_csv("path/to/data.csv")torch.load("model.pth")for pretrained models
Design Principle: Always Set Default Values for Inputs
IMPORTANT: Whenever possible, give every input a default value so the step can run independently without a pipeline. This lets users test individual steps with vh execution run step-name --adhoc. Pipeline edges override defaults at runtime.
Look for default values in the existing code, README, documentation, config files, or data paths already referenced in the project. Only add a default if you can find a real, meaningful value. Do not invent placeholder URLs or dummy paths.
2. Define Inputs in valohai.yaml
- step:
name: train-model
image: tensorflow/tensorflow:2.6.0
command:
- pip install -r requirements.txt
- python train.py
inputs:
- name: training-data
default: s3://my-bucket/datasets/train.csv
- name: pretrained-model
default: datum://production-model-latest
filename: model.h5
Input Options
inputs:
# Single file
- name: config
default: s3://bucket/config.json
# Multiple files with wildcard
- name: images
default:
- s3://bucket/images/*.jpg
- s3://bucket/images/*.png
# Multiple cloud sources
- name: data
default:
- s3://aws-bucket/data.csv
- azure://container/data.csv
- gs://gcs-bucket/data.csv
# Rename on download
- name: model
default: s3://bucket/models/v42-best.h5
filename: model.h5
# Preserve directory structure
- name: dataset
default: s3://bucket/dataset/**/*.json
keep-directories: suffix
# Optional input (can be empty)
- name: checkpoint
optional: true
# Reference a Valohai datum (output from another execution)
- name: pretrained
default: datum://my-model-alias
keep-directories Options
none(default): All files flat in/valohai/inputs/{name}/suffix: Keeps path after the wildcard rootfull: Keeps the full storage path
3. Update Python Code
Replace cloud SDK / download code with simple local file reads:
Before (with boto3)
import boto3
s3 = boto3.client('s3', aws_access_key_id=KEY, aws_secret_access_key=SECRET)
s3.download_file('my-bucket', 'datasets/train.csv', '/tmp/train.csv')
df = pd.read_csv('/tmp/train.csv')
After (Valohai)
import pandas as pd
# Files are already downloaded to /valohai/inputs/
df = pd.read_csv("/valohai/inputs/training-data/train.csv")
Reading Multiple Input Files
import os
import glob
# All files in the input directory
image_dir = "/valohai/inputs/images/"
image_files = glob.glob(os.path.join(image_dir, "*.jpg"))
for image_path in image_files:
process_image(image_path)
Loading a Model
import tensorflow as tf
# With filename option, the file is always named model.h5
model = tf.keras.models.load_model("/valohai/inputs/pretrained-model/model.h5")
4. Override Inputs at Runtime
# Override via CLI
vh execution run train-model --training-data=s3://different-bucket/new-data.csv --adhoc
# Or use the web UI to browse and select files
Step-by-Step: Outputs
1. Identify Output/Save Code
Look for:
model.save("model.h5")/torch.save(model, "model.pth")df.to_csv("results.csv")plt.savefig("plot.png")joblib.dump(model, "model.pkl")- Any code saving files locally
2. Update Save Paths
Simply change the output path to /valohai/outputs/:
Before
model.save("model.h5")
df.to_csv("results.csv")
plt.savefig("loss_curve.png")
After
model.save("/valohai/outputs/model.h5")
df.to_csv("/valohai/outputs/results.csv")
plt.savefig("/valohai/outputs/loss_curve.png")
That's the only change needed. Valohai automatically uploads everything in /valohai/outputs/ to your configured cloud storage when the execution completes.
3. Directory Structure in Outputs
Subdirectories are preserved:
import os
os.makedirs("/valohai/outputs/models", exist_ok=True)
os.makedirs("/valohai/outputs/plots", exist_ok=True)
model.save("/valohai/outputs/models/best_model.h5")
plt.savefig("/valohai/outputs/plots/training_curve.png")
4. Live Upload During Training (Checkpoints)
Files marked read-only are uploaded immediately, without waiting for execution to end:
import os
from stat import S_IREAD, S_IRGRP, S_IROTH
for epoch in range(epochs):
# Training...
if epoch % 10 == 0:
path = f"/valohai/outputs/checkpoint_epoch_{epoch}.pt"
torch.save(model.state_dict(), path)
os.chmod(path, S_IREAD | S_IRGRP | S_IROTH) # Triggers immediate upload
5. Package Large Numbers of Output Files
IMPORTANT: If your code produces thousands of individual files (e.g., preprocessed images, tokenized text shards, tile crops), package them into a single archive before writing to /valohai/outputs/. Do NOT write thousands of small files individually.
Valohai makes a separate HTTPS request with a presigned URL for every file it uploads and downloads. With tens of thousands of small files, you'll spend more time on HTTP overhead than actual data transfer.
Use tar (without compression) to bundle files — this keeps packaging fast while eliminating the per-file HTTP overhead:
import tarfile
import os
# WRONG - 20,000 individual files = 20,000 HTTPS requests on download
for i, image in enumerate(processed_images):
save_image(image, f"/valohai/outputs/image_{i}.png")
# CORRECT - 1 tar archive = 1 HTTPS request on download
output_dir = "/tmp/processed_images"
os.makedirs(output_dir, exist_ok=True)
for i, image in enumerate(processed_images):
save_image(image, f"{output_dir}/image_{i}.png")
# Package without compression (fast, no CPU overhead)
with tarfile.open("/valohai/outputs/processed_images.tar", "w") as tar:
tar.add(output_dir, arcname="processed_images")
Rule of thumb: If you're writing more than ~1,000 files to outputs, tar them up. A few large files are always faster than many small ones on Valohai.
6. No Need to Declare Outputs in YAML
You do not need to add an outputs section to valohai.yaml. Any file your code writes to /valohai/outputs/ is automatically captured and uploaded. Just save files there and Valohai handles the rest.
Validate the YAML
IMPORTANT: After adding inputs to valohai.yaml, always run vh lint to validate:
vh lint
This catches issues like invalid input names, bad URL formats in defaults, and YAML syntax errors. Fix any errors before running.
Common Migration Patterns
Full Training Script Migration
import argparse
import json
import os
import pandas as pd
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score
import joblib
parser = argparse.ArgumentParser()
parser.add_argument("--n_estimators", type=int, default=100)
parser.add_argument("--max_depth", type=int, default=10)
args = parser.parse_args()
# Load data from Valohai inputs
train_df = pd.read_csv("/valohai/inputs/training-data/train.csv")
test_df = pd.read_csv("/valohai/inputs/test-data/test.csv")
X_train, y_train = train_df.drop("target", axis=1), train_df["target"]
X_test, y_test = test_df.drop("target", axis=1), test_df["target"]
# Train model
model = RandomForestClassifier(n_estimators=args.n_estimators, max_depth=args.max_depth)
model.fit(X_train, y_train)
# Evaluate and log metrics
accuracy = accuracy_score(y_test, model.predict(X_test))
print(json.dumps({"accuracy": round(accuracy, 4)}))
# Save model to Valohai outputs
joblib.dump(model, "/valohai/outputs/model.pkl")
Using datum:// References in Pipelines
Outputs from one execution can be referenced by another using datum:// aliases:
inputs:
- name: model
default: datum://production-model-latest
This references a specific versioned file by its alias, enabling reproducible pipeline connections.
Creating Datasets with JSONL Metadata
Valohai uses a special metadata file (valohai.metadata.jsonl) to create and version datasets. Write this file to /valohai/outputs/ alongside your output files.
Before Creating a Dataset Version: Check Existing Versions
IMPORTANT: When creating a dataset version that builds on a previous version (using from), always confirm with the user which dataset version or alias to base it on. Do not guess or hardcode version numbers.
Use the CLI to fetch available data and aliases in the project:
# List data files in the project (shows datum URLs)
vh data list
# List aliases (shows named references like "production", "latest")
vh alias list
Ask the user:
- Which dataset and version to base the new version on (e.g.,
dataset://my-dataset/v2) - Or which alias to use (e.g.,
dataset://my-dataset/production) - Or whether to skip basing it on a previous version entirely and create a fresh dataset
Basic Dataset Creation
import json
# Save your output files to /valohai/outputs/ as usual
df.to_csv("/valohai/outputs/processed_data.csv")
# Create the metadata file to register it as a dataset version
metadata = {
"processed_data.csv": {
"valohai.dataset-versions": ["dataset://my-dataset/v1"],
},
}
with open("/valohai/outputs/valohai.metadata.jsonl", "w") as f:
for file_name, file_metadata in metadata.items():
json.dump({"file": file_name, "metadata": file_metadata}, f)
f.write("\n")
Dataset with Multiple Files
import json
metadata = {}
# Save multiple files and register them all in the same dataset version
for split in ["train", "val", "test"]:
filename = f"{split}.csv"
df.to_csv(f"/valohai/outputs/{filename}")
metadata[filename] = {
"valohai.dataset-versions": ["dataset://my-dataset/v1"],
}
with open("/valohai/outputs/valohai.metadata.jsonl", "w") as f:
for file_name, file_metadata in metadata.items():
json.dump({"file": file_name, "metadata": file_metadata}, f)
f.write("\n")
Incremental Dataset Updates
Create a new version based on a previous one, adding or excluding files:
import json
metadata = {
"new_data.csv": {
"valohai.dataset-versions": [
{
"uri": "dataset://my-dataset/v3",
"from": "dataset://my-dataset/v2",
"start_fresh": False,
"exclude": ["bad_file.csv", "old_file.csv"],
},
],
},
}
with open("/valohai/outputs/valohai.metadata.jsonl", "w") as f:
for file_name, file_metadata in metadata.items():
json.dump({"file": file_name, "metadata": file_metadata}, f)
f.write("\n")
Dataset with Aliases
Aliases let you reference datasets by name (e.g., production, latest) instead of version number:
import json
metadata = {
"model.pkl": {
"valohai.dataset-versions": [
{
"uri": "dataset://my-models/v5",
"from": "dataset://my-models/v4",
"targeting_aliases": ["production", "stable"],
},
],
},
}
with open("/valohai/outputs/valohai.metadata.jsonl", "w") as f:
for file_name, file_metadata in metadata.items():
json.dump({"file": file_name, "metadata": file_metadata}, f)
f.write("\n")
Packaged Datasets (Large File Collections)
For datasets with 10,000+ small files, use packaging for 10-100x faster downloads:
import json
metadata = {
"images.tar": {
"valohai.dataset-versions": [
{
"uri": "dataset://my-images/train-v1",
"packaging": True,
},
],
},
}
with open("/valohai/outputs/valohai.metadata.jsonl", "w") as f:
for file_name, file_metadata in metadata.items():
json.dump({"file": file_name, "metadata": file_metadata}, f)
f.write("\n")
Consuming Datasets as Inputs
Reference datasets in valohai.yaml inputs using the dataset:// URI:
inputs:
- name: training-data
default: dataset://my-dataset/latest
- name: model
default: dataset://my-models/production
Execution Environment Reference
Every Valohai execution has:
/valohai/repository/- Your code (working directory)/valohai/inputs/{input-name}/- Downloaded input files (read-only)/valohai/outputs/- Write output files here/valohai/config/parameters.json- Auto-generated parameter config
Edge Cases
- Input directories are read-only - never write to
/valohai/inputs/ - Input names in YAML map to directory names under
/valohai/inputs/ - If multiple files have the same name across different default URLs, they overwrite each other
- When running locally (not on Valohai),
/valohai/inputs/won't exist - consider adding a fallback path for local development - No file size limits on outputs
- Any file type is supported for both inputs and outputs
- Wildcard inputs: all matching files are downloaded to the same directory