Files
DRL_PROJ/pipeline/README.md
T
Johnny Fernandes bb3dfb92d5 Clean state
2026-04-30 01:25:39 +01:00

6.1 KiB

Pipeline

Orchestrates ephemeral Vast.ai GPU instances: searches for an offer, creates the instance, syncs the project, trains, downloads outputs/, and destroys the instance automatically. Generator runs also rsync generator/outputs/ every 50 epochs while training is still running.

One-time setup

Create pipeline/.env:

VAST_API_KEY=your-vast-api-key
VAST_SSH_PRIVATE_KEY=/home/you/.ssh/id_ed25519   # optional, this is the default

The matching .pub file must exist alongside the private key. The pipeline registers it with Vast.ai automatically if it isn't there yet.

Commands

run — train on a remote GPU and fetch results

python -m pipeline run <config...> [options]

Accepts one or more config paths, or a single directory (all *.json inside, sorted). Duplicate configs (identical training settings after resolving extends and shared.json) are skipped automatically.

Flag Default Description
configs (required) One or more config paths, or a directory of JSON configs
--download-data off Download the DFF dataset via HuggingFace on the remote before training
--send-cropped off Rsync local cropped/{classifier,generator}/ to remote (picks subdirectory based on config)
--select-offer off Interactively browse and pick the GPU offer
--sort config Ranking mode: price, performance, or dlp_per_dollar
--region TEXT any Filter by region, e.g. europe, Portugal, US
--price FLOAT config Max hourly price cap in USD
--dry-run off Print matching offers without creating an instance
--keep-on-failure off Do not destroy the instance if training fails
--no-gpu off Disable GPU training on remote (use CPU instead)
--select-template off Interactively choose a Vast.ai Docker template
--template HASH config Use a specific template hash ID
--pipeline-config PATH none JSON file that overrides pipeline/defaults/vast.json

Examples:

# Cheapest available RTX 3090 in Europe, download data on remote
python -m pipeline run configs/resnet18.json --region europe --download-data

# Browse offers interactively, sort by price
python -m pipeline run configs/resnet18.json --select-offer --sort price

# Run all configs in a directory sequentially on one instance
python -m pipeline run configs/phase2/ --region europe

# See what offers would be selected without spending money
python -m pipeline run configs/resnet18.json --dry-run --region europe

# Keep the instance alive if something goes wrong (for debugging)
python -m pipeline run configs/resnet18.json --keep-on-failure

# Cap price at $0.12/h
python -m pipeline run configs/resnet18.json --price 0.12

offers — inspect available GPU offers

python -m pipeline offers [options]
Flag Default Description
--sort config Ranking mode: price, performance, or dlp_per_dollar
--region TEXT any Region filter
--price FLOAT config Max hourly price cap
--select-offer off Interactive offer picker (prints the selected offer as JSON)
--list-regions off Print a count of available offers per region and exit
--limit-output INT 10 How many offers to print
--pipeline-config PATH none Pipeline config override

Examples:

# See the 20 best-value offers under $0.15/h in Europe
python -m pipeline offers --region europe --price 0.15 --limit-output 20

# List which regions have matching GPUs
python -m pipeline offers --list-regions

# Interactive picker — useful before committing to a run
python -m pipeline offers --select-offer --sort price

up — create an instance without training

Spins up an instance and prints SSH connection details. Useful for manual experiments or debugging.

python -m pipeline up [options]
Flag Default Description
--label TEXT auto Optional label for the instance
--select-template off Interactively choose a Vast.ai Docker template
--template HASH config Use a specific template hash ID
--pipeline-config PATH none Pipeline config override
python -m pipeline up
python -m pipeline up --label my-debug-session

status — show instance details

python -m pipeline status <instance_id> [--pipeline-config PATH]

down — destroy an instance

python -m pipeline down <instance_id> [--pipeline-config PATH]

Pipeline config overrides

Pass --pipeline-config my_overrides.json to override any field from pipeline/defaults/vast.json. Only the fields you specify are changed; the rest keep their defaults (deep-merged). Useful for switching GPU types or raising the price cap for a single run without editing defaults.

Example — allow RTX 4090, higher price cap:

{
  "search": {
    "gpu_names": ["RTX 4090"],
    "max_dph_total": 0.45
  }
}

Key fields in pipeline/defaults/vast.json:

Section Key Default Meaning
search gpu_names ["RTX 3090", "RTX 3090 Ti"] Accepted GPU models
search max_dph_total 0.40 Max price per hour
search sort_mode "dlp_per_dollar" Default ranking (price, performance, or dlp_per_dollar)
search min_reliability 0.98 Minimum host reliability score
instance disk_gb 48 Disk size provisioned on the instance
instance image "vastai/pytorch:latest" Docker image
remote workspace_dir "/workspace/DRL_PROJ" Remote working directory
remote ssh_timeout_seconds 900 How long to wait for SSH to become available

Full workflow example

# 1. Check what's available and how much it costs
python -m pipeline offers --region europe --list-regions
python -m pipeline offers --region europe --sort price --limit-output 20

# 2. Run training (auto-selects best offer, downloads data if needed)
python -m pipeline run configs/resnet18.json --region europe --download-data

# 3. Results land in classifier/outputs/ automatically