165 lines
6.1 KiB
Markdown
165 lines
6.1 KiB
Markdown
# Pipeline
|
|
|
|
Orchestrates ephemeral Vast.ai GPU instances: searches for an offer, creates the instance, syncs the project, trains, downloads `outputs/`, and destroys the instance automatically. Generator runs also rsync `generator/outputs/` every 50 epochs while training is still running.
|
|
|
|
## One-time setup
|
|
|
|
Create `pipeline/.env`:
|
|
|
|
```dotenv
|
|
VAST_API_KEY=your-vast-api-key
|
|
VAST_SSH_PRIVATE_KEY=/home/you/.ssh/id_ed25519 # optional, this is the default
|
|
```
|
|
|
|
The matching `.pub` file must exist alongside the private key. The pipeline registers it with Vast.ai automatically if it isn't there yet.
|
|
|
|
## Commands
|
|
|
|
### `run` — train on a remote GPU and fetch results
|
|
|
|
```
|
|
python -m pipeline run <config...> [options]
|
|
```
|
|
|
|
Accepts one or more config paths, or a single directory (all `*.json` inside, sorted). Duplicate configs (identical training settings after resolving `extends` and `shared.json`) are skipped automatically.
|
|
|
|
| Flag | Default | Description |
|
|
|------|---------|-------------|
|
|
| `configs` | *(required)* | One or more config paths, or a directory of JSON configs |
|
|
| `--download-data` | off | Download the DFF dataset via HuggingFace on the remote before training |
|
|
| `--send-cropped` | off | Rsync local `cropped/{classifier,generator}/` to remote (picks subdirectory based on config) |
|
|
| `--select-offer` | off | Interactively browse and pick the GPU offer |
|
|
| `--sort` | config | Ranking mode: `price`, `performance`, or `dlp_per_dollar` |
|
|
| `--region TEXT` | any | Filter by region, e.g. `europe`, `Portugal`, `US` |
|
|
| `--price FLOAT` | config | Max hourly price cap in USD |
|
|
| `--dry-run` | off | Print matching offers without creating an instance |
|
|
| `--keep-on-failure` | off | Do not destroy the instance if training fails |
|
|
| `--no-gpu` | off | Disable GPU training on remote (use CPU instead) |
|
|
| `--select-template` | off | Interactively choose a Vast.ai Docker template |
|
|
| `--template HASH` | config | Use a specific template hash ID |
|
|
| `--pipeline-config PATH` | none | JSON file that overrides `pipeline/defaults/vast.json` |
|
|
|
|
**Examples:**
|
|
|
|
```bash
|
|
# Cheapest available RTX 3090 in Europe, download data on remote
|
|
python -m pipeline run configs/resnet18.json --region europe --download-data
|
|
|
|
# Browse offers interactively, sort by price
|
|
python -m pipeline run configs/resnet18.json --select-offer --sort price
|
|
|
|
# Run all configs in a directory sequentially on one instance
|
|
python -m pipeline run configs/phase2/ --region europe
|
|
|
|
# See what offers would be selected without spending money
|
|
python -m pipeline run configs/resnet18.json --dry-run --region europe
|
|
|
|
# Keep the instance alive if something goes wrong (for debugging)
|
|
python -m pipeline run configs/resnet18.json --keep-on-failure
|
|
|
|
# Cap price at $0.12/h
|
|
python -m pipeline run configs/resnet18.json --price 0.12
|
|
```
|
|
|
|
### `offers` — inspect available GPU offers
|
|
|
|
```
|
|
python -m pipeline offers [options]
|
|
```
|
|
|
|
| Flag | Default | Description |
|
|
|------|---------|-------------|
|
|
| `--sort` | config | Ranking mode: `price`, `performance`, or `dlp_per_dollar` |
|
|
| `--region TEXT` | any | Region filter |
|
|
| `--price FLOAT` | config | Max hourly price cap |
|
|
| `--select-offer` | off | Interactive offer picker (prints the selected offer as JSON) |
|
|
| `--list-regions` | off | Print a count of available offers per region and exit |
|
|
| `--limit-output INT` | 10 | How many offers to print |
|
|
| `--pipeline-config PATH` | none | Pipeline config override |
|
|
|
|
**Examples:**
|
|
|
|
```bash
|
|
# See the 20 best-value offers under $0.15/h in Europe
|
|
python -m pipeline offers --region europe --price 0.15 --limit-output 20
|
|
|
|
# List which regions have matching GPUs
|
|
python -m pipeline offers --list-regions
|
|
|
|
# Interactive picker — useful before committing to a run
|
|
python -m pipeline offers --select-offer --sort price
|
|
```
|
|
|
|
### `up` — create an instance without training
|
|
|
|
Spins up an instance and prints SSH connection details. Useful for manual experiments or debugging.
|
|
|
|
```
|
|
python -m pipeline up [options]
|
|
```
|
|
|
|
| Flag | Default | Description |
|
|
|------|---------|-------------|
|
|
| `--label TEXT` | auto | Optional label for the instance |
|
|
| `--select-template` | off | Interactively choose a Vast.ai Docker template |
|
|
| `--template HASH` | config | Use a specific template hash ID |
|
|
| `--pipeline-config PATH` | none | Pipeline config override |
|
|
|
|
```bash
|
|
python -m pipeline up
|
|
python -m pipeline up --label my-debug-session
|
|
```
|
|
|
|
### `status` — show instance details
|
|
|
|
```
|
|
python -m pipeline status <instance_id> [--pipeline-config PATH]
|
|
```
|
|
|
|
### `down` — destroy an instance
|
|
|
|
```
|
|
python -m pipeline down <instance_id> [--pipeline-config PATH]
|
|
```
|
|
|
|
## Pipeline config overrides
|
|
|
|
Pass `--pipeline-config my_overrides.json` to override any field from `pipeline/defaults/vast.json`. Only the fields you specify are changed; the rest keep their defaults (deep-merged). Useful for switching GPU types or raising the price cap for a single run without editing defaults.
|
|
|
|
**Example — allow RTX 4090, higher price cap:**
|
|
|
|
```json
|
|
{
|
|
"search": {
|
|
"gpu_names": ["RTX 4090"],
|
|
"max_dph_total": 0.45
|
|
}
|
|
}
|
|
```
|
|
|
|
**Key fields in `pipeline/defaults/vast.json`:**
|
|
|
|
| Section | Key | Default | Meaning |
|
|
|---------|-----|---------|---------|
|
|
| `search` | `gpu_names` | `["RTX 3090", "RTX 3090 Ti"]` | Accepted GPU models |
|
|
| `search` | `max_dph_total` | `0.40` | Max price per hour |
|
|
| `search` | `sort_mode` | `"dlp_per_dollar"` | Default ranking (`price`, `performance`, or `dlp_per_dollar`) |
|
|
| `search` | `min_reliability` | `0.98` | Minimum host reliability score |
|
|
| `instance` | `disk_gb` | `48` | Disk size provisioned on the instance |
|
|
| `instance` | `image` | `"vastai/pytorch:latest"` | Docker image |
|
|
| `remote` | `workspace_dir` | `"/workspace/DRL_PROJ"` | Remote working directory |
|
|
| `remote` | `ssh_timeout_seconds` | `900` | How long to wait for SSH to become available |
|
|
|
|
## Full workflow example
|
|
|
|
```bash
|
|
# 1. Check what's available and how much it costs
|
|
python -m pipeline offers --region europe --list-regions
|
|
python -m pipeline offers --region europe --sort price --limit-output 20
|
|
|
|
# 2. Run training (auto-selects best offer, downloads data if needed)
|
|
python -m pipeline run configs/resnet18.json --region europe --download-data
|
|
|
|
# 3. Results land in classifier/outputs/ automatically
|
|
```
|