Files
DRL_PROJ/pipeline/README.md
T
Johnny Fernandes bb3dfb92d5 Clean state
2026-04-30 01:25:39 +01:00

165 lines
6.1 KiB
Markdown

# Pipeline
Orchestrates ephemeral Vast.ai GPU instances: searches for an offer, creates the instance, syncs the project, trains, downloads `outputs/`, and destroys the instance automatically. Generator runs also rsync `generator/outputs/` every 50 epochs while training is still running.
## One-time setup
Create `pipeline/.env`:
```dotenv
VAST_API_KEY=your-vast-api-key
VAST_SSH_PRIVATE_KEY=/home/you/.ssh/id_ed25519 # optional, this is the default
```
The matching `.pub` file must exist alongside the private key. The pipeline registers it with Vast.ai automatically if it isn't there yet.
## Commands
### `run` — train on a remote GPU and fetch results
```
python -m pipeline run <config...> [options]
```
Accepts one or more config paths, or a single directory (all `*.json` inside, sorted). Duplicate configs (identical training settings after resolving `extends` and `shared.json`) are skipped automatically.
| Flag | Default | Description |
|------|---------|-------------|
| `configs` | *(required)* | One or more config paths, or a directory of JSON configs |
| `--download-data` | off | Download the DFF dataset via HuggingFace on the remote before training |
| `--send-cropped` | off | Rsync local `cropped/{classifier,generator}/` to remote (picks subdirectory based on config) |
| `--select-offer` | off | Interactively browse and pick the GPU offer |
| `--sort` | config | Ranking mode: `price`, `performance`, or `dlp_per_dollar` |
| `--region TEXT` | any | Filter by region, e.g. `europe`, `Portugal`, `US` |
| `--price FLOAT` | config | Max hourly price cap in USD |
| `--dry-run` | off | Print matching offers without creating an instance |
| `--keep-on-failure` | off | Do not destroy the instance if training fails |
| `--no-gpu` | off | Disable GPU training on remote (use CPU instead) |
| `--select-template` | off | Interactively choose a Vast.ai Docker template |
| `--template HASH` | config | Use a specific template hash ID |
| `--pipeline-config PATH` | none | JSON file that overrides `pipeline/defaults/vast.json` |
**Examples:**
```bash
# Cheapest available RTX 3090 in Europe, download data on remote
python -m pipeline run configs/resnet18.json --region europe --download-data
# Browse offers interactively, sort by price
python -m pipeline run configs/resnet18.json --select-offer --sort price
# Run all configs in a directory sequentially on one instance
python -m pipeline run configs/phase2/ --region europe
# See what offers would be selected without spending money
python -m pipeline run configs/resnet18.json --dry-run --region europe
# Keep the instance alive if something goes wrong (for debugging)
python -m pipeline run configs/resnet18.json --keep-on-failure
# Cap price at $0.12/h
python -m pipeline run configs/resnet18.json --price 0.12
```
### `offers` — inspect available GPU offers
```
python -m pipeline offers [options]
```
| Flag | Default | Description |
|------|---------|-------------|
| `--sort` | config | Ranking mode: `price`, `performance`, or `dlp_per_dollar` |
| `--region TEXT` | any | Region filter |
| `--price FLOAT` | config | Max hourly price cap |
| `--select-offer` | off | Interactive offer picker (prints the selected offer as JSON) |
| `--list-regions` | off | Print a count of available offers per region and exit |
| `--limit-output INT` | 10 | How many offers to print |
| `--pipeline-config PATH` | none | Pipeline config override |
**Examples:**
```bash
# See the 20 best-value offers under $0.15/h in Europe
python -m pipeline offers --region europe --price 0.15 --limit-output 20
# List which regions have matching GPUs
python -m pipeline offers --list-regions
# Interactive picker — useful before committing to a run
python -m pipeline offers --select-offer --sort price
```
### `up` — create an instance without training
Spins up an instance and prints SSH connection details. Useful for manual experiments or debugging.
```
python -m pipeline up [options]
```
| Flag | Default | Description |
|------|---------|-------------|
| `--label TEXT` | auto | Optional label for the instance |
| `--select-template` | off | Interactively choose a Vast.ai Docker template |
| `--template HASH` | config | Use a specific template hash ID |
| `--pipeline-config PATH` | none | Pipeline config override |
```bash
python -m pipeline up
python -m pipeline up --label my-debug-session
```
### `status` — show instance details
```
python -m pipeline status <instance_id> [--pipeline-config PATH]
```
### `down` — destroy an instance
```
python -m pipeline down <instance_id> [--pipeline-config PATH]
```
## Pipeline config overrides
Pass `--pipeline-config my_overrides.json` to override any field from `pipeline/defaults/vast.json`. Only the fields you specify are changed; the rest keep their defaults (deep-merged). Useful for switching GPU types or raising the price cap for a single run without editing defaults.
**Example — allow RTX 4090, higher price cap:**
```json
{
"search": {
"gpu_names": ["RTX 4090"],
"max_dph_total": 0.45
}
}
```
**Key fields in `pipeline/defaults/vast.json`:**
| Section | Key | Default | Meaning |
|---------|-----|---------|---------|
| `search` | `gpu_names` | `["RTX 3090", "RTX 3090 Ti"]` | Accepted GPU models |
| `search` | `max_dph_total` | `0.40` | Max price per hour |
| `search` | `sort_mode` | `"dlp_per_dollar"` | Default ranking (`price`, `performance`, or `dlp_per_dollar`) |
| `search` | `min_reliability` | `0.98` | Minimum host reliability score |
| `instance` | `disk_gb` | `48` | Disk size provisioned on the instance |
| `instance` | `image` | `"vastai/pytorch:latest"` | Docker image |
| `remote` | `workspace_dir` | `"/workspace/DRL_PROJ"` | Remote working directory |
| `remote` | `ssh_timeout_seconds` | `900` | How long to wait for SSH to become available |
## Full workflow example
```bash
# 1. Check what's available and how much it costs
python -m pipeline offers --region europe --list-regions
python -m pipeline offers --region europe --sort price --limit-output 20
# 2. Run training (auto-selects best offer, downloads data if needed)
python -m pipeline run configs/resnet18.json --region europe --download-data
# 3. Results land in classifier/outputs/ automatically
```