DRL_PROJ/pipeline/README.md

# Pipeline

Orchestrates ephemeral Vast.ai GPU instances: searches for an offer, creates the instance, syncs the project, trains, downloads `outputs/`, and destroys the instance automatically. Generator runs also rsync `generator/outputs/` every 50 epochs while training is still running.

## One-time setup

Create `pipeline/.env`:

```dotenv
VAST_API_KEY=your-vast-api-key
VAST_SSH_PRIVATE_KEY=/home/you/.ssh/id_ed25519   # optional, this is the default
```

The matching `.pub` file must exist alongside the private key. The pipeline registers it with Vast.ai automatically if it isn't there yet.

## Commands

### `run` — train on a remote GPU and fetch results

```
python -m pipeline run <config...> [options]
```

Accepts one or more config paths, or a single directory (all `*.json` inside, sorted). Duplicate configs (identical training settings after resolving `extends` and `shared.json`) are skipped automatically.

| Flag | Default | Description |
|------|---------|-------------|
| `configs` | *(required)* | One or more config paths, or a directory of JSON configs |
| `--download-data` | off | Download the DFF dataset via HuggingFace on the remote before training |
| `--send-cropped` | off | Rsync local `cropped/{classifier,generator}/` to remote (picks subdirectory based on config) |
| `--select-offer` | off | Interactively browse and pick the GPU offer |
| `--sort` | config | Ranking mode: `price`, `performance`, or `dlp_per_dollar` |
| `--region TEXT` | any | Filter by region, e.g. `europe`, `Portugal`, `US` |
| `--price FLOAT` | config | Max hourly price cap in USD |
| `--dry-run` | off | Print matching offers without creating an instance |
| `--keep-on-failure` | off | Do not destroy the instance if training fails |
| `--no-gpu` | off | Disable GPU training on remote (use CPU instead) |
| `--select-template` | off | Interactively choose a Vast.ai Docker template |
| `--template HASH` | config | Use a specific template hash ID |
| `--pipeline-config PATH` | none | JSON file that overrides `pipeline/defaults/vast.json` |

**Examples:**

```bash
# Cheapest available RTX 3090 in Europe, download data on remote
python -m pipeline run configs/resnet18.json --region europe --download-data

# Browse offers interactively, sort by price
python -m pipeline run configs/resnet18.json --select-offer --sort price

# Run all configs in a directory sequentially on one instance
python -m pipeline run configs/phase2/ --region europe

# See what offers would be selected without spending money
python -m pipeline run configs/resnet18.json --dry-run --region europe

# Keep the instance alive if something goes wrong (for debugging)
python -m pipeline run configs/resnet18.json --keep-on-failure

# Cap price at $0.12/h
python -m pipeline run configs/resnet18.json --price 0.12
```

### `offers` — inspect available GPU offers

```
python -m pipeline offers [options]
```

| Flag | Default | Description |
|------|---------|-------------|
| `--sort` | config | Ranking mode: `price`, `performance`, or `dlp_per_dollar` |
| `--region TEXT` | any | Region filter |
| `--price FLOAT` | config | Max hourly price cap |
| `--select-offer` | off | Interactive offer picker (prints the selected offer as JSON) |
| `--list-regions` | off | Print a count of available offers per region and exit |
| `--limit-output INT` | 10 | How many offers to print |
| `--pipeline-config PATH` | none | Pipeline config override |

**Examples:**

```bash
# See the 20 best-value offers under $0.15/h in Europe
python -m pipeline offers --region europe --price 0.15 --limit-output 20

# List which regions have matching GPUs
python -m pipeline offers --list-regions

# Interactive picker — useful before committing to a run
python -m pipeline offers --select-offer --sort price
```

### `up` — create an instance without training

Spins up an instance and prints SSH connection details. Useful for manual experiments or debugging.

```
python -m pipeline up [options]
```

| Flag | Default | Description |
|------|---------|-------------|
| `--label TEXT` | auto | Optional label for the instance |
| `--select-template` | off | Interactively choose a Vast.ai Docker template |
| `--template HASH` | config | Use a specific template hash ID |
| `--pipeline-config PATH` | none | Pipeline config override |

```bash
python -m pipeline up
python -m pipeline up --label my-debug-session
```

### `status` — show instance details

```
python -m pipeline status <instance_id> [--pipeline-config PATH]
```

### `down` — destroy an instance

```
python -m pipeline down <instance_id> [--pipeline-config PATH]
```

## Pipeline config overrides

Pass `--pipeline-config my_overrides.json` to override any field from `pipeline/defaults/vast.json`. Only the fields you specify are changed; the rest keep their defaults (deep-merged). Useful for switching GPU types or raising the price cap for a single run without editing defaults.

**Example — allow RTX 4090, higher price cap:**

```json
{
  "search": {
    "gpu_names": ["RTX 4090"],
    "max_dph_total": 0.45
  }
}
```

**Key fields in `pipeline/defaults/vast.json`:**

| Section | Key | Default | Meaning |
|---------|-----|---------|---------|
| `search` | `gpu_names` | `["RTX 3090", "RTX 3090 Ti"]` | Accepted GPU models |
| `search` | `max_dph_total` | `0.40` | Max price per hour |
| `search` | `sort_mode` | `"dlp_per_dollar"` | Default ranking (`price`, `performance`, or `dlp_per_dollar`) |
| `search` | `min_reliability` | `0.98` | Minimum host reliability score |
| `instance` | `disk_gb` | `48` | Disk size provisioned on the instance |
| `instance` | `image` | `"vastai/pytorch:latest"` | Docker image |
| `remote` | `workspace_dir` | `"/workspace/DRL_PROJ"` | Remote working directory |
| `remote` | `ssh_timeout_seconds` | `900` | How long to wait for SSH to become available |

## Full workflow example

```bash
# 1. Check what's available and how much it costs
python -m pipeline offers --region europe --list-regions
python -m pipeline offers --region europe --sort price --limit-output 20

# 2. Run training (auto-selects best offer, downloads data if needed)
python -m pipeline run configs/resnet18.json --region europe --download-data

# 3. Results land in classifier/outputs/ automatically
```