jalf/DRL_PROJ

Fork 0

Files

T

History

Johnny Fernandes 634150c983 Removing deduplication system of configs

2026-04-30 03:33:42 +01:00

defaults

Clean state

2026-04-30 01:25:39 +01:00

scripts

Clean state

2026-04-30 01:25:39 +01:00

__init__.py

Clean state

2026-04-30 01:25:39 +01:00

__main__.py

Clean state

2026-04-30 01:25:39 +01:00

cli.py

Generator runner improvement

2026-04-30 03:21:49 +01:00

config.py

Clean state

2026-04-30 01:25:39 +01:00

orchestrator.py

Removing deduplication system of configs

2026-04-30 03:33:42 +01:00

README.md

Clean state

2026-04-30 01:25:39 +01:00

remote.py

Clean state

2026-04-30 01:25:39 +01:00

rsync-excludes.txt

Clean state

2026-04-30 01:25:39 +01:00

vast_api.py

Clean state

2026-04-30 01:25:39 +01:00

README.md

Pipeline

Orchestrates ephemeral Vast.ai GPU instances: searches for an offer, creates the instance, syncs the project, trains, downloads outputs/, and destroys the instance automatically. Generator runs also rsync generator/outputs/ every 50 epochs while training is still running.

One-time setup

Create pipeline/.env:

VAST_API_KEY=your-vast-api-key
VAST_SSH_PRIVATE_KEY=/home/you/.ssh/id_ed25519   # optional, this is the default

The matching .pub file must exist alongside the private key. The pipeline registers it with Vast.ai automatically if it isn't there yet.

Commands

`run` — train on a remote GPU and fetch results

python -m pipeline run <config...> [options]

Accepts one or more config paths, or a single directory (all *.json inside, sorted). Duplicate configs (identical training settings after resolving extends and shared.json) are skipped automatically.

Flag	Default	Description
`configs`	(required)	One or more config paths, or a directory of JSON configs
`--download-data`	off	Download the DFF dataset via HuggingFace on the remote before training
`--send-cropped`	off	Rsync local `cropped/{classifier,generator}/` to remote (picks subdirectory based on config)
`--select-offer`	off	Interactively browse and pick the GPU offer
`--sort`	config	Ranking mode: `price`, `performance`, or `dlp_per_dollar`
`--region TEXT`	any	Filter by region, e.g. `europe`, `Portugal`, `US`
`--price FLOAT`	config	Max hourly price cap in USD
`--dry-run`	off	Print matching offers without creating an instance
`--keep-on-failure`	off	Do not destroy the instance if training fails
`--no-gpu`	off	Disable GPU training on remote (use CPU instead)
`--select-template`	off	Interactively choose a Vast.ai Docker template
`--template HASH`	config	Use a specific template hash ID
`--pipeline-config PATH`	none	JSON file that overrides `pipeline/defaults/vast.json`

Examples:

# Cheapest available RTX 3090 in Europe, download data on remote
python -m pipeline run configs/resnet18.json --region europe --download-data

# Browse offers interactively, sort by price
python -m pipeline run configs/resnet18.json --select-offer --sort price

# Run all configs in a directory sequentially on one instance
python -m pipeline run configs/phase2/ --region europe

# See what offers would be selected without spending money
python -m pipeline run configs/resnet18.json --dry-run --region europe

# Keep the instance alive if something goes wrong (for debugging)
python -m pipeline run configs/resnet18.json --keep-on-failure

# Cap price at $0.12/h
python -m pipeline run configs/resnet18.json --price 0.12

`offers` — inspect available GPU offers

python -m pipeline offers [options]

Flag	Default	Description
`--sort`	config	Ranking mode: `price`, `performance`, or `dlp_per_dollar`
`--region TEXT`	any	Region filter
`--price FLOAT`	config	Max hourly price cap
`--select-offer`	off	Interactive offer picker (prints the selected offer as JSON)
`--list-regions`	off	Print a count of available offers per region and exit
`--limit-output INT`	10	How many offers to print
`--pipeline-config PATH`	none	Pipeline config override

Examples:

# See the 20 best-value offers under $0.15/h in Europe
python -m pipeline offers --region europe --price 0.15 --limit-output 20

# List which regions have matching GPUs
python -m pipeline offers --list-regions

# Interactive picker — useful before committing to a run
python -m pipeline offers --select-offer --sort price

`up` — create an instance without training

Spins up an instance and prints SSH connection details. Useful for manual experiments or debugging.

python -m pipeline up [options]

Flag	Default	Description
`--label TEXT`	auto	Optional label for the instance
`--select-template`	off	Interactively choose a Vast.ai Docker template
`--template HASH`	config	Use a specific template hash ID
`--pipeline-config PATH`	none	Pipeline config override

python -m pipeline up
python -m pipeline up --label my-debug-session

`status` — show instance details

python -m pipeline status <instance_id> [--pipeline-config PATH]

`down` — destroy an instance

python -m pipeline down <instance_id> [--pipeline-config PATH]

Pipeline config overrides

Pass --pipeline-config my_overrides.json to override any field from pipeline/defaults/vast.json. Only the fields you specify are changed; the rest keep their defaults (deep-merged). Useful for switching GPU types or raising the price cap for a single run without editing defaults.

Example — allow RTX 4090, higher price cap:

{
  "search": {
    "gpu_names": ["RTX 4090"],
    "max_dph_total": 0.45
  }
}

Key fields in pipeline/defaults/vast.json:

Section	Key	Default	Meaning
`search`	`gpu_names`	`["RTX 3090", "RTX 3090 Ti"]`	Accepted GPU models
`search`	`max_dph_total`	`0.40`	Max price per hour
`search`	`sort_mode`	`"dlp_per_dollar"`	Default ranking (`price`, `performance`, or `dlp_per_dollar`)
`search`	`min_reliability`	`0.98`	Minimum host reliability score
`instance`	`disk_gb`	`48`	Disk size provisioned on the instance
`instance`	`image`	`"vastai/pytorch:latest"`	Docker image
`remote`	`workspace_dir`	`"/workspace/DRL_PROJ"`	Remote working directory
`remote`	`ssh_timeout_seconds`	`900`	How long to wait for SSH to become available

Full workflow example

# 1. Check what's available and how much it costs
python -m pipeline offers --region europe --list-regions
python -m pipeline offers --region europe --sort price --limit-output 20

# 2. Run training (auto-selects best offer, downloads data if needed)
python -m pipeline run configs/resnet18.json --region europe --download-data

# 3. Results land in classifier/outputs/ automatically

README.md

Pipeline

One-time setup

Commands

run — train on a remote GPU and fetch results

offers — inspect available GPU offers

up — create an instance without training

status — show instance details

down — destroy an instance

Pipeline config overrides

Full workflow example

`run` — train on a remote GPU and fetch results

`offers` — inspect available GPU offers

`up` — create an instance without training

`status` — show instance details

`down` — destroy an instance