Releases: dstackai/dstack
0.18.2
On-prem clusters
Network
The dstack pool add-ssh
command now supports the --network
argument. Use this argument if you want to use multiple instances that share the same private network as a cluster to run multi-node tasks.
The --network
argument accepts the IP address range (CIDR) of the private network of the instance.
Example:
dstack pool add-ssh -i ~/.ssh/id_rsa [email protected] --network 10.0.0.0/24
Once you've added multiple instances with the same network value, you'll be able to use them as a cluster to run multi-node tasks.
Private subnets
By default, dstack
uses public IPs for SSH access to running instances, requiring public subnets in the VPC. The new update allows AWS instances to use private subnets instead.
To create instances only in private subnets, set public_ips
to false
in the AWS backend settings:
type: aws
creds:
type: default
vpc_ids:
...
public_ips: false
Note
- Both
dstack server
and thedstack
CLI should have access to the private subnet to access instances. - If you want running instances to access the Internet, the private subnets need to have a NAT gateway.
Gateways
dstack apply
Previously, to create or update gateways, one had to use the dstack gateway create
or dstack gateway update
commands.
Now, it's possible to define a gateway configuration via YAML and create or update it using the dstack apply
command.
Example:
type: gateway
name: example-gateway
backend: gcp
region: europe-west1
domain: example.com
dstack apply -f examples/deployment/gateway.dstack.yml
For now, the dstack apply
command only supports the gateway
configuration type. Soon, it will also support dev-environment
, task
, and service
, replacing the dstack run
command.
The dstack destroy
command can be used to delete resources.
Private gateways
By default, gateways are deployed using public subnets. Since 0.18.2
, it is now possible to deploy gateways using private subnets. To do this, you need to set public_ips
to false
and specify the ARN of a certificate from AWS Certificate Manager.
type: gateway
name: example-gateway
backend: aws
region: eu-west-1
domain: "example.com"
public_ip: false
certificate:
type: acm
arn: "arn:aws:acm:eu-west-1:3515152512515:certificate/3251511125--1241-1224-121251515125"
In this case, dstack
will deploy the gateway in a private subnet behind a load balancer using the specified certificate.
Note
Private gateways are currently supported only for AWS.
What's changed
- Support multi-node tasks with
dstack pool add-ssh
instances by @TheBits in #1189 - Fixed the JSON schema errors by @r4victor in #1193
- Support spot instances with
runpod
by @Bihan in #1119 - Speed up AWS VPC validation by @r4victor in #1196
- [Internal] Optimize
ProjectModel
loading by @r4victor in #1199 - Support provisioning instances without public IPs on AWS by @r4victor in #1203
- Minor improvements of
dstack pool add-ssh
by @TheBits in #1202 - Instances cannot be reused by other users by @TheBits in #1204
- Do not create AWS instance profile when launching instances by @r4victor in #1212
- Allow running services without
https
by @r4victor in #1217 - Implement
dstack apply
for gateways by @r4victor in #1223 - Support gateways without public IPs on AWS by @r4victor in #1224
- Support
--network
withdstack pool add-ssh
by @TheBits in #1225 - [Internal] Make gateway creation async by @r4victor in #1236
- Using a more resourceful VM type by default for GCP gateway by @r4victor in #1237
- Handle properly if the
network
passed todstack pool add-ssh
is not correct by @TheBits in #1233 - Use valid GCP resource names by @r4victor in #1248
- Always try to restart
dstack-shim.service
withdstack pool add-ssh
by @TheBits in #1253 - [Internal] Improve instance processing by @r4victor in #1251
- Changed
dstack pool remove
torm
by @muddi900 in #1258 - Support gateways behind ALB with ACM certificate by @r4victor in #1264
- Support IP addresses with
--network
by @TheBits in #1263 - [Internal] Fix double unlocking when processing runs and instances by @r4victor in #1268
- Add dstack destroy command and improve dstack apply by @r4victor in #1271
- Fix instances from pools ignoring regions by @r4victor in #1272
- Add the
axolotl
example by @deep-diver in #1187
New Contributors
Full Changelog: 0.18.1...0.18.2
0.18.1
On-prem servers
Now you can add your own servers as pool instances:
dstack pool add-ssh -i ~/.ssh/id_rsa [email protected]
Note
The server should be pre-installed with CUDA 12.1 and NVIDIA Docker.
Configuration
All .dstack/profiles.yml
properties now can be specified via run configurations:
type: dev-environment
ide: vscode
spot_policy: auto
backends: ["aws"]
regions: ["eu-west-1", "eu-west-2"]
instance_types: ["p3.8xlarge", "p3.16xlarge"]
max_price: 2.0
max_duration: 1d
New examples 🔥🔥
Thanks to the contribution from @deep-diver, we got two new examples:
Other
- Configuring VPCs using their IDs (via
vpc_ids
inserver/config.yml
) - Support for global profiles (via
~/.dstack/profiles.yml
) - Updated the default environment variables (
DSTACK_RUN_NAME
,DSTACK_GPUS_NUM
,DSTACK_NODES_NUM
,DSTACK_NODE_RANK
, andDSTACK_MASTER_NODE_IP
) - It’s now possible to use NVIDIA
A10
GPU on Azure - More granular permissions for Azure
What's changed
- Fix server freeze on terminate instance by @jvstme in #1132
- Support profile params in run configurations by @r4victor in #1131
- Support global
.dstack/profiles.yml
by @r4victor in #1134 - Fix
No such profile: None
when missing.dstack/profiles.yml
by @r4victor in #1135 - Make Azure permissions more granular by @r4victor in #1139
- Validate min disk size by @r4victor in #1146
- Fix unexpected error if system Python version is unknown by @r4victor in #1147
- Add request timeouts to prevent code freezes by @jvstme in #1140
- Refactor backends to wait for instance IP address outside
run_job/create_instance
by @r4victor in #1149 - Fix provisioning Azure instances with A10 GPU by @jvstme in #1150
- [Internal] Move
packer
->scripts/packer
by @jvstme in #1153 - Added the ability of adding own instances by @TheBits in #1115
- An issue with the
executor_error
check being falsely positive by @TheBits in #1160 - Make user project quota configurable by @r4victor in #1161
- Configure CORS headers on gateway by @r4victor in #1166
- Allow to configure AWS
vpc_ids
by @r4victor in #1170 - [Internal] Show dstack version in Sentry issues by @jvstme in #1167
- Fix
KeyError: 'IpPermissions'
when using AWS by @jvstme in #1169 - Create public ssh key is it not exist in
dstack pool add-ssh
by @TheBits in #1173 - Fixed is the environment file upload by @TheBits in #1175
- Updated shim status processing by @TheBits in #1174
- Fix bugs in
dstack pool add-ssh
by @TheBits in #1178 - Fix Cudo Create VM response error by @Bihan in #1179
- Implement API for configuring backends via yaml by @r4victor in #1181
- Allow running gated models with
HUGGING_FACE_HUB_TOKEN
by @r4victor in #1184 - Pass all dstack runner envs as
DSTACK_*
by @r4victor in #1185 - Improve the retries in the get_host_info and get_shim_healthcheck by @TheBits in #1183
- Example/h4alignment handbook by @deep-diver in #1180
- The deploy is launched in ThreadPoolExecutor by @TheBits in #1186
Full Changelog: 0.18.0...0.18.1rc2
0.18.0
RunPod
The update adds the long-awaited integration with RunPod, a distributed GPU cloud that offers GPUs at affordable prices.
To use RunPod, specify your RunPod API key in ~/.dstack/server/config.yml
:
projects:
- name: main
backends:
- type: runpod
creds:
type: api_key
api_key: US9XTPDIV8AR42MMINY8TCKRB8S4E7LNRQ6CAUQ9
Once the server is restarted, go ahead and run workloads.
Clusters
Another major change with the update is the ability to run multi-node tasks over an interconnected cluster of instances.
type: task
nodes: 2
commands:
- git clone https://github.com/r4victor/pytorch-distributed-resnet.git
- cd pytorch-distributed-resnet
- mkdir -p data
- cd data
- wget -c --quiet https://www.cs.toronto.edu/~kriz/cifar-10-python.tar.gz
- tar -xvzf cifar-10-python.tar.gz
- cd ..
- pip3 install -r requirements.txt torch
- mkdir -p saved_models
- torchrun --nproc_per_node=$DSTACK_GPUS_PER_NODE
--node_rank=$DSTACK_NODE_RANK
--nnodes=$DSTACK_NODES_NUM
--master_addr=$DSTACK_MASTER_NODE_IP
--master_port=8008 resnet_ddp.py
--num_epochs 20
resources:
gpu: 1
Currently supported providers for this feature include AWS, GCP, and Azure.
Other
- The
commands
property is now not required for tasks and services if you use animage
that has a default entrypoint configured. - The permissions required for using
dstack
with GCP are more granular.
What's changed
- Add
username
filter to/api/runs/list
by @r4victor in #1068 - Inherit core models from DualBaseModel by @r4victor in #967
- Fixed the YAML schema validation for
replicas
by @peterschmidt85 in #1055 - Improve the
server/config.yml
reference documentation by @peterschmidt85 in #1077 - Add the
runpod
backend by @Bihan in #1063 - Support JSON log handler by @TheBits in #1085
- Added lock to the
terminate_idle_instance
by @TheBits in #1081 dstack init
doesn't work with a remote Git repo by @peterschmidt85 in #1090- Minor improvements of
dstack server
output by @peterschmidt85 in #1088 - Return an error information from
dstack-shim
by @TheBits in #1061 - Replace
RetryPolicy.limit
toRetryPolicy.duration
by @TheBits in #1074 - Make
dstack version
configurable when deploying docs by @peterschmidt85 in #1095 dstack init
doesn't work with a local Git repo by @peterschmidt85 in #1096- Fix infinite
create_instance()
on thecudo
provider by @r4victor in #1082 - Do not update the
latest
Docker image and YAML scheme for pre-release builds by @peterschmidt85 in #1099 - Support multi-node tasks by @r4victor in #1103
- Make
commands
optional in run configurations by @jvstme in #1104 - Allow the
cudo
backend use non-gpu instances by @Bihan in #1092 - Make GCP permissions more granular by @r4victor in #1107
Full changelog: 0.17.0...0.18.0
0.17.0
Service auto-scaling
Previously, dstack
always served services as single replicas. While this is suitable for development, in production, the service must automatically scale based on the load.
That's why in 0.17.0
, we extended dstack
with the capability to configure replicas
(the number of replicas) as well as scaling
(the auto-scaling policy).
Regions and instance types
The update brings support for specifying regions and instance types (in dstack run
and .dstack/profiles.yml
)
Environment variables
Firstly, it's now possible to configure an environment variable in the configuration without hardcoding its value. Secondly, dstack run
now inherits environment variables from the current process.
For more details on these new features, check the changelog.
What's changed
- Support running multiple replicas for a service by @Egor-S in #986 and #1015
- Allow to specify
instance_type
via CLI and profiles by @r4victor in #1023 - Allow to specify regions via CLI and profiles by @r4victor in #947
- Allow specifying required env variables by @spott in #1003
- Allow configuring CA for gateways by @jvstme in #1022
- Support Python 3.12 by @peterschmidt85 in #1031
- The
shm_size
property in resources doesn't take effect by @peterschmidt85 in #1007 - Sometimes, runs get stuck at pulling by @TheBits in #1035
vastai
doesn't show any offers since0.16.0
by @iRohith in #959- It's not possible to configure projects other than
main
by @peterschmidt85 in #992 - Spot instances don't work on GCP by @peterschmidt85 in #996
New contributors
Full changelog: 0.16.5...0.17.0
0.16.5
0.16.4
CUDO Compute
The 0.16.4
update introduces the cudo
backend, which allows running workloads with CUDO Compute, a cloud GPU marketplace.
To configure the cudo
backend, you simply need to specify your CUDO Compute project ID and API key:
projects:
- name: main
backends:
- type: cudo
project_id: my-cudo-project
creds:
type: api_key
api_key: 7487240a466624b48de22865589
Once it's done, you can restart the dstack server
and use the dstack
CLI or API to run workloads.
Note
Limitations
- The
dstack gateway
feature is not yet compatible withcudo
, but it is expected to be supported in version0.17.0
,
planned for release within a week. - The
cudo
backend cannot yet be used with dstack Sky, but it will also be enabled within a week.
Full changelog: 0.16.3...0.16.4
0.16.3
Bug-fixes
- [Bug] The
shm_size
property inresources
doesn't take effect #1006 - [Bug]: It's not possible to configure projects other than main via
~/.dstack/server/config.yml
#991 - [Bug] Spot instances don't work on GCP if the username has upper case letters #975
Full changelog: 0.16.2...0.16.3
0.16.1
Improvements to dstack pool
- Change default idle duration for
dstack pool add
to72h
#964 - Set the default spot policy in
dstack pool add
toon-demand
#962 - Add pool support for
lambda
,azure
, andtensordock
#923 - Allow to pass idle duration and spot policy in
dstack pool add
#918 dstack run
does not respect pool-relatedprofiles.yml
parameters #949
Bug-fixes
- Runs submitted via Python API have no termination policy #955
- The
vastai
backend doesn't show any offers since0.16.0
#958 - Handle permission error when adding Include to
~/.ssh/config
#937 - The SSH tunnel fails because of a messy
~/.ssh/config
#933 - The
PATH
is overridden when logging via SSH #930 - The SSH tunnel fails with
Too many authentication failures
#927
We've also updated our guide on how to add new backends. It's now available here.
New contributors
- @iRohith made their first contribution in #959
- @spott made their first contribution in #934
- @KevKibe made their first contribution in #917
Full Changelog: 0.16.0...0.16.1
0.16.0
Pools
The 0.16.0
release is the next major update, which, in addition to many bug fixes, introduces pools, a major new feature that enables a more efficient way to manage instance lifecycles and reuse instances across runs.
dstack run
Previously, when running a dev environment, task, or service, dstack
provisioned an instance in a configured
backend, and upon completion of the run, deleted the instance.
Now, when using the dstack run
command, it tries to reuse an instance from a pool. If no ready instance meets the
requirements, dstack
automatically provisions a new one and adds it to the pool.
Once the workload finishes, the instance is marked as idle
.
If the instance remains idle for the configured duration, dstack
tears it down.
dstack pool
The dstack pool
command allows for managing instances within pools.
To manually add an instance to a pool, use dstack pool add
:
dstack pool add --gpu 80GB --idle-duration 1d
The dstack pool add
command allows specifying resource requirements, along with the spot policy, idle duration, max
price, retry policy, and other policies.
If no idle duration is configured, by default, dstack
sets it to 72h
.
To override it, use the --idle-duration DURATION
argument.
To learn more about pools, refer to the official documentation. To learn more about 0.16.0
, refer to the changelog.
What's changed
- Add dstack pool by @TheBits in #880
- Pools: fix failed instance status by @Egor-S in #889
- Add columns to
dstack pool show
by @TheBits in #898 - Add submit stop by @TheBits in #895
- Add kubernetes logo by @plutov in #900
- Handle exceptions from backend.compute().get_offers by @r4victor in #904
- Fix process_finished_jobs parsing None job_model.job_provisioning_data by @r4victor in #905
- Validate run_name by @r4victor in #906
- Filter out private subnets when provisioning in custom aws vpc by @r4victor in #909
- Issue 894 rework failed instance status by @TheBits in #899
- Handle unexpected exceptions from run_job by @r4victor in #911
- Request GPU in docker with --gpus=all by @Egor-S in #913
- Issue 918 fix cli argimenuts for dstack pool add by @TheBits in #919
- Added router tests for pools by @TheBits in #916
- Fix #921 by @TheBits in #922
New contributors
Full changelog: 0.15.1...0.16.0
0.15.2rc2
Bug-fixes
- Exclude private subnets when provisioning in AWS #908
- Ollama doesn't detect the GPU (requires
--gpus==all
instead of--runtime=nvidia
) #910
Full changelog: 0.15.1...0.15.2rc2