dstack is a new platform, that we are working on, to help automate data and training workflows.
The platform allows you to define workflows and their infrastructure requirements as code.
It provisions infrastructure on demand, and versions data and models.
You can use any frameworks, experiment trackers, cloud vendors, or hardware.
In this article, I’ll walk you through how dstack works, and why we made certain design decisions. To start, I’d like to tell you the story of why we decided to build dstack in the first place.
AI and software development
Being a software engineer, I learned to appreciate how easy it is to build software today. Languages, frameworks, and build systems – all are free and open-source. The most mature CI/CD and cloud platforms are offered at a price even small companies can afford. If you want to ship software, you just write the code, together with the script that builds it and deploys it. You pay per compute hour plus a little fee per user for the service.
When it comes to deployment, common orchestration frameworks (e.g. Kubernetes) almost perfectly solve the problem of scaling and fault-tolerance. In training workloads, the core challenges are quite different. First, training deep learning models requires processing and moving huge amounts of data. Second, typical workloads involve piping together numerous tasks that may have different hardware requirements. As a result, it’s extremely difficult for AI researchers to do their job at a reasonable scale without hiring a team of engineers to automate their workloads by building custom solutions.
Two cherries on top of the problems related to managing infrastructure and data are collaboration and reproducibility. In software development, you build on top of libraries created by your community. When there's a bug, you can track it down to a version, reproduce it, then implement and deploy a fix.
In Artificial Intelligence, the building blocks also include base models and data. If you want to have the same level of collaboration and reproducibility as in Software Development, you have to treat every output of the training pipeline as an artifact that can be versioned and reused.
Looking at the existing solutions that solve data orchestration, you’ll notice that they are very fragmented and focus on different aspects, from data engineering to machine learning, etc. Zooming in at the use case of training deep learning models, we’ll see two types of solutions. First is open-source frameworks (e.g. Kubeflow, Airflow, Argo, Pachyderm, etc.) that you have to set up and glue together with other tools. A classic example is Airflow, an amazing piece of technology that was originally designed for data engineering and often used today for deep learning even though it’s coupled with inconveniences when it comes to experiment tracking, data versioning, etc. The second type of solution is end-to-end AI platforms (e.g. SageMaker or Vertex AI) that offer an opinionated approach to everything in addition to locking you into one vendor.
Principles
When we started working on dstack, we set the following key requirements:
Designed for collaboration and reuse: It’s not possible to do complex work alone. Everything is easier if it’s done in teams or with the help of the community. Thus, it should be easy to reuse as much as possible, including other users’ code, data, models, and even scripts for managing infrastructure. It should be possible to collaborate not only internally within the company but also with the community outside of the company.
Technology-agnostic: It should be possible to use any languages, frameworks to handle data or train models. It should be possible to use any computing vendor. It should be possible to use other tools and platforms to facilitate the training, including third-party experiment trackers, model registries, profilers, etc. Finally, the solution shouldn't require the user to modify their existing code to work with dstack.
Made for continuous training: Training models doesn’t end when you ship your model to production. It only starts there. Once your model is deployed, it’s critical to observe the model, back-track issues that occur to the model to the steps of the training pipeline, fix these issues, validate, and re-deploy your model. Thus, the solution should be optimized to train and ship models on a regular basis.
Infrastructure as code
As a software engineer, I learned that the easiest way to make something reproducible is by writing it as code. This approach is heavily used today in DevOps, for example, to define the infrastructure and provision it across computing vendors. However, unlike deployment, handling data and training workloads are typically done in batches which means the infrastructure is only needed for the time a workload is running.
With dstack, we decided to build a tool to describe workflows and specify infrastructure as code. It made it possible to easily allocate the computational resources for training and automatically release them when the workflow is finished. This opens up a lot of opportunities.
First, with this approach, it’s very easy to decouple infrastructure definition and the computing platform where this infrastructure can be provisioned. For example, you can declare a workflow that requires setting up a Dask cluster. Since dstack supports multiple computing vendors, the user of dstack can provision their workflow in any cloud, e.g. AWS, GCP, or Azure, even using spot instances.
Secondly, the user is able not only to make a workflow provider that sets up a Dask cluster to run a given workload but also to share this provider with other users so other users can use the very same workflow template with just a few lines of code.
How dstack works
Here’s an example:
name: finetune-model
provider: bash
commands:
- pip install -r requirements.txt
- training_script: train.py
artifacts:
- ./checkpoint
deps:
- workflow: download-base-model
resources:
memory: 128GB
gpu:
name: V100
count: 1
When you run this workflow, dstack will invoke the bash
provider that uses the dstack SDK to submit jobs.
Every job may specify a repo with the sources, a Docker image, commands, exposed ports, an IDE of the primary job in case there should be communication between jobs, the input artifacts (e.g. from other runs), the hardware requirements (e.g. the number or a name of GPU, memory, etc), and finally the output artifacts.
Once jobs are created, dstack will provision the required infrastructure for the time the workflow is running and will release it or tear it down afterward. If the run is successful, you can assign it a tag and then refer to its output artifacts in other workflows.
The purpose of this design is to significantly reduce the complexity and allow the user to build on top of the other users’ work, be it specific workflow providers, or other workflows that prepare data or base models.
The ability to use third-party workflow providers together with the ability to use versioned artifacts of other workflows as dependencies (e.g. data, models, etc) facilitates collaboration and reuse.
Looking forward
What about the status of the project?
First of all, we plan to open-source the core components of dstack. You can track the progress of this activity in our repo.
Secondly, an early preview of the product is available as an in-cloud solution for anyone to try. All it takes to try it is signing up with your GitHub account, defining a workflow, connecting dstack to your cloud account or hardware (to provision infrastructure), and running workflows.
There are a lot of features that are still in work or not started yet, and thus require your input. If you’d like to try something cool and help us build a great tool, join our community of early adopters, and contribute to the development process.
Contribution
Finally and most importantly, we need your help with designing providers SDK and implementing more providers for specific use-cases. Your help with this is very much appreciated. Please join our Slack channel and don’t hesitate to contact me personally so I can onboard you and we can discuss issues and tasks together.