Version: 1.51

Advanced deployment options

Advanced configuration options are available in for models, explainers and transformers. The options can be enabled and disabled in the Model, Explainer and Transformer steps when creating a Deployment, or on the Details page of a Deployment. Depending on the selected framework, not all options might be available.

Serverless

Models, explainers, and transformers are deployed non-serverless by default. Non-serverless deployment is particularly well-suited for real-time applications. Since they are always active and consistently utilize resources, these deployments can readily respond to incoming requests without delay.

Serverless models, explainers, or transformers operate differently by not continuously consuming resources. Instead, they activate when the first request is received and shut down completely if there is no further activity for a period of time. This approach is cost-effective when your deployment is not actively used for fetching predictions. However, it's important to be aware of a potential drawback known as a "cold-start." This refers to the delay in serving the first request as the application needs to boot up.

Serverless vs. non-serverless?

Both non-serverless and serverless Deployment methods offer automatic scaling to handle incoming traffic. This means that as the usage of your model increases, additional instances of your model will be deployed to accommodate the growing traffic. Conversely, if there are more instances than required to serve the incoming traffic, your deployment will scale down accordingly.

If your applications is a real-time application, choose the non-serverless deployment. If your application is not real-time (e.g. runs once a day), go with the serverless option. If you are unsure about what option to select, we suggest the non-serverless option.

Customized model resources

Organizations on all plans can configure CPU request and limit, and memory request and limit. Organizations on Private Cloud plans can also select a node type. Deeploy implements a buffer of 25% for all resources. For example, for a node type that has 4 CPU cores, 1 core will be reserved for the Deeploy stack itself. In case the model resources are not customized, a default setting is used.

Info

Organizations on Private Cloud plans need to make sure the requested CPU and Memory are available in the cluster Deeploy runs on

Private registry

Select a Docker credential so the private registry can be accessed. Please note that you either need to add docker credentials before creating the Deployment, or use temporary credentials.

Private object storage

Select a blob credential so the private object storage can be accessed. Please note that you either need to add blob credentials before creating the Deployment, or use temporary credentials.

Environment variables

Select the environment variables that will be inserted in your container. Different environment variables can be selected for your model, explainer or transformer. Please note that you need to create an environment variable before creating the Deployment.

Autoscaling

Paid feature

This feature is exclusive to organizations on a Private Cloud or Scale plan. To learn more about plans, refer to our pricing page.

Autoscaling for models is configured in the model step, but also automatically applies to the explainer and transformer if those are included. Autoscaling is enabled by default and set to the Concurrency type. This ensures that your model, explainer, and transformer will automatically scale up to a maximum of five containers when there are 10 or more concurrent requests. The other possible autoscaling type is CPU. With CPU the containers will autoscale when the current CPU load is 70% or more of your requested amount. If you deselect autoscaling, the number of containers will not scale beyond 1.

Serverless​

Serverless vs. non-serverless?​

Customized model resources​

Private registry​

Private object storage​

Environment variables​

Autoscaling​