Model Inference
Inference methods
Inference your mode in multiple ways after creating a Deployment. Test whether everything is working on the test page of your Deployment. Then integrate it in your pipelines or applications using Deeploy's Python Client or REST API.
Authentication
Authentication is required for inferencing your Deployment. Authenticate using the Python Client, or authenticate to the REST API:
- Basic auth: create a personal key pair in your account page. Base64 encode the access key and secret key: echo "access_key:secret_key" | base64. Use the Base64 encoded string asAuthorizationheader, e.g.Authorization: Basic <base64 encoded key>
- Bearer token: create a Deployment token on your Deployment's Authentication page. Use the token as Authorizationheader, e.g.Authorization: Bearer <token>. Alternatively you can use a JSON Web Token from your OpenID Connect provider as Bearer token. See configure OpenID Connect for more information.
Accepted content types
Mdels and explainers accept various content types. Most standard model frameworks only accept JSON input (application/json), however custom Docker artifacts accept more content types, e.g. an image or PDF. Deeploy accepts the following content types: 'application/json', 'application/pdf', 'application/octet-stream', 'image/*', 'text/*', 'binary/octet-stream'. Make sure to set the correct content type header when doing inference. If no header is set, it will default to application/json. If your model or explainer returns a different type other than application/json, your request's response will be in binary format.
Request and Prediction logs
For each inference request, a single request log and one or more prediction logs are created in Deeploy. The request is split up into multiple prediction logs if the request body contains multiple entries in the instances or inputs or prompt(Huggingface generative model completion) array, providing that the predictions, choices(Huggingface generative model completion) and explanations in the response body have the same length. In case of a mismatch of lengths in input and output or any other case, the batch is not split in several prediction logs but stored together as one prediction log.
The prediction logs contain the request and response bodies. These are only stored when the content types can be recognized as application/json. The request body is stored as a binary file in object storage for other types.