Skip to main content
Version: 1.56

Inference data payload

All managed Deployments must follow the data plane formats. External Deployments don't have to adhere to this protocol. Managed Deployments that have a different data payload format can make use of custom mapping.

V1

Models that use Data Plane v1 follow the TensorFlow V1 HTTP API protocol.

Info

Supports streaming More info

Predict

For the /predict endpoint, the protocol is structured as follows:

Request

{
"instances": [ <value>|<(nested)list>|<list-of-objects> ]
}

The instances field contains the content of the input tensor.

Response

{
"predictions": [ <value>|<(nested)list>|<list-of-objects> ]
}

The predictions field contains the content of the output tensor.

Explain

For the /explain endpoint, the protocol is structured as follows:

Request

{
"instances": [ <value>|<(nested)list>|<list-of-objects> ]
}

The instances field contains the content of the input tensor.

Response

{
"explanations": [ <value>|<(nested)list>|<list-of-objects> ]
}

The explanations field contains the content of the output tensor.

V2

The V2 protocol provides increased utility and portability. Currently, not all Deployment frameworks support the V2 protocol, which is why all supported frameworks only support V1. When creating your own custom Docker image, you can adopt V2.

The protocol for both the /predict and /explain endpoints is structured as follows:

Request

{
"name": $string,
"shape": [ $number, ... ],
"datatype": $string,
"parameters": $parameters,
"data": [ <value>|<(nested)list>|<list-of-objects> ]
}

Response

{
"model_name": $string,
"model_version": $string,
"id": $string,
"parameters": $parameters,
"outputs": [$response_output, ... ]
}
  • model_name: name of the model
  • model_version: version of the model
  • id: the identifier given to the request
  • parameters (optional): object containing 0 or more parameters as explained in the parameters documentation
  • outputs: see response_output below

Response output

{
"name": $string
"shape": [$number, ... ],
"datatype": $string,
"data": [$tensor_data, ... ]
}
  • name: name of the output tensor
  • shape: shape of the output tensor. Each dimension is an integer
  • datatype: datatype of tensor output elements as defined in the tensor data types documentation
  • data: content of the output tensor. More information can be found in the tensor data documentation

Completions

The completions protocol is available for:

  • External Deployments (when the endpoint is a /completions endpoint)
  • Hugging Face generative model Deployments
  • Custom Docker Deployments (when the endpoint is a /completions endpoint)
Info

Supports streaming More info

You can also pass additional OpenAI format parameters such as temperature, max_tokens, and others.

/completions endpoint

Request

{
"prompt": [ < list-of-prompt-strings > ]
}

For an explain request with a standard explainer deployed:

{
"prompt": [ < list-of-prompt-strings > ],
"explain": true
}

Response

{
"id": < id >,
"model": < model name >,
"choices":[
{
"index": 0,
"text" : < response >,
...
}
]
}

Chat Completions

The chat completion protocol is available for:

  • External Deployments (when the endpoint is a chat completion endpoint)
  • Hugging Face generative model Deployments
  • Custom Docker Deployments (when the endpoint is a chat completion endpoint)
Info

Supports streaming More info

You can also pass additional OpenAI format parameters such as temperature, max_tokens, and others.

/chat/completions endpoint

Request

{
"messages": [
{
"role": "< role >",
"content": < message >"
}
],
..
}

Response

{
"id": < id >,
"model": < model name >,
"choices":[
{
"index": 0,
"message":{
"role":"assistant","reasoning_content":null,"content":"...."
}
...
}
...
]
}

Embeddings

The embeddings protocol is available for:

  • External Deployments (when the endpoint is an embedding endpoint)
  • Hugging Face generative model Deployments
  • Custom Docker Deployments (when the endpoint is an embedding endpoint)

You can also pass additional OpenAI format parameters such as temperature, max_tokens, and others.

/embeddings endpoint

Request

{
"input": [
< array of text to get embedding for >
]
}

Response

{
"embedding": [
array of embeddings
],
..
}

Streaming

Streaming with Server-Sent Events (SSE) is currently supported at predict, chat completions and completions endpoints.

Enabling streaming

Chat completions and completions

To enable streaming in chat completions and completions endpoints, the OpenAI specification is followed by including the stream key in the request body:

{
"stream": true,
...
}
Note

Completions with explanations does not support streaming.

Predict endpoint

For the predict endpoint, streaming can be enabled in two ways:

  1. Query parameter: Add stream=true as a query parameter in the request
  2. Request body: Include "stream": true in the request body

When both are specified, the query parameter takes precedence over the body parameter.

Getting request log id and prediction log ids.

When streaming is enabled, responses are sent as Server-Sent Events (SSE). At the end of the stream (if logging is not skipped), metadata is sent as a comment event containing the request log ID and prediction log IDs:

: deeploy-metadata {"requestLogId":"2dd07d37-b4b2-4e53-a141-36905beb25af","predictionLogIds":["50ca4b0a-362d-4775-80a7-a6d40aabc7c3"]}