Version: Cloud

Inference data payload

All managed Deployments must follow the data plane formats. External Deployments don't have to adhere to this protocol. Managed Deployments that have a different data payload format can make use of custom mapping.

V1

Models that use Data Plane v1 follow the TensorFlow V1 HTTP API protocol.

Info

Supports streaming More info

Predict

For the /predict endpoint, the protocol is structured as follows:

Request

{
  "instances": [ <value>|<(nested)list>|<list-of-objects> ]
}

The instances field contains the content of the input tensor.

Response

{
  "predictions": [ <value>|<(nested)list>|<list-of-objects> ]
}

The predictions field contains the content of the output tensor.

Explain

For the /explain endpoint, the protocol is structured as follows:

Request

{
  "instances": [ <value>|<(nested)list>|<list-of-objects> ]
}

The instances field contains the content of the input tensor.

Response

{
  "explanations": [ <value>|<(nested)list>|<list-of-objects> ]
}

The explanations field contains the content of the output tensor.

V2

The V2 protocol provides increased utility and portability. Currently, not all Deployment frameworks support the V2 protocol, which is why all supported frameworks only support V1. When creating your own custom Docker image, you can adopt V2.

The protocol for both the /predict and /explain endpoints is structured as follows:

Request

{
  "name": $string,
  "shape": [ $number, ... ],
  "datatype": $string,
  "parameters": $parameters,
  "data": [ <value>|<(nested)list>|<list-of-objects> ]
}

name: name of the input tensor
shape: shape of the input tensor. Each dimension is an integer
datatype: datatype of tensor input elements as defined in the tensor data types documentation
parameters (optional): object containing 0 or more parameters as explained in the parameters documentation
data: content of the input tensor. More information can be found in the tensor data documentation

Response

{
  "model_name": $string,
  "model_version": $string,
  "id": $string,
  "parameters": $parameters,
  "outputs": [$response_output, ... ]
}

model_name: name of the model
model_version: version of the model
id: the identifier given to the request
parameters (optional): object containing 0 or more parameters as explained in the parameters documentation
outputs: see response_output below

Response output

{
  "name": $string
  "shape": [$number, ... ],
  "datatype": $string,
  "data": [$tensor_data, ... ]
}

name: name of the output tensor
shape: shape of the output tensor. Each dimension is an integer
datatype: datatype of tensor output elements as defined in the tensor data types documentation
data: content of the output tensor. More information can be found in the tensor data documentation

Completions

The completions protocol is available for:

External Deployments (when the endpoint is a /completions endpoint)
Hugging Face generative model Deployments
Custom Docker Deployments (when the endpoint is a /completions endpoint)

Info

Supports streaming More info

You can also pass additional OpenAI format parameters such as temperature, max_tokens, and others.

/completions endpoint

Request

{
  "prompt": [ < list-of-prompt-strings > ]
}

For an explain request with a standard explainer deployed:

{
  "prompt": [ < list-of-prompt-strings > ],
  "explain": true
}

Response

{
  "id": < id >,
  "model": < model name >,
  "choices":[
    {
      "index": 0,
      "text" : < response >,
      ...
    }
  ]
}

Chat Completions

The chat completion protocol is available for:

External Deployments (when the endpoint is a chat completion endpoint)
Hugging Face generative model Deployments
Custom Docker Deployments (when the endpoint is a chat completion endpoint)

Info

Supports streaming More info

You can also pass additional OpenAI format parameters such as temperature, max_tokens, and others.

/chat/completions endpoint

Request

{
  "messages": [
      {                    
      "role": "< role >",
      "content": < message >"
      }
    ],
    ..
}

Response

{
  "id": < id >,
  "model": < model name >,
  "choices":[
    {
      "index": 0,
      "message":{
        "role":"assistant","reasoning_content":null,"content":"...."
      }
      ...
    }
    ...
  ]
}

Embeddings

The embeddings protocol is available for:

External Deployments (when the endpoint is an embedding endpoint)
Hugging Face generative model Deployments
Custom Docker Deployments (when the endpoint is an embedding endpoint)

You can also pass additional OpenAI format parameters such as temperature, max_tokens, and others.

/embeddings endpoint

Request

{
  "input": [
    < array of text to get embedding for >
  ]
}

Response

{
  "embedding": [
      array of embeddings
    ],
    ..
}

Streaming

Streaming with Server-Sent Events (SSE) is currently supported at predict, chat completions and completions endpoints.

Enabling streaming

Chat completions and completions

To enable streaming in chat completions and completions endpoints, the OpenAI specification is followed by including the stream key in the request body:

{
  "stream": true,
  ...
}

Note

Completions with explanations does not support streaming.

Predict endpoint

For the predict endpoint, streaming can be enabled in two ways:

Query parameter: Add stream=true as a query parameter in the request
Request body: Include "stream": true in the request body

When both are specified, the query parameter takes precedence over the body parameter.

Getting request log id and prediction log ids.

When streaming is enabled, responses are sent as Server-Sent Events (SSE). At the end of the stream (if logging is not skipped), metadata is sent as a comment event containing the request log ID and prediction log IDs:

: deeploy-metadata {"requestLogId":"2dd07d37-b4b2-4e53-a141-36905beb25af","predictionLogIds":["50ca4b0a-362d-4775-80a7-a6d40aabc7c3"]}

V1​

Predict​

Request​

Response​

Explain​

Request​

Response​

V2​

Request​

Response​

Response output​

Completions​

/completions endpoint​

Request​

Response​

Chat Completions​

/chat/completions endpoint​

Request​

Response​

Embeddings​

/embeddings endpoint​

Request​

Response​

Streaming​

Enabling streaming​

Chat completions and completions​

Predict endpoint​

Getting request log id and prediction log ids.​

V1

Predict

Request

Response

Explain

Request

Response

V2

Request

Response

Response output

Completions

/completions endpoint

Request

Response

Chat Completions

/chat/completions endpoint

Request

Response

Embeddings

/embeddings endpoint

Request

Response

Streaming

Enabling streaming

Chat completions and completions

Predict endpoint

Getting request log id and prediction log ids.