V1
/predict
All model servers within a Deployment speak the Tensorflow V1 HTTP API.
The protocol looks as follows:
Request
{
"instances": [ <value>|<(nested)list>|<list-of-objects> ]
}
Instances: content of the input tensor
Response
{
"predictions": [ <value>|<(nested)list>|<list-of-objects> ]
}
Predictions: content of the output tensor
/Explain
All explainer servers within a Deployment speak the Tensorflow V1 HTTP API.
The protocol looks as follows:
Request
{
"instances": [ <value>|<(nested)list>|<list-of-objects> ]
}
Instances: content of the input tensor
Response
{
"explanations": [ <value>|<(nested)list>|<list-of-objects> ]
}
Predictions: content of the output tensor
V2
The V2 protocol increases utility and portability. Not all deployment frameworks support the V2 protocol, which is why, currently, all supported frameworks support only V1. When you create your own custom Docker, you are free to adopt V2.
The protocol looks as follows:
Request
{
"name": $string,
"shape": [ $number, ... ],
"datatype": $string,
"parameters": $parameters,
"data": [ <value>|<(nested)list>|<list-of-objects> ]
}
- name: name of the input tensor
- shape: shape of the input tensor. Each dimension is an integer
- datatype: datatype of tensor input elements as defined here
- parameters (optional): object containing 0 or more parameters - as explained here
- data: content of the input tensor. More information can be found here
Response
{
"model_name": $string,
"model_version": $string,
"id": $string,
"parameters": $parameters,
"outputs": [$response_output, ... ]
}
- name: name of the input tensor
- id: the identifier given to the request
- parameters (optional): object containing 0 or more parameters as explained here
- outputs: see response_output below.
Response_output
{
"name": $string
"shape": [$number, ... ],
"datatype": $string,
"data": [$tensor_data, ... ]
}
- name: name of the output tensor
- shape: shape of the input tensor. Each dimension is an integer
- datatype: datatype of tensor output elements as defined here
- data: content of the output tensor. More information can be found here
Completion
Available for external deployments (given endpoint is a completion endpoint), Huggingface generative model deployments and custom docker deployments (given endpoint is a completion endpoint). Additionally other openAI format params can be passed such as temperature, max_tokens etc.
/completions endpoint
Request
{
"prompt": [ < list-of-prompt-strings > ]
}
Explain request with standard explainer deployed
{
"prompt": [ < list-of-prompt-strings > ],
"explain": true
}
Response
{
"id": < id >,
"model": < model name >,
"choices":[
{
"index": 0,
"text" : < response >,
...
}
]
}
Chat Completion
Available for external deployments (given endpoint is a chat completion endpoint), Huggingface generative model deployments and custom docker deployments (given endpoint is a chat completion endpoint). Additionally other openAI format params can be passed such as temperature, max_tokens etc.
/chat/completions endpoint
Request
{
"messages": [
{
"role": "< role >",
"content": < message >"
}
],
..
}
Response
{
"id": < id >,
"model": < model name >,
"choices":[
{
"index": 0,
"message":{
"role":"assistant","reasoning_content":null,"content":"...."
}
...
}
...
]
}
Embeddings
Available for external deployments (given endpoint is a embedding endpoint), Huggingface generative model deployments and custom docker deployments (given endpoint is a embedding endpoint). Additionally other openAI format params can be passed such as temperature, max_tokens etc.
/embeddings endpoint
Request
{
"input": [
< array of text to get embedding for >
]
}
Response
{
"embedding": [
array of embeddings
],
..
}