Serving

Use superserve.serve() to deploy agents as HTTP endpoints via Ray Serve:

import superserve
from superserve import Agent


@superserve.tool(num_cpus=1)
def search(query: str) -> str:
    """Search for information."""
    return f"Results for: {query}"


class MyAgent(Agent):
    tools = [search]

    async def run(self, query: str) -> str:
        result = await self.call_tool("search", query=query)
        return f"Found: {result}"


superserve.serve(MyAgent, name="my-agent", num_cpus=1, memory="2GB")

Run with:

superserve up

Serve Options

Option	Type	Default	Description
`agent`	Any	required	Agent instance to serve
`name`	str	None	Agent name (inferred from directory if not set)
`host`	str	”0.0.0.0”	Host to bind to
`port`	int	8000	HTTP port
`num_cpus`	int/float	1	CPU cores per replica
`num_gpus`	int/float	0	GPUs per replica
`memory`	str	”2GB”	Memory per replica
`replicas`	int	1	Number of replicas
`route_prefix`	str	/agents/	URL prefix for this agent

Endpoints

Each agent gets an endpoint at its route prefix:

POST /{agent_name}/

Request body:

{
  "query": "Hello!"
}

Response:

{
  "response": "Hi there! How can I help you?"
}

Streaming

Enable streaming by setting the stream parameter in your request.

Stream Mode	Response Format
`"text"`	Server-sent events with text chunks
`"events"`	Server-sent events with JSON objects

Text streaming request:

{
  "query": "Hello!",
  "stream": "text"
}

Event streaming request:

{
  "query": "Hello!",
  "stream": "events"
}

Streaming support depends on the agent framework. Most frameworks support both text and event streaming automatically.

CLI: superserve up

The superserve up command discovers and runs all agents and MCP servers in your project:

superserve up

Serving at http://localhost:8000

  Agents:
    POST http://localhost:8000/agents/myagent/

  MCP Servers:
    http://localhost:8000/weather/mcp

It automatically:

Imports all agents/*/agent.py files
Imports all mcp_servers/*/server.py files
Collects superserve.serve() and superserve.serve_mcp() registrations
Starts Ray Serve with everything on one port

CLI Options

superserve up [OPTIONS]

Option	Default	Description
`--port`	8000	HTTP server port
`--host`	0.0.0.0	Host to bind to
`--agents`	All	Comma-separated list of agents to run
`--servers`	All	Comma-separated list of MCP servers to run

Scaling with Replicas

Ray Serve automatically load balances requests across replicas:

superserve.serve(
    agent,
    name="high-throughput",
    num_cpus=1,
    memory="2GB",
    replicas=4,
)

GPU Agents

For ML inference workloads, allocate GPUs per replica:

superserve.serve(
    agent,
    name="ml-agent",
    num_gpus=1,
    memory="8GB",
    replicas=2,
)

Resource Planning

When planning cluster resources, consider:

Agent Config	Cluster Requirement
`replicas=3, num_cpus=2`	6 CPUs total
`replicas=2, num_gpus=1`	2 GPUs total
`replicas=4, memory="4GB"`	16GB RAM total

Monitoring

When running, you can access:

Agent endpoints: http://localhost:8000/agents/{agent_name}/
Health check: http://localhost:8000/agents/{agent_name}/health
Ray Dashboard: http://localhost:8265

The Ray Dashboard provides visibility into:

Cluster resources and utilization
Task and actor status
Logs and metrics

Getting started

Core

Frameworks

Native Tools

Serve Options

Endpoints

Streaming

CLI: superserve up

CLI Options

Scaling with Replicas

GPU Agents

Resource Planning

Monitoring

Getting started

Core

Frameworks

Native Tools

​Serve Options

​Endpoints

​Streaming

​CLI: superserve up

​CLI Options

​Scaling with Replicas

​GPU Agents

​Resource Planning

​Monitoring

Serve Options

Endpoints

Streaming

CLI: superserve up

CLI Options

Scaling with Replicas

GPU Agents

Resource Planning

Monitoring