Use superserve.serve() to deploy agents as HTTP endpoints via Ray Serve:
import superserve
from superserve import Agent
@superserve.tool(num_cpus=1)
def search(query: str) -> str:
"""Search for information."""
return f"Results for: {query}"
class MyAgent(Agent):
tools = [search]
async def run(self, query: str) -> str:
result = await self.call_tool("search", query=query)
return f"Found: {result}"
superserve.serve(MyAgent, name="my-agent", num_cpus=1, memory="2GB")
Run with:
Serve Options
| Option | Type | Default | Description |
|---|
agent | Any | required | Agent instance to serve |
name | str | None | Agent name (inferred from directory if not set) |
host | str | ”0.0.0.0” | Host to bind to |
port | int | 8000 | HTTP port |
num_cpus | int/float | 1 | CPU cores per replica |
num_gpus | int/float | 0 | GPUs per replica |
memory | str | ”2GB” | Memory per replica |
replicas | int | 1 | Number of replicas |
route_prefix | str | /agents/ | URL prefix for this agent |
Endpoints
Each agent gets an endpoint at its route prefix:
Request body:
Response:
{
"response": "Hi there! How can I help you?"
}
Streaming
Enable streaming by setting the stream parameter in your request.
| Stream Mode | Response Format |
|---|
"text" | Server-sent events with text chunks |
"events" | Server-sent events with JSON objects |
Text streaming request:
{
"query": "Hello!",
"stream": "text"
}
Event streaming request:
{
"query": "Hello!",
"stream": "events"
}
Streaming support depends on the agent framework. Most frameworks support both text and event streaming automatically.
CLI: superserve up
The superserve up command discovers and runs all agents and MCP servers in your project:
Serving at http://localhost:8000
Agents:
POST http://localhost:8000/agents/myagent/
MCP Servers:
http://localhost:8000/weather/mcp
It automatically:
- Imports all
agents/*/agent.py files
- Imports all
mcp_servers/*/server.py files
- Collects
superserve.serve() and superserve.serve_mcp() registrations
- Starts Ray Serve with everything on one port
CLI Options
| Option | Default | Description |
|---|
--port | 8000 | HTTP server port |
--host | 0.0.0.0 | Host to bind to |
--agents | All | Comma-separated list of agents to run |
--servers | All | Comma-separated list of MCP servers to run |
Scaling with Replicas
Ray Serve automatically load balances requests across replicas:
superserve.serve(
agent,
name="high-throughput",
num_cpus=1,
memory="2GB",
replicas=4,
)
GPU Agents
For ML inference workloads, allocate GPUs per replica:
superserve.serve(
agent,
name="ml-agent",
num_gpus=1,
memory="8GB",
replicas=2,
)
Resource Planning
When planning cluster resources, consider:
| Agent Config | Cluster Requirement |
|---|
replicas=3, num_cpus=2 | 6 CPUs total |
replicas=2, num_gpus=1 | 2 GPUs total |
replicas=4, memory="4GB" | 16GB RAM total |
Monitoring
When running, you can access:
- Agent endpoints:
http://localhost:8000/agents/{agent_name}/
- Health check:
http://localhost:8000/agents/{agent_name}/health
- Ray Dashboard:
http://localhost:8265
The Ray Dashboard provides visibility into:
- Cluster resources and utilization
- Task and actor status
- Logs and metrics