Production MCP Deployment Patterns
From development to production
The previous sections covered building and securing individual MCP servers. Production deployment is a different animal: orchestrating dozens of servers, scaling them, enforcing governance, and actually knowing when something breaks.
Bloomberg reduced time-to-production from days to minutes after standardizing on MCP. Block runs over 60 internal MCP servers serving thousands of employees daily. AWS offers fully managed MCP servers for EKS and ECS. These organizations stopped thinking about individual servers and started thinking about infrastructure.
Kubernetes deployment patterns
Kubernetes handles orchestration for production MCP servers. Here's the catch: MCP servers using stdio transport don't expose network ports. Communication happens through stdin/stdout, which Kubernetes doesn't natively understand.
ToolHive for MCP orchestration
ToolHive, developed by Stacklok for OpenShift and Kubernetes, solves this through a proxy architecture.
Two components:
- MCP Server Pod — Your actual MCP server container
- Proxyrunner Pod — Handles protocol conversion, authentication, and telemetry
For stdio-based servers, the proxy converts HTTP requests into JSON-RPC messages sent to the server's stdin. Responses flow back through stdout to the proxy, which returns them as HTTP responses. Network-native transports (Streamable HTTP) flow directly through standard Kubernetes Ingress.
ToolHive uses StatefulSets as the workload type. All traffic flows exclusively through the proxy. No direct network access to MCP server pods.
apiVersion: toolhive.stacklok.io/v1alpha1
kind: MCPServer
metadata:
name: customer-server
spec:
image: registry.company.com/mcp/customer-server:v1.2.0
transport: stdio
resources:
limits:
cpu: 100m
memory: 128Mi
requests:
cpu: 50m
memory: 64MiContainer configuration
Production MCP containers need hardening:
FROM python:3.12-slim
# Run as non-root
RUN useradd -m -u 1000 mcpuser
USER mcpuser
WORKDIR /app
COPY --chown=mcpuser:mcpuser requirements.txt .
RUN pip install --no-cache-dir -r requirements.txt
COPY --chown=mcpuser:mcpuser server.py .
# Health check endpoint for Kubernetes probes
HEALTHCHECK --interval=30s --timeout=5s --start-period=10s \
CMD python -c "import sys; sys.exit(0)"
ENTRYPOINT ["python", "server.py"]The requirements: non-root user execution, minimal base image (slim or distroless), no shell in production images when possible, and multi-stage builds to reduce attack surface.
Health checks
Kubernetes needs to know when MCP servers are healthy.
spec:
containers:
- name: mcp-server
livenessProbe:
httpGet:
path: /health
port: 8080
initialDelaySeconds: 60
periodSeconds: 10
readinessProbe:
httpGet:
path: /health/ready
port: 8080
initialDelaySeconds: 5
periodSeconds: 5For stdio servers without HTTP endpoints, use exec probes:
livenessProbe:
exec:
command:
- python
- -c
- "import sys; sys.exit(0)"
initialDelaySeconds: 60
periodSeconds: 30Set a startupProbe with a 60-second timeout for servers that load large models or establish database connections at startup.
The startup probe handles slow initialization without triggering false liveness failures.
MCP gateways for enterprise governance
An MCP gateway sits between AI clients and MCP servers. Instead of each server handling its own authentication, logging, and access control, the gateway consolidates everything into a single policy enforcement point.
What gateways do
| Capability | What it means |
|---|---|
| Authentication | OAuth 2.0, OIDC, SAML, SSO integration |
| Authorization | RBAC and ABAC for tool access |
| Audit trails | Centralized logging of all interactions |
| Session management | Preserving conversation context across requests |
| Load balancing | Distributing requests across server instances |
| Tool registry | Centralized discovery and management |
Without a gateway, governance fragments across servers. With one, you get a single chokepoint for policy enforcement.
Gateway architecture
┌─────────────┐ ┌─────────────┐ ┌─────────────────┐
│ Claude Code ├────▶│ MCP Gateway ├────▶│ MCP Server Pool │
└─────────────┘ └──────┬──────┘ └─────────────────┘
│
┌──────┴──────┐
│ Policy │
│ Engine │
│ ─────────── │
│ Auth Server │
│ Audit Log │
│ Rate Limits │
└─────────────┘The gateway intercepts all MCP traffic. Clients authenticate to the gateway, not individual servers. The gateway validates tokens, enforces policies, logs interactions, and routes requests.
Gateway options
Several gateway solutions exist:
Bifrost is developer-focused with sub-3ms latency overhead. Stateless design enables horizontal scaling. Prometheus metrics and distributed tracing built in.
TrueFoundry MCP Gateway targets enterprise use cases with roughly 10ms latency and 350+ requests per second on a single vCPU. OAuth 2.0 identity injection and credential vault management included.
Kong AI Gateway extends Kong's API gateway with MCP-specific capabilities. Session-aware routing with protocol translation.
AWS AgentCore Gateway handles tool management across MCP servers, REST APIs, and Lambda functions. Integrates with AWS IAM.
If you lack dedicated gateway infrastructure, start with an API gateway like Kong or AWS API Gateway configured as an MCP proxy. Add MCP-specific features as you need them.
Gateway configuration
A typical gateway deployment:
apiVersion: apps/v1
kind: Deployment
metadata:
name: mcp-gateway
spec:
replicas: 3
selector:
matchLabels:
app: mcp-gateway
template:
spec:
containers:
- name: gateway
image: company/mcp-gateway:latest
env:
- name: AUTH_ISSUER
value: "https://auth.company.com"
- name: TOOL_REGISTRY_URL
value: "http://mcp-registry:8080"
- name: LOG_LEVEL
value: "info"
ports:
- containerPort: 8443
name: https
resources:
requests:
cpu: 500m
memory: 512Mi
limits:
cpu: 2000m
memory: 2GiRun multiple gateway replicas behind a load balancer. Session affinity ensures conversation continuity when clients need stateful interactions.
Multi-server orchestration
Production environments run dozens of MCP servers. Single-server deployment patterns don't scale to fleets.
Centralized registry
A tool registry enables dynamic discovery:
# Registry client in Claude Code configuration
{
"mcpServers": {
"registry": {
"type": "http",
"url": "https://mcp-registry.company.com/discover"
}
}
}The registry returns available servers based on user identity and permissions. Servers register at startup. Health checks remove unhealthy servers from discovery.
What this buys you: agents discover tools without static configuration, new servers become available automatically, and permission policies apply at discovery time.
Namespace isolation
Separate MCP servers by function and trust level:
mcp-production/
├── core-servers/ # Critical business systems
│ ├── database-server
│ └── customer-api
├── utility-servers/ # General-purpose tools
│ ├── filesystem
│ └── search
└── experimental/ # Testing new integrations
└── prototype-serverKubernetes namespaces with NetworkPolicy enforce boundaries:
apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
name: core-servers-isolation
namespace: mcp-production-core
spec:
podSelector: {}
policyTypes:
- Ingress
- Egress
ingress:
- from:
- namespaceSelector:
matchLabels:
name: mcp-gateway
egress:
- to:
- namespaceSelector:
matchLabels:
name: internal-servicesCore servers only accept traffic from the gateway namespace. Egress limited to approved internal services.
Scaling strategies
MCP servers scale differently than typical web services.
Stateless servers (pure tool execution) scale horizontally. Increase replica count based on request volume. Use Horizontal Pod Autoscaler with custom metrics.
Stateful servers (session context, database connections) need more care: session affinity in load balancing, connection pooling at the server level, StatefulSets with persistent volumes.
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
name: mcp-database-server
spec:
scaleTargetRef:
apiVersion: apps/v1
kind: Deployment
name: mcp-database-server
minReplicas: 2
maxReplicas: 10
metrics:
- type: Pods
pods:
metric:
name: mcp_tool_requests_per_second
target:
type: AverageValue
averageValue: "50"MCP session management relies on mcp-session-id headers for conversation continuity.
Horizontal scaling requires session-aware load balancing or Redis-backed session storage.
Without this, clients lose context when their requests hit different replicas.
Monitoring and observability
Production MCP deployments need observability. The previous section covered security logging. Here we add performance metrics, distributed tracing, and alerting.
Metrics that matter
Track these for every MCP server:
| Metric | What it tells you | Target |
|---|---|---|
| Tool success rate | Calls completing without error | Over 99% for critical tools |
| Tool latency (p50/p95/p99) | Response time distribution | Set SLOs per tool importance |
| Error rate | Failed requests | Under 0.1% overall |
| Discovery success rate | Tools properly discovered | Over 99.9% |
| Session duration | How long clients stay connected | Track your baseline |
OpenTelemetry integration
OpenTelemetry is the standard for MCP observability. The specification includes semantic conventions for MCP:
from opentelemetry import trace
from opentelemetry.trace import SpanKind
tracer = trace.get_tracer("mcp-server")
@mcp.tool
async def get_customer(customer_id: str) -> dict:
with tracer.start_as_current_span(
"tools/call",
kind=SpanKind.SERVER,
attributes={
"mcp.method.name": "tools/call",
"gen_ai.tool.name": "get_customer",
"mcp.session.id": get_session_id(),
}
) as span:
try:
result = await database.get_customer(customer_id)
span.set_attribute("result.size", len(str(result)))
return result
except Exception as e:
span.record_exception(e)
span.set_status(trace.StatusCode.ERROR)
raiseKey span attributes per the OpenTelemetry semantic conventions: mcp.method.name (the request method), gen_ai.tool.name (target tool), mcp.session.id (session identifier), and mcp.protocol.version (MCP version).
Distributed tracing
MCP interactions span multiple services. Trace context propagation connects the full request path.
# Extract trace context from MCP request
from opentelemetry.propagate import extract
def handle_tool_call(request):
# Extract trace context from request metadata
context = extract(request.params.get("_meta", {}))
with tracer.start_as_current_span(
"tool_execution",
context=context,
kind=SpanKind.SERVER
):
# Execute tool with trace context propagated
return execute_tool(request)W3C Trace Context headers (traceparent, tracestate) propagate through the _meta property in MCP messages.
This gives you end-to-end visibility from client through gateway to server.
Alerting thresholds
Configure alerts based on your Service Level Objectives:
| Alert | Trigger | Action |
|---|---|---|
| High error rate | Over 0.5% errors for 5 minutes | Page on-call |
| Latency degradation | p99 over 3x p50 for 15 minutes | Investigate |
| Discovery failures | Under 99.9% success | Critical alert |
| Task success regression | Under 95% on golden paths | Business impact |
# Prometheus alerting rule
groups:
- name: mcp-alerts
rules:
- alert: MCPHighErrorRate
expr: |
sum(rate(mcp_tool_errors_total[5m])) /
sum(rate(mcp_tool_requests_total[5m])) > 0.005
for: 5m
labels:
severity: page
annotations:
summary: "MCP error rate exceeds 0.5%"Dashboards
A production MCP dashboard should show:
- Overview panel with total requests, error rate, average latency
- Per-tool breakdown with success rate and latency by tool
- Gateway health: authentication success, routing latency
- Resource utilization: CPU, memory, connection pools
- Session metrics: active sessions, session duration distribution
Grafana, Datadog, and similar platforms have MCP-specific dashboard templates. Start with those and customize for your environment.
Enterprise deployment checklist
Before going to production:
- Kubernetes manifests with resource limits and health probes
- MCP gateway configured with authentication and authorization
- Tool registry with dynamic discovery
- Network policies isolating MCP namespaces
- OpenTelemetry instrumentation for tracing and metrics
- Alerting rules for SLO violations
- Runbooks for common incident scenarios
- Credential rotation procedures tested
- Load testing completed with realistic traffic patterns
- DR/failover procedures documented and tested
Production MCP infrastructure requires the same operational rigor as any other service your business depends on. The patterns here are a starting point. Adapt them to fit your organization's existing platform and operational practices.