Production MCP Deployment Patterns

From development to production

The previous sections covered building and securing individual MCP servers. Production deployment is a different animal: orchestrating dozens of servers, scaling them, enforcing governance, and actually knowing when something breaks.

Bloomberg reduced time-to-production from days to minutes after standardizing on MCP. Block runs over 60 internal MCP servers serving thousands of employees daily. AWS offers fully managed MCP servers for EKS and ECS. These organizations stopped thinking about individual servers and started thinking about infrastructure.

Kubernetes deployment patterns

Kubernetes handles orchestration for production MCP servers. Here's the catch: MCP servers using stdio transport don't expose network ports. Communication happens through stdin/stdout, which Kubernetes doesn't natively understand.

ToolHive for MCP orchestration

ToolHive, developed by Stacklok for OpenShift and Kubernetes, solves this through a proxy architecture.

Two components:

MCP Server Pod — Your actual MCP server container
Proxyrunner Pod — Handles protocol conversion, authentication, and telemetry

For stdio-based servers, the proxy converts HTTP requests into JSON-RPC messages sent to the server's stdin. Responses flow back through stdout to the proxy, which returns them as HTTP responses. Network-native transports (Streamable HTTP) flow directly through standard Kubernetes Ingress.

ToolHive uses StatefulSets as the workload type. All traffic flows exclusively through the proxy. No direct network access to MCP server pods.

apiVersion: toolhive.stacklok.io/v1alpha1
kind: MCPServer
metadata:
  name: customer-server
spec:
  image: registry.company.com/mcp/customer-server:v1.2.0
  transport: stdio
  resources:
    limits:
      cpu: 100m
      memory: 128Mi
    requests:
      cpu: 50m
      memory: 64Mi

Container configuration

Production MCP containers need hardening:

FROM python:3.12-slim

# Run as non-root
RUN useradd -m -u 1000 mcpuser
USER mcpuser

WORKDIR /app
COPY --chown=mcpuser:mcpuser requirements.txt .
RUN pip install --no-cache-dir -r requirements.txt

COPY --chown=mcpuser:mcpuser server.py .

# Health check endpoint for Kubernetes probes
HEALTHCHECK --interval=30s --timeout=5s --start-period=10s \
  CMD python -c "import sys; sys.exit(0)"

ENTRYPOINT ["python", "server.py"]

The requirements: non-root user execution, minimal base image (slim or distroless), no shell in production images when possible, and multi-stage builds to reduce attack surface.

Health checks

Kubernetes needs to know when MCP servers are healthy.

spec:
  containers:
  - name: mcp-server
    livenessProbe:
      httpGet:
        path: /health
        port: 8080
      initialDelaySeconds: 60
      periodSeconds: 10
    readinessProbe:
      httpGet:
        path: /health/ready
        port: 8080
      initialDelaySeconds: 5
      periodSeconds: 5

For stdio servers without HTTP endpoints, use exec probes:

livenessProbe:
  exec:
    command:
    - python
    - -c
    - "import sys; sys.exit(0)"
  initialDelaySeconds: 60
  periodSeconds: 30

Set a startupProbe with a 60-second timeout for servers that load large models or establish database connections at startup. The startup probe handles slow initialization without triggering false liveness failures.

MCP gateways for enterprise governance

An MCP gateway sits between AI clients and MCP servers. Instead of each server handling its own authentication, logging, and access control, the gateway consolidates everything into a single policy enforcement point.

What gateways do

Capability	What it means
Authentication	OAuth 2.0, OIDC, SAML, SSO integration
Authorization	RBAC and ABAC for tool access
Audit trails	Centralized logging of all interactions
Session management	Preserving conversation context across requests
Load balancing	Distributing requests across server instances
Tool registry	Centralized discovery and management

Without a gateway, governance fragments across servers. With one, you get a single chokepoint for policy enforcement.

Gateway architecture

┌─────────────┐     ┌─────────────┐     ┌─────────────────┐
│ Claude Code ├────▶│ MCP Gateway ├────▶│ MCP Server Pool │
└─────────────┘     └──────┬──────┘     └─────────────────┘
                           │
                    ┌──────┴──────┐
                    │ Policy      │
                    │ Engine      │
                    │ ─────────── │
                    │ Auth Server │
                    │ Audit Log   │
                    │ Rate Limits │
                    └─────────────┘

The gateway intercepts all MCP traffic. Clients authenticate to the gateway, not individual servers. The gateway validates tokens, enforces policies, logs interactions, and routes requests.

Gateway options

Several gateway solutions exist:

Bifrost is developer-focused with sub-3ms latency overhead. Stateless design enables horizontal scaling. Prometheus metrics and distributed tracing built in.

TrueFoundry MCP Gateway targets enterprise use cases with roughly 10ms latency and 350+ requests per second on a single vCPU. OAuth 2.0 identity injection and credential vault management included.

Kong AI Gateway extends Kong's API gateway with MCP-specific capabilities. Session-aware routing with protocol translation.

AWS AgentCore Gateway handles tool management across MCP servers, REST APIs, and Lambda functions. Integrates with AWS IAM.

If you lack dedicated gateway infrastructure, start with an API gateway like Kong or AWS API Gateway configured as an MCP proxy. Add MCP-specific features as you need them.

Gateway configuration

A typical gateway deployment:

apiVersion: apps/v1
kind: Deployment
metadata:
  name: mcp-gateway
spec:
  replicas: 3
  selector:
    matchLabels:
      app: mcp-gateway
  template:
    spec:
      containers:
      - name: gateway
        image: company/mcp-gateway:latest
        env:
        - name: AUTH_ISSUER
          value: "https://auth.company.com"
        - name: TOOL_REGISTRY_URL
          value: "http://mcp-registry:8080"
        - name: LOG_LEVEL
          value: "info"
        ports:
        - containerPort: 8443
          name: https
        resources:
          requests:
            cpu: 500m
            memory: 512Mi
          limits:
            cpu: 2000m
            memory: 2Gi

Run multiple gateway replicas behind a load balancer. Session affinity ensures conversation continuity when clients need stateful interactions.

Multi-server orchestration

Production environments run dozens of MCP servers. Single-server deployment patterns don't scale to fleets.

Centralized registry

A tool registry enables dynamic discovery:

# Registry client in Claude Code configuration
{
  "mcpServers": {
    "registry": {
      "type": "http",
      "url": "https://mcp-registry.company.com/discover"
    }
  }
}

The registry returns available servers based on user identity and permissions. Servers register at startup. Health checks remove unhealthy servers from discovery.

What this buys you: agents discover tools without static configuration, new servers become available automatically, and permission policies apply at discovery time.

Namespace isolation

Separate MCP servers by function and trust level:

mcp-production/
├── core-servers/          # Critical business systems
│   ├── database-server
│   └── customer-api
├── utility-servers/       # General-purpose tools
│   ├── filesystem
│   └── search
└── experimental/          # Testing new integrations
    └── prototype-server

Kubernetes namespaces with NetworkPolicy enforce boundaries:

apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
  name: core-servers-isolation
  namespace: mcp-production-core
spec:
  podSelector: {}
  policyTypes:
  - Ingress
  - Egress
  ingress:
  - from:
    - namespaceSelector:
        matchLabels:
          name: mcp-gateway
  egress:
  - to:
    - namespaceSelector:
        matchLabels:
          name: internal-services

Core servers only accept traffic from the gateway namespace. Egress limited to approved internal services.

Scaling strategies

MCP servers scale differently than typical web services.

Stateless servers (pure tool execution) scale horizontally. Increase replica count based on request volume. Use Horizontal Pod Autoscaler with custom metrics.

Stateful servers (session context, database connections) need more care: session affinity in load balancing, connection pooling at the server level, StatefulSets with persistent volumes.

apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
  name: mcp-database-server
spec:
  scaleTargetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: mcp-database-server
  minReplicas: 2
  maxReplicas: 10
  metrics:
  - type: Pods
    pods:
      metric:
        name: mcp_tool_requests_per_second
      target:
        type: AverageValue
        averageValue: "50"

MCP session management relies on mcp-session-id headers for conversation continuity. Horizontal scaling requires session-aware load balancing or Redis-backed session storage. Without this, clients lose context when their requests hit different replicas.

Monitoring and observability

Production MCP deployments need observability. The previous section covered security logging. Here we add performance metrics, distributed tracing, and alerting.

Metrics that matter

Track these for every MCP server:

Metric	What it tells you	Target
Tool success rate	Calls completing without error	Over 99% for critical tools
Tool latency (p50/p95/p99)	Response time distribution	Set SLOs per tool importance
Error rate	Failed requests	Under 0.1% overall
Discovery success rate	Tools properly discovered	Over 99.9%
Session duration	How long clients stay connected	Track your baseline

OpenTelemetry integration

OpenTelemetry is the standard for MCP observability. The specification includes semantic conventions for MCP:

from opentelemetry import trace
from opentelemetry.trace import SpanKind

tracer = trace.get_tracer("mcp-server")

@mcp.tool
async def get_customer(customer_id: str) -> dict:
    with tracer.start_as_current_span(
        "tools/call",
        kind=SpanKind.SERVER,
        attributes={
            "mcp.method.name": "tools/call",
            "gen_ai.tool.name": "get_customer",
            "mcp.session.id": get_session_id(),
        }
    ) as span:
        try:
            result = await database.get_customer(customer_id)
            span.set_attribute("result.size", len(str(result)))
            return result
        except Exception as e:
            span.record_exception(e)
            span.set_status(trace.StatusCode.ERROR)
            raise

Key span attributes per the OpenTelemetry semantic conventions: mcp.method.name (the request method), gen_ai.tool.name (target tool), mcp.session.id (session identifier), and mcp.protocol.version (MCP version).

Distributed tracing

MCP interactions span multiple services. Trace context propagation connects the full request path.

# Extract trace context from MCP request
from opentelemetry.propagate import extract

def handle_tool_call(request):
    # Extract trace context from request metadata
    context = extract(request.params.get("_meta", {}))

    with tracer.start_as_current_span(
        "tool_execution",
        context=context,
        kind=SpanKind.SERVER
    ):
        # Execute tool with trace context propagated
        return execute_tool(request)

W3C Trace Context headers (traceparent, tracestate) propagate through the _meta property in MCP messages. This gives you end-to-end visibility from client through gateway to server.

Alerting thresholds

Configure alerts based on your Service Level Objectives:

Alert	Trigger	Action
High error rate	Over 0.5% errors for 5 minutes	Page on-call
Latency degradation	p99 over 3x p50 for 15 minutes	Investigate
Discovery failures	Under 99.9% success	Critical alert
Task success regression	Under 95% on golden paths	Business impact

# Prometheus alerting rule
groups:
- name: mcp-alerts
  rules:
  - alert: MCPHighErrorRate
    expr: |
      sum(rate(mcp_tool_errors_total[5m])) /
      sum(rate(mcp_tool_requests_total[5m])) > 0.005
    for: 5m
    labels:
      severity: page
    annotations:
      summary: "MCP error rate exceeds 0.5%"

Dashboards

A production MCP dashboard should show:

Overview panel with total requests, error rate, average latency
Per-tool breakdown with success rate and latency by tool
Gateway health: authentication success, routing latency
Resource utilization: CPU, memory, connection pools
Session metrics: active sessions, session duration distribution

Grafana, Datadog, and similar platforms have MCP-specific dashboard templates. Start with those and customize for your environment.

Enterprise deployment checklist

Before going to production:

Production MCP infrastructure requires the same operational rigor as any other service your business depends on. The patterns here are a starting point. Adapt them to fit your organization's existing platform and operational practices.

On this page