lmrelay added to PyPI

Share

Local LLM‑Aware Balancer / Gateway

One Endpoint, Three Wire Protocols, Eight Upstream Providers, Health‑Sorted Failover, Multi‑Key Rotation, Per‑Key Model Allow‑List

A comprehensive deep‑dive into the architecture, features, and practical use‑cases of a cutting‑edge local gateway that abstracts LLM access across multiple cloud providers and on‑prem solutions.

Table of Contents

  1. Executive Summary
  2. High‑Level Architecture
  3. Endpoint Abstraction & Wire Protocols
  4. Upstream Providers & Health‑Sorted Failover
  5. Key Rotation & Multi‑Key Management
  6. Per‑Key Model Allow‑List & Fine‑Grained Policy
  7. Security, Access Control, and Compliance
  8. Performance & Latency Considerations
  9. Cost Optimization Strategies
  10. Deployment Scenarios & Options
  11. Real‑World Use Cases
  12. Integration Guide & API Overview
  13. Monitoring, Observability, & Health Checks
  14. Comparison to Competitors & Alternatives
  15. Future Enhancements & Roadmap
  16. Conclusion

Executive Summary

TL;DR – A local, LLM‑aware gateway that presents a single entry point to your applications while simultaneously supporting three distinct LLM wire protocols (Ollama, OpenAI, Anthropic). It aggregates up to eight upstream providers—each with its own health check, failover, and key rotation logic—allowing granular per‑key model whitelisting and dynamic failover based on real‑time health scores. This design delivers unmatched flexibility, cost‑efficiency, and security, especially for enterprises that need to mix on‑prem, private, and multi‑cloud LLM services.

Key Takeaways

| Feature | Benefit | Example | |---------|---------|---------| | Single endpoint | Simplifies application code | POST https://mygateway.company.com/llm | | Three protocols | Unifies different vendor APIs | Ollama local, OpenAI public, Anthropic’s API | | Eight upstreams | Diversifies risk & cost | Mix of OpenAI, Anthropic, Azure OpenAI, Google Vertex, local OpenAI‑compatible servers | | Health‑sorted failover | Minimizes downtime & latency | Dynamically picks the healthiest endpoint | | Multi‑key rotation | Mitigates credential leakage | Rotates every 24 h per provider | | Per‑key model allow‑list | Enforces policy at credential level | key1 → GPT‑4, key2 → Claude‑2 | | Observability | Enables real‑time ops | Prometheus metrics, distributed tracing | | Cost‑optimizability | Uses cheaper providers first | Routes GPT‑4 to cheaper Vertex when available |


High‑Level Architecture

┌───────────────────────┐
│   Client Applications  │
└───────────▲────────────┘
            │
            │  HTTPS / REST
            │
┌───────────▼────────────┐
│  Local LLM‑Aware Gateway│
│  • Single Endpoint      │
│  • Protocol Adapters    │
│  • Load‑Balancing Layer │
│  • Health‑Monitor       │
│  • Key‑Manager          │
└───────▲───────▲───────┘
        │       │
  ┌─────▼──────▼───────┐
  │   Upstream Providers│
  │   (OpenAI, Anthro.,│
  │    Azure, Vertex,  │
  │    local Ollama,   │
  │    etc.)           │
  └──────────────────────┘
  • Gateway Core: Written in Rust/Go (language‑agnostic), it exposes a simple HTTP/HTTPS endpoint.
  • Protocol Adapters: Convert the gateway’s unified request format to vendor‑specific wire protocols.
  • Load‑Balancer: Implements weighted round‑robin + health‑sorted selection.
  • Health‑Monitor: Periodic ping / liveness checks, latency tracking, error‑rate monitoring.
  • Key‑Manager: Stores, rotates, and scopes API keys; enforces per‑key model allow‑lists.

The gateway is intentionally stateless; all metadata (provider weights, health scores, key scopes) is stored in a lightweight embedded database (e.g., SQLite) or an external key‑value store (Redis). This makes scaling trivial: spin up multiple gateway instances behind a CDN or an application load balancer.


Endpoint Abstraction & Wire Protocols

| Protocol | Description | Advantages | |----------|-------------|------------| | Ollama | Open‑source local inference engine; supports multiple models (LLAMA, Mistral, etc.). | No internet latency; fully private. | | OpenAI | Standard OpenAI REST API. | Industry‑standard, GPT‑4, GPT‑3.5. | | Anthropic | Claude models via HTTP. | Distinct model families, privacy‑focused. |

Unified Request Payload

{
  "prompt": "Translate the following to French: ...",
  "model": "gpt-4",
  "temperature": 0.7,
  "max_tokens": 512,
  "metadata": { "client_id": "app-1234" }
}

The gateway internally resolves:

  1. Target Provider – based on provider preference, cost, and health score.
  2. Provider‑Specific Parameters – e.g., Anthropic expects anthropic_version.
  3. Key Scoping – selects the API key whose model allow‑list includes gpt-4.

This abstraction means client code can evolve without vendor lock‑in; the gateway keeps the heavy lifting.


Upstream Providers & Health‑Sorted Failover

Provider List

| Index | Provider | Type | Model Portfolio | Typical Cost (USD) | |-------|----------|------|-----------------|--------------------| | 1 | OpenAI | Public | GPT‑4, GPT‑3.5 | 0.03/1K tokens (GPT‑4) | | 2 | Anthropic | Public | Claude‑2 | 0.02/1K tokens | | 3 | Azure OpenAI | Public | GPT‑4o, GPT‑3.5 | Varies by region | | 4 | Google Vertex | Public | Gemini, PaLM | ~0.02/1K tokens | | 5 | AWS Bedrock | Public | Claude‑2, GPT‑3.5 | ~0.04/1K tokens | | 6 | Ollama Local | Private | LLAMA, Mistral | Near‑zero cost | | 7 | Self‑Hosted Private LLM | Private | Custom | Depends on infra | | 8 | Hybrid Cloud | Hybrid | Any | Custom pricing |

Health Monitoring Metrics

| Metric | Source | Threshold | Action | |--------|--------|-----------|--------| | Latency | Provider response time | > 2 s | Downgrade weight | | Error‑Rate | 4xx/5xx counts | > 5% | Mark unhealthy | | Availability | Up/Down | Down | Skip provider | | Throughput | Tokens/sec | < 50 | Downgrade weight |

The Health‑Sorted Failover algorithm is deterministic:

  1. Weight Calculation – Base weight × health score.
  2. Sorting – Providers sorted descending by effective weight.
  3. Selection – Weighted round‑robin across sorted list.
  4. Fallback – If selected provider fails, immediately retry with next sorted provider.

This ensures the gateway always uses the best available provider per request, maximizing throughput and minimizing cost.


Key Rotation & Multi‑Key Management

Why Rotate Keys?

  • Credential Theft Mitigation – Limits exposure if a key leaks.
  • Regulatory Compliance – Some regulations mandate key rotation (e.g., PCI‑DSS).
  • Dynamic Pricing – Rotating keys can align with cost‑effective provider cycles.

Implementation

| Feature | Implementation Details | |---------|------------------------| | Per‑Provider Key Store | Each provider can have 1–N keys; stored encrypted at rest. | | Rotation Schedule | Configurable (e.g., daily, weekly, event‑driven). | | Key Rotation Policy | - Round‑robin across keys.
- Weight‑based rotation (more weight → more usage). | | Key Validation | On‑startup ping verifies key validity; rotates if invalid. | | Audit Trail | Every key usage is logged with timestamp and provider. |

Example Configuration

providers:
  openai:
    keys:
      - id: openai-key-1
        secret: <encrypted>
        models: ["gpt-4", "gpt-3.5"]
      - id: openai-key-2
        secret: <encrypted>
        models: ["gpt-4"]
    rotation: "daily"
  ollama:
    keys:
      - id: ollama-key-1
        secret: <encrypted>
        models: ["llama-2", "mistral"]
    rotation: "never"

The gateway automatically picks the next key from the list when the current key reaches its rotation window.


Per‑Key Model Allow‑List & Fine‑Grained Policy

Why Per‑Key Policy?

  • Cost Control – Some keys may only allow cheaper models.
  • Regulatory Control – Certain data must only be processed by approved models.
  • Operational Segmentation – Separate teams can use distinct key pools.

Allow‑List Structure

key_id:
  - model_name
  - model_name

The gateway performs a lookup at request time:

  1. Model Requestedgpt-4
  2. Find Key → Scan keys where gpt-4 is in the list.
  3. Select Key → Apply rotation logic.
  4. Forward Request → Using the chosen key.

If no key permits the requested model, the gateway returns a 403 Forbidden with a clear message.


Security, Access Control, and Compliance

| Layer | Controls | Best Practices | |-------|----------|----------------| | Transport | TLS 1.3, HSTS, HTTP/2 | Enforce mutual TLS for internal traffic | | API Gateway | Rate‑limit, IP whitelisting, OAuth2/JWT | Use short‑lived access tokens | | Key Storage | AES‑256 GCM, Hardware Security Module (HSM) | Rotate encryption keys regularly | | Audit | Immutable logs, SIEM integration | Log request metadata, provider response times | | Compliance | GDPR, HIPAA, FedRAMP (when using compliant providers) | Data‑at‑rest encryption, audit trails |

The gateway also supports role‑based access control (RBAC) at the API level: different application tiers (e.g., dev, prod) can be routed to specific provider pools.


Performance & Latency Considerations

Latency Benchmarks

| Scenario | Latency (ms) | Notes | |----------|--------------|-------| | Local Ollama | 50–200 | CPU‑bound, no network hop | | OpenAI | 300–600 | Cloud region dependent | | Anthropic | 250–450 | Slightly lower than OpenAI | | Azure OpenAI | 350–700 | Region‑specific | | Google Vertex | 200–400 | Close to OpenAI | | AWS Bedrock | 400–800 | Higher due to multi‑service hops |

The health‑sorted failover can dynamically route to the provider with the lowest observed latency at any given time.

Caching & Concurrency

  • Cache Layer – Optional in‑memory cache for identical prompts; TTL configurable per provider.
  • Concurrency – The gateway handles thousands of concurrent requests via async I/O; connection pooling to providers is enabled.
  • Back‑pressure – When provider latency spikes, the gateway queues or drops requests gracefully.

Cost Optimization Strategies

| Strategy | Implementation | Expected Savings | |----------|----------------|------------------| | Weighted Prioritization | Set higher weight to cheaper providers (e.g., Vertex). | Up to 30% cost reduction. | | Dynamic Token‑Limit | Lower max_tokens for low‑value queries; route expensive models only when needed. | 15–25% token savings. | | Model‑Level Billing | Use free or low‑cost models for non‑critical tasks (e.g., Claude‑2 for drafts). | Significant cost avoidance. | | Multi‑Key Rotation | Distribute usage across keys to avoid high‑tier usage limits. | Avoid throttling penalties. | | Local Inference | Process smaller prompts on local Ollama server to cut cloud usage. | Near‑zero per‑token cost. |

The gateway exposes a Cost‑Insight API that reports real‑time spend per provider and per key, enabling dynamic re‑weighting in production.


Deployment Scenarios & Options

| Deployment Type | Suitability | Notes | |-----------------|-------------|-------| | On‑Prem Kubernetes | Enterprise data centers, compliance regimes | Deploy as a Helm chart; configure external TLS termination | | Serverless (AWS Lambda, GCP Cloud Run) | Rapid scaling, low maintenance | Limited by cold‑start latency; use warm‑up patterns | | Docker Compose | Development, CI pipelines | Simple local testing; can mimic local Ollama provider | | Edge CDN (Cloudflare Workers) | Low‑latency, global reach | Wrap gateway behind edge for geographic proximity | | Hybrid (Local + Cloud) | Multi‑cloud redundancy | Route internal traffic to local providers, external to public APIs |

High‑Availability (HA) Setup

  • Deploy 3+ instances behind a load balancer.
  • Use shared key‑value store (Redis) for health metrics.
  • Enable graceful draining to avoid dropped requests during maintenance.

Real‑World Use Cases

| Domain | Scenario | How Gateway Helps | |--------|----------|-------------------| | Customer Support | Real‑time FAQ generation using GPT‑4 | Routes to cheapest provider for high‑volume queries; falls back to local for latency. | | Finance | Automated risk assessment with GPT‑4o | Ensures regulatory compliance via per‑key policy; routes to Azure OpenAI for secure data. | | Healthcare | Clinical note summarization | Uses local Ollama to keep PHI on‑prem; fallback to HIPAA‑compliant cloud provider. | | E‑Commerce | Product recommendation and chat | Mixes Anthropic for creative suggestions; cheap Vertex for standard tasks. | | Education | Adaptive tutoring | Rotates keys to control cost per student; uses local inference for on‑device prompts. | | Gaming | NPC dialogue generation | Routes to low‑latency Ollama; falls back to public providers for complex scenes. |


Integration Guide & API Overview

1. Deploying the Gateway

  1. Clone the repo
   git clone https://github.com/company/llm-gateway.git
   cd llm-gateway
  1. Configure
   cp config.yaml.example config.yaml
   # Edit provider keys, weights, rotation, etc.
  1. Run
   docker compose up -d

2. Sending Requests

curl -X POST https://gateway.company.com/llm \
  -H "Authorization: Bearer <client-token>" \
  -H "Content-Type: application/json" \
  -d '{
        "prompt": "Summarize the following document...",
        "model": "gpt-4",
        "max_tokens": 200,
        "temperature": 0.5
      }'

3. Response Format

{
  "model": "gpt-4",
  "usage": {
    "prompt_tokens": 50,
    "completion_tokens": 120,
    "total_tokens": 170
  },
  "choices": [
    {
      "text": "The summary is as follows...",
      "finish_reason": "stop",
      "index": 0
    }
  ],
  "provider": "openai",
  "request_id": "abcd-1234"
}

4. Administrative Endpoints

| Endpoint | Method | Description | |----------|--------|-------------| | /admin/providers | GET | List providers, weights, health | | /admin/providers | POST | Add/Update provider | | /admin/keys | GET | List keys and scopes | | /admin/keys | POST | Add/Update key | | /admin/metrics | GET | Prometheus metrics endpoint | | /admin/audit | GET | Download audit logs (JSONL) |

5. SDKs

The gateway is language‑agnostic; official SDKs are available for:

  • Python (llm_gateway_client)
  • Node.js (llm-gateway-sdk)
  • Java (llm-gateway-java)
  • Go (llm-gateway-go)

Monitoring, Observability & Health Checks

Metrics Export

  • Provider Metrics
  • llm_gateway_provider_latency{provider="openai"}
  • llm_gateway_provider_error_rate{provider="anthropic"}
  • Key Metrics
  • llm_gateway_key_usage{key_id="openai-key-1"}
  • System Metrics
  • llm_gateway_requests_total
  • llm_gateway_active_connections

All metrics are exposed via /admin/metrics and scrapeable by Prometheus. Grafana dashboards are pre‑packaged.

Tracing

OpenTelemetry instrumentation is available. Set the OTEL_EXPORTER_OTLP_ENDPOINT env var to send traces to your tracing backend (Jaeger, Zipkin, etc.).

Health Endpoint

GET /health returns JSON:

{
  "status": "healthy",
  "providers": [
    {"name": "openai", "status": "up", "latency_ms": 350},
    {"name": "ollama", "status": "up", "latency_ms": 120}
  ],
  "uptime_seconds": 123456
}

Alerts

Sample Alertmanager rules:

- alert: ProviderDown
  expr: llm_gateway_provider_latency{provider="openai"} > 2000
  for: 1m
  labels:
    severity: critical
  annotations:
    summary: "OpenAI latency > 2s for 1 minute"

Comparison to Competitors & Alternatives

| Feature | Local LLM‑Aware Gateway | OpenAI’s Proxy | Anthropic’s API Gateway | Custom Load‑Balancer | |---------|------------------------|----------------|------------------------|----------------------| | Single Endpoint | ✅ | ✅ | ✅ | ❌ (multi‑endpoint) | | Protocol Abstraction | ✅ (Ollama + OpenAI + Anthropic) | ❌ (only OpenAI) | ❌ (only Anthropic) | ⚙️ (requires custom adapters) | | Health‑Sorted Failover | ✅ | ❌ | ❌ | ❌ | | Multi‑Key Rotation | ✅ | ❌ | ❌ | ⚙️ (requires custom logic) | | Per‑Key Model Policy | ✅ | ❌ | ❌ | ⚙️ | | Cost Optimization | ✅ (weighting, token‑limit) | ❌ | ❌ | ⚙️ | | On‑Prem Inference | ✅ (Ollama) | ❌ | ❌ | ⚙️ | | Observability | Prometheus + OpenTelemetry | ❌ | ❌ | ⚙️ | | Compliance | Built‑in GDPR, HIPAA support | Vendor‑dependent | Vendor‑dependent | Depends on implementation |

Takeaway – The local gateway unifies the best of all worlds: it consolidates multiple providers under a single API, adds advanced routing and failover logic, and provides granular security controls that are simply not available with vendor‑native solutions.


Future Enhancements & Roadmap

| Phase | Feature | Timeline | |-------|---------|----------| | Q3 2026 | AI‑Based Load Prediction – Predict provider latency & cost trends using ML | | | Q4 2026 | Multi‑Region Edge Deployments – Automatic edge caching of model responses | | | Q1 2027 | Serverless Auto‑Scaling – Dynamic scaling of gateway instances based on request volume | | | Q2 2027 | Dynamic Model Switching – Route based on content type (e.g., code vs. prose) | | | Q3 2027 | Integration with IAM Providers – Fine‑grained policy via Azure AD / Okta | | | Q4 2027 | Zero‑Trust Networking – mTLS + network segmentation by provider | |


Conclusion

The Local LLM‑Aware Balancer / Gateway represents a paradigm shift for enterprises looking to harness the power of large language models without becoming locked into a single cloud provider or incurring prohibitive costs. By presenting a single endpoint that transparently routes requests across eight diverse upstreams—while handling health monitoring, key rotation, per‑key policy, and comprehensive observability—the gateway delivers:

  • Operational Agility – Rapidly switch providers or models without touching application code.
  • Cost Efficiency – Weighted failover, dynamic token limits, and local inference reduce spend.
  • Security & Compliance – Fine‑grained policy enforcement, encrypted key storage, and auditability meet stringent regulatory needs.
  • Performance – Health‑sorted selection and local inference minimize latency for mission‑critical workloads.

Whether you’re building a customer‑facing chatbot, a financial risk engine, or an internal knowledge base, this gateway equips you with the flexibility to choose the right model, right provider, right time—without the overhead of managing multiple SDKs, keys, and vendor quirks.


Read more