lmrelay added to PyPI

Share

1 – Introduction

The landscape of large‑language‑model (LLM) services has exploded in the past year. Providers such as OpenAI, Anthropic, and a growing ecosystem of on‑prem solutions (e.g., Ollama, Meta’s Llama, and proprietary models) are competing on price, latency, policy compliance, and developer experience. In this rapidly evolving space, developers and enterprises are looking for a single, unified interface that can talk to any of these services, manage credentials, and provide a seamless fail‑over path when a provider goes down or reaches capacity.

The news article introduces a Local LLM‑Aware Balancer / Gateway – a lightweight, self‑hosted component that sits between an application and multiple LLM providers. It exposes a single endpoint to the client while supporting three wire protocols (Ollama, OpenAI, Anthropic). Internally it can talk to up to eight upstream providers, perform health‑sorted fail‑over, rotate API keys on a per‑request basis, and enforce per‑key model allowlists. This feature set promises to dramatically simplify LLM integration, reduce operational overhead, and increase resilience.

Below we unpack the key components of the solution, illustrate how they inter‑operate, and evaluate the impact on developers and organizations that rely on LLMs at scale.


2 – High‑Level Architecture

+----------------------+          +------------------+          +----------------------+
|   Client Apps       |  <---->  |  LLM Gateway     |  <---->  |  Upstream Provider A|
+----------------------+   API    +------------------+   HTTP   +----------------------+
                                           |   |   |   |   |    ...      |
                                           |   |   |   |   |    ...      |
                                           v   v   v   v   v    ...      |
                                      +------------------+  +------------------+
                                      |  Provider B      |  |  Provider C      |
                                      +------------------+  +------------------+
                                      |  Provider ...    |  |  Provider H      |
                                      +------------------+  +------------------+
  • Client Apps – Any application that wants to generate text, embeddings, or fine‑tune a model. They send requests to the gateway via a single HTTP endpoint.
  • LLM Gateway – The core of the system. It accepts requests, routes them to the best available upstream provider, handles authentication, retries, and fail‑over. It also supports protocol translation so that a client can send an OpenAI‑style request, and the gateway can forward it to an Ollama or Anthropic backend if needed.
  • Upstream Providers – External LLM services (OpenAI, Anthropic, Ollama, etc.). The gateway can connect to up to eight of them simultaneously, each potentially requiring a different API key, endpoint URL, and model list.

The gateway is stateless from the client perspective: every request is self‑contained. Internally it maintains a health‑check registry of providers and a key‑rotation queue. Because the gateway is self‑hosted, it can be placed behind a corporate firewall or in a private cloud, providing data‑privacy guarantees that are often lacking in cloud‑only solutions.


3 – Supported Protocols

3.1 Ollama Protocol

Ollama is an open‑source framework that can run LLMs locally or in the cloud. It exposes a simple REST API that mirrors OpenAI’s request/response format but with a few differences:

  • Endpointhttp://localhost:11434/api/generate
  • Payload – JSON with model, prompt, temperature, etc.
  • Response – JSON containing id, created_at, model, choices, usage

The gateway can proxy Ollama requests without modification or translate them into a different upstream format if required.

3.2 OpenAI Protocol

OpenAI’s API remains the de‑facto standard for LLM interaction. The gateway forwards requests exactly as received, preserving all fields: model, messages, temperature, max_tokens, and the newly introduced response_format parameter. The gateway can also strip sensitive fields (e.g., api_key) before passing the request to downstream providers that do not support them.

3.3 Anthropic Protocol

Anthropic’s Claude API uses a similar request structure but differs in field names and model identifiers. For example, model is passed as model, but the choices array is named completion. The gateway implements a bidirectional translation layer:

| Anthropic | OpenAI | Ollama | |-----------|--------|--------| | model | model| model| | prompt | messages | prompt | | max_tokens | max_tokens | max_tokens | | temperature | temperature | temperature |

This ensures a universal contract for clients while allowing the gateway to talk to any provider.


4 – Upstream Provider Management

4.1 Provider Configuration

Each provider is defined in the gateway’s configuration file (YAML/JSON/TOML). A provider record includes:

| Field | Description | |-------|-------------| | name | Human‑readable identifier | | type | openai, anthropic, or ollama | | url | Base URL for the provider’s API | | api_keys | Array of API keys (see key rotation) | | models | List of supported models (e.g., gpt-4o, claude-3-5-sonnet-20240620, llama3.1-70b) | | health_check | Optional health‑check endpoint and HTTP method | | weight | Optional load‑balancing weight (default 1) |

Example configuration for three providers:

providers:
  - name: openai-prod
    type: openai
    url: https://api.openai.com/v1
    api_keys: [ "sk-xxxx1", "sk-xxxx2" ]
    models:
      - gpt-4o
      - gpt-4o-mini
    health_check:
      url: /health
      method: GET
  - name: anthropic-prod
    type: anthropic
    url: https://api.anthropic.com/v1
    api_keys: [ "anthropic_key_1" ]
    models:
      - claude-3-5-sonnet-20240620
  - name: ollama-local
    type: ollama
    url: http://localhost:11434
    api_keys: []   # Ollama can be used without API keys
    models:
      - llama3.1-70b
      - mistral-7b

4.2 Health‑Sorted Fail‑over

The gateway maintains a health cache for each provider, refreshed at configurable intervals (e.g., every 30 s). A provider is considered healthy if:

  1. The health‑check endpoint returns HTTP 200.
  2. The provider’s response time is below a configurable threshold (e.g., 300 ms).
  3. No critical error flags (e.g., rate limiting, internal server error) are reported.

When a request arrives, the gateway selects the top‑ranked provider that supports the requested model. Ranking follows:

  1. Health status – Only healthy providers are considered.
  2. Weight – Providers can be weighted to bias traffic.
  3. Latency – Historical latency data can influence selection.

If the chosen provider fails during the request (e.g., network error or 5xx response), the gateway retries with the next ranked provider, up to a maximum retry count. This mechanism provides zero‑downtime fail‑over even when a provider experiences outages.


5 – Multi‑Key Rotation

5.1 Why Rotate Keys?

LLM providers typically limit the number of concurrent requests per API key to protect against abuse and to provide fair usage. Rotating keys helps:

  • Distribute load across multiple keys to avoid hitting per‑key rate limits.
  • Improve performance by selecting keys with the lowest current latency.
  • Enhance security – If a key is compromised, rotation reduces the window of exposure.

5.2 Rotation Algorithm

The gateway implements a Round‑Robin rotation with per‑key usage counters:

  1. Each provider has a pool of API keys.
  2. For each request, the gateway selects the key with the lowest usage counter.
  3. After the request is sent, the counter is incremented.
  4. When the counter reaches a configurable max usage threshold (e.g., 10 000 requests), the key is reset to 0.

Optional: A token bucket algorithm can be used to enforce stricter rate limits (e.g., 1 k requests per minute). The gateway’s key‑rotation logic can integrate with this to ensure no key exceeds the provider’s quota.

5.3 Implementation Snippet

type Key struct {
    Key     string
    Counter int
}

type Provider struct {
    Keys []Key
    mu   sync.Mutex
}

func (p *Provider) pickKey() string {
    p.mu.Lock()
    defer p.mu.Unlock()
    min := p.Keys[0]
    for _, k := range p.Keys[1:] {
        if k.Counter < min.Counter {
            min = k
        }
    }
    min.Counter++
    return min.Key
}

The gateway then injects the selected key into the request header (Authorization: Bearer <key>).


6 – Per‑Key Model Allow‑list

In many organizations, different teams or services are authorized to use specific models (e.g., production teams may only use gpt-4o, whereas research labs may experiment with gpt-4o-mini). The gateway can enforce this policy at key level:

  • Key Policy – Each API key in the configuration is associated with a list of allowed models.
  • Runtime Validation – When a request is received, the gateway checks that the requested model is in the allowed list for the key that will be used.
  • Error Handling – If the model is disallowed, the gateway returns a 403 Forbidden with a clear message (“Model <model> is not allowed for API key sk-xxxx1”).

Example Configuration

providers:
  - name: openai-prod
    type: openai
    url: https://api.openai.com/v1
    api_keys:
      - key: "sk-xxxx1"
        models:
          - gpt-4o
          - gpt-4o-mini
      - key: "sk-xxxx2"
        models:
          - gpt-4o
    ...

This granularity allows auditing and compliance: only keys with the necessary authorizations are used for specific models.


7 – Deployment Scenarios

7.1 On‑Premise / Private Cloud

The gateway can be deployed on any container‑friendly infrastructure: Docker, Kubernetes, Vagrant, or even bare metal. A typical deployment might look like:

docker run -d --name llm-gateway \
  -p 8080:8080 \
  -v /etc/llm-gateway/config.yaml:/app/config.yaml \
  ghcr.io/example/llm-gateway:latest

The gateway’s configuration file lives on a persistent volume, allowing roll‑outs of new provider lists or key rotations without downtime.

7.2 Hybrid Cloud

Some providers (e.g., OpenAI, Anthropic) are public cloud services; others (e.g., Ollama) can be run in a private data center. The gateway bridges these worlds, enabling a hybrid strategy where sensitive workloads stay on‑prem, and general‑purpose workloads use the cloud.

7.3 Zero‑Downtime Upgrades

Because the gateway is stateless and uses a declarative config, upgrades are simple:

  1. Pull a new image or rebuild the binary.
  2. Reload the config or use a rolling update in Kubernetes.
  3. The gateway continues to serve traffic while the new version is being deployed.

8 – Use Cases

| Use Case | How the Gateway Helps | |----------|-----------------------| | Multi‑Team LLM Access | Teams can be assigned different keys and model allow‑lists, all managed centrally. | | Cost Optimization | The gateway can route requests to the cheapest provider or to a local Ollama instance when budgets are tight. | | Compliance & Data Residency | Sensitive data can be processed only on‑prem, while other workloads use public APIs. | | Fail‑over During Outages | Automatic health‑sorted routing prevents downtime if OpenAI or Anthropic experiences incidents. | | Experimentation | Developers can toggle between models (e.g., GPT‑4o vs Claude‑3.5) without changing code. | | Security | API keys never travel outside the gateway; requests are signed internally. | | Analytics | The gateway can collect per‑key, per‑model metrics (latency, cost, usage) and expose them via Prometheus. |


9 – Performance Considerations

| Concern | Mitigation | |---------|------------| | Latency Added by Gateway | The gateway introduces a single network hop. By caching health status and key pools, request routing is O(1). In practice, latency increases by < 10 ms, negligible compared to provider round‑trip times. | | Concurrency Limits | The gateway can be scaled horizontally (multiple instances behind a load balancer). Since each instance is stateless, scaling is straightforward. | | Key Rotation Overhead | The key rotation algorithm is lightweight (O(1) per request). It does not require external coordination. | | Health‑Check Load | Health checks are performed asynchronously and can be throttled (e.g., one request per provider every 30 s). They add minimal load to upstream services. | | Error Propagation | The gateway normalizes error responses into a uniform format, simplifying client handling. | | Back‑pressure | If all providers are busy, the gateway can apply a wait‑or‑retry policy or return a 429 Too Many Requests. |


10 – Security & Compliance

10.1 Credential Management

  • API keys are stored encrypted at rest (e.g., using HashiCorp Vault or Kubernetes Secrets).
  • The gateway never logs or echoes API keys.
  • Each request is authenticated using a short‑lived JWT that contains the client’s identity and the allowed provider list.

10.2 Data‑Privacy Guarantees

Because the gateway can be hosted on‑prem, it can enforce that no data leaves the local network unless explicitly allowed. This is crucial for regulated industries (finance, healthcare) that must keep data within a defined jurisdiction.

10.3 Audit Trails

All requests and responses are logged (excluding model content for privacy) with timestamps, provider, model, and key used. These logs can be fed into SIEM solutions for monitoring and compliance audits.


11 – Extensibility

The gateway is built with a modular plugin architecture:

  • Protocol Plugins – Adding a new provider (e.g., Gemini, Cohere) involves writing a small adapter that implements Request() and Response() methods.
  • Health‑Check Plugins – Custom health checks (e.g., GPU availability for on‑prem providers) can be plugged in.
  • Metrics Exporters – Prometheus, Grafana, or Datadog exporters can be added for observability.
  • Policy Engines – Integrating with Open Policy Agent (OPA) allows fine‑grained access control beyond per‑key model allow‑lists.

12 – Competitive Landscape

| Feature | LLM Gateway (ours) | Commercial Alternatives | |---------|--------------------|--------------------------| | Unified Endpoint | ✔︎ | ✔︎ | | Protocol Translation | ✔︎ | ❌ (mostly OpenAI only) | | Multi‑Provider Support | ✔︎ (8+ providers) | Limited | | Health‑Sorted Fail‑over | ✔︎ | ❌ | | Key Rotation | ✔︎ | ❌ | | Per‑Key Model Allow‑list | ✔︎ | ❌ | | On‑Prem Deployment | ✔︎ | ❌ (cloud‑only) | | Open Source | ✔︎ | ❌ |

While there are managed services that provide LLM routing, most are cloud‑only and lack the fine‑grained control that enterprises need. The gateway’s open‑source nature and protocol‑agnostic design fill a niche for companies that require both flexibility and compliance.


13 – Potential Challenges

| Challenge | Impact | Mitigation | |-----------|--------|------------| | Provider API Changes | Requires updates to the gateway’s adapters. | Use versioned adapters and community contributions. | | Key Leakage | Exposure of provider credentials. | Store keys securely, rotate regularly, monitor for anomalies. | | Latency Variability | Upstream providers may have variable response times. | Use weighted load‑balancing and latency‑based routing. | | Cost Management | Uncontrolled usage may lead to unexpected bills. | Implement per‑key spending limits and alerts. | | Compliance Audits | Misconfigurations could violate regulations. | Enforce policy checks and automated compliance scans. |


14 – Future Directions

  • Auto‑Scaling – Dynamically spin up new gateway instances based on request volume.
  • AI‑Powered Routing – Use machine learning to predict the best provider based on historical performance.
  • Cost‑Based Optimization – Integrate provider pricing APIs to route requests to the cheapest option in real time.
  • Fine‑Grained Quotas – Enforce per‑team, per‑user limits in addition to per‑key limits.
  • Enhanced Observability – Full tracing (OpenTelemetry) of end‑to‑end LLM request flows.

15 – Conclusion

The Local LLM‑Aware Balancer / Gateway delivers a compelling solution for developers and organizations that rely on multiple LLM providers. By offering a single, unified endpoint while supporting Ollama, OpenAI, and Anthropic protocols, the gateway abstracts away the complexities of provider integration.

Its health‑sorted fail‑over, multi‑key rotation, and per‑key model allow‑lists provide resilience, security, and policy enforcement that are hard to achieve with a purely cloud‑based approach. Coupled with on‑prem deployment options and a modular architecture, the gateway positions itself as a foundational component for any modern LLM‑driven application stack.

Whether you’re a startup experimenting with the latest model, an enterprise enforcing data residency, or a regulated organization needing strict access controls, the gateway equips you with the flexibility to choose the right provider for the right request—all while keeping your codebase simple and your operations transparent.

Read more

Hong Kong's Votee AI and Toronto's Beever AI Open-Source Beever Atlas -- Turns Your Telegram, Discord, Mattermost, Microsoft Teams and Slack Chats Into a Living Wiki

Open-Source LLM Knowledge Base: A Dual‑Edition Breakthrough for Teams and Individuals (≈ 4 000 words) Executive Summary In a bold move that signals the maturation of large‑language‑model (LLM) tooling, a new open‑source knowledge‑base (KB) has emerged that promises to reshape how teams and solo practitioners capture,

By Tornado