jolteon: An LLM Routing Proxy for Discourse

Jun 27, 2026

6 min read

Two weeks ago, someone posted an internal proposal:

My dream for the future is a good proxy that automatically sends requests to the most appropriate model, allowing us to run a heterogeneous AI inference, and have it routed to our customers automatically.

The concrete version of the problem: we'd deployed DeepSeek 4 Flash to a server in one region. It's good for long-form generation - periodic reports, data explorer summaries, dashboard highlights. But it's a single instance and has no vision support. Meanwhile Qwen 3 handles title suggestions and translation just fine on smaller, faster hardware. The gap between "we have these models" and "the right request reaches the right model" was just config - but config scattered across individual Discourse AI setups with no central health checking or fallback logic.

The proposal asked for a centralized smart router that could:

receive all LLM inference requests from hosted customers
centralize health checking across all deployments
route each request to the most appropriate model using context from Discourse - feature name, hosting tier, daily AI credit usage

jolteon is that proxy. It shipped yesterday. Here's how it works.

Built with Fable

The proposal landed two weeks ago. What made it possible to ship this fast was Fable - an experimental Claude model variant Anthropic was running for a period. I used it to write the entire proxy in a few hours.

Fable was faster than Opus and reasoned differently - more willing to work through intermediate steps explicitly. For a project like this, where you're sketching out proxy architecture, async Rust, streaming SSE semantics, and per-pool config schemas all at once, that mattered. It could hold the whole design in mind and produce working code rather than plausible-looking scaffolding.

The reasoning adapter work especially - where jolteon rewrites OpenAI-style reasoning_effort fields into vLLM-specific knobs before forwarding to Qwen3 and DeepSeek backends - involved a lot of "here is the vLLM template behavior, here is what the client sends, figure out the right rewrite" prompts. Fable worked through the token budget math and template key conflicts step by step.

It's no longer available. But two weeks from proposal to production is the result.

The routing problem

For jolteon to route intelligently, Discourse needed to tell it something about each request. We added a set of headers that discourse-ai attaches when sending inference requests - feature name, hosting tier, whether the site is on a trial, quota usage, and whether the payload contains images:

X-Discourse-AI-Feature: ai_bot
X-Discourse-AI-Tier: standard
X-Discourse-AI-Trial: 1
X-Discourse-AI-Quota-Used: 87
X-Discourse-AI-Vision: 1

The routing logic combines these to land on a quality class: fast, balanced, or frontier. Then it picks a backend pool from a preference ladder for that class.

classes:
  fast:
    ladder:
      - { qwen-small: 70, deepseek-flash: 30 }
  balanced:
    ladder:
      - { deepseek-flash: 80, qwen-large: 20 }
      - { qwen-small: 100 }          # fallback
  frontier:
    ladder:
      - { qwen-large: 100 }
      - { deepseek-flash: 100 }      # fallback
      - { qwen-small: 100 }          # fallback

Some features get a base class in config. ai_bot is frontier. translate is fast with a hard cap so it can't escalate even if tier would normally push it higher. Paid tiers get a +1 modifier. Trials get -1. Over 80% quota gets -1. Over 150% gets -2. The modifiers stack, the result gets clamped, the cap gets applied.

Within a pool, backends are selected least-connections. Vision requests drop non-vision pools from the entire ladder. If a backend fails before the first response byte, jolteon retries the next backend, then the next ladder step. Once streaming starts, retries aren't possible.

Why Rust

I wanted a single stateless binary with no runtime dependencies. Rust handles the async proxying well - SSE streams run up to 300 seconds and the proxy can't touch a single byte of the response body. The request body needs to be buffered (so failed attempts can be replayed), but the response must stream through unchanged.

Rust's type system was also useful for the routing logic. The feature-to-class mapping, the modifier math, the ladder walk - all of it is a pure function over the policy and a snapshot of backend state. It's straightforward to test.

Reasoning parameter adaptation

This is the weirdest part of jolteon and the part that took the most iteration.

Discourse is configured as an OpenAI Chat Completions client. It sends reasoning_effort: "medium" when the user has reasoning enabled. But vLLM doesn't implement that field the same way across models. Qwen3 needs thinking_token_budget and chat_template_kwargs.enable_thinking. DeepSeek V4 needs chat_template_kwargs.thinking and its own reasoning_effort inside that object.

jolteon handles this with per-pool reasoning adapters:

pools:
  qwen-small:
    reasoning:
      adapter: qwen3
      forward_reasoning_effort: true
      effort_budgets:
        low: 512
        medium: 2048
        high: 8192

The adapter fires per attempt, not once per request. If a request fails over from qwen-large to deepseek-flash, jolteon rewrites the body again using the DeepSeek adapter before the second attempt. The client sent one request with reasoning_effort: medium. Each backend got the right vLLM fields for its template.

Hot reload

The policy is a YAML file. Reload it without dropping connections:

kill -HUP $(pidof jolteon)

Backend health and in-flight counts survive the reload. The only thing that requires a restart is changing the listen address. In practice, tuning weights or adding a backend is a YAML edit, a commit, and a SIGHUP. No downtime.

The --check flag validates the policy without starting the proxy:

jolteon --check routing-policy.yml

Unknown fields are rejected, so misspelled keys fail fast rather than silently doing nothing. That check is wired into our deploy pipeline.

What it looks like in production

Every request produces one structured JSON log line:

{
  "site": "example.com",
  "feature": "ai_bot",
  "tier": "standard",
  "trial": false,
  "quota_used": 0,
  "vision": false,
  "class": "frontier",
  "pool": "qwen-large",
  "backend": "dub-1",
  "attempts": 1,
  "status": 200,
  "ttft_ms": 142,
  "duration_ms": 9214,
  "bytes": 48211
}

ttft_ms is time from request arrival to first upstream body byte - the latency number that matters for streaming responses. The line emits when the stream ends, including on client disconnect.

Prometheus metrics on /metrics give you histograms for ttft and duration, attempt failure counts by backend and reason, in-flight gauges. The status endpoint at /jolteon/status shows every pool and backend in JSON.

Open source plans

The proxy is Discourse-specific in one place: the X-Discourse-AI-* header names. Generalizing those to generic headers (or making them configurable) would make the routing logic applicable to any OpenAI-compatible fleet. I'd like to do that and open source it - the ladder routing, vision handling, and reasoning adaptation are useful beyond Discourse.

For now the source is in an internal repo. The architecture, full config schema, and smoke/load/reload/reasoning test scripts are in the README. If you're building something similar and want to compare notes, reach out.

Links

discourse-ai - the Discourse AI plugin jolteon fronts
vLLM - the inference backend

Yapper, or: just ask the bots

Jun 10, 2026

9 min read

Yapper is a Discourse plugin that turns a forum into a place where humans can read but only registered bots can post. It's live at yapper.forum. Here's what it does, and why I think the protocol underneath it is the actual product.

#discourse

#ai

#agents

#protocol

#oss
Six records, two bugs, one refactor: a day with ARF

Jun 07, 2026

9 min read

git blame tells you who wrote this line. arf why tells you what they were thinking. I shipped ARF earlier this year as a format for capturing agent reasoning alongside commits. Here's what happened when an agent actually used it for a session of real work.

#rust

#ai

#oss

#tools
Running Claude Code on NixOS

May 07, 2026

4 min read

Claude Code's official installer drops a prebuilt Linux binary that won't execute on NixOS. A 133-line shell.nix patches it on the way in. Here's what's broken, why, and how to fix it without leaving the official install path.

#nixos

#ai

#tools
Inside Claude Code's Team Memory Sync Engine

Apr 05, 2026

7 min read

How Claude Code shares knowledge across your team: a file watcher, delta sync, secret scanner, and optimistic concurrency -- all hidden behind a directory you didn't know existed.

#ai

#tools

#dev
Rewriting Claude Code in Rust, Part 3: TUI, Agents, and Multi-Provider

Apr 04, 2026

6 min read

claux gets a ratatui TUI, sub-agents, and support for any OpenAI-compatible endpoint. One session, start to finish.

#ai

#rust

#tools

#dev

jolteon: An LLM Routing Proxy for Discourse

Built with Fable

The routing problem

Why Rust

Reasoning parameter adaptation

Hot reload

What it looks like in production

Open source plans

Links

Related Posts

Yapper, or: just ask the bots

Six records, two bugs, one refactor: a day with ARF

Running Claude Code on NixOS

Inside Claude Code's Team Memory Sync Engine

Rewriting Claude Code in Rust, Part 3: TUI, Agents, and Multi-Provider