v0.1The control plane for inference engineers

Warp-level control
for inference and self-improving systems

Warply is a Python control plane for disaggregated, self-improving inference. Programmable prefill/decode pools, KV-aware routing, and RL post-training — without the kernel, CRD, or Kubernetes tax. import warply, not YAML.

Start building

serve.py

1import warply as wp
2 
3# Disaggregated prefill/decode serving on any cloud — in Python.
4engine = wp.DisaggEngine(
5    model="meta-llama/Llama-3.1-8B",
6    prefill=wp.Pool("1xH100", replicas=1),
7    decode=wp.Pool("1xH100", replicas=1),
8    backend="sglang",
9    kv_transfer="nixl",
10    cloud="lambda",
11)
12 
13engine.up()                 # provision + route, no YAML
14client = engine.client()    # OpenAI-compatible, falls out for free
15 
16client.chat.completions.create(
17    model="warply",
18    messages=[{"role": "user", "content": "Explain the KV cache."}],
19)

Composes with the engines you already trust

SGLangvLLMTensorRT-LLMNVIDIA Dynamollm-dNIXLRaySkyPilotTritonHugging FaceLMCacheMooncakeSGLangvLLMTensorRT-LLMNVIDIA Dynamollm-dNIXLRaySkyPilotTritonHugging FaceLMCacheMooncake

The thesis

There are plenty of AI researchers and SWEs. There are nowhere near enough inference engineers.

Mechanisms are commoditized

Prefill/decode separation, KV-cache transfer, and attention/FFN splitting now ship in vLLM, SGLang, TensorRT-LLM, and Dynamo. The hard substrate is solved.

The control plane is missing

Driving that substrate still means Dynamo CRDs, Kubernetes YAML, and per-cloud provisioning glue. Researchers become part-time systems engineers.

Warply is that control plane

A clean, programmable Python layer over disagg, KV routing, and rollouts — composable primitives with escape hatches all the way down.

Composable primitives

Real control over disagg, KV cache,
routing, and rollouts.

Every mechanism that production inference teams care about, exposed as a primitive you can read, compose, and override in Python.

Disaggregated by default

Independent prefill and decode pools. Scale each to its own bottleneck instead of over-provisioning one monolith.

engine.scale(decode=2)

KV-aware routing

Cache-reuse-driven routing across pools, with KV tiering via LMCache and Mooncake. Move the cache, not the compute.

route="kv-reuse"

Cloud-portable

One spec across Lambda, neoclouds, Modal, and local. Chase spot capacity and the cheapest GPUs without a rewrite.

cloud="lambda"

RL & self-improvement

First-class rollout vs gradient-update separation. Build RSI flywheels: generate, score, GRPO, update weights, repeat.

wp.SelfImprove(...)

Python-first API

Declarative spec plus imperative up / scale / down / client. The SDK is the product — not an afterthought on top of YAML.

wp.DisaggEngine(...)

Escape hatches all the way down

Drop to raw engine flags, custom Triton kernels, or export deployment YAML when you need to. Control, never a black box.

engine.export_yaml()

The developer surface

The whole lifecycle, in a few lines of Python.

From first deploy to RL post-training, Warply stays a library you call — declarative where you want it, imperative where you need it.

serve.py

1import warply as wp
2 
3engine = wp.DisaggEngine(
4    model="meta-llama/Llama-3.1-8B",
5    prefill=wp.Pool("1xH100", replicas=1),
6    decode=wp.Pool("1xH100", replicas=1),
7    backend="sglang",
8    kv_transfer="nixl",
9    cloud="lambda",
10)
11 
12engine.up()
13client = engine.client()

The open intersection

Portability, disagg intelligence, and RL — behind one API.

The mechanisms are well served. The programmable, cloud-portable, RL-aware control plane over them is the wedge nobody owns yet.

	Warply	SkyPilot	Dynamo	llm-d	Modal
Python-first SDK
Disagg-aware (prefill/decode pools)
KV-aware routing & tiering
Cloud-portable by design
RL / RSI orchestration
No Kubernetes / CRDs required
Open-source core

Roadmap

Open core first. Self-improving systems next.

The control plane and disagg serving are free and open. Advanced RL/RSI and the managed plane are how Warply sustains itself.

Phase 0 · Shipping

Single-cloud disagg serving

DisaggEngine API: up / scale / down / client
Prefill & decode pools, scaled independently
SGLang engine + NIXL KV transfer
OpenAI-compatible client, export_yaml() escape hatch

Phase 1 · Building

Portability + RL anchor

Multicloud provider system with spot/cost awareness
KV-aware routing and tiering (LMCache, Mooncake)
vLLM & TensorRT-LLM adapters
Async RL post-training and RSI primitives

Phase 2 · Exploring

Managed & enterprise

Hosted disagg control plane as a service
SSO, audit logs, and support
Custom kernels; AFD for MoE on superpod interconnect
Hybrid inference + training memory management

Stop writing YAML.
Start writing Python.

Launch disaggregation-aware inference in a few lines, keep full control of the internals, and move across clouds when you need to.

Read the docs Star on GitHub

Warp-level controlfor inference and self-improving systems