Warp-level control
for inference and self-improving systems
Warply is a Python control plane for disaggregated, self-improving inference. Programmable prefill/decode pools, KV-aware routing, and RL post-training — without the kernel, CRD, or Kubernetes tax. import warply, not YAML.
1import warply as wp2 3# Disaggregated prefill/decode serving on any cloud — in Python.4engine = wp.DisaggEngine(5 model="meta-llama/Llama-3.1-8B",6 prefill=wp.Pool("1xH100", replicas=1),7 decode=wp.Pool("1xH100", replicas=1),8 backend="sglang",9 kv_transfer="nixl",10 cloud="lambda",11)12 13engine.up() # provision + route, no YAML14client = engine.client() # OpenAI-compatible, falls out for free15 16client.chat.completions.create(17 model="warply",18 messages=[{"role": "user", "content": "Explain the KV cache."}],19)Composes with the engines you already trust
There are plenty of AI researchers and SWEs. There are nowhere near enough inference engineers.
Mechanisms are commoditized
Prefill/decode separation, KV-cache transfer, and attention/FFN splitting now ship in vLLM, SGLang, TensorRT-LLM, and Dynamo. The hard substrate is solved.
The control plane is missing
Driving that substrate still means Dynamo CRDs, Kubernetes YAML, and per-cloud provisioning glue. Researchers become part-time systems engineers.
Warply is that control plane
A clean, programmable Python layer over disagg, KV routing, and rollouts — composable primitives with escape hatches all the way down.
Real control over disagg, KV cache,
routing, and rollouts.
Every mechanism that production inference teams care about, exposed as a primitive you can read, compose, and override in Python.
Disaggregated by default
Independent prefill and decode pools. Scale each to its own bottleneck instead of over-provisioning one monolith.
engine.scale(decode=2)KV-aware routing
Cache-reuse-driven routing across pools, with KV tiering via LMCache and Mooncake. Move the cache, not the compute.
route="kv-reuse"Cloud-portable
One spec across Lambda, neoclouds, Modal, and local. Chase spot capacity and the cheapest GPUs without a rewrite.
cloud="lambda"RL & self-improvement
First-class rollout vs gradient-update separation. Build RSI flywheels: generate, score, GRPO, update weights, repeat.
wp.SelfImprove(...)Python-first API
Declarative spec plus imperative up / scale / down / client. The SDK is the product — not an afterthought on top of YAML.
wp.DisaggEngine(...)Escape hatches all the way down
Drop to raw engine flags, custom Triton kernels, or export deployment YAML when you need to. Control, never a black box.
engine.export_yaml()The whole lifecycle, in a few lines of Python.
From first deploy to RL post-training, Warply stays a library you call — declarative where you want it, imperative where you need it.
1import warply as wp2 3engine = wp.DisaggEngine(4 model="meta-llama/Llama-3.1-8B",5 prefill=wp.Pool("1xH100", replicas=1),6 decode=wp.Pool("1xH100", replicas=1),7 backend="sglang",8 kv_transfer="nixl",9 cloud="lambda",10)11 12engine.up()13client = engine.client()Portability, disagg intelligence, and RL — behind one API.
The mechanisms are well served. The programmable, cloud-portable, RL-aware control plane over them is the wedge nobody owns yet.
| Warply | SkyPilot | Dynamo | llm-d | Modal | |
|---|---|---|---|---|---|
| Python-first SDK | |||||
| Disagg-aware (prefill/decode pools) | |||||
| KV-aware routing & tiering | |||||
| Cloud-portable by design | |||||
| RL / RSI orchestration | |||||
| No Kubernetes / CRDs required | |||||
| Open-source core |
Open core first. Self-improving systems next.
The control plane and disagg serving are free and open. Advanced RL/RSI and the managed plane are how Warply sustains itself.
Single-cloud disagg serving
- DisaggEngine API: up / scale / down / client
- Prefill & decode pools, scaled independently
- SGLang engine + NIXL KV transfer
- OpenAI-compatible client, export_yaml() escape hatch
Portability + RL anchor
- Multicloud provider system with spot/cost awareness
- KV-aware routing and tiering (LMCache, Mooncake)
- vLLM & TensorRT-LLM adapters
- Async RL post-training and RSI primitives
Managed & enterprise
- Hosted disagg control plane as a service
- SSO, audit logs, and support
- Custom kernels; AFD for MoE on superpod interconnect
- Hybrid inference + training memory management
Stop writing YAML.
Start writing Python.
Launch disaggregation-aware inference in a few lines, keep full control of the internals, and move across clouds when you need to.