In early development · building in the open
Modelplane is the control plane above your inference clusters across cloud, neocloud, and on-premise. Platform teams set policy and capacity; developers declare a model and get a serving endpoint. Modelplane continuously reconciles the whole fleet: provisioning, scheduling, autoscaling, routing, and caching. All of it runs entirely under your control.
Any model, any engine, any infrastructure. Modelplane doesn’t replace the inference ecosystem; it sits above the pieces your teams already choose and composes them into a running, self-reconciling fleet.
orchestrates
Models
open weights & custom
Serving
inference engines
Infrastructure
accelerators & providers
Accelerators
Providers
Modelplane matches each model’s requirements and serving topology to the hardware available, using expressive CEL selectors and composable API shapes. Topology is declared as shape, so it places anything from a single GPU to multi-node, disaggregated frontier serving, and new parallelism strategies work as they emerge.
tensor parallel
Split each layer across GPUs in a node for low-latency single-model serving.
pipeline parallel
Stage a model across nodes so very large models fit beyond a single box.
data / expert
Replicate workers, or shard experts across them for MoE throughput.
prefill / decode
Disaggregate prefill and decode onto separate pools for frontier serving.
+ emerging topology
Described as shape, so future parallelism strategies just work.
At its core, Modelplane is a flexible resource model for inference. Each role owns its own resources: developers declare model deployments and expose one service across regions, clouds, and managed vendors, while platform teams declare the fleet of clusters, accelerators, and gateways underneath.
Development & ML teams
Define model deployments: the model, the engine and its configuration, serving topology, hardware request, region, and environment. Then expose them as one service, weighted across regions, clouds, and managed vendors.
kind: ModelService
name: prod-llama
routing: weighted, openai
kind: ModelDeployment
model: llama-4-70b
cluster: aws-us-east
kind: ModelDeployment
model: llama-4-70b
cluster: gcp-eu-west
kind: ModelEndpoint
target: vendor-api
type: managed
Platform teams
Declare the fleet: a gateway over clusters across clouds and regions, each with its own hardware classes and node pools. Set the capacity, accelerators, policy, and cost controls the whole fleet runs within.
kind: InferenceGateway
name: prod-gateway
routes: all endpoints
kind: InferenceCluster
name: aws-us-east
pools: h200, h100
kind: InferenceCluster
name: gcp-eu-west
pools: tpu-v6e, a100
kind: InferenceCluster
name: onprem-dc1
pools: h100, l40s
01 / Provisioning
Provision inference clusters on AWS, GCP, and Azure, or bring your own on any Kubernetes. Each gets hardware classes, node pools, an inference gateway, and the full serving stack, installed and continuously reconciled.
Provisioning
classes: h200-8x, h100-8x · node pools · gateway
✓ GPU operator & drivers
✓ Serving engines
✓ Inference gateway
02 / Scheduling
Modelplane treats every cluster, cloud, and region as one global pool. A fleet scheduler places each model's replicas where its requirements match a cluster's capabilities, then hands off to the cluster's own scheduler and DRA, with support for advanced schedulers like KAI, Kueue, and Volcano.
Two-level scheduling
fleet scheduler
one global pool
tracks requirements
↔ capabilities
places replicas
cluster scheduler
DRA · KAI / Kueue / Volcano
03 / Autoscaling
Every model exposes the standard Kubernetes scale subresource, so its replicas scale out across clusters, clouds, and regions, driven by hand or by HPA and KEDA.
roadmapScale-to-zero is on the roadmap.
Autoscaling
04 / Routing
A model service is one stable, OpenAI-compatible endpoint over many replicas and model endpoints. Weighted routing spreads traffic across replicas for canary and A/B rollouts, and a managed endpoint can take a weighted share too.
roadmapAutomatic cross-cloud failover is on the roadmap.
One service, many endpoints
ModelService · prod-llama
● one OpenAI-compatible endpoint
ModelEndpoint
replica · aws-us-east
ModelEndpoint
replica · gcp-eu-west
ModelEndpoint
managed · vendor
Modelplane is Apache 2 and open source end to end. The control plane lives entirely in your infrastructure and depends on nothing outside it, so no vendor can restrict, throttle, or revoke access. Donation to a neutral open source foundation is planned.
Built by the team behind Crossplane, the proven open source foundation for infrastructure control planes, trusted at Apple, JPMC, Nike, Elastic, Grafana, and MongoDB.