enterprise · commercial · multi-node

HAL Simulator · Enterprise

Full hardware abstraction across TPU, NPU, CUDA, ROCm, OpenVINO, AMD AI and real quantum accelerators. Automatic kernel-level sharding and fabric-aware scheduling for leaf/spine, Dragonfly+ and HCI clusters.

backends

6 vendors · unified API

max nodes

4,096

auto-shard

kernel-level

fabric

leaf/spine · DF+ · HCI

sla

24/7 · 99.95%

air-gap

FIPS 140-3

telemetry

OTLP · Prom · Parquet

license

Commercial · per-core

Cluster configuration

Accelerator (unified HAL)

Fabric topology

Nodes: 16

Automatic kernel-level shardingtensor + pipeline + expert parallel

aggregate

1644 TFLOPS

p50 latency

0.53 ms

power

6.8 kW

sched dec/s

397

fabric util

92.1%

Leaf / spine projection

                      ┌──────── spine-01 ────────┐  ┌──── spine-02 ────┐
                      │  400 Gbps · 1.4µs    │  │  ECMP · RoCEv2   │
                      └──────────────┬───────────┘  └────────┬─────────┘
   leaf-01 ── goo · goo · goo · goo
   leaf-02 ── goo · goo · goo · goo
   leaf-03 ── goo · goo · goo · goo
   leaf-04 ── goo · goo · goo · goo

scheduler: auto-shard ON · NUMA + topology-aware
sharding strategy: tensor || pipeline || expert (selected per kernel)

Per-kernel sharding breakdown

filter kind

✓ plan passes shard, latency and bandwidth checks

Layer / kernel	Kind	Strategy	Slice (MB)	Placement across nodes
embed.tok	embed	replicate	49.0	n0n1n2n3n4n5n6n7
block.0.attn	attn	tensor	8.4	n0n1n2n3n4n5n6n7
block.0.mlp	mlp	expert	16.8	n0n1n2n3n4n5n6n7
block.1.attn	attn	tensor	8.4	n0n1n2n3n4n5n6n7
block.1.mlp	mlp	expert	16.8	n0n1n2n3n4n5n6n7
block.2.attn	attn	tensor	8.4	n0n1n2n3n4n5n6n7
block.2.mlp	mlp	expert	16.8	n0n1n2n3n4n5n6n7
block.3.attn	attn	tensor	8.4	n0n1n2n3n4n5n6n7
block.3.mlp	mlp	expert	16.8	n0n1n2n3n4n5n6n7
block.4.attn	attn	tensor	8.4	n0n1n2n3n4n5n6n7
block.4.mlp	mlp	expert	16.8	n0n1n2n3n4n5n6n7
norm.out	norm	pipeline	8.0	n0n1n2n3n4n5n6n7
lm_head	head	replicate	49.0	n0n1n2n3n4n5n6n7

replicate · pipeline · tensor · expert — selected per kernel from the model graph. Slices show the per-node memory after split.

Kernel scheduling timeline

filter kind

t=0.00 msGoogle TPU v5p · Leaf/spine · 400GbE RoCEv2 · all kernelst=10.20 ms

0.00 msbootqhaldiscover 8 node(s) · Google TPU v5p
0.60 msfabricfabric-monprobe Leaf/spine · 400GbE RoCEv2 · rtt=1.4µs
1.10 mscompilemodel.compile()lower embed+attn+mlp+norm+head → 13 kernels
2.00 msplansharderauto-shard: tensor+expert+pipeline
2.60 msplaceschedulerbind embed+head (replicate) → all nodes
3.00 msplaceschedulerbind attn (tensor) → split across nodes
3.40 msplaceschedulerbind mlp (expert) → nodes
3.80 msplaceschedulerbind norm (pipeline) → node 0
4.20 msdispatchnode-00launch attn shard #0
4.50 msdispatchnode-00launch mlp shard #0
4.80 msdispatchnode-01launch attn shard #1
5.10 msdispatchnode-01launch mlp shard #1
5.40 msdispatchnode-02launch attn shard #2
5.70 msdispatchnode-02launch mlp shard #2
6.00 msdispatchnode-03launch attn shard #3
6.30 msdispatchnode-03launch mlp shard #3
6.60 msdispatchnode-04launch attn shard #4
6.90 msdispatchnode-04launch mlp shard #4
7.20 mscollectivenccl/rcclall-reduce attn 32MB · ring across 8 nodes
7.80 mscollectivenccl/rcclall-reduce mlp 64MB · ring across 8 nodes
8.40 msrebalancefabric-monleaf-02 congestion 78% → migrate mlp shard #2
8.90 msrebalanceschedulerevict cold KV cache · reclaim 2.1GB
9.30 mscollectivenccl/rcclall-gather logits · ECMP path
9.80 msemittelemetryp50=lat ✓ · power ✓ · OTLP → Parquet
10.20 msdoneruntimestep complete · feedback → policy net

request quote

Per-core licensing, support tiers and air-gap installers via enterprise@adaptive.os.

Try the open-source Lite simulator first — same HAL, capped at 2 shards.