enterprise · commercial · multi-node
HAL Simulator · Enterprise
Full hardware abstraction across TPU, NPU, CUDA, ROCm, OpenVINO, AMD AI and real quantum accelerators. Automatic kernel-level sharding and fabric-aware scheduling for leaf/spine, Dragonfly+ and HCI clusters.
backends
6 vendors · unified API
max nodes
4,096
auto-shard
kernel-level
fabric
leaf/spine · DF+ · HCI
sla
24/7 · 99.95%
air-gap
FIPS 140-3
telemetry
OTLP · Prom · Parquet
license
Commercial · per-core
Cluster configuration
Accelerator (unified HAL)
Fabric topology
Nodes: 16
aggregate
1644 TFLOPS
p50 latency
0.53 ms
power
6.8 kW
sched dec/s
397
fabric util
92.1%
Leaf / spine projection
┌──────── spine-01 ────────┐ ┌──── spine-02 ────┐
│ 400 Gbps · 1.4µs │ │ ECMP · RoCEv2 │
└──────────────┬───────────┘ └────────┬─────────┘
leaf-01 ── goo · goo · goo · goo
leaf-02 ── goo · goo · goo · goo
leaf-03 ── goo · goo · goo · goo
leaf-04 ── goo · goo · goo · goo
scheduler: auto-shard ON · NUMA + topology-aware
sharding strategy: tensor || pipeline || expert (selected per kernel)Per-kernel sharding breakdown
| Layer / kernel | Kind | Strategy | Slice (MB) | Placement across nodes |
|---|---|---|---|---|
| embed.tok | embed | replicate | 49.0 | n0n1n2n3n4n5n6n7 |
| block.0.attn | attn | tensor | 8.4 | n0n1n2n3n4n5n6n7 |
| block.0.mlp | mlp | expert | 16.8 | n0n1n2n3n4n5n6n7 |
| block.1.attn | attn | tensor | 8.4 | n0n1n2n3n4n5n6n7 |
| block.1.mlp | mlp | expert | 16.8 | n0n1n2n3n4n5n6n7 |
| block.2.attn | attn | tensor | 8.4 | n0n1n2n3n4n5n6n7 |
| block.2.mlp | mlp | expert | 16.8 | n0n1n2n3n4n5n6n7 |
| block.3.attn | attn | tensor | 8.4 | n0n1n2n3n4n5n6n7 |
| block.3.mlp | mlp | expert | 16.8 | n0n1n2n3n4n5n6n7 |
| block.4.attn | attn | tensor | 8.4 | n0n1n2n3n4n5n6n7 |
| block.4.mlp | mlp | expert | 16.8 | n0n1n2n3n4n5n6n7 |
| norm.out | norm | pipeline | 8.0 | n0n1n2n3n4n5n6n7 |
| lm_head | head | replicate | 49.0 | n0n1n2n3n4n5n6n7 |
replicate · pipeline · tensor · expert — selected per kernel from the model graph. Slices show the per-node memory after split.
Kernel scheduling timeline
t=0.00 msGoogle TPU v5p · Leaf/spine · 400GbE RoCEv2t=7.35 ms
- 0.00 msbootqhaldiscover 8 node(s) · Google TPU v5p
- 0.60 msfabricfabric-monprobe Leaf/spine · 400GbE RoCEv2 · rtt=1.4µs
- 1.10 mscompilemodel.compile()lower graph → 13 kernels · pick dtypes (fp16/int8)
- 2.00 msplansharderauto-shard: tensor+expert+pipeline
- 2.60 msplaceschedulerbind kernels → nodes · NUMA + topology aware
- 3.10 msdispatchnode-00launch attn+mlp shard #0
- 3.45 msdispatchnode-01launch attn+mlp shard #1
- 3.80 msdispatchnode-02launch attn+mlp shard #2
- 4.15 msdispatchnode-03launch attn+mlp shard #3
- 4.50 msdispatchnode-04launch attn+mlp shard #4
- 4.85 mscollectivenccl/rcclall-reduce 64MB · ring across 8 nodes
- 5.55 msrebalancefabric-monleaf-02 congestion 78% → migrate shard #2
- 6.05 msrebalanceschedulerevict cold KV cache · reclaim 2.1GB
- 6.45 mscollectivenccl/rcclall-gather logits · ECMP path
- 6.95 msemittelemetryp50=lat ✓ · power ✓ · OTLP → Parquet
- 7.35 msdoneruntimestep complete · feedback → policy net
request quote
Per-core licensing, support tiers and air-gap installers via enterprise@adaptive.os.
Try the open-source Lite simulator first — same HAL, capped at 2 shards.