enterprise · commercial · multi-node

HAL Simulator · Enterprise

Full hardware abstraction across TPU, NPU, CUDA, ROCm, OpenVINO, AMD AI and real quantum accelerators. Automatic kernel-level sharding and fabric-aware scheduling for leaf/spine, Dragonfly+ and HCI clusters.

backends
6 vendors · unified API
max nodes
4,096
auto-shard
kernel-level
fabric
leaf/spine · DF+ · HCI
sla
24/7 · 99.95%
air-gap
FIPS 140-3
telemetry
OTLP · Prom · Parquet
license
Commercial · per-core

Cluster configuration

Accelerator (unified HAL)
Fabric topology
Nodes: 16
aggregate
1644 TFLOPS
p50 latency
0.53 ms
power
6.8 kW
sched dec/s
397
fabric util
92.1%

Leaf / spine projection

                      ┌──────── spine-01 ────────┐  ┌──── spine-02 ────┐
                      │  400 Gbps · 1.4µs    │  │  ECMP · RoCEv2   │
                      └──────────────┬───────────┘  └────────┬─────────┘
   leaf-01 ── goo · goo · goo · goo
   leaf-02 ── goo · goo · goo · goo
   leaf-03 ── goo · goo · goo · goo
   leaf-04 ── goo · goo · goo · goo

scheduler: auto-shard ON · NUMA + topology-aware
sharding strategy: tensor || pipeline || expert (selected per kernel)

Per-kernel sharding breakdown

Layer / kernelKindStrategySlice (MB)Placement across nodes
embed.tokembedreplicate49.0
n0n1n2n3n4n5n6n7
block.0.attnattntensor8.4
n0n1n2n3n4n5n6n7
block.0.mlpmlpexpert16.8
n0n1n2n3n4n5n6n7
block.1.attnattntensor8.4
n0n1n2n3n4n5n6n7
block.1.mlpmlpexpert16.8
n0n1n2n3n4n5n6n7
block.2.attnattntensor8.4
n0n1n2n3n4n5n6n7
block.2.mlpmlpexpert16.8
n0n1n2n3n4n5n6n7
block.3.attnattntensor8.4
n0n1n2n3n4n5n6n7
block.3.mlpmlpexpert16.8
n0n1n2n3n4n5n6n7
block.4.attnattntensor8.4
n0n1n2n3n4n5n6n7
block.4.mlpmlpexpert16.8
n0n1n2n3n4n5n6n7
norm.outnormpipeline8.0
n0n1n2n3n4n5n6n7
lm_headheadreplicate49.0
n0n1n2n3n4n5n6n7

replicate · pipeline · tensor · expert — selected per kernel from the model graph. Slices show the per-node memory after split.

Kernel scheduling timeline

t=0.00 msGoogle TPU v5p · Leaf/spine · 400GbE RoCEv2t=7.35 ms
  1. 0.00 msbootqhaldiscover 8 node(s) · Google TPU v5p
  2. 0.60 msfabricfabric-monprobe Leaf/spine · 400GbE RoCEv2 · rtt=1.4µs
  3. 1.10 mscompilemodel.compile()lower graph → 13 kernels · pick dtypes (fp16/int8)
  4. 2.00 msplansharderauto-shard: tensor+expert+pipeline
  5. 2.60 msplaceschedulerbind kernels → nodes · NUMA + topology aware
  6. 3.10 msdispatchnode-00launch attn+mlp shard #0
  7. 3.45 msdispatchnode-01launch attn+mlp shard #1
  8. 3.80 msdispatchnode-02launch attn+mlp shard #2
  9. 4.15 msdispatchnode-03launch attn+mlp shard #3
  10. 4.50 msdispatchnode-04launch attn+mlp shard #4
  11. 4.85 mscollectivenccl/rcclall-reduce 64MB · ring across 8 nodes
  12. 5.55 msrebalancefabric-monleaf-02 congestion 78% → migrate shard #2
  13. 6.05 msrebalanceschedulerevict cold KV cache · reclaim 2.1GB
  14. 6.45 mscollectivenccl/rcclall-gather logits · ECMP path
  15. 6.95 msemittelemetryp50=lat ✓ · power ✓ · OTLP → Parquet
  16. 7.35 msdoneruntimestep complete · feedback → policy net
request quote

Per-core licensing, support tiers and air-gap installers via enterprise@adaptive.os.

Try the open-source Lite simulator first — same HAL, capped at 2 shards.