open source · apache 2.0 · single-node

HAL Simulator · Lite

Free, open-source simulator for the Adaptive OS hardware abstraction layer. Unified API across CPU, CUDA, OpenVINO NPU and a 24-qubit QPU simulator. Capped at 2 shards on a single leaf — perfect for laptops, Jetson and Pi 5.

backends
4 (CPU·CUDA·OVN·QPU-sim)
max shards
2
fabric
single leaf · 25GbE
price
$0

Tune the workload

Accelerator
Shards: 1 / 2 (lite cap)
Batch: 16
throughput
42.2 TFLOPS
latency p50
1.42 ms
power
304 W

Fabric view

leaf-01 ── cuda #0  
        └─ 25 GbE · rdma=off · lite profile

Per-kernel sharding breakdown

Layer / kernelKindStrategySlice (MB)Placement across shards
embed.tokembedreplicate49.0
s0
block.0.attnattntensor67.0
s0
block.0.mlpmlpexpert134.0
s0
block.1.attnattntensor67.0
s0
block.1.mlpmlpexpert134.0
s0
block.2.attnattntensor67.0
s0
block.2.mlpmlpexpert134.0
s0
block.3.attnattntensor67.0
s0
block.3.mlpmlpexpert134.0
s0
block.4.attnattntensor67.0
s0
block.4.mlpmlpexpert134.0
s0
norm.outnormpipeline8.0
s0
lm_headheadreplicate49.0
s0

replicate · pipeline · tensor · expert — selected per kernel from the model graph. Slices show the per-shard memory after split.

Kernel scheduling timeline

t=0.00 msNVIDIA CUDA (RTX) · single leaf · 25GbEt=5.95 ms
  1. 0.00 msbootqhaldiscover 1 shard(s) · NVIDIA CUDA (RTX)
  2. 0.60 msfabricfabric-monprobe single leaf · 25GbE · rtt=2.5µs
  3. 1.10 mscompilemodel.compile()lower graph → 13 kernels · pick dtypes (fp16/int8)
  4. 2.00 msplansharderauto-shard: tensor+expert+pipeline
  5. 2.60 msplaceschedulerbind kernels → shards · NUMA + topology aware
  6. 3.10 msdispatchshard-00launch attn+mlp shard #0
  7. 3.45 mscollectivenccl/rcclall-reduce 64MB · ring across 1 shards
  8. 4.15 msrebalancefabric-monleaf-02 congestion 78% → migrate shard #2
  9. 4.65 msrebalanceschedulerevict cold KV cache · reclaim 2.1GB
  10. 5.05 mscollectivenccl/rcclall-gather logits · ECMP path
  11. 5.55 msemittelemetryp50=lat ✓ · power ✓ · OTLP → Parquet
  12. 5.95 msdoneruntimestep complete · feedback → policy net
Need TPU / ROCm / real QPU, multi-rack spine fabric, automatic sharding across ≥8 nodes, or NUMA-aware scheduling? Upgrade to the Enterprise simulator.