open source · apache 2.0 · single-node
HAL Simulator · Lite
Free, open-source simulator for the Adaptive OS hardware abstraction layer. Unified API across CPU, CUDA, OpenVINO NPU and a 24-qubit QPU simulator. Capped at 2 shards on a single leaf — perfect for laptops, Jetson and Pi 5.
backends
4 (CPU·CUDA·OVN·QPU-sim)
max shards
2
fabric
single leaf · 25GbE
price
$0
Tune the workload
Accelerator
Shards: 1 / 2 (lite cap)
Batch: 16
throughput
42.2 TFLOPS
latency p50
1.42 ms
power
304 W
Fabric view
leaf-01 ── cuda #0
└─ 25 GbE · rdma=off · lite profilePer-kernel sharding breakdown
| Layer / kernel | Kind | Strategy | Slice (MB) | Placement across shards |
|---|---|---|---|---|
| embed.tok | embed | replicate | 49.0 | s0 |
| block.0.attn | attn | tensor | 67.0 | s0 |
| block.0.mlp | mlp | expert | 134.0 | s0 |
| block.1.attn | attn | tensor | 67.0 | s0 |
| block.1.mlp | mlp | expert | 134.0 | s0 |
| block.2.attn | attn | tensor | 67.0 | s0 |
| block.2.mlp | mlp | expert | 134.0 | s0 |
| block.3.attn | attn | tensor | 67.0 | s0 |
| block.3.mlp | mlp | expert | 134.0 | s0 |
| block.4.attn | attn | tensor | 67.0 | s0 |
| block.4.mlp | mlp | expert | 134.0 | s0 |
| norm.out | norm | pipeline | 8.0 | s0 |
| lm_head | head | replicate | 49.0 | s0 |
replicate · pipeline · tensor · expert — selected per kernel from the model graph. Slices show the per-shard memory after split.
Kernel scheduling timeline
t=0.00 msNVIDIA CUDA (RTX) · single leaf · 25GbEt=5.95 ms
- 0.00 msbootqhaldiscover 1 shard(s) · NVIDIA CUDA (RTX)
- 0.60 msfabricfabric-monprobe single leaf · 25GbE · rtt=2.5µs
- 1.10 mscompilemodel.compile()lower graph → 13 kernels · pick dtypes (fp16/int8)
- 2.00 msplansharderauto-shard: tensor+expert+pipeline
- 2.60 msplaceschedulerbind kernels → shards · NUMA + topology aware
- 3.10 msdispatchshard-00launch attn+mlp shard #0
- 3.45 mscollectivenccl/rcclall-reduce 64MB · ring across 1 shards
- 4.15 msrebalancefabric-monleaf-02 congestion 78% → migrate shard #2
- 4.65 msrebalanceschedulerevict cold KV cache · reclaim 2.1GB
- 5.05 mscollectivenccl/rcclall-gather logits · ECMP path
- 5.55 msemittelemetryp50=lat ✓ · power ✓ · OTLP → Parquet
- 5.95 msdoneruntimestep complete · feedback → policy net
Need TPU / ROCm / real QPU, multi-rack spine fabric, automatic sharding across ≥8 nodes, or NUMA-aware scheduling? Upgrade to the Enterprise simulator.