open source · apache 2.0 · single-node

HAL Simulator · Lite

Free, open-source simulator for the Adaptive OS hardware abstraction layer. Unified API across CPU, CUDA, OpenVINO NPU and a 24-qubit QPU simulator. Capped at 2 shards on a single leaf — perfect for laptops, Jetson and Pi 5.

backends

4 (CPU·CUDA·OVN·QPU-sim)

max shards

fabric

single leaf · 25GbE

price

Tune the workload

Accelerator

Shards: 1 / 2 (lite cap)

Batch: 16

throughput

42.2 TFLOPS

latency p50

1.42 ms

power

304 W

Fabric view

leaf-01 ── cuda #0  
        └─ 25 GbE · rdma=off · lite profile

Per-kernel sharding breakdown

filter kind

WARNoversized per-shard slice (134.0 MB)

Largest kernel slice exceeds 90 MB working set; risk of HBM pressure on NVIDIA CUDA (RTX).

↳ suggested: Enable auto-shard (expert split for MLP) or increase shard count.

Layer / kernel	Kind	Strategy	Slice (MB)	Placement across shards
embed.tok	embed	replicate	49.0	s0
block.0.attn	attn	tensor	67.0	s0
block.0.mlp	mlp	expert	134.0	s0
block.1.attn	attn	tensor	67.0	s0
block.1.mlp	mlp	expert	134.0	s0
block.2.attn	attn	tensor	67.0	s0
block.2.mlp	mlp	expert	134.0	s0
block.3.attn	attn	tensor	67.0	s0
block.3.mlp	mlp	expert	134.0	s0
block.4.attn	attn	tensor	67.0	s0
block.4.mlp	mlp	expert	134.0	s0
norm.out	norm	pipeline	8.0	s0
lm_head	head	replicate	49.0	s0

replicate · pipeline · tensor · expert — selected per kernel from the model graph. Slices show the per-shard memory after split.

Kernel scheduling timeline

filter kind

t=0.00 msNVIDIA CUDA (RTX) · single leaf · 25GbE · all kernelst=7.80 ms

0.00 msbootqhaldiscover 1 shard(s) · NVIDIA CUDA (RTX)
0.60 msfabricfabric-monprobe single leaf · 25GbE · rtt=2.5µs
1.10 mscompilemodel.compile()lower embed+attn+mlp+norm+head → 13 kernels
2.00 msplansharderauto-shard: tensor+expert+pipeline
2.60 msplaceschedulerbind embed+head (replicate) → all shards
3.00 msplaceschedulerbind attn (tensor) → split across shards
3.40 msplaceschedulerbind mlp (expert) → shards
3.80 msplaceschedulerbind norm (pipeline) → shard 0
4.20 msdispatchshard-00launch attn shard #0
4.50 msdispatchshard-00launch mlp shard #0
4.80 mscollectivenccl/rcclall-reduce attn 32MB · ring across 1 shards
5.40 mscollectivenccl/rcclall-reduce mlp 64MB · ring across 1 shards
6.00 msrebalancefabric-monleaf-02 congestion 78% → migrate mlp shard #2
6.50 msrebalanceschedulerevict cold KV cache · reclaim 2.1GB
6.90 mscollectivenccl/rcclall-gather logits · ECMP path
7.40 msemittelemetryp50=lat ✓ · power ✓ · OTLP → Parquet
7.80 msdoneruntimestep complete · feedback → policy net

Need TPU / ROCm / real QPU, multi-rack spine fabric, automatic sharding across ≥8 nodes, or NUMA-aware scheduling? Upgrade to the Enterprise simulator.