An inference engine called "flash-moe," built by Dan Woods, is drawing attention as a method to run the 397B-parameter MoE model Qwen3.5-397B-A17B on a 48GB MacBook Pro without using any framework such as PyTorch or MLX.
Flash-MoE · On-Device LLM
A 397-Billion-Parameter Model, Running on a 48GB Laptop
Flash-MoE runs Qwen3.5-397B-A17B on a MacBook Pro (M3 Max) using only C / Objective-C and hand-written Metal shaders — no frameworks. It streams a 209GB 4-bit model from SSD on demand and still produces working tool calls at over 4.4 tokens/sec.
397B
total params · ~17B active per token
48GB
unified memory on the laptop running it
4.36 tok/s
with working tool calling (4-bit + FMA)
209GB
model streamed from SSD on demand
209GB model vs. 48GB of memory
The model never fits in RAM. Memory is partitioned and the experts stream from SSD — the OS page cache does the heavy lifting.
209GB
Model on disk
4-bit quantized
48GB RAM
5.5GB weights · 200MB scratch
Speed vs. quality, by quantization
4.36
4-bit + FMA
tool calls work
5.74
2-bit
JSON / tools break
tokens/sec — 2-bit is fastest, but only 4-bit + FMA keeps tool calling reliable.
How it works
SSD expert streaming pipeline
SSD read
17.5 GB/s sequential
→
Load K=4 experts
+ shared expert / token
→
FMA Metal kernel
dequant + matmul fused
→
Generate token
~400 GB/s unified mem
60 layers · 512 experts each · no llama.cpp, no MLX, no Python — pure C / Objective-C + Metal.
Why it matters
Production-quality tool calling on a laptop alone
Opens the door to on-device MoE execution
Privacy-focused, low-latency local workloads
An iPhone 17 Pro fork has already appeared
PoC-stage limits
2-bit quantization breaks JS Continue reading The rest of this article is for AI News Blitz readers. Choose an option below to keep reading.
Already purchased? Sign in ✓ Signed in — this article isn’t included in your current plan.Unlocking the full article…