Breaking

flash-moe runs 397B-parameter model on a MacBook with no frameworks

June 29, 2026 at 09:46 EDT

Open Source
Foundation Models
Infra & Chips

An inference engine called "flash-moe," built by Dan Woods, is drawing attention as a method to run the 397B-parameter MoE model Qwen3.5-397B-A17B on a 48GB MacBook Pro without using any framework such as PyTorch or MLX.

Flash-MoE · On-Device LLM

A 397-Billion-Parameter Model, Running on a 48GB Laptop

Flash-MoE runs Qwen3.5-397B-A17B on a MacBook Pro (M3 Max) using only C / Objective-C and hand-written Metal shaders — no frameworks. It streams a 209GB 4-bit model from SSD on demand and still produces working tool calls at over 4.4 tokens/sec.

397B

total params · ~17B active per token

48GB

unified memory on the laptop running it

4.36 tok/s

with working tool calling (4-bit + FMA)

209GB

model streamed from SSD on demand

209GB model vs. 48GB of memory

The model never fits in RAM. Memory is partitioned and the experts stream from SSD — the OS page cache does the heavy lifting.

209GB

Model on disk

4-bit quantized

42 OS+cache

48GB RAM

5.5GB weights · 200MB scratch

Speed vs. quality, by quantization

3.90

4-bit baseline

4.36

4-bit + FMA

tool calls work

5.74

2-bit

JSON / tools break

tokens/sec — 2-bit is fastest, but only 4-bit + FMA keeps tool calling reliable.

How it works

SSD expert streaming pipeline

SSD read

17.5 GB/s sequential

→

Load K=4 experts

+ shared expert / token

→

FMA Metal kernel

dequant + matmul fused

→

Generate token

~400 GB/s unified mem

60 layers · 512 experts each · no llama.cpp, no MLX, no Python — pure C / Objective-C + Metal.

Why it matters

Production-quality tool calling on a laptop alone
Opens the door to on-device MoE execution
Privacy-focused, low-latency local workloads
An iPhone 17 Pro fork has already appeared

PoC-stage limits

2-bit quantization breaks JS

Continue reading

The rest of this article is for AI News Blitz readers. Choose an option below to keep reading.