Modulum · BABILong measurement · decode-rate finding

Modulum decodes 17–22 % faster than vanilla Gemma-4-31B-Q4 on multi-fact reasoning.

Same base model weights, same prompts, same temperature. The only difference is Hypernym's platform layer on top of llama.cpp. Modulum's attention conditioning produces a measurable speedup on qa3 (3-fact temporal reasoning) at every context length we tested.

qa3 32k

+21.6 %

Modulum 49.5 tok/s · Vanilla 40.7 tok/s

qa3 64k

+21.0 %

Modulum 45.9 tok/s · Vanilla 38.0 tok/s

qa3 128k

+16.9 %

Modulum 40.2 tok/s · Vanilla 34.4 tok/s

01 · Decode rate — all 9 cells

Where the speedup shows up and where it doesn't.

Median tokens-per-second from llama.cpp's timings block, captured natively by both endpoints. Modulum is the qa3 winner; qa2 is essentially flat; qa1 at short context shows Modulum slower — but that data was captured during a phase-1 endpoint-load incident (78 503 retries), so the in-flight 2026-05-18 reruns will give us a clean qa1 baseline.

Cell	Modulum decode	Vanilla decode	Speedup	Note
qa1 32k	35.1 tok/s	50.4 tok/s	−30.3 %	Phase-1 (2026-05-13) — captured during endpoint-load incident
qa1 64k	33.6 tok/s	41.5 tok/s	−18.9 %	Phase-1 — same caveat
qa1 128k	37.1 tok/s	35.9 tok/s	+3.2 %	Phase-1 + 503-storm retries
qa2 32k	39.5 tok/s	40.4 tok/s	−2.2 %	Phase-3 clean
qa2 64k	35.1 tok/s	37.6 tok/s	−6.8 %	Phase-3 clean
qa2 128k	32.7 tok/s	34.9 tok/s	−6.4 %	Phase-3 clean
qa3 32k	49.5 tok/s	40.7 tok/s	+21.6 %	Phase-3 clean — load-bearing
qa3 64k	45.9 tok/s	38.0 tok/s	+21.0 %	Phase-3 clean — load-bearing
qa3 128k	40.2 tok/s	34.4 tok/s	+16.9 %	Phase-3/5/10 clean — load-bearing

Phase-1 caveat (qa1 cells): The Modulum qa1 cells were the first runs we did, on 2026-05-13, during which the Modulum endpoint returned 78 of 100 HTTP 503 "backend busy" errors on qa1 128k (single-slot endpoint under sustained load). The −19 % to −30 % qa1 prefill+decode numbers compared to vanilla likely reflect endpoint-level load at that time, not inherent Modulum platform overhead. The qa2/qa3 cells ran 24 hours later on the same endpoint at zero errors. Fresh Modulum re-runs for idx 0..49 on all 9 cells are in flight now (PID 62924) — we'll have a clean 2026-05-18 baseline for qa1 by ~8h from now.

02 · The pattern at a glance

qa3 speedup compounds with task difficulty.

Decode-rate delta (Modulum − Vanilla) ÷ Vanilla, per cell:

qa1 32k

−30.3 %

qa1 64k

−18.9 %

qa1 128k

+3.2 %

qa2 32k

−2.2 %

qa2 64k

−6.8 %

qa2 128k

−6.4 %

qa3 32k

+21.6 %

qa3 64k

+21.0 %

qa3 128k

+16.9 %

Hypothesis — Modulum's attention conditioning produces a tighter probability distribution over the next token on multi-fact reasoning. Tighter distributions concentrate decode-time compute on a smaller candidate set, shortening outputs and producing the +17–22 % qa3 speedup. On retrieval-style qa1, the conditioning doesn't help and may add a small overhead. We are NOT yet in a position to confirm this hypothesis architecturally; it explains the observed pattern.

03 · How this was measured

Same weights, same prompts, same temperature — only the inference stack differs.

Variable	Modulum	Vanilla
Base model	Gemma-4-31B-it	Gemma-4-31B-it (same)
Quantization	Q4_K_M	Q4_K_M (same)
Inference runtime	llama.cpp + Modulum platform	llama.cpp /completion (raw)
Endpoint	`gemma4.hypernym.ai/v1/chat/completions`	`35.192.66.207:9011/completion` (Hypernym mirror)
Temperature	0	0
Max tokens	256	64
Dataset	`RMT-team/babilong-1k-samples`	`RMT-team/babilong-1k-samples`
Sample indexes (mask comparison)	idx 0..49 (50 prompts per cell)	idx 0..49 (same 50 prompts)
Timing source	llama.cpp `timings.predicted_ms`	llama.cpp `timings.predicted_ms`
Tokens/sec computed as	tokens_predicted · 1000 / predicted_ms	same