Modulum · BABILong measurement · decode-rate finding

Modulum decodes 17–22 % faster than vanilla Gemma-4-31B-Q4 on multi-fact reasoning.

Same base model weights, same prompts, same temperature. The only difference is Hypernym's platform layer on top of llama.cpp. Modulum's attention conditioning produces a measurable speedup on qa3 (3-fact temporal reasoning) at every context length we tested.

qa3 32k
+21.6 %
Modulum 49.5 tok/s · Vanilla 40.7 tok/s
qa3 64k
+21.0 %
Modulum 45.9 tok/s · Vanilla 38.0 tok/s
qa3 128k
+16.9 %
Modulum 40.2 tok/s · Vanilla 34.4 tok/s

Where the speedup shows up and where it doesn't.

Median tokens-per-second from llama.cpp's timings block, captured natively by both endpoints. Modulum is the qa3 winner; qa2 is essentially flat; qa1 at short context shows Modulum slower — but that data was captured during a phase-1 endpoint-load incident (78 503 retries), so the in-flight 2026-05-18 reruns will give us a clean qa1 baseline.

Cell Modulum decode Vanilla decode Speedup Note
qa1 32k35.1 tok/s50.4 tok/s−30.3 %Phase-1 (2026-05-13) — captured during endpoint-load incident
qa1 64k33.6 tok/s41.5 tok/s−18.9 %Phase-1 — same caveat
qa1 128k37.1 tok/s35.9 tok/s+3.2 %Phase-1 + 503-storm retries
qa2 32k39.5 tok/s40.4 tok/s−2.2 %Phase-3 clean
qa2 64k35.1 tok/s37.6 tok/s−6.8 %Phase-3 clean
qa2 128k32.7 tok/s34.9 tok/s−6.4 %Phase-3 clean
qa3 32k49.5 tok/s40.7 tok/s+21.6 %Phase-3 clean — load-bearing
qa3 64k45.9 tok/s38.0 tok/s+21.0 %Phase-3 clean — load-bearing
qa3 128k40.2 tok/s34.4 tok/s+16.9 %Phase-3/5/10 clean — load-bearing
Phase-1 caveat (qa1 cells): The Modulum qa1 cells were the first runs we did, on 2026-05-13, during which the Modulum endpoint returned 78 of 100 HTTP 503 "backend busy" errors on qa1 128k (single-slot endpoint under sustained load). The −19 % to −30 % qa1 prefill+decode numbers compared to vanilla likely reflect endpoint-level load at that time, not inherent Modulum platform overhead. The qa2/qa3 cells ran 24 hours later on the same endpoint at zero errors. Fresh Modulum re-runs for idx 0..49 on all 9 cells are in flight now (PID 62924) — we'll have a clean 2026-05-18 baseline for qa1 by ~8h from now.

qa3 speedup compounds with task difficulty.

Decode-rate delta (Modulum − Vanilla) ÷ Vanilla, per cell:

qa1 32k
−30.3 %
qa1 64k
−18.9 %
qa1 128k
+3.2 %
qa2 32k
−2.2 %
qa2 64k
−6.8 %
qa2 128k
−6.4 %
qa3 32k
+21.6 %
qa3 64k
+21.0 %
qa3 128k
+16.9 %

Hypothesis — Modulum's attention conditioning produces a tighter probability distribution over the next token on multi-fact reasoning. Tighter distributions concentrate decode-time compute on a smaller candidate set, shortening outputs and producing the +17–22 % qa3 speedup. On retrieval-style qa1, the conditioning doesn't help and may add a small overhead. We are NOT yet in a position to confirm this hypothesis architecturally; it explains the observed pattern.

Same weights, same prompts, same temperature — only the inference stack differs.

VariableModulumVanilla
Base modelGemma-4-31B-itGemma-4-31B-it (same)
QuantizationQ4_K_MQ4_K_M (same)
Inference runtimellama.cpp + Modulum platformllama.cpp /completion (raw)
Endpointgemma4.hypernym.ai/v1/chat/completions35.192.66.207:9011/completion (Hypernym mirror)
Temperature00
Max tokens25664
DatasetRMT-team/babilong-1k-samplesRMT-team/babilong-1k-samples
Sample indexes (mask comparison)idx 0..49 (50 prompts per cell)idx 0..49 (same 50 prompts)
Timing sourcellama.cpp timings.predicted_msllama.cpp timings.predicted_ms
Tokens/sec computed astokens_predicted · 1000 / predicted_mssame