8-bit quantization on a mixture-of-experts LLM: does protecting key layers help?

Key findings

We built two 8-bit quantizations of the same 122B mixture-of-experts model, differing only in which layers stay at full 16-bit precision, and benchmarked them head-to-head on an agentic tool-calling task: same hardware, same settings, 456 simulations each.
The higher-precision "agentic" build, which keeps embeddings and attention projections at bf16 following DeepSeek-V3's validated mixed-precision recipe, scored lower on every single metric (first-try success −1.8 points, write-action accuracy −2.4, database match −0.9).
The extra precision's only measurable effect was a cost: a 30% slower prefill that contributed five additional infrastructure timeout errors. There was no quality gain on any metric.
Why: at 8-bit the per-weight error is roughly 0.4%, and for continuous-valued tensors (embeddings, attention) the model's residual connections, normalization, and softmax absorb it. The DeepSeek recipe was validated for FP8 training, where error compounds over millions of updates, not for a single inference pass.
The one protection that survived scrutiny: keeping the MoE routing gates at bf16, because a tiny gate error is discontinuous and can flip which expert fires. Both builds keep the gates protected, and both beat a naive uniform 8-bit quantization.

Precision map of the two builds: the base build keeps only the routing gates at bf16, while the agentic build also keeps embeddings, the 12-layer attention projections, and the output head at bf16 — The two builds, side by side: orange blocks stay at full bf16 precision; everything else is quantized to 8-bit.

Why this study exists

Teams quantize large models to run them locally, whether on a workstation, on-premises, or inside a compliance boundary, and when they do they inherit precision recipes from the research literature. A landmark technical report says to keep certain components at higher precision, a community guide repeats it, and the recipe ships without anyone re-checking whether it applies to their model, their bit width, and their workload. It sounds principled, so it goes in.

This is a re-check. We took a widely cited mixed-precision recipe from open large-model work, DeepSeek-V3's FP8 framework, validated by the team that trained a 671B mixture-of-experts model, and tested whether it improves an 8-bit inference build on a real agentic task. It didn't, and it cost speed. A negative result like that is worth publishing, because it is what stops a team from spending bits, latency, and engineering effort on an optimization that reads well on paper and does nothing in production. This is one operator measuring its own system, so read the limitations before generalizing. But the comparison is clean, and the lesson travels.

What quantization and "experts" actually mean

Two pieces of background, in plain terms, because the rest of the paper rests on them.

Quantization compresses a model's weights from 16-bit floating-point numbers down to smaller representations, such as 8-bit or 4-bit, to cut memory use and speed up inference. The trade-off is precision: each weight is stored less exactly, which can degrade output quality. The question is always how much precision you can give up before quality moves, and where the model is sensitive to that loss.

Mixture-of-experts (MoE) is the architecture of the model we tested, Qwen3.5-122B-A10B. Each layer holds 256 specialist sub-networks ("experts"), but a small router picks only 8 of them to run for any given token. So a model with 122 billion total parameters does the work of a roughly 10-billion-parameter model on each token, which is why it fits and runs on a single Mac: quantized to 8-bit it is about 117 GB on disk, and a single Apple M3 Ultra has 256 GB of unified memory. That router, the gate that decides which experts fire, is the part of this story that matters most.

What we built

Two variants of the same model, identical except for which tensors keep their full 16-bit (bf16) precision instead of being quantized to 8-bit.

Build	Size	What stays at full bf16 precision
Base	8.25 bits per weight · 117 GB	Everything is 8-bit except the MoE routing gates, the tiny tensors that decide which 8 of 256 experts fire per token. They stay at bf16 because quantizing them can flip which expert activates, a discontinuous error.
Agentic	8.41 bits per weight · 120 GB	Everything in the base build, plus: the embedding matrix and output head (which map between token IDs and vectors, where special tokens like `<tool_call>` live), and all four attention projections (Q, K, V, O) in the 12 full-attention layers (whose K and V values get baked into the KV cache and read back at every later token).

The agentic build's extra protections aren't arbitrary. They follow DeepSeek-V3's mixed-precision architecture (arXiv:2412.19437), where the team that built a 671B MoE model empirically determined that these exact components need higher precision at FP8 scale. The hypothesis we set out to test was straightforward: those protections should produce better tool-calling accuracy, because embeddings at bf16 preserve the distinction between structurally meaningful tokens, attention projections at bf16 write more accurate KV-cache entries that persist across the whole generation, and the whole configuration is the one a 671B-scale team already validated.

The benchmark

We measured tool-calling accuracy on τ²-bench retail, a benchmark from Sierra Research (arXiv:2406.12045) that simulates customer-service conversations between an AI agent and a simulated customer. The agent has to use API tools (look up orders, modify accounts, process returns) while following domain policies, and the benchmark verifies the actual database state after the conversation, not just whether the agent said the right words. That makes it a genuine test of whether the model can act correctly through tools, the workload quantization is supposed to leave intact.

107 tasks, 456 simulations per model (multiple trials per task).
User simulator: GPT-4o-mini via the OpenAI API.
Agent: the local model on the M3 Ultra, served through LM Studio.
Identical settings for both variants.

On comparability: published τ²-bench leaderboard numbers use GPT-4o as the user simulator; we used the cheaper GPT-4o-mini, so our absolute scores are not comparable to leaderboard figures. That doesn't matter for this study. The only comparison that counts here is base versus agentic, run on the same simulator, hardware, and benchmark configuration, and that comparison is clean.

The results

The higher-precision build lost on every metric. Not by much on any single one, but the direction never reversed.

Metric	Base (8.25 BPW)	Agentic (8.41 BPW)	Delta
Pass^1 (first-try success)	51.1%	49.3%	−1.8
Pass^2 (consistent over 2 trials)	39.9%	37.7%	−2.2
Pass^4 (consistent over 4 trials)	28.9%	28.9%	0.0
Average reward	0.597	0.584	−0.013
Read actions (API lookups)	91.8%	89.2%	−2.6
Write actions (API mutations)	62.1%	59.7%	−2.4
DB match (final state correct)	61.8%	60.9%	−0.9
Infra errors	66	71	+5

Grouped bar chart comparing base and agentic 8-bit builds across five tool-calling metrics; the agentic build is lower or equal on every one — Tool-calling accuracy, base vs the higher-precision build. Every bar moved the wrong way.

And the protections weren't free. Keeping the attention projections at bf16 turned fast quantized integer matrix multiplies into slower floating-point ones, with a sharply different cost at prefill versus decode:

Metric	Base	Agentic	Delta
Decode (tok/s)	37.8	35.8	−5.3%
Prefill (tok/s)	38.8	27.0	−30.4%

Bar chart of decode and prefill throughput: decode falls 5.3% while prefill falls 30.4% for the higher-precision build — The cost of the bf16 attention projections: a modest decode hit, a steep prefill hit.

So the agentic build scored lower on every quality metric, ran 30% slower at prefill, and that slowdown contributed five extra infrastructure timeouts (66 → 71) as more conversations brushed up against the benchmark's per-turn time window. The bf16 precision bumps produced no quality gain at 8-bit. Their only measurable effect was the slowdown.

Why the precision bumps changed nothing

At 8-bit quantization, the error introduced into each weight is approximately 0.4% of that weight's dynamic range. For continuous-valued tensors such as embeddings and attention projections, an error that small is already absorbed downstream: the model's residual connections add the original signal back around each block, layer normalization re-centers the activations, and softmax flattens small perturbations before they reach an output token. By the time a 0.4% wobble in an attention projection would change which token the model emits, three different mechanisms have already smoothed it out. Spending extra bits to make those tensors more exact is spending precision the architecture was already going to recover for free.

Why the prefill got slower

The slowdown, unlike the quality, is mechanical and predictable. In the 12 full-attention layers, the Q/K/V/O projections now run as bf16 matrix multiplications instead of quantized int8 ones. During prefill, every token in the prompt is pushed through those projections at once, so the cost scales with prompt length × projection size × number of protected layers, and that is where the 30% appears. During single-token decode, the same projections process one token at a time, so the same bf16 penalty costs only 5.3%. The extra infrastructure errors follow directly: a slower prefill on every turn of a multi-turn conversation pushes more runs past the timeout window.

Why this matters

Three things follow from the comparison, and they're the reason a negative result like this is worth publishing.

The DeepSeek precedent didn't transfer, and the reason is clean. DeepSeek-V3's team validated those higher-precision components for FP8 training, where numerical errors compound over millions of gradient updates and a small bias in one step becomes a large drift many steps later. Inference is a single forward pass; the error has nowhere to accumulate. What is critical during training is not automatically critical during inference quantization, and treating a training-time recipe as an inference-time rule is how a well-motivated optimization becomes dead weight.
The one 8-bit protection that survived scrutiny is the routing gates. They are the single component where quantization error is discontinuous: a tiny numerical change in a gate weight can flip which of 256 experts fires for a token. That is a qualitatively wrong choice (the wrong specialist) rather than a quantitatively worse one (the right specialist, marginally less precise). Both builds keep the gates at bf16, and both beat a naive uniform 8-bit quantization. The practical rule is to spend precision where the error is discontinuous, not where it is continuous and self-correcting.
At 8-bit, the simpler build was the better engineering decision. The quantization is already near-lossless for continuous-valued tensors, so a mixed-precision optimization that is well-motivated in the literature, and genuinely validated at training time, produced no measurable benefit at inference and cost real latency. Shipping the smaller, faster model would have been the correct call. None of that is specific to this model: a best practice borrowed from one regime has to be re-validated in yours before it earns a place in production.

How to reproduce this

For practitioners quantizing MoE models on Apple Silicon, the implementation details that mattered:

Quantization tool: a custom quantize_streaming.py built on mlx-lm's convert(), with mx.set_default_device(mx.cpu) to avoid macOS Metal GPU watchdog timeouts during conversion.
Batched-expert predicate: Qwen3.5 stores 256 experts per layer as single 3D tensors (shape [256, 2048, 768]). A naive predicate using isinstance(module, nn.Linear) misses them and produces 15.6 BPW instead of 8.25 BPW. The fix is a minimal predicate that quantizes the expert tensors and only skips the routing gates.
Full-attention layer detection: the quantizer reads layer_types from config.json to find the 12 full-attention layers (layers 3, 7, 11, … 47) and protects Q/K/V/O only in those, not globally. The 36 GatedDeltaNet (linear-attention) layers are never touched.
Precision audit: the quantizer emits a per-tensor report of exactly what was protected and why. The agentic build has 146 protected modules — 1 embed_tokens, 1 lm_head, 96 routing gates (48 layers × 2), and 48 attention projections (12 layers × 4).
Known ecosystem limitation: all Qwen3.5 MLX models currently have broken KV-cache reuse because of the hybrid attention architecture (full attention + GatedDeltaNet). This is an mlx-lm issue, not specific to our quantization (GitHub issues #903, #980, #965).

Frequently asked questions

Does 8-bit quantization hurt tool-calling accuracy?

In our benchmark, very little. At 8-bit the per-weight error is about 0.4% of each weight's dynamic range, and for continuous-valued tensors that error is absorbed by the model's residual connections, layer normalization, and softmax. Spending extra bits to protect those tensors at full bf16 precision produced no measurable accuracy gain on τ²-bench retail across 456 simulations per model.

Should I keep embeddings and attention layers at higher precision when quantizing to 8-bit?

On our agentic benchmark, no. We built a variant that kept the embedding matrix, output head, and the Q/K/V/O attention projections of the full-attention layers at bf16, following DeepSeek-V3's mixed-precision recipe. It scored lower on every metric and ran about 30% slower at prefill, with no quality upside. That recipe was validated for FP8 training, where numerical error compounds over millions of gradient updates, not for a single inference forward pass.

What precision matters most for a mixture-of-experts (MoE) model?

The routing gates. They are the one component where quantization error is discontinuous: a tiny change in a gate weight can flip which expert fires for a token, which is a qualitatively wrong choice rather than a slightly-less-accurate one. Keeping the routing gates at full bf16 precision is the protection that earned its place; both builds in our test did so, and both beat a naive uniform 8-bit quantization.

Does the DeepSeek-V3 FP8 recipe apply to inference quantization?

Not directly. DeepSeek-V3's team validated keeping certain components at higher precision for FP8 training, where small numerical errors accumulate across millions of gradient updates. Inference is a single forward pass, so the error does not compound the same way. What is critical during training is not necessarily critical during inference quantization, which is why the recipe produced no benefit when we transferred it to an 8-bit inference build.

Can a 122-billion-parameter model run on a single Mac?

Yes. Qwen3.5-122B-A10B is a mixture-of-experts model with 122 billion total parameters but only about 10 billion active per token. Quantized to 8-bit it occupies roughly 117 GB on disk and runs on a single Apple M3 Ultra Mac Studio with 256 GB of unified memory, served through LM Studio on Apple's MLX library.

Limitations

This is a single operator measuring its own builds, and it should be read that way. We tested one model family (Qwen3.5-122B-A10B), one bit width (8-bit), and one benchmark (τ²-bench retail); the conclusion that continuous-tensor protections are near-free at 8-bit may not hold at 4-bit or below, where per-weight error is several times larger and the residual stream has less margin to absorb it. Our user simulator was GPT-4o-mini rather than the leaderboard's GPT-4o, so our absolute scores are not comparable to published τ²-bench numbers; only the base-versus-agentic comparison is. The deltas on individual metrics are small and within the range where trial-to-trial noise matters; the finding we stand behind is the direction (the protections never helped on any metric) and the speed cost (which is mechanical and reproducible), not the precise size of any single quality gap. Inference ran on a self-hosted stack (LM Studio on one M3 Ultra), and the hybrid-attention KV-cache limitation noted above affects both builds equally. We report this as a field comparison; we'd treat the quality numbers as directional until corroborated by other quantization setups, while expecting the prefill-cost mechanism to generalize.

In one line: at 8-bit, the only quantization protection that matters for a mixture-of-experts model is keeping the routing gates at full precision — everything else is optimization theater, validated by shipping rather than by theorizing.

How to cite this

Ahmed, F. (2026). Does protecting a quantized model's key layers improve agent accuracy? A mixed-precision 8-bit benchmark on a 122B MoE model. ideius. https://www.ideius.com/papers/agentic-quantization-benchmark/

The figures are derived from a controlled benchmark of two model builds on identical hardware. For questions about methodology or the quantization pipeline, reach out — and if you're choosing how to deploy a model locally and want this kind of measured, evidence-first comparison rather than an inherited recipe, that's where our LLM, RAG, and evaluation work starts.

Key references

DeepSeek-V3 Technical Report (arXiv:2412.19437) — Section 3.3: the FP8 mixed-precision framework, validated at 671B scale, that the agentic build follows.
τ-bench (arXiv:2406.12045) — Sierra Research's benchmark for tool-agent-user interaction, which introduced the database-state verification and Pass^k method τ²-bench builds on.
mlx-lm (github.com/ml-explore/mlx-lm) — Apple's MLX inference and quantization library.

Does protecting a quantized model's key layers improve agent accuracy? A benchmark