CPU-only LLM Inference

In this article, we’ll be putting our second-hand AMD Threadripper 1950x to some inference tests 🔥 - can you already smell some overheated plastic? No? That’s because the be quiet! is at the heart of our QuietBee 🐝😎

Quick intro - Niche AI trajectory

Alright, let’s take a step back - we’ve built the base for our AI home-lab (called QuietBee) - but as long as the GPU’s aren’t here yet - QuietBee just sits there without any work 😟. However, while doing some Nichebench testing - we had to use some local models - and we did run some random non-optimized CPU inference on the QuietBee - and it was quite okay (yielding 5-10 tokens, depending on the model).

But wait. Why are we doing this at all, you’re asking? I got you → we’re on our path to build open-weight Niche AI models - fine-tuned flavors that can perform exceptionally well, in niche domains, think: Drupal, Wordpress, Laravel, etc. We started the series with this super tiny fine-tuning experiment, and we’re still experimenting.

Alright, now that you’re up to speed - let’s dive into the article. Specs for the rest of the article: we’re running AMD Threadripper 🧵 1950x with 94 GB of RAM (I need 32 GB more to make it full quad-channel). I also capped CPU at 3.6 Ghz - this is important, I’m NOT running full-speed, reasons is that my PSU can’t take it - it’s pending to be changed next month 🤫

So, our results are RELATIVE, also the article is less about the results and more about the knobs, compilations and approach.

Targets: 🧠 What you’re actually optimizing for

We’ll be using for most of the things llama.cpp and a CPU-specific fork.

When you build llama.cpp, you get → llama-bench executable. You can use it to run benchmarks.

There are 3 metrics we can use:

Prefill (pp) - how fast the model processes your prompt (input). Matters for RAG, retrieval, few-shot, and long-context summarization. In our case - we care, as usually in programming we do give a LOT of context.
Decode (tg) - how fast the model generates (output). Matters for chatbots, code completion, streaming.
Mixed (pp+tg) - a combined test that’s closer to real-world workloads (both long prompts and continuous generation).

When turning various knobs on CPU - you will optimize for one of these metrics, it’s rare when you can bump all 3 at the same time.

Here’s a sample output:

nikro@quietbee:~/projects$ ./ik_llama.cpp/build/bin/llama-bench --model ./models/gpt-oss-20b-Q4_K_M.gguf -t 16 -ngl 0 -pg 256,512
| model                          |       size |     params | backend    | threads |          test |              t/s |
| ------------------------------ | ---------: | ---------: | ---------- | ------: | ------------: | ---------------: |
| gpt-oss ?B Q4_K - Medium       |  10.81 GiB |    20.91 B | CPU        |      16 |         pp512 |     87.92 ± 1.70 |
| gpt-oss ?B Q4_K - Medium       |  10.81 GiB |    20.91 B | CPU        |      16 |         tg128 |     16.25 ± 0.18 |
| gpt-oss ?B Q4_K - Medium       |  10.81 GiB |    20.91 B | CPU        |      16 |   pp256+tg512 |     21.44 ± 0.40 |

Also:

Different models behave differently - in all my tests we used the same model: openai/gpt-oss-20B @ Q4
Different quantizations also behave differently - to dive deep into the rabbit-hole see this discussion.
Apparently allocating what CPU cores / memory your runs use - also can increase fidelity / stable generation speeds.

Process & knobs: 🔧 What we tuned and how

🛠️ Build variants:

These are various build-types - different sources (i.e. original vs fork) - and different build flags, before running the compilation:

Vanilla - a simple llama.cpp build - default CPU flags - reference here;
BLAS / BLIS - enabled optimized linear algebra - same reference as above - sometimes these help in token generation (output), but by a little;
IK native - ik_llama.cpp - built with flag march=native (could not enable FANCY SIMD still) - this is a fork of llama.cpp with special optimized knobs here and there.
… various other combinations and flags you can try out.

🎛️ Runtime knobs

Threads – scaling beyond 16 cores didn’t help decode, 14-16 was best.
NUMA – this controls which CPU / Memory slots are used.
uBatch and Batch – playing with this knob can increase performance by a little.
KV cache – q8_0/f16 was the biggest decode (output) win (but not in ik_llama though)
Flash Attention – modest prefill boost (~3–5%), no decode gain (there are also flavors)

The process overall looks like this - step-by-step of sorts:

You clone the llama.cpp (or ik_llama.cpp);
You read the docs on how to run a BUILD - get all the dependencies installed;
You run the build → get your build/bin/llama-bench
You can now run the bench with various parameters / knobs.

After experimenting A LOT, we realized that for this to be done reliably, due to small variances here and there, it’s better to create a wrapper shell command and run a bulk batch, and then compare the runs.

The normal 1 run output looks something like:

nikro@quietbee:~/projects$ ./ik_llama.cpp/build/bin/llama-bench --model ./models/gpt-oss-20b-Q4_K_M.gguf -t 16 -ngl 0 -pg 256,512
| model                          |       size |     params | backend    | threads |          test |              t/s |
| ------------------------------ | ---------: | ---------: | ---------- | ------: | ------------: | ---------------: |
======================================= HAVE_FANCY_SIMD is NOT defined
| gpt-oss ?B Q4_K - Medium       |  10.81 GiB |    20.91 B | CPU        |      16 |         pp512 |     87.92 ± 1.70 |
| gpt-oss ?B Q4_K - Medium       |  10.81 GiB |    20.91 B | CPU        |      16 |         tg128 |     16.25 ± 0.18 |
| gpt-oss ?B Q4_K - Medium       |  10.81 GiB |    20.91 B | CPU        |      16 |   pp256+tg512 |     21.44 ± 0.40 |

Here you get the 512 tokens prefill (input processing), 128 tokens decoding (output generation) and a mixture of 256 prefill + 512 generation.

By going back and forward, we ended-up having a huge table of these runs, and it got very hard to compare stuff. So, we decided to code (with some help from Claude) a llama-bench wrapper - https://github.com/HumanFace-Tech/hft-cpu-test/ - this tool 🛠️ allows you to:

Run basic computer/setup checks ✅
Define a config.yml for 🔭 exploratory (superficial) setups you want to explore…
Run these in one batch and get a final comparison output 📊
Use the output from exploratory run → to create a more in-depth run (check more knobs) to find the ultimate configs for best runs 🎯

Looks something like:

(venv) nikro@quietbee:~/projects/hft-cpu-test$ ./run_bench.sh configs/qb-exploratory.yaml  
🔍 Pre-flight checks...
🚀 Starting benchmark harness...
 
📊 Report directory: reports/2025-10-15-182000-exploratory
 
🚀 Starting EXPLORATORY benchmark run
 
📋 Test matrix: 36 unique configs × 2 reps = 72 runs
 
[1/36] vanilla / all_cores_16t_01 / pp512
  Rep 1: numactl -N 0,1 -m 0,1 --physcpubind=0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15 /home/nikro/projects/llama.cpp/build-vanilla/bin/llama-bench -m /home/nikro/projects/models/gpt-oss-20b-Q4_K_M.gguf -p 512 -n 0 -o json -t 16
    ✓ 56.880251 t/s
  Rep 2: numactl -N 0,1 -m 0,1 --physcpubind=0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15 /home/nikro/projects/llama.cpp/build-vanilla/bin/llama-bench -m /home/nikro/projects/models/gpt-oss-20b-Q4_K_M.gguf -p 512 -n 0 -o json -t 16
    ✓ 56.750169 t/s
  ✓ reports/2025-10-15-182000-exploratory/summary.md
 
...

We use it ourselves (and planning still to use it, once I get a proper PSU to handle an extra CPU EPS connection, and enable XMP on my RAM) - hope it will be useful for someone else where 🌎 - if you’re that someone, I’d appreciate a ⭐ on that repo 😅.

Results: 📈 What actually happened

Let’s talk about my exploratory → deep tests.

🔭 Exploratory:

We made 6 different builds:

llama.cpp (vanilla) - simple default llama.cpp CPU build
llama.cpp + BLAS
llama.cpp + BLIS
ik_llama.cpp - the specialized fork - vanilla
ik_llama.cpp + BLIS
ik_llama.cpp + additional flags - that I hoped would enable the FANCY SIMD, but they didn’t, and I also don’t know whether those flags had any influence on the performance of the build

Then we just targeted 2 different modes (after experimentation):

with numactl enabled - and force everything to run on first 16 logic cores (having 1 process per 1 physical core) - aka all_cores_16t_01
with numactl disabled - basically have it dynamic (but balance is still off - however it allows to run random 1st available cores) - aka no_node_16t

mode: exploratory
repetitions: 2
 
model_path: /home/nikro/projects/models/gpt-oss-20b-Q4_K_M.gguf
model_info: "gpt-oss-20b-Q4_K_M.gguf"
 
# Builds to test - edit paths to match your setup
builds:
  vanilla:
    binary: /home/nikro/projects/llama.cpp/build-vanilla/bin/llama-bench
    label: "vanilla (baseline)"
  blas:
    binary: /home/nikro/projects/llama.cpp/build-blas/bin/llama-bench
    label: "vanilla-BLAS"
  blis:
    binary: /home/nikro/projects/llama.cpp/build-blis/bin/llama-bench
    label: "vanilla-BLIS"
  ik_vanilla:
    binary: /home/nikro/projects/ik_llama.cpp/build-cpu/bin/llama-bench
    label: "IK-vanilla"
  ik_blis:
    binary: /home/nikro/projects/ik_llama.cpp/build-blis/bin/llama-bench
    label: "IK-BLIS"
  ik_fancy:
    binary: /home/nikro/projects/ik_llama.cpp/build-fancy/bin/llama-bench
    label: "IK-fancy"
 
# Which builds to include in this run
builds_select:
  - vanilla
  - blas
  - blis
  - ik_vanilla
  - ik_blis
  - ik_fancy
 
# Test matrix - exploratory sweep across different NUMA strategies
# Threadripper 1950X: 16 physical cores (0-15), 32 logical (16-31 are SMT siblings)
test_matrix:
  # Test 1: All 16 physical cores, 16 threads
  - name: "all_cores_16t_01"
    numactl: "-N 0,1 -m 0,1 --physcpubind=0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15"
    env:
      OMP_NUM_THREADS: "16"
    extra_args: "-t 16"
 
  # Test 2: No NUMA 
  - name: "no_node_16t"
    numactl: ""
    env:
      OMP_NUM_THREADS: "16"
    extra_args: "-t 16"
 
# Standard llama-bench metrics
metrics:
  - pp512    # Prompt processing: 512 tokens
  - tg128    # Text generation: 128 tokens
  - mixed    # Prompt=512 + Generate=128
 
# Output directory
output_dir: ./reports

Results are:

Raws:

# Benchmark Summary - Exploratory
 
**Date:** 2025-10-15 20:29:40
**Config:** configs/qb-exploratory.yaml
**Model:** gpt-oss-20b-Q4_K_M.gguf
 
## PP512
 
| Build      | Config           | t/s (our reps)   | llama-bench σ   |   Reps |
|:-----------|:-----------------|:-----------------|:----------------|-------:|
| ik_vanilla | all_cores_16t_01 | 91.91 ± 0.06     | ±0.17           |      2 |
| ik_fancy   | all_cores_16t_01 | 91.40 ± 0.17     | ±0.22           |      2 |
| ik_fancy   | no_node_16t      | 89.14 ± 1.37     | ±3.11           |      2 |
| ik_vanilla | no_node_16t      | 86.80 ± 4.23     | ±8.05           |      2 |
| vanilla    | all_cores_16t_01 | 56.82 ± 0.07     | ±0.17           |      2 |
| vanilla    | no_node_16t      | 56.74 ± 0.18     | ±0.25           |      2 |
| blas       | no_node_16t      | 50.40 ± 0.81     | ±0.81           |      2 |
| blas       | all_cores_16t_01 | 38.58 ± 0.03     | ±0.18           |      2 |
| blis       | no_node_16t      | 31.79 ± 0.02     | ±0.08           |      2 |
| blis       | all_cores_16t_01 | 31.66 ± 0.01     | ±0.19           |      2 |
| ik_blis    | all_cores_16t_01 | 27.41 ± 0.03     | ±0.09           |      2 |
| ik_blis    | no_node_16t      | 25.90 ± 0.64     | ±0.95           |      2 |
 
## TG128
 
| Build      | Config           | t/s (our reps)   | llama-bench σ   |   Reps |
|:-----------|:-----------------|:-----------------|:----------------|-------:|
| ik_blis    | no_node_16t      | 18.14 ± 0.49     | ±0.17           |      2 |
| ik_vanilla | no_node_16t      | 18.05 ± 0.51     | ±0.29           |      2 |
| ik_vanilla | all_cores_16t_01 | 17.03 ± 0.99     | ±0.48           |      2 |
| ik_blis    | all_cores_16t_01 | 16.72 ± 0.48     | ±0.25           |      2 |
| ik_fancy   | all_cores_16t_01 | 16.01 ± 1.88     | ±0.69           |      2 |
| ik_fancy   | no_node_16t      | 15.99 ± 2.01     | ±0.10           |      2 |
| vanilla    | all_cores_16t_01 | 15.55 ± 0.01     | ±0.01           |      2 |
| blis       | all_cores_16t_01 | 15.47 ± 0.06     | ±0.02           |      2 |
| vanilla    | no_node_16t      | 14.30 ± 0.03     | ±0.04           |      2 |
| blas       | no_node_16t      | 13.25 ± 0.22     | ±0.22           |      2 |
| blas       | all_cores_16t_01 | 13.19 ± 0.86     | ±0.10           |      2 |
| blis       | no_node_16t      | 12.27 ± 0.02     | ±0.09           |      2 |
 
## MIXED
 
| Build      | Config           | t/s (our reps)   | llama-bench σ   |   Reps |
|:-----------|:-----------------|:-----------------|:----------------|-------:|
| ik_vanilla | all_cores_16t_01 | 91.66 ± 0.02     | ±0.23           |      2 |
| ik_fancy   | all_cores_16t_01 | 91.40 ± 0.04     | ±0.33           |      2 |
| ik_vanilla | no_node_16t      | 79.63 ± 4.12     | ±13.58          |      2 |
| ik_fancy   | no_node_16t      | 73.54 ± 12.04    | ±8.76           |      2 |
| vanilla    | all_cores_16t_01 | 56.95 ± 0.02     | ±0.12           |      2 |
| vanilla    | no_node_16t      | 56.80 ± 0.05     | ±0.13           |      2 |
| blas       | no_node_16t      | 50.13 ± 0.37     | ±0.81           |      2 |
| blas       | all_cores_16t_01 | 38.48 ± 0.02     | ±0.18           |      2 |
| blis       | no_node_16t      | 31.84 ± 0.01     | ±0.05           |      2 |
| blis       | all_cores_16t_01 | 31.76 ± 0.09     | ±0.15           |      2 |
| ik_blis    | all_cores_16t_01 | 27.41 ± 0.04     | ±0.07           |      2 |
| ik_blis    | no_node_16t      | 26.34 ± 0.15     | ±0.65           |      2 |
 
---

Alright, here’s what we learned from this:

Our custom ik_llama builds (i.e. ik_vanilla) - performs very well - on average 60-90% higher than normal vanilla builds.
NUMA-aware execution boosts throughput - and at times, by a lot.
TG128 (generation/decoding) - tells a bit a different story, so if you want that number higher no matter what, you might want to be interested in BLIS, but to be fair: ik_vanilla isn’t far behind (the variance might be the one that sets it apart).

Okay okay, let’s dive into a deeper analysis now that we know what we wanna focus on. The report btw generated a promoted.yml config - that we used as our basis - we did adjust it slightly:

mode: deep
repetitions: 2
metrics:
- pp512
- tg128
- mixed
output_dir: ./reports
builds:
  ik_vanilla:
    binary: /home/nikro/projects/ik_llama.cpp/build-cpu/bin/llama-bench
    label: IK-vanilla
  ik_fancy:
    binary: /home/nikro/projects/ik_llama.cpp/build-fancy/bin/llama-bench
    label: IK-fancy
  ik_blis:
    binary: /home/nikro/projects/ik_llama.cpp/build-blis/bin/llama-bench
    label: IK-BLIS
builds_select:
- ik_vanilla
# - ik_fancy - skip this one, as I am not sure our flags do anything different.
- ik_blis
test_matrix:
- name: all_cores_16t_01
  numactl: -N 0,1 -m 0,1 --physcpubind=0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15
  env:
    OMP_NUM_THREADS: '16'
  extra_args: -t 16
- name: no_node_16t
  numactl: ''
  env:
    OMP_NUM_THREADS: '16'
  extra_args: -t 16
model_path: /home/nikro/projects/models/gpt-oss-20b-Q4_K_M.gguf
model_info: gpt-oss-20b-Q4_K_M.gguf
 
# ============================================================================
# PARAMETER SWEEP (this is what makes "deep" mode deep!)
# ============================================================================
 
parameter_sweep:
  # KV Cache Type Variations
  kv_cache:
    - name: "f16_f16"
      args: "-ctk f16 -ctv f16"  # Baseline
    #- name: "f8_f16"
    #  args: "-ctk q8_0 -ctv f16" # Quantize K cache (common optimization)
  # MLA/Attention Variants
  mla_variants:
    - name: "baseline"
      args: ""
    - name: "mla2_fa_fmoe"
      args: "-mla 2 -fa -fmoe"
    - name: "mla3_fa_fmoe"
      args: "-mla 3 -fa -fmoe"
  # Batch Size Variations
  batch_sizes:
    - name: "std"
      args: "-b 2048 -ub 512"
    - name: "small_128"
      args: "-b 256 -ub 128"
    - name: "small_64"
      args: "-b 256 -ub 64"
    - name: "mid"
      args: "-b 512 -ub 256"

I also ran this as 2 separate runs for different KV params - initially only f16/f16 (classic), and later the q8_0 / f16:

# PP512 — Top 10 (Deep)
 
| Build      | Config                                          | t/s | σ      |
|------------|--------------------------------------------------|-----|--------|
| ik_vanilla | all_cores_16t_01_f16_f16_mla3_fa_fmoe_std       | 91.83 | ±0.27 |
| ik_vanilla | all_cores_16t_01_f16_f16_baseline_std           | 91.71 | ±0.41 |
| ik_vanilla | all_cores_16t_01_f16_f16_mla2_fa_fmoe_std       | 91.70 | ±0.27 |
| ik_vanilla | all_cores_16t_01_f16_f16_baseline_mid           | 91.02 | ±0.22 |
| ik_vanilla | all_cores_16t_01_f16_f16_mla2_fa_fmoe_mid       | 90.59 | ±0.24 |
| ik_vanilla | all_cores_16t_01_f16_f16_mla3_fa_fmoe_mid       | 90.23 | ±0.28 |
| ik_vanilla | all_cores_16t_01_f16_f16_baseline_small_128     | 87.20 | ±0.43 |
| ik_vanilla | all_cores_16t_01_f16_f16_mla2_fa_fmoe_small_128 | 86.91 | ±0.24 |
| ik_vanilla | all_cores_16t_01_f16_f16_mla3_fa_fmoe_small_128 | 86.81 | ±0.26 |
| ik_vanilla | no_node_16t_f16_f16_mla3_fa_fmoe_std            | 82.26 | ±13.91 |
…others removed for brevity…
 
# TG128 — Top 10 (Deep)
 
| Build      | Config                                          | t/s | σ      |
|------------|--------------------------------------------------|-----|--------|
| ik_vanilla | no_node_16t_f16_f16_baseline_mid                | 18.28 | ±0.18 |
| ik_vanilla | no_node_16t_f16_f16_mla2_fa_fmoe_small_64       | 18.21 | ±0.10 |
| ik_vanilla | all_cores_16t_01_f16_f16_baseline_small_64      | 18.18 | ±0.24 |
| ik_vanilla | no_node_16t_f16_f16_mla3_fa_fmoe_std            | 18.18 | ±0.22 |
| ik_vanilla | no_node_16t_f16_f16_mla2_fa_fmoe_small_128      | 18.15 | ±0.12 |
| ik_blis    | all_cores_16t_01_f16_f16_baseline_small_64      | 17.80 | ±0.43 |
| ik_blis    | no_node_16t_f16_f16_baseline_mid                | 17.58 | ±0.23 |
| ik_blis    | no_node_16t_f16_f16_mla2_fa_fmoe_mid            | 17.58 | ±0.32 |
| ik_vanilla | no_node_16t_f16_f16_mla2_fa_fmoe_mid            | 17.43 | ±0.09 |
| ik_blis    | all_cores_16t_01_f16_f16_mla3_fa_fmoe_small_64  | 17.42 | ±0.23 |
…others removed for brevity…
 
# MIXED — Top 10 (Deep)
 
| Build      | Config                                          | t/s | σ      |
|------------|--------------------------------------------------|-----|--------|
| ik_vanilla | all_cores_16t_01_f16_f16_mla3_fa_fmoe_std       | 91.76 | ±0.25 |
| ik_vanilla | all_cores_16t_01_f16_f16_baseline_std           | 91.63 | ±0.36 |
| ik_vanilla | all_cores_16t_01_f16_f16_mla2_fa_fmoe_std       | 91.49 | ±0.19 |
| ik_vanilla | all_cores_16t_01_f16_f16_mla3_fa_fmoe_mid       | 90.82 | ±0.30 |
| ik_vanilla | all_cores_16t_01_f16_f16_mla2_fa_fmoe_mid       | 90.71 | ±0.36 |
| ik_vanilla | all_cores_16t_01_f16_f16_baseline_mid           | 90.54 | ±0.23 |
| ik_vanilla | all_cores_16t_01_f16_f16_baseline_small_128     | 87.49 | ±0.19 |
| ik_vanilla | all_cores_16t_01_f16_f16_mla3_fa_fmoe_small_128 | 87.10 | ±0.28 |
| ik_vanilla | all_cores_16t_01_f16_f16_mla2_fa_fmoe_small_128 | 86.98 | ±0.30 |
| ik_vanilla | no_node_16t_f16_f16_mla3_fa_fmoe_std            | 85.80 | ±7.81 |
…others removed for brevity…

Testing Results - Deep Testing - top 10 - prefill

Testing Results - Deep Testing - top 10 - generation

Testing Results - Deep Testing - top 10 - mixed

Great - so what about quantizing the K in KV: q8_0 / f16? It does make a model slightly worse (≤1%) - so keep that in mind:

## PP512 (top 10)
1) ik_vanilla — all_cores_16t_01_f8_f16_baseline_std           — 92.50 t/s
2) ik_vanilla — all_cores_16t_01_f8_f16_mla2_fa_fmoe_std       — 92.47 t/s
3) ik_vanilla — all_cores_16t_01_f8_f16_mla3_fa_fmoe_std       — 92.30 t/s
4) ik_vanilla — all_cores_16t_01_f8_f16_mla3_fa_fmoe_mid       — 91.31 t/s
5) ik_vanilla — all_cores_16t_01_f8_f16_baseline_mid           — 91.17 t/s
6) ik_vanilla — all_cores_16t_01_f8_f16_mla2_fa_fmoe_mid       — 91.06 t/s
7) ik_vanilla — all_cores_16t_01_f8_f16_mla2_fa_fmoe_small_128 — 87.62 t/s
8) ik_vanilla — all_cores_16t_01_f8_f16_mla3_fa_fmoe_small_128 — 87.36 t/s
9) ik_vanilla — all_cores_16t_01_f8_f16_baseline_small_128     — 87.03 t/s
10) ik_vanilla — no_node_16t_f8_f16_baseline_std               — 84.15 t/s
… others removed for brevity …
 
## TG128 (top 10)
1) ik_vanilla — no_node_16t_f8_f16_mla2_fa_fmoe_small_128 — 18.47 t/s
2) ik_blis    — no_node_16t_f8_f16_mla2_fa_fmoe_mid       — 18.03 t/s
3) ik_blis    — no_node_16t_f8_f16_baseline_small_128     — 17.98 t/s
4) ik_blis    — all_cores_16t_01_f8_f16_mla3_fa_fmoe_mid  — 17.86 t/s
5) ik_blis    — no_node_16t_f8_f16_mla3_fa_fmoe_std       — 17.82 t/s
6) ik_vanilla — no_node_16t_f8_f16_baseline_small_128     — 17.70 t/s
7) ik_vanilla — no_node_16t_f8_f16_mla3_fa_fmoe_std       — 17.61 t/s
8) ik_vanilla — no_node_16t_f8_f16_mla2_fa_fmoe_std       — 17.58 t/s
9) ik_blis    — no_node_16t_f8_f16_mla3_fa_fmoe_small_64  — 17.57 t/s
10) ik_vanilla — no_node_16t_f8_f16_baseline_mid          — 17.49 t/s
… others removed for brevity …
 
## MIXED (top 10)
1) ik_vanilla — all_cores_16t_01_f8_f16_mla2_fa_fmoe_std       — 92.72 t/s
2) ik_vanilla — all_cores_16t_01_f8_f16_baseline_std           — 92.63 t/s
3) ik_vanilla — all_cores_16t_01_f8_f16_mla3_fa_fmoe_std       — 92.42 t/s
4) ik_vanilla — all_cores_16t_01_f8_f16_baseline_mid           — 91.27 t/s
5) ik_vanilla — all_cores_16t_01_f8_f16_mla2_fa_fmoe_mid       — 91.14 t/s
6) ik_vanilla — all_cores_16t_01_f8_f16_mla3_fa_fmoe_mid       — 91.01 t/s
7) ik_vanilla — all_cores_16t_01_f8_f16_mla2_fa_fmoe_small_128 — 87.90 t/s
8) ik_vanilla — all_cores_16t_01_f8_f16_mla3_fa_fmoe_small_128 — 87.73 t/s
9) ik_vanilla — all_cores_16t_01_f8_f16_baseline_small_128     — 87.41 t/s
10) ik_vanilla — no_node_16t_f8_f16_baseline_mid               — 86.67 t/s
… others removed for brevity …

Or visually 📊:

Testing Results - Deep Testing - top 10 at KV q8_0 and f16 - prefill

Testing Results - Deep Testing - top 10 at KV q8_0 and f16 - generation

Testing Results - Deep Testing - top 10 at KV q8_0 and f16 - Mixed

So, we can conclude:

K-quantization gives a small but consistent speed bump.
- MIXED best improved from ~91.7 t/s (f16/f16) to ~92.7 t/s (q8_0/f16), roughly +1%.
- PP512 (prefill/input) moved from ~91.8 to ~92.5 t/s (+0.7%). TG128 best nudged up from ~18.3 to ~18.5 t/s.
The “all_cores_16t_01 + ik_vanilla” family still dominates PP512 and MIXED. Whether baseline or with MLA/FA/FMoE toggled, those variants occupy the top slots. The feature flags don’t change ranking much → gains are within ~0–1%.
TG128 (generation/decoding) remains more sensitive to pinning and small-batch configs. Top TG128 entries skew toward no_node_16t and “small_128/64” variants (both ik_vanilla and ik_blis show up), suggesting generation kernels benefit from tight per-token work and sometimes less aggressive NUMA/pinning.

🥇 Our “everyday” preset (and why)

After running multiple rounds of deep CPU benchmarking - hundreds of iterations with different kernel builds, memory pinning strategies, attention implementations, KV cache types, and batching configurations → we can now confidently say we’ve found the sweet spot. This is the setup that delivers top-tier performance across all workload types without sacrificing model quality.

Build: ik_vanilla (the ik_llama.cpp - fork, with default settings)
Threads: 16 (pinned across NUMA nodes, e.g. numactl --physcpubind=0-15)
KV Cache: default (which is f16/f16)
Flags: -fa -fmoe -mla 2 (fast attention, fused MoE, modern linear algebra optimizations)
Batching: standard
Execution profile: NUMA balanced, threads pinned, page locality enforced

This single configuration emerged as the most consistent and balanced across all three benchmark dimensions we care about:

PP512 (prompt processing):
- 🚀 ~91.7 t/s - within 0.2 % of the absolute best run.
- 📈 +61–62 % faster than vanilla (which averages ~56.8 t/s).
TG128 (generation):
- ⚡ ~17.0 t/s - only ~3 % slower than the most extreme TG-optimized build.
- 📈 ~12–14 % faster than vanilla (~15.0–15.2 t/s).
MIXED:
- 🔥 ~91.6–91.8 t/s - best overall throughput when simulating realistic multi-stage workloads.
- 📈 ~61 % faster than vanilla (~56.8 t/s).

🧭 Closing thoughts

I think, it was a pretty good ride we took - we can now restrict some params, swap the model and check again. Or, we can install new PSU or 1 more RAM stick, and try again. I think general rules should translate (and stick to the CPU+RAM setup) - but the result of the ride - is the repo → wrapper around llama-bench that can (in few hours) give us a nice & solid comparison table - is 🔥

If you’re curious - feel free to explore and use the tool - and test things yourself - I’d love to hear what results you encountered, leave a comment 👇.

Also please consider supporting our effort on - Ko-Fi - any donation will go straight into the GPU budget and more articles → https://ko-fi.com/nikrosergiu

And, if you want to get in-touch for some projects - check out HumanFace Tech and book a meeting.

💡 Inspired by this article?

If you found this article helpful and want to discuss your specific needs, I'd love to help! Whether you need personal guidance or are looking for professional services for your business, I'm here to assist.

☕ Support me on Ko-fi 📅 Book a 1:1 Consultation 🚀 Explore Our Company Services

Comments:

Feel free to ask any question / or share any suggestion!

CPU-only LLM Inference

Quick intro - Niche AI trajectory

Targets: 🧠 What you’re actually optimizing for

Process & knobs: 🔧 What we tuned and how

🛠️ Build variants:

🎛️ Runtime knobs

Results: 📈 What actually happened

🥇 Our “everyday” preset (and why)

🧭 Closing thoughts

💡 Inspired by this article?

Tags:

Categories:

Comments:

Quick intro - Niche AI trajectory

Targets: 🧠 What you’re actually optimizing for

Process & knobs: 🔧 What we tuned and how

🛠️ Build variants:

🎛️ Runtime knobs

Results: 📈 What actually happened

🥇 Our “everyday” preset (and why)

🧭 Closing thoughts

💡 Inspired by this article?

Tags:

Categories:

Comments:

Related Articles:

QuietBee: Building a Home AI Lab for Fine-Tuning

AI Fine-tuning Experiment