
Nichebench - Benching AIs vs Drupal 10-11
Introduction
Finally, this article took me ~ 2 months to write-up š - mostly because coming up with Nichebench, creating test-cases and running slowly the tests against many existing LLMs.
Letās start with this - not all LLMs are created equally. The data they are trained on, the formatting of the data, the labelling, the used underlying architecture and many other parameters - heavily influence the final result. Some LLMs might be good at coding š§āš», but when prompted about Drupal 10 or Drupal 11Ā implementation, might creep-out some Drupal 7-flavor solutions - or plainly hallucinate some slop as a replyĀ šµ
Most of us get around these issues, by just throwing more money at this problem - i.e. picking Claude Sonnet 4 (or Opus) - or using up Github Copilotās paid credits.
Very few of us play with smaller LLMs (be it via OpenRouter, Together.AI, groq or Ollama) - and thatās understandable, because after a few tries, youād quickly realize that they are performing pretty poorly - specifically in the Drupal 10/11 setting.
You normally end-up re-prompting and explaining in-details what you want the AI Agent to achieve - constantly thinking āwouldnāt it be faster if Iād just do it myself?ā. And, you continue re-prompting, likeĀ "Do it, or..."

As a part of my experiment (started in this article) - I want to fine-tune a niche-LLM, one that would focus on Drupal 10/11 knowledge and recent best-practices, and see if I could reach (or even outperform) current SOTA models.
The next stepping stone on this journey is to find a good, solid, open-weight model to build upon - enter NichebenchĀ by HumanFace Techš

Introducing Nichebench
To test how well AI models know Drupal 10-11 - weāve cooked š§āš³ Nichebench.
I originally wanted to base it on LightEval - but quickly realized that LLM-as-a-Judge multi-turn complexity, didnāt allow me to proceed as I wanted, so, instead, I embraced DeepEval.
Using DeepEval offered us:
- Consistency - itās a relatively simple-to-embrace framework, and keeps things flexible, but also enforces consistency - in terms of how the tests should be written and evaluated.
- Future-proofing - now, I focus specifically on Drupal knowledge, but Iād like to run in parallel other tests on the model (MMLU, etc) - and there are many baked-in tests (Task Completion, Tool Correctness, etc.).
- Parallel task execution - allows us running concurrent tests - to speed things up.
Although I didnāt stay 100% loyal to the framework ā I wanted to have my own formatted outputs / CLI tool, so I had to go āaroundā the traditional deepeval CLI tool (which handles concurrency and other stuff) - but itās ok (maybe Iāll reverse this in the future).
Nichebench so far supports 2 types of tests (thereās a 3rd but Iāll ignore it for now): quiz & code_generation.
Quizzes are simple - they usually have: A, B, C, D, E, F answers - and we STILL use LLM-as-a-Judge to verify the answer. Why - some LLMs might give random explanation or blabber, and then say āOkay, so, saying all that - I think the actually answer is Aā - and Iād like to have an LLM to figure that out.
Code Generations are a bit more complex - they have a detailed system prompt, then a context of a task at hand, then whatās expected from the MUT (model under testing), then the model completes the end-to-end implementation and forwards this to LLM-as-judge (usually gpt-5@medium in my case) - and the Judge also sees a list of criteria & judge hints - what to look for and why.

ā ļø NOTE: I didnāt ship the actual quiz & code tasks as a part of this repo, because I donāt want newer models to use my questions / answers for their training (so the folder will be empty) - but you can ask me and I can add you to the private repo, if youāre interested (or if you wish to contribute).
Results: Drupal Quiz
Hereās the result - top to bottom (best to worst):

I did test a lot of models - because in general, this test is relatively fast and cheap (unless itās a reasoning model). Youāll see that some SOTA models didnāt get 95-100% but 90-91% - and thatās fine. Itās not about absolute perfect score, itās about relative score - thatās why I tested also proprietary SOTA models: gpt-5 and claude sonnet 4.
My take-away from this test:
- phi4:14b from Microsoft - scored 88%, that was impressive!
- gpt-oss-120B - scored 90% (and 20B - 80%)
- devstral:24b - scored 88%
- qwen3-coder-30b-a3b - scored 86%
Some models scored better than others, also because of recency - some tests (both Quiz & Code Gen)Ā that I have include relative fresh Drupal 11 APIs (but a small %). Sure, that might NOT be relevant as weāll use fine-tuning to solve those - but wouldnāt we rather start with a model that already is better at some of those?
Results: Drupal Code Generation
Hereās the result - top to bottom (best to worst):

Now this test tells a totally different story. Important to note - this is NOT agentic, this is 1 time output (multiple file changes in 1 textual output) test.
These tests run long-time and are pretty expensive. For reference - GPT-5@Medium (judged by GPT-5 itself)Ā scored 75%. I think Claude Sonnet 4 or GPT-5@High would probably get 80-90%.
And now letās switch focus back to our open-weight models ā and the GAP is HUGE. Take-aways:
- gpt-oss-120B - is the highest, hit 40%
- gpt-oss-20B - is the next-best, reached 32-35% ā a really good candidate!Ā š
- phi4:14B - scored only 22-23%, disappointing
- qwens: qwen3-32b - scored very well - 32%, while qwen3-coder-30b-a3b - 27%
I know people complained about gpt-oss models NOT being amazing or up to their expectations - but somehow these models did pretty well on my tests.
One IMPORTANT thing to note here - gpt-oss models were pretty weak generating complex properly escaped function calls (JSONs), thatās a normal thing for smaller models but somehow it really sticks badly in gpt-oss models (120B ofc performs way better, but even it screws up)
RAW results can be found in this spreadsheet š.
Analysis, Picking Finalists
After these tests - heavier relying on the results of code generation tests ā hereās the analysis of 4 smallish LLM models.
Model | Arch | Context (native) | License | Tool Use | JSON/YAML/XML Gen | Coding Specialization |
---|---|---|---|---|---|---|
GPT-OSS-20B | MoE | 128K | Apache-2.0 | Native function calling, browsing, Python tools ā but flaky reliability (JSON/tool-call errors reported) | JSON/YAML supported, but frequent parsing errors š | Generalist; decent coding & reasoning, not coding-specialized |
Phi-4 (14B) | Dense | 16K (32K reasoning variant) | MIT | Supports function calling (basic) but limited depth; better in small iterative tasks | Good structured extraction, less reliable for complex JSON | Strong at math & reasoning, weaker at repo-scale coding |
Qwen3-32B | Dense | 128K | Apache-2.0 | Standard tool calling, reliable & stable | JSON/YAML generation very consistent š | Balanced ā not coding-only, but very stable all-rounder |
Qwen3-Coder-30B-A3B | MoE | 256K š (up to 1M with YaRN) | Apache-2.0 | Advanced agentic tool use (XML format for calls; more reliable on quantized runs) š | Excels in structured output for dev workflows (JSON/YAML/XML) š | Specialized for repo-scale coding, strong at multi-file agentic workflows š |
šĀ Another ādown-fallā of Phi-4 is its tiny context window, and basic tool-use (less agentic) than newer models.
šĀ GPT-OSS-20B still looks good in many regards - but we WILL need to address the JSON/XML/YAML serialization issue. Bad side of it is that itās generalist - but even so it outperformed qwen3-coder-30b-a3b in both test-types.
š I like both Qwen3 models but for different reasons - MoE flavor is code-specific, agentic ā Iād expect exceptional tool-use, serialization and other aspects (also a huge context-window); while the dense one generally outperforms in many regards, even if itās generalistic.
ā
%20.png)
I also favor a bit GPT-OSS-20B because of its guardrailsĀ š”ļø,Ā I know OpenAI team invested a lot into it, so per general Iād feel safer playing with it. Another reason is its huge popularity (at least for now) - which means I could throw my LoRAs onto almost any server and have it running - and at very fast speeds. And finally, I also want to consider LATER the option of transferring the fine-tuning to its larger brother: 120B.
And, here's a full video covering the article:
Next Steps
Building on top of Fine-tuning article - next step would be to generate proper synthetic data, Iām talking about 10k-20k of proper technical Drupal-related tasks and solutions, properly described, annotated, labelled and what not.
And then, we start cooking a proper NicheAI - Drupal AI LLM, and see if we can reach SOTA-quality level.
Hereās what to expect:
- š§Ŗ Hardware - weāll dive into the building process of the best cost-optimized AI-rig at home, capable of fine-tuning serious models,Ā all under $2000Ā š¤Æ.
- š§ŖĀ GeneratingĀ proper datasetĀ - how can we generate a large-enough and good-enough dataset for our fine-tuning process?
- š§ŖĀ Niche LLM modelsĀ - how we can improve existing coding models, to be incredibly fluent in a niche programming language or framework - looking at you, DrupalĀ š
- And much more š
If these articles are useful and/or entertaining for you - please consider supporting the effort ā viaĀ Ko-FiĀ donations š
š” Inspired by this article?
If you found this article helpful and want to discuss your specific needs, I'd love to help! Whether you need personal guidance or are looking for professional services for your business, I'm here to assist.
Comments:
Feel free to ask any question / or share any suggestion!