AI Fine-tuning Experiment

Introduction

Hi there! I’m diving into the unknown 🤿 - fine‑tuning open‑weight models to see what goodies I can bake. Quick heads‑up: I’m a web‑dev that tinkers with AI (LangChain / LangGraph, etc.), not a formal ML researcher, but I’m pretty curious about AI and its impact on our lives 🤖.

Before we begin - if you're more a hands-on and a visual person, I also made a video ▶ that goes along with this article.

Okay, but why go through with all of that? Well, there’s a lot to unpack:

🤓 Just learn fine-tuning, get a bit into the weeds of AI training, etc.
🏋️‍♂️ Challenge - see if I can get some new knowledge & facts baked into existing LLM models
🛠️ Build on top of that - what’s my “North Star”? I’m targeting building niche specialized models - think: models that are great at Drupal coding, Wordpress, hell, even Svelte with Runes (who knows, maybe even Fireship will be impressed?).
📊 Assess - check if my new adjustments actually translate into real gains.
🤝 Contribute & share - if it’ll go well, I’m planning to release some of these publicly - so the niche communities can benefit from these.
📈 Incorporate - we’re having some products in our pipeline (@ HumanFace Tech) - that will really benefit from these niche models.

The idea is to drive it as an experiment, document things (text, code, videos) and share - maybe someone’s on a similar journey or is just curious - I’m pretty sure we’re not alone.

Why did I use 🦥 Unsloth?

I browse HuggingFace a lot, so Unsloth was already on my radar (I used a lot of their models) - and inevitably I was aware of their library. My hands were always itchy to try it out, but here’s a breakdown of available approaches:

Hugging Face (HF) Transformers + PEFT (LoRA / QLoRA) - the “default” route that slots straight into the HF ecosystem, delivering parameter-efficient fine-tuning with minimal code changes but only baseline speed and memory savings.
DeepSpeed ZeRO - shards model states and optimizer tensors across GPUs/CPU so you can fit or accelerate very large models, at the cost of extra configuration and a multi-GPU (or offload) setup.
Axolotl - a lightweight wrapper around HF that recently back-ported Unsloth-style Triton kernels to trim VRAM and claims up to ~1.5-2 × faster LoRA/QLoRA runs while keeping a simple CLI.
NVIDIA NeMo - an end-to-end generative-AI stack optimised for NVIDIA hardware that scales from single GPUs to DGX SuperPODs, but is heavier and ties you more tightly to the NVIDIA ecosystem.

Alright, now that we covered alternatives - here’s what I actually needed:

I want to do QLoRa - for the training‑time and memory win. The process spits out a tiny LoRA adapter and, if I choose, a merged checkpoint (GGUF) ready for Ollama/LM Studio.
I want to use my RTX 3060 (12GB VRAM) - so I would definitely benefit from a framework that can work in restricted VRAM scenarios (and RAM-scenarios too).
I want to have quick iterations, because I suspect (and it was indeed the case) - I will have to do many iterations, to check, and get things right.

Here’s how the Unsloth fits my requirements: it’s Open-source, HF-compatible, supports the same PEFT techniques (LoRA / QLoRA) and the same model zoo (Llama, Mistral, Gemma, Qwen, DeepSeek, etc.).

Where Unsloth shines

Speed - custom CUDA/Triton kernels give roughly 2-5 × faster training than HF + Flash-Attention 2 baselines 🚀
Memory - up to ~80 % less VRAM, letting you fine-tune a 9 B model in only 6-7 GB or run on free Colab (3 GB) notebooks 💾
Simplicity - single-GPU friendly, no distributed plumbing, and ready-made notebooks for Colab, Kaggle, or local use ⚙️

They also have excellent (step-by-step) notebooks and examples. Sure, I had to use an AI to guide me through the AI slang, variables, etc - to have a deeper understanding, but still it helped a lot.

The process

Here, in their official docs - you can have a quick-start fine-tuning guide. It kinda covers all the basics:

Picking the Model + Method 🎯
Data - getting the dataset 📥
Install → Train → Evaluate → Run (Inference) ▶️

Now let’s jump into Notebooks (there are VSCode extensions for that) - these are 1 page mixture of coding and markdown cells (code + text) - and it allows you to rerun code cells on demand, from top to bottom, and investigate the output of the code as you go.

Unsloth offers a TON of example notebooks - each of them is basically a good starting platform for whatever experiment you want to do.

Model & Memory:

I picked initially the Llama 3.2-3B, but I got quite a lot of Out of Memory (OOM) errors - because while the training, it has to hold all that model in your RAM (not just VRAM) and if you’re having Chrome + Docker, even swap memory won’t help (as my Ubuntu defaults at killing processes that horde >50% for at least 1 minute). Sadly I have only 16GB of RAM and I hit the ceiling pretty fast - eventually had to instruct my OOM-process killer to be OK with 90% and ~5 minute loads.

Eventually, as this was 1st experiment - I decided to drop to Llama 3.2-1B (quantized to q4) - as it allowed faster iterations, loops, and less memory headaches. So if you’re having 8-12GB GPU I still recommend starting off by using a small model.

Dataset:

You don’t need a big dataset to get started - I just asked ChatGPT to generate me JSONL (it’s a file with multiple JSON objects, one per line). I gave it examples (taken from the Unsloth’s Notebook), gave it a theme and asked to generate 20 demo-samples.

Is 20 examples enough? It depends - generally “No”, but you can still use these to see if some new knowledge “sticks” to the AI model (validate the experiment). It’s also important to Prep data (make sure you’re using LLM-specific formats, each LLM has certain expectations, special tokens, etc).

In my case, I generated some sample facts about a new Goddess - Afinika 😉. The lore talked about worship rituals, days of the year, month, week, representations and what not. Once I generated the lore, I decided to indoctrinate the AI into learning about this new religion, goddess, rituals, etc.

Whatever your use-case or dataset is, you will most probably need to format the data (load it) properly, according to the requirements of your model.

Training, hyperparameters and monitoring:

Once you have your data properly loaded, you can proceed with the model itself. Following the guides from the Unsloth - if you’re doing a QLoRa - you’ll need to load the model + apply the LoRA.

# Split 80/20 so we see generalization
ds_train, ds_eval = dataset.train_test_split(test_size=0.2, seed=42).values()
 
print(f"📊 Train: {len(ds_train)} examples, Eval: {len(ds_eval)} examples")
 
# Setup trainer using tutorial-style TrainingArguments with formatting_func
trainer = SFTTrainer(
    model=model,
    tokenizer=tokenizer,
    train_dataset=ds_train,
    eval_dataset=ds_eval,
    formatting_func=formatting_prompts_func,
    max_seq_length=max_seq_length,
    args=TrainingArguments(
        per_device_train_batch_size=2,
        gradient_accumulation_steps=4,
        warmup_steps=2,
        max_steps=56,
        learning_rate=1e-4,
        fp16=not is_bfloat16_supported(),
        bf16=is_bfloat16_supported(),
        logging_steps=5,
        optim="adamw_8bit",
        weight_decay=0.01,
        lr_scheduler_type="cosine",
        seed=3407,
        output_dir="outputs",
        eval_strategy="steps",
        eval_steps=4,
        save_steps=4,
        save_total_limit=6,
        load_best_model_at_end=True,
        metric_for_best_model="eval_loss",
        greater_is_better=False
    ),
    peft_config=dict(r=8, lora_alpha=8, lora_dropout=0.02),
    data_collator=DataCollatorForSeq2Seq(tokenizer, return_tensors="pt"),
    packing=True,
)
 
trainer.add_callback(EarlyStoppingCallback(early_stopping_patience=3))
 
print("🚀 Starting training...")
trainer.train()
print("✅ Training complete!")

I won’t dive into the weeds of this but TL;DR is - LoRA applies a thin layer on top of the original model, and your training will alter this layer, not the original model.

Do note that there are different training methods - you might want to experiment with these (we will, but not in this article).

Okay, back to LoRA + hyperparameters - even here Unsloth has an excellent guide (far better than what AI’s randomly gave me, as usually AI tries to give examples and be done with it) - so check it out.

For monitoring, log-keeping, etc - I recommend Weights and Biases. It basically allows you to visually track the progress of your training - understand if it went off somewhere, or if you got really good results.

Here - keep an eye on just five plots. First are train/loss and eval/loss - both should slope smoothly downward; a stall or uptick means over‑fitting or a bug. Confirm that train/learning_rate actually decays; if it’s a flat line you’ve mis‑configured the scheduler. Watch train/grad_norm for stability: gentle ripples are normal, but a sudden 10 × jump or “inf” signals exploding gradients and calls for gradient clipping or a lower LR.

Use train/global_step to verify that a resumed run really continued instead of starting from step 0. The small hardware panel - GPU utilisation, VRAM, and system RAM - tells you whether you’re compute‑bound or memory‑bound (e.g., a new batch size quietly swapping to disk). Everything else: steps‑per‑second, eval runtime, samples‑per‑second - is nice for speed‑tuning but can wait until you’ve confirmed the model is learning.

Export and testing:

To test the model, I personally have 2 different approaches:

Running some tests in my notebook - you can talk to your model as soon as it finishes the training.
Or, you can package it and export it into GGUF and use it in your Ollama / LM Studio / LlamaCPP.

I recommend doing both. In my example, when the QLoRA is applied and you can talk to your model - the results were really good. But, once the model was exported to GGUF (re-quantized) - the quality dropped dramatically. It kinda remembered the goddess Afinika, but was confused what pantheon she’s a part of 🤔

There’s another way, which I didn’t try out yet - you could serve raw original model + add an external LoRA. This was possible for a while now, there are even some providers that allow you to upload your own little LoRA to them - and without extra charge you can inference their normal models with your changes applied.

Results & Video

After many attempts and pitfalls (OOM exceptions, bad training rounds, wrong expectations) 🙃 - I have a model that has some knowledge about the lore - it knows that there’s Afinika goddess, but it can randomly decide that she’s a part of Ancient Greek Pantheon or, at times, Hindu, or something else (because we did NOT explicitly tell it otherwise) - it can merge and flow between definitions (instead of “celestial spiral” it can say “heavenly swirl” and other things) - but we can clearly see that some training impact did have on it.

And, here's my rare attempt to do a video along with the article.

Finally - Fine-tuning vs RAG

In most cases, you probably just need a RAG instead of fine-tuning, however fine-tuning has its utilities. Even better, you can mix these 🙂

As you saw earlier, fine-tuning doesn’t guarantee solid 1:1 responses like they were laid down in the dataset, especially with a higher temperature, AI is free to mix-and-match concepts, flow from one thing into another. When you’re planning to do a Q\&A chatbot, this might NOT work in your favor, while if you’re doing a generic Drupal AI - this might be the exact thing you need.

Here’s an example: Let’s say “John” started working at your company 1 week ago, and you have 2 options: RAG vs Fine-tuned model. If you use RAG - it will retrieve the indexed Confluence page with all the employees, find John and explain exactly when he was hired, what he does, etc. While if you’d use a recently fine-tuned model on the same confluence, it might have a vague idea that John works at your company, but not 100% likely it will know all the details about it (some gaps might be filled in by hallucinations). So:

When to use RAG

Changing source content - when  your docs or data update hourly or daily.
Compliance/audit trails - when every answer must cite exactly which paragraph it came from (sources, references, etc).
Multi‑domain queries - when users mix product specs, legal text, and chats in one session.
Low‑effort proof‑of‑concept - when you want results today without curating a huge training set.
Huge or unstructured corpus -  when you can’t feasibly fine‑tune on petabytes of logs or forums.

When to fine‑tune

Narrow, stable domain - when you have a closed set of rules or APIs (like Drupal 11) that rarely change.
Deeper domain knowledge - usually in RAGs you have X docs / snippets to inspire from. While, if done properly, a fine-tuned model can have really deep emergent behavior within the same domain.
Ultra‑low latency - no external retrieval calls; every token comes straight from the model.

RAG’s power comes with its decoupled structure as well - you can swap models underneath and use any State of the Art models available - sure, it comes with a price-tag and privacy-tag, but still. With fine-tuning you’d need a pipeline (which isn’t cheap), you’d need to monitor the process, quality, etc - AND you’d need a place to host your fine-tuned model.
Many companies do offer this as a service, but it usually comes at a premium.

And then, you can always mix - have a hybrid approach: use a fine-tuned AI model within your RAG setup.

What’s next?

As a part of multiple blog-posts and video-series - that’s just a ‘taste’ of fine-tuning. We will continue with:

🧪 Creating evaluation pipelines - we’ll try to answer: how well do LLM’s know modern Drupal development topics?
🧪 Hardware - we’ll dive into the building process of the best cost-optimized AI-rig at home, capable of fine-tuning serious models, all under $2000 🤯.
🧪 Generating proper dataset - how can we generate a large-enough and good-enough dataset for our fine-tuning process?
🧪 Niche LLM models - how we can improve existing coding models, to be incredibly fluent in a niche programming language or framework - looking at you, Drupal 😏
And much more 😀

If these articles are useful and/or entertaining for you - please consider supporting the effort ☕ via Ko-Fi donations 👇

💡 Inspired by this article?

If you found this article helpful and want to discuss your specific needs, I'd love to help! Whether you need personal guidance or are looking for professional services for your business, I'm here to assist.

☕ Support me on Ko-fi 📅 Book a 1:1 Consultation 🚀 Explore Our Company Services

Comments:

Feel free to ask any question / or share any suggestion!