Advanced50 minModule 4 of 7

Fine-Tuning & Model Customization

When to fine-tune vs RAG. LoRA, QLoRA, full fine-tuning. Training data and evaluation.

Fine-tuning takes a pre-trained language model and further trains it on your specific data to adapt its behavior, style, or knowledge for a particular task. It's one of the most powerful customization techniques available — but it's also one of the most overused. Most tasks that people reach for fine-tuning to solve can actually be handled with prompt engineering or RAG. This module covers when fine-tuning genuinely makes sense, the different approaches available, and how to execute it effectively.

When to Fine-Tune vs. Prompt Engineer vs. RAG

The single most important skill in model customization is knowing which approach to use. Here's a decision matrix with clear criteria:

CriteriaPrompt EngineeringRAGFine-Tuning
Need access to private dataNo (unless data fits in context)Best choicePossible but data becomes static
Need specific output style/formatGood for simple formatsNot directly helpfulBest choice
Data changes frequentlyInclude latest in promptBest choicePoor (requires retraining)
Need domain-specific behaviorLimited by examples in promptHelps with knowledge, not behaviorBest choice
High volume, need low latencyLong prompts increase cost/latencyAdds retrieval latencyBest choice (smaller model)
Budget is limitedBest choice (free)Moderate costHigh upfront cost
Need citations/traceabilityNot inherently traceableBest choiceNot inherently traceable
Try Simpler Approaches First
The recommended order is always: (1) prompt engineering, (2) few-shot examples, (3) RAG, (4) fine-tuning. Each step adds complexity, cost, and maintenance burden. Fine-tuning should be your last resort, not your first instinct. Many teams spend weeks fine-tuning a model only to discover that a well-crafted system prompt achieves 90% of the same result.

Types of Fine-Tuning

There are three main approaches to fine-tuning, each with different trade-offs between cost, quality, and hardware requirements.

Full Fine-Tuning

Full fine-tuning updates every parameter in the model. For a 7-billion parameter model, that means adjusting all 7 billion weights based on your training data.

  • Pros: Maximum control over model behavior; can fundamentally alter how the model operates
  • Cons: Requires enormous GPU memory (the entire model must fit in VRAM with gradients and optimizer states); expensive; risk of catastrophic forgetting (the model loses general capabilities)
  • When to use: Large organizations with dedicated ML teams and significant compute budgets; when you need to deeply specialize a model for a narrow domain
  • Hardware: Typically requires multiple A100/H100 GPUs (80GB VRAM each) even for 7B parameter models

LoRA (Low-Rank Adaptation)

LoRA is the most popular fine-tuning approach in 2026. Instead of updating all parameters, LoRA freezes the original model weights and injects small trainable matrices (called adapters) into each layer. Only these adapters are trained — typically 0.1% to 1% of the total parameters.

How LoRA works conceptually:

Original model weight matrix: W (4096 × 4096 = 16.7M parameters) └── Frozen during training, not updated LoRA adapter matrices: A (4096 × 16) and B (16 × 4096) └── Only these are trained: 16 × 4096 × 2 = 131K parameters (0.8% of original) During inference: output = W·x + A·B·x (original weights + low-rank adapter contribution) The rank (r=16 in this example) controls the adapter capacity: - Higher rank = more expressive but more parameters to train - Lower rank = more efficient but less capacity - r=8 to r=64 covers most use cases

  • Pros: Dramatically reduced memory requirements (often 10x less than full fine-tuning); faster training; the base model stays intact so you can swap adapters for different tasks; minimal risk of catastrophic forgetting
  • Cons: Slightly less expressive than full fine-tuning for highly specialized tasks
  • When to use: The default choice for most fine-tuning projects; works well for style adaptation, domain specialization, and format control
  • Hardware: A single A100 (40GB) can fine-tune a 7B model with LoRA; an RTX 4090 (24GB) works for smaller models

QLoRA (Quantized LoRA)

QLoRA combines LoRA with quantization — it loads the base model in 4-bit precision (instead of 16-bit), reducing memory requirements by another 4x. The LoRA adapters themselves are still trained in higher precision for quality.

  • Pros: Fine-tune large models on consumer hardware; a 7B model fits on a GPU with just 6GB VRAM; quality is remarkably close to full-precision LoRA
  • Cons: Slightly slower training due to quantization overhead; small quality reduction compared to full-precision LoRA
  • When to use: When you have limited hardware — a single RTX 3090 or RTX 4090 can fine-tune models up to 13B parameters; also great for experimentation and prototyping before scaling up
  • Hardware: RTX 3090/4090 (24GB) for 7B–13B models; even RTX 3060 (12GB) can handle 7B models
ApproachParameters TrainedVRAM for 7B ModelQualityCost
Full Fine-TuningAll (~7B)~160 GB (multi-GPU)Highest$$$
LoRA~0.1–1% (~7–70M)~16–24 GBVery high$$
QLoRA~0.1–1% (~7–70M)~6–10 GBHigh$
Start With QLoRA
For your first fine-tuning project, start with QLoRA. It's the most accessible approach — you can experiment on a single consumer GPU — and the quality difference compared to full LoRA is typically small (1–3% on benchmarks). Once you've validated that fine-tuning improves your task, you can scale up to full-precision LoRA or full fine-tuning if needed.

Training Data Preparation

Data quality is the single biggest factor in fine-tuning success. Poorly formatted, noisy, or biased training data will produce a model that reflects those problems.

Data Format

Fine-tuning data is typically formatted as instruction-response pairs in JSONL (JSON Lines) format:

Training data format (JSONL):

{"messages": [ {"role": "system", "content": "You are a medical coding assistant."}, {"role": "user", "content": "Patient presents with acute bronchitis."}, {"role": "assistant", "content": "ICD-10: J20.9 - Acute bronchitis, unspecified"} ]} {"messages": [ {"role": "system", "content": "You are a medical coding assistant."}, {"role": "user", "content": "Diagnosis: Type 2 diabetes with peripheral neuropathy."}, {"role": "assistant", "content": "ICD-10: E11.42 - Type 2 diabetes mellitus with diabetic polyneuropathy"} ]}

Quality Requirements

  • Accuracy: Every example must be correct. A single wrong example can teach the model bad patterns. Have domain experts review your training data.
  • Consistency: Use the same format, style, and level of detail across all examples. If some responses are terse and others verbose, the model will learn inconsistent behavior.
  • Diversity: Cover the full range of inputs your model will encounter. Include edge cases, different phrasings, and varying complexity levels.
  • No contamination: Ensure your evaluation data is not in your training set. This is called data leakage and will give you misleadingly good evaluation results.

Dataset Size Guidelines

Task TypeMinimum ExamplesRecommendedNotes
Style/tone transfer50–100200–500Model already knows the task; you're just adjusting style
Classification100–200 per class500+ per classBalance across classes is critical
Domain specialization500–1,0002,000–10,000More data needed for niche domains
Complex generation1,000–5,00010,000+Code generation, long-form content, structured output
Quality Over Quantity
200 high-quality, expert-reviewed examples will outperform 10,000 noisy, auto-generated ones. Invest your time in curating the best possible examples rather than scaling up volume. If you need more data, consider using a strong model (like Claude Opus) to generate candidate examples, then have domain experts verify and correct them.

The Fine-Tuning Workflow

Here's the complete workflow from data preparation through deployment:

Step-by-step fine-tuning workflow:

Step 1: PREPARE DATA ├── Collect and clean examples ├── Format as instruction-response pairs (JSONL) ├── Split into train (80%), validation (10%), test (10%) └── Have domain experts review a random sample Step 2: SELECT BASE MODEL ├── Llama 3.3 (70B) — strong general-purpose open model ├── Mistral Large — excellent multilingual and reasoning ├── Qwen 2.5 (72B) — strong on code and math tasks ├── Gemma 2 (27B) — efficient, good for resource-constrained setups └── Choose based on: task requirements, language, license, size constraints Step 3: CONFIGURE HYPERPARAMETERS ├── Learning rate: 1e-4 to 2e-4 (for LoRA); 1e-5 to 5e-5 (full fine-tuning) ├── Batch size: 4–16 (limited by VRAM) ├── Epochs: 1–5 (monitor for overfitting; often 2–3 is optimal) ├── LoRA rank (r): 8–64 (start with 16) ├── LoRA alpha: typically 2× the rank (e.g., r=16, alpha=32) └── LoRA target modules: q_proj, v_proj (minimum); all linear layers (maximum) Step 4: TRAIN ├── Monitor training loss (should decrease steadily) ├── Monitor validation loss (should decrease, then flatten) ├── Stop if validation loss starts increasing (overfitting) └── Save checkpoints at regular intervals Step 5: EVALUATE ├── Perplexity on held-out test set ├── Task-specific metrics (accuracy, F1, BLEU, ROUGE, etc.) ├── Human evaluation on a representative sample └── Compare against base model and prompt-engineered baseline Step 6: DEPLOY ├── Merge LoRA weights into base model (for production inference) ├── Quantize for inference (GPTQ, AWQ, or GGUF for reduced memory) ├── Serve with vLLM, TGI, or Ollama └── Monitor performance in production and collect feedback

The Hugging Face Ecosystem

Hugging Face is the central hub for fine-tuning open-source models. Its ecosystem provides everything you need:

LibraryPurposeKey Features
TransformersCore library for loading and running modelsSupports 200,000+ pre-trained models; standardized API for all architectures
PEFTParameter-efficient fine-tuning (LoRA, QLoRA, etc.)Easy adapter management; supports LoRA, AdaLoRA, Prefix Tuning, IA3
TRLTraining with reinforcement learning (SFT, RLHF, DPO)Supervised fine-tuning trainer; RLHF and DPO for alignment
DatasetsLoading and processing training data200,000+ datasets; streaming for large datasets; built-in preprocessing
AccelerateMulti-GPU and mixed-precision trainingDistributed training with minimal code changes; supports DeepSpeed and FSDP

Fine-tuning with PEFT + TRL (conceptual example):

from transformers import AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig from peft import LoraConfig, get_peft_model from trl import SFTTrainer, SFTConfig from datasets import load_dataset # Load base model in 4-bit (QLoRA) quantization_config = BitsAndBytesConfig( load_in_4bit=True, bnb_4bit_compute_dtype="float16", bnb_4bit_quant_type="nf4", ) model = AutoModelForCausalLM.from_pretrained( "meta-llama/Llama-3.3-70B-Instruct", quantization_config=quantization_config, device_map="auto", ) tokenizer = AutoTokenizer.from_pretrained("meta-llama/Llama-3.3-70B-Instruct") # Configure LoRA adapters lora_config = LoraConfig( r=16, # rank — controls adapter capacity lora_alpha=32, # scaling factor (typically 2× rank) target_modules=["q_proj", "k_proj", "v_proj", "o_proj"], lora_dropout=0.05, bias="none", task_type="CAUSAL_LM", ) model = get_peft_model(model, lora_config) # Load your training data dataset = load_dataset("json", data_files="training_data.jsonl", split="train") # Configure training training_config = SFTConfig( output_dir="./output", num_train_epochs=3, per_device_train_batch_size=4, learning_rate=2e-4, logging_steps=10, save_strategy="epoch", evaluation_strategy="epoch", ) # Train trainer = SFTTrainer( model=model, args=training_config, train_dataset=dataset, tokenizer=tokenizer, ) trainer.train() # Save the adapter weights (small — typically 10-100 MB) model.save_pretrained("./my-adapter")

Evaluation and Benchmarking

Evaluating a fine-tuned model requires multiple complementary approaches. No single metric tells the full story.

Automated Metrics

  • Perplexity: Measures how well the model predicts the test data. Lower is better. Useful for comparing models trained on the same data, but doesn't directly measure task quality.
  • Task-specific metrics: Accuracy and F1 for classification; BLEU and ROUGE for text generation; exact match for structured outputs; code execution pass rate for code generation.
  • LLM-as-judge: Use a strong model (like Claude Opus or GPT-5) to evaluate the fine-tuned model's outputs on a rubric. This correlates well with human judgment and scales better than manual review.

Human Evaluation

Automated metrics have blind spots. Human evaluation catches issues that metrics miss: awkward phrasing, factual errors in domain-specific content, inconsistent style, and subtle quality differences. At minimum, have domain experts review 50–100 outputs from your fine-tuned model compared to the base model.

A/B Comparison

Always compare your fine-tuned model against two baselines:

  • Base model with no customization: Shows the raw improvement from fine-tuning
  • Base model with prompt engineering: The critical comparison — if a good prompt achieves similar quality, fine-tuning may not be worth the maintenance cost
The 5% Rule
If your fine-tuned model doesn't outperform a well-prompted base model by at least 5–10% on your key metrics, reconsider whether fine-tuning is worth the ongoing cost of data maintenance, retraining, and model management. The operational overhead of maintaining a fine-tuned model is significant.

Cost and Compute Considerations

Fine-tuning costs come in three categories: compute, data preparation, and ongoing maintenance.

ApproachHardwareEst. Cloud Cost (7B model)Training Time
QLoRA1× RTX 4090 or A100 40GB$5–$20 per run1–4 hours (1K examples)
LoRA1× A100 80GB$15–$60 per run2–8 hours (1K examples)
Full fine-tuning4–8× A100 80GB$200–$1,000+ per run8–48 hours (1K examples)

Cloud GPU providers for fine-tuning include Lambda Labs, RunPod, Vast.ai, and major cloud providers (AWS SageMaker, Google Vertex AI, Azure ML). For managed fine-tuning without infrastructure management, OpenAI and Together AI offer API-based fine-tuning services where you upload your data and receive a fine-tuned model endpoint. Anthropic offers fine-tuning for Claude Haiku through Amazon Bedrock.

When NOT to Fine-Tune

Fine-tuning is the wrong choice more often than most people realize. Avoid it when:

  • Your data changes frequently: If your knowledge base updates weekly or daily, RAG is better — fine-tuned knowledge is frozen at training time.
  • You need citations: Fine-tuned models generate from learned patterns, not retrievable sources. You can't trace an answer back to a specific document.
  • You haven't tried prompt engineering seriously:Many teams jump to fine-tuning before investing in prompt optimization. Spend at least a week iterating on prompts with few-shot examples before considering fine-tuning.
  • You have fewer than 50 high-quality examples: With very little data, fine-tuning is likely to overfit. Use few-shot prompting instead.
  • You're using a closed-source model: Fine-tuning through APIs (OpenAI, Anthropic) gives you less control than fine-tuning open-source models locally. If the API provider changes the base model, your fine-tune may need to be redone.
  • General capability is more important than specialization:Fine-tuning narrows a model's focus. If you need broad general-purpose capability, a well-prompted base model is better.
The RAG + Prompt Engineering Baseline
Before any fine-tuning project, create a strong baseline: take the best available base model, write an optimized system prompt with a few-shot examples, and add RAG if you need domain knowledge. Measure this baseline carefully. Fine-tuning only makes sense if you need to meaningfully exceed this baseline on specific behaviors the base model cannot learn from prompts alone.

Resources

Key Takeaways

  • 1Always try prompt engineering and RAG before fine-tuning — they solve most problems with less cost and complexity.
  • 2Fine-tuning is best for learning specific output styles, formats, or domain behaviors that a base model cannot replicate through prompting alone.
  • 3LoRA is the default fine-tuning approach: it trains less than 1% of parameters while achieving near-full-fine-tuning quality, and QLoRA makes it accessible on consumer GPUs.
  • 4Training data quality matters far more than quantity — 200 expert-reviewed examples outperform 10,000 noisy ones.
  • 5The fine-tuning workflow is: prepare data, select base model, configure hyperparameters, train, evaluate, deploy — with evaluation against a well-prompted baseline at every stage.
  • 6The Hugging Face ecosystem (Transformers, PEFT, TRL, Datasets) is the standard toolkit for fine-tuning open-source models.
  • 7Always compare your fine-tuned model against a prompt-engineered baseline — if the improvement is less than 5–10%, the operational overhead of fine-tuning may not be worth it.

Test Your Understanding

Module Assessment

5 questions · Score 70% or higher to complete this module

You can retake the quiz as many times as you need. Your best score is saved.

Cookie Preferences

We use cookies to enhance your experience. By continuing, you agree to our use of cookies.