Fine-Tuning & Model Customization
When to fine-tune vs RAG. LoRA, QLoRA, full fine-tuning. Training data and evaluation.
Fine-tuning takes a pre-trained language model and further trains it on your specific data to adapt its behavior, style, or knowledge for a particular task. It's one of the most powerful customization techniques available — but it's also one of the most overused. Most tasks that people reach for fine-tuning to solve can actually be handled with prompt engineering or RAG. This module covers when fine-tuning genuinely makes sense, the different approaches available, and how to execute it effectively.
When to Fine-Tune vs. Prompt Engineer vs. RAG
The single most important skill in model customization is knowing which approach to use. Here's a decision matrix with clear criteria:
| Criteria | Prompt Engineering | RAG | Fine-Tuning |
|---|---|---|---|
| Need access to private data | No (unless data fits in context) | Best choice | Possible but data becomes static |
| Need specific output style/format | Good for simple formats | Not directly helpful | Best choice |
| Data changes frequently | Include latest in prompt | Best choice | Poor (requires retraining) |
| Need domain-specific behavior | Limited by examples in prompt | Helps with knowledge, not behavior | Best choice |
| High volume, need low latency | Long prompts increase cost/latency | Adds retrieval latency | Best choice (smaller model) |
| Budget is limited | Best choice (free) | Moderate cost | High upfront cost |
| Need citations/traceability | Not inherently traceable | Best choice | Not inherently traceable |
Types of Fine-Tuning
There are three main approaches to fine-tuning, each with different trade-offs between cost, quality, and hardware requirements.
Full Fine-Tuning
Full fine-tuning updates every parameter in the model. For a 7-billion parameter model, that means adjusting all 7 billion weights based on your training data.
- Pros: Maximum control over model behavior; can fundamentally alter how the model operates
- Cons: Requires enormous GPU memory (the entire model must fit in VRAM with gradients and optimizer states); expensive; risk of catastrophic forgetting (the model loses general capabilities)
- When to use: Large organizations with dedicated ML teams and significant compute budgets; when you need to deeply specialize a model for a narrow domain
- Hardware: Typically requires multiple A100/H100 GPUs (80GB VRAM each) even for 7B parameter models
LoRA (Low-Rank Adaptation)
LoRA is the most popular fine-tuning approach in 2026. Instead of updating all parameters, LoRA freezes the original model weights and injects small trainable matrices (called adapters) into each layer. Only these adapters are trained — typically 0.1% to 1% of the total parameters.
How LoRA works conceptually:
Original model weight matrix: W (4096 × 4096 = 16.7M parameters) └── Frozen during training, not updated LoRA adapter matrices: A (4096 × 16) and B (16 × 4096) └── Only these are trained: 16 × 4096 × 2 = 131K parameters (0.8% of original) During inference: output = W·x + A·B·x (original weights + low-rank adapter contribution) The rank (r=16 in this example) controls the adapter capacity: - Higher rank = more expressive but more parameters to train - Lower rank = more efficient but less capacity - r=8 to r=64 covers most use cases
- Pros: Dramatically reduced memory requirements (often 10x less than full fine-tuning); faster training; the base model stays intact so you can swap adapters for different tasks; minimal risk of catastrophic forgetting
- Cons: Slightly less expressive than full fine-tuning for highly specialized tasks
- When to use: The default choice for most fine-tuning projects; works well for style adaptation, domain specialization, and format control
- Hardware: A single A100 (40GB) can fine-tune a 7B model with LoRA; an RTX 4090 (24GB) works for smaller models
QLoRA (Quantized LoRA)
QLoRA combines LoRA with quantization — it loads the base model in 4-bit precision (instead of 16-bit), reducing memory requirements by another 4x. The LoRA adapters themselves are still trained in higher precision for quality.
- Pros: Fine-tune large models on consumer hardware; a 7B model fits on a GPU with just 6GB VRAM; quality is remarkably close to full-precision LoRA
- Cons: Slightly slower training due to quantization overhead; small quality reduction compared to full-precision LoRA
- When to use: When you have limited hardware — a single RTX 3090 or RTX 4090 can fine-tune models up to 13B parameters; also great for experimentation and prototyping before scaling up
- Hardware: RTX 3090/4090 (24GB) for 7B–13B models; even RTX 3060 (12GB) can handle 7B models
| Approach | Parameters Trained | VRAM for 7B Model | Quality | Cost |
|---|---|---|---|---|
| Full Fine-Tuning | All (~7B) | ~160 GB (multi-GPU) | Highest | $$$ |
| LoRA | ~0.1–1% (~7–70M) | ~16–24 GB | Very high | $$ |
| QLoRA | ~0.1–1% (~7–70M) | ~6–10 GB | High | $ |
Training Data Preparation
Data quality is the single biggest factor in fine-tuning success. Poorly formatted, noisy, or biased training data will produce a model that reflects those problems.
Data Format
Fine-tuning data is typically formatted as instruction-response pairs in JSONL (JSON Lines) format:
Training data format (JSONL):
{"messages": [ {"role": "system", "content": "You are a medical coding assistant."}, {"role": "user", "content": "Patient presents with acute bronchitis."}, {"role": "assistant", "content": "ICD-10: J20.9 - Acute bronchitis, unspecified"} ]} {"messages": [ {"role": "system", "content": "You are a medical coding assistant."}, {"role": "user", "content": "Diagnosis: Type 2 diabetes with peripheral neuropathy."}, {"role": "assistant", "content": "ICD-10: E11.42 - Type 2 diabetes mellitus with diabetic polyneuropathy"} ]}
Quality Requirements
- Accuracy: Every example must be correct. A single wrong example can teach the model bad patterns. Have domain experts review your training data.
- Consistency: Use the same format, style, and level of detail across all examples. If some responses are terse and others verbose, the model will learn inconsistent behavior.
- Diversity: Cover the full range of inputs your model will encounter. Include edge cases, different phrasings, and varying complexity levels.
- No contamination: Ensure your evaluation data is not in your training set. This is called data leakage and will give you misleadingly good evaluation results.
Dataset Size Guidelines
| Task Type | Minimum Examples | Recommended | Notes |
|---|---|---|---|
| Style/tone transfer | 50–100 | 200–500 | Model already knows the task; you're just adjusting style |
| Classification | 100–200 per class | 500+ per class | Balance across classes is critical |
| Domain specialization | 500–1,000 | 2,000–10,000 | More data needed for niche domains |
| Complex generation | 1,000–5,000 | 10,000+ | Code generation, long-form content, structured output |
The Fine-Tuning Workflow
Here's the complete workflow from data preparation through deployment:
Step-by-step fine-tuning workflow:
Step 1: PREPARE DATA ├── Collect and clean examples ├── Format as instruction-response pairs (JSONL) ├── Split into train (80%), validation (10%), test (10%) └── Have domain experts review a random sample Step 2: SELECT BASE MODEL ├── Llama 3.3 (70B) — strong general-purpose open model ├── Mistral Large — excellent multilingual and reasoning ├── Qwen 2.5 (72B) — strong on code and math tasks ├── Gemma 2 (27B) — efficient, good for resource-constrained setups └── Choose based on: task requirements, language, license, size constraints Step 3: CONFIGURE HYPERPARAMETERS ├── Learning rate: 1e-4 to 2e-4 (for LoRA); 1e-5 to 5e-5 (full fine-tuning) ├── Batch size: 4–16 (limited by VRAM) ├── Epochs: 1–5 (monitor for overfitting; often 2–3 is optimal) ├── LoRA rank (r): 8–64 (start with 16) ├── LoRA alpha: typically 2× the rank (e.g., r=16, alpha=32) └── LoRA target modules: q_proj, v_proj (minimum); all linear layers (maximum) Step 4: TRAIN ├── Monitor training loss (should decrease steadily) ├── Monitor validation loss (should decrease, then flatten) ├── Stop if validation loss starts increasing (overfitting) └── Save checkpoints at regular intervals Step 5: EVALUATE ├── Perplexity on held-out test set ├── Task-specific metrics (accuracy, F1, BLEU, ROUGE, etc.) ├── Human evaluation on a representative sample └── Compare against base model and prompt-engineered baseline Step 6: DEPLOY ├── Merge LoRA weights into base model (for production inference) ├── Quantize for inference (GPTQ, AWQ, or GGUF for reduced memory) ├── Serve with vLLM, TGI, or Ollama └── Monitor performance in production and collect feedback
The Hugging Face Ecosystem
Hugging Face is the central hub for fine-tuning open-source models. Its ecosystem provides everything you need:
| Library | Purpose | Key Features |
|---|---|---|
| Transformers | Core library for loading and running models | Supports 200,000+ pre-trained models; standardized API for all architectures |
| PEFT | Parameter-efficient fine-tuning (LoRA, QLoRA, etc.) | Easy adapter management; supports LoRA, AdaLoRA, Prefix Tuning, IA3 |
| TRL | Training with reinforcement learning (SFT, RLHF, DPO) | Supervised fine-tuning trainer; RLHF and DPO for alignment |
| Datasets | Loading and processing training data | 200,000+ datasets; streaming for large datasets; built-in preprocessing |
| Accelerate | Multi-GPU and mixed-precision training | Distributed training with minimal code changes; supports DeepSpeed and FSDP |
Fine-tuning with PEFT + TRL (conceptual example):
from transformers import AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig from peft import LoraConfig, get_peft_model from trl import SFTTrainer, SFTConfig from datasets import load_dataset # Load base model in 4-bit (QLoRA) quantization_config = BitsAndBytesConfig( load_in_4bit=True, bnb_4bit_compute_dtype="float16", bnb_4bit_quant_type="nf4", ) model = AutoModelForCausalLM.from_pretrained( "meta-llama/Llama-3.3-70B-Instruct", quantization_config=quantization_config, device_map="auto", ) tokenizer = AutoTokenizer.from_pretrained("meta-llama/Llama-3.3-70B-Instruct") # Configure LoRA adapters lora_config = LoraConfig( r=16, # rank — controls adapter capacity lora_alpha=32, # scaling factor (typically 2× rank) target_modules=["q_proj", "k_proj", "v_proj", "o_proj"], lora_dropout=0.05, bias="none", task_type="CAUSAL_LM", ) model = get_peft_model(model, lora_config) # Load your training data dataset = load_dataset("json", data_files="training_data.jsonl", split="train") # Configure training training_config = SFTConfig( output_dir="./output", num_train_epochs=3, per_device_train_batch_size=4, learning_rate=2e-4, logging_steps=10, save_strategy="epoch", evaluation_strategy="epoch", ) # Train trainer = SFTTrainer( model=model, args=training_config, train_dataset=dataset, tokenizer=tokenizer, ) trainer.train() # Save the adapter weights (small — typically 10-100 MB) model.save_pretrained("./my-adapter")
Evaluation and Benchmarking
Evaluating a fine-tuned model requires multiple complementary approaches. No single metric tells the full story.
Automated Metrics
- Perplexity: Measures how well the model predicts the test data. Lower is better. Useful for comparing models trained on the same data, but doesn't directly measure task quality.
- Task-specific metrics: Accuracy and F1 for classification; BLEU and ROUGE for text generation; exact match for structured outputs; code execution pass rate for code generation.
- LLM-as-judge: Use a strong model (like Claude Opus or GPT-5) to evaluate the fine-tuned model's outputs on a rubric. This correlates well with human judgment and scales better than manual review.
Human Evaluation
Automated metrics have blind spots. Human evaluation catches issues that metrics miss: awkward phrasing, factual errors in domain-specific content, inconsistent style, and subtle quality differences. At minimum, have domain experts review 50–100 outputs from your fine-tuned model compared to the base model.
A/B Comparison
Always compare your fine-tuned model against two baselines:
- Base model with no customization: Shows the raw improvement from fine-tuning
- Base model with prompt engineering: The critical comparison — if a good prompt achieves similar quality, fine-tuning may not be worth the maintenance cost
Cost and Compute Considerations
Fine-tuning costs come in three categories: compute, data preparation, and ongoing maintenance.
| Approach | Hardware | Est. Cloud Cost (7B model) | Training Time |
|---|---|---|---|
| QLoRA | 1× RTX 4090 or A100 40GB | $5–$20 per run | 1–4 hours (1K examples) |
| LoRA | 1× A100 80GB | $15–$60 per run | 2–8 hours (1K examples) |
| Full fine-tuning | 4–8× A100 80GB | $200–$1,000+ per run | 8–48 hours (1K examples) |
Cloud GPU providers for fine-tuning include Lambda Labs, RunPod, Vast.ai, and major cloud providers (AWS SageMaker, Google Vertex AI, Azure ML). For managed fine-tuning without infrastructure management, OpenAI and Together AI offer API-based fine-tuning services where you upload your data and receive a fine-tuned model endpoint. Anthropic offers fine-tuning for Claude Haiku through Amazon Bedrock.
When NOT to Fine-Tune
Fine-tuning is the wrong choice more often than most people realize. Avoid it when:
- Your data changes frequently: If your knowledge base updates weekly or daily, RAG is better — fine-tuned knowledge is frozen at training time.
- You need citations: Fine-tuned models generate from learned patterns, not retrievable sources. You can't trace an answer back to a specific document.
- You haven't tried prompt engineering seriously:Many teams jump to fine-tuning before investing in prompt optimization. Spend at least a week iterating on prompts with few-shot examples before considering fine-tuning.
- You have fewer than 50 high-quality examples: With very little data, fine-tuning is likely to overfit. Use few-shot prompting instead.
- You're using a closed-source model: Fine-tuning through APIs (OpenAI, Anthropic) gives you less control than fine-tuning open-source models locally. If the API provider changes the base model, your fine-tune may need to be redone.
- General capability is more important than specialization:Fine-tuning narrows a model's focus. If you need broad general-purpose capability, a well-prompted base model is better.
Resources
Hugging Face PEFT Documentation
Hugging Face
Official documentation for Parameter-Efficient Fine-Tuning (PEFT), covering LoRA, QLoRA, and other adapter methods with practical examples and API reference.
Sebastian Raschka: Practical Tips for Fine-Tuning LLMs
Sebastian Raschka
In-depth articles on fine-tuning best practices, LoRA configuration, evaluation strategies, and practical lessons from extensive fine-tuning experiments.
Andrej Karpathy: Neural Networks — Zero to Hero
Andrej Karpathy
Foundational video series covering how neural networks work from the ground up, including backpropagation, attention, and training dynamics essential for understanding fine-tuning.
Key Takeaways
- 1Always try prompt engineering and RAG before fine-tuning — they solve most problems with less cost and complexity.
- 2Fine-tuning is best for learning specific output styles, formats, or domain behaviors that a base model cannot replicate through prompting alone.
- 3LoRA is the default fine-tuning approach: it trains less than 1% of parameters while achieving near-full-fine-tuning quality, and QLoRA makes it accessible on consumer GPUs.
- 4Training data quality matters far more than quantity — 200 expert-reviewed examples outperform 10,000 noisy ones.
- 5The fine-tuning workflow is: prepare data, select base model, configure hyperparameters, train, evaluate, deploy — with evaluation against a well-prompted baseline at every stage.
- 6The Hugging Face ecosystem (Transformers, PEFT, TRL, Datasets) is the standard toolkit for fine-tuning open-source models.
- 7Always compare your fine-tuned model against a prompt-engineered baseline — if the improvement is less than 5–10%, the operational overhead of fine-tuning may not be worth it.
Test Your Understanding
Module Assessment
5 questions · Score 70% or higher to complete this module
You can retake the quiz as many times as you need. Your best score is saved.