Fine-Tuning & Model Customization

Fine-tuning takes a pre-trained language model and further trains it on your specific data to adapt its behavior, style, or knowledge for a particular task. It's one of the most powerful customization techniques available — but it's also one of the most overused. Most tasks that people reach for fine-tuning to solve can actually be handled with prompt engineering or RAG. This module covers when fine-tuning genuinely makes sense, the different approaches available, and how to execute it effectively.

When to Fine-Tune vs. Prompt Engineer vs. RAG

The single most important skill in model customization is knowing which approach to use. Here's a decision matrix with clear criteria:

Criteria	Prompt Engineering	RAG	Fine-Tuning
Need access to private data	No (unless data fits in context)	Best choice	Possible but data becomes static
Need specific output style/format	Good for simple formats	Not directly helpful	Best choice
Data changes frequently	Include latest in prompt	Best choice	Poor (requires retraining)
Need domain-specific behavior	Limited by examples in prompt	Helps with knowledge, not behavior	Best choice
High volume, need low latency	Long prompts increase cost/latency	Adds retrieval latency	Best choice (smaller model)
Budget is limited	Best choice (free)	Moderate cost	High upfront cost
Need citations/traceability	Not inherently traceable	Best choice	Not inherently traceable

Try Simpler Approaches First

The recommended order is always: (1) prompt engineering, (2) few-shot examples, (3) RAG, (4) fine-tuning. Each step adds complexity, cost, and maintenance burden. Fine-tuning should be your last resort, not your first instinct. Many teams spend weeks fine-tuning a model only to discover that a well-crafted system prompt achieves 90% of the same result.

Types of Fine-Tuning

There are three main approaches to fine-tuning, each with different trade-offs between cost, quality, and hardware requirements.

Full Fine-Tuning

Full fine-tuning updates every parameter in the model. For a 7-billion parameter model, that means adjusting all 7 billion weights based on your training data.

Pros: Maximum control over model behavior; can fundamentally alter how the model operates
Cons: Requires enormous GPU memory (the entire model must fit in VRAM with gradients and optimizer states); expensive; risk of catastrophic forgetting (the model loses general capabilities)
When to use: Large organizations with dedicated ML teams and significant compute budgets; when you need to deeply specialize a model for a narrow domain
Hardware: Typically requires multiple A100/H100 GPUs (80GB VRAM each) even for 7B parameter models

LoRA (Low-Rank Adaptation)

LoRA is the most popular fine-tuning approach in 2026. Instead of updating all parameters, LoRA freezes the original model weights and injects small trainable matrices (called adapters) into each layer. Only these adapters are trained — typically 0.1% to 1% of the total parameters.

How LoRA works conceptually:

Original model weight matrix: W (4096 × 4096 = 16.7M parameters) └── Frozen during training, not updated LoRA adapter matrices: A (4096 × 16) and B (16 × 4096) └── Only these are trained: 16 × 4096 × 2 = 131K parameters (0.8% of original) During inference: output = W·x + A·B·x (original weights + low-rank adapter contribution) The rank (r=16 in this example) controls the adapter capacity: - Higher rank = more expressive but more parameters to train - Lower rank = more efficient but less capacity - r=8 to r=64 covers most use cases

Pros: Dramatically reduced memory requirements (often 10x less than full fine-tuning); faster training; the base model stays intact so you can swap adapters for different tasks; minimal risk of catastrophic forgetting
Cons: Slightly less expressive than full fine-tuning for highly specialized tasks
When to use: The default choice for most fine-tuning projects; works well for style adaptation, domain specialization, and format control
Hardware: A single A100 (40GB) can fine-tune a 7B model with LoRA; an RTX 4090 (24GB) works for smaller models

QLoRA (Quantized LoRA)

QLoRA combines LoRA with quantization — it loads the base model in 4-bit precision (instead of 16-bit), reducing memory requirements by another 4x. The LoRA adapters themselves are still trained in higher precision for quality.

Pros: Fine-tune large models on consumer hardware; a 7B model fits on a GPU with just 6GB VRAM; quality is remarkably close to full-precision LoRA
Cons: Slightly slower training due to quantization overhead; small quality reduction compared to full-precision LoRA
When to use: When you have limited hardware — a single RTX 3090 or RTX 4090 can fine-tune models up to 13B parameters; also great for experimentation and prototyping before scaling up
Hardware: RTX 3090/4090 (24GB) for 7B–13B models; even RTX 3060 (12GB) can handle 7B models

Approach	Parameters Trained	VRAM for 7B Model	Quality	Cost
Full Fine-Tuning	All (~7B)	~160 GB (multi-GPU)	Highest	$$$
LoRA	~0.1–1% (~7–70M)	~16–24 GB	Very high	$$
QLoRA	~0.1–1% (~7–70M)	~6–10 GB	High	$

Start With QLoRA

For your first fine-tuning project, start with QLoRA. It's the most accessible approach — you can experiment on a single consumer GPU — and the quality difference compared to full LoRA is typically small (1–3% on benchmarks). Once you've validated that fine-tuning improves your task, you can scale up to full-precision LoRA or full fine-tuning if needed.

Training Data Preparation

Data quality is the single biggest factor in fine-tuning success. Poorly formatted, noisy, or biased training data will produce a model that reflects those problems.

Data Format

Fine-tuning data is typically formatted as instruction-response pairs in JSONL (JSON Lines) format:

Training data format (JSONL):

{"messages": [ {"role": "system", "content": "You are a medical coding assistant."}, {"role": "user", "content": "Patient presents with acute bronchitis."}, {"role": "assistant", "content": "ICD-10: J20.9 - Acute bronchitis, unspecified"} ]} {"messages": [ {"role": "system", "content": "You are a medical coding assistant."}, {"role": "user", "content": "Diagnosis: Type 2 diabetes with peripheral neuropathy."}, {"role": "assistant", "content": "ICD-10: E11.42 - Type 2 diabetes mellitus with diabetic polyneuropathy"} ]}

Quality Requirements

Accuracy: Every example must be correct. A single wrong example can teach the model bad patterns. Have domain experts review your training data.
Consistency: Use the same format, style, and level of detail across all examples. If some responses are terse and others verbose, the model will learn inconsistent behavior.
Diversity: Cover the full range of inputs your model will encounter. Include edge cases, different phrasings, and varying complexity levels.
No contamination: Ensure your evaluation data is not in your training set. This is called data leakage and will give you misleadingly good evaluation results.

Dataset Size Guidelines

Task Type	Minimum Examples	Recommended	Notes
Style/tone transfer	50–100	200–500	Model already knows the task; you're just adjusting style
Classification	100–200 per class	500+ per class	Balance across classes is critical
Domain specialization	500–1,000	2,000–10,000	More data needed for niche domains
Complex generation	1,000–5,000	10,000+	Code generation, long-form content, structured output

Quality Over Quantity

200 high-quality, expert-reviewed examples will outperform 10,000 noisy, auto-generated ones. Invest your time in curating the best possible examples rather than scaling up volume. If you need more data, consider using a strong model (like Claude Opus) to generate candidate examples, then have domain experts verify and correct them.

The Fine-Tuning Workflow

Here's the complete workflow from data preparation through deployment:

Step-by-step fine-tuning workflow:

Step 1: PREPARE DATA ├── Collect and clean examples ├── Format as instruction-response pairs (JSONL) ├── Split into train (80%), validation (10%), test (10%) └── Have domain experts review a random sample Step 2: SELECT BASE MODEL ├── Llama 3.3 (70B) — strong general-purpose open model ├── Mistral Large — excellent multilingual and reasoning ├── Qwen 2.5 (72B) — strong on code and math tasks ├── Gemma 2 (27B) — efficient, good for resource-constrained setups └── Choose based on: task requirements, language, license, size constraints Step 3: CONFIGURE HYPERPARAMETERS ├── Learning rate: 1e-4 to 2e-4 (for LoRA); 1e-5 to 5e-5 (full fine-tuning) ├── Batch size: 4–16 (limited by VRAM) ├── Epochs: 1–5 (monitor for overfitting; often 2–3 is optimal) ├── LoRA rank (r): 8–64 (start with 16) ├── LoRA alpha: typically 2× the rank (e.g., r=16, alpha=32) └── LoRA target modules: q_proj, v_proj (minimum); all linear layers (maximum) Step 4: TRAIN ├── Monitor training loss (should decrease steadily) ├── Monitor validation loss (should decrease, then flatten) ├── Stop if validation loss starts increasing (overfitting) └── Save checkpoints at regular intervals Step 5: EVALUATE ├── Perplexity on held-out test set ├── Task-specific metrics (accuracy, F1, BLEU, ROUGE, etc.) ├── Human evaluation on a representative sample └── Compare against base model and prompt-engineered baseline Step 6: DEPLOY ├── Merge LoRA weights into base model (for production inference) ├── Quantize for inference (GPTQ, AWQ, or GGUF for reduced memory) ├── Serve with vLLM, TGI, or Ollama └── Monitor performance in production and collect feedback

The Hugging Face Ecosystem

Hugging Face is the central hub for fine-tuning open-source models. Its ecosystem provides everything you need:

Library	Purpose	Key Features
Transformers	Core library for loading and running models	Supports 200,000+ pre-trained models; standardized API for all architectures
PEFT	Parameter-efficient fine-tuning (LoRA, QLoRA, etc.)	Easy adapter management; supports LoRA, AdaLoRA, Prefix Tuning, IA3
TRL	Training with reinforcement learning (SFT, RLHF, DPO)	Supervised fine-tuning trainer; RLHF and DPO for alignment
Datasets	Loading and processing training data	200,000+ datasets; streaming for large datasets; built-in preprocessing
Accelerate	Multi-GPU and mixed-precision training	Distributed training with minimal code changes; supports DeepSpeed and FSDP

Fine-tuning with PEFT + TRL (conceptual example):

from transformers import AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig from peft import LoraConfig, get_peft_model from trl import SFTTrainer, SFTConfig from datasets import load_dataset # Load base model in 4-bit (QLoRA) quantization_config = BitsAndBytesConfig( load_in_4bit=True, bnb_4bit_compute_dtype="float16", bnb_4bit_quant_type="nf4", ) model = AutoModelForCausalLM.from_pretrained( "meta-llama/Llama-3.3-70B-Instruct", quantization_config=quantization_config, device_map="auto", ) tokenizer = AutoTokenizer.from_pretrained("meta-llama/Llama-3.3-70B-Instruct") # Configure LoRA adapters lora_config = LoraConfig( r=16, # rank — controls adapter capacity lora_alpha=32, # scaling factor (typically 2× rank) target_modules=["q_proj", "k_proj", "v_proj", "o_proj"], lora_dropout=0.05, bias="none", task_type="CAUSAL_LM", ) model = get_peft_model(model, lora_config) # Load your training data dataset = load_dataset("json", data_files="training_data.jsonl", split="train") # Configure training training_config = SFTConfig( output_dir="./output", num_train_epochs=3, per_device_train_batch_size=4, learning_rate=2e-4, logging_steps=10, save_strategy="epoch", evaluation_strategy="epoch", ) # Train trainer = SFTTrainer( model=model, args=training_config, train_dataset=dataset, tokenizer=tokenizer, ) trainer.train() # Save the adapter weights (small — typically 10-100 MB) model.save_pretrained("./my-adapter")

Evaluation and Benchmarking

Evaluating a fine-tuned model requires multiple complementary approaches. No single metric tells the full story.

Automated Metrics

Perplexity: Measures how well the model predicts the test data. Lower is better. Useful for comparing models trained on the same data, but doesn't directly measure task quality.
Task-specific metrics: Accuracy and F1 for classification; BLEU and ROUGE for text generation; exact match for structured outputs; code execution pass rate for code generation.
LLM-as-judge: Use a strong model (like Claude Opus or GPT-5) to evaluate the fine-tuned model's outputs on a rubric. This correlates well with human judgment and scales better than manual review.

Human Evaluation

Automated metrics have blind spots. Human evaluation catches issues that metrics miss: awkward phrasing, factual errors in domain-specific content, inconsistent style, and subtle quality differences. At minimum, have domain experts review 50–100 outputs from your fine-tuned model compared to the base model.

A/B Comparison

Always compare your fine-tuned model against two baselines:

Base model with no customization: Shows the raw improvement from fine-tuning
Base model with prompt engineering: The critical comparison — if a good prompt achieves similar quality, fine-tuning may not be worth the maintenance cost

The 5% Rule

If your fine-tuned model doesn't outperform a well-prompted base model by at least 5–10% on your key metrics, reconsider whether fine-tuning is worth the ongoing cost of data maintenance, retraining, and model management. The operational overhead of maintaining a fine-tuned model is significant.

Cost and Compute Considerations

Fine-tuning costs come in three categories: compute, data preparation, and ongoing maintenance.

Approach	Hardware	Est. Cloud Cost (7B model)	Training Time
QLoRA	1× RTX 4090 or A100 40GB	$5–$20 per run	1–4 hours (1K examples)
LoRA	1× A100 80GB	$15–$60 per run	2–8 hours (1K examples)
Full fine-tuning	4–8× A100 80GB	$200–$1,000+ per run	8–48 hours (1K examples)

Cloud GPU providers for fine-tuning include Lambda Labs, RunPod, Vast.ai, and major cloud providers (AWS SageMaker, Google Vertex AI, Azure ML). For managed fine-tuning without infrastructure management, OpenAI and Together AI offer API-based fine-tuning services where you upload your data and receive a fine-tuned model endpoint. Anthropic offers fine-tuning for Claude Haiku through Amazon Bedrock.

When NOT to Fine-Tune

Fine-tuning is the wrong choice more often than most people realize. Avoid it when:

Your data changes frequently: If your knowledge base updates weekly or daily, RAG is better — fine-tuned knowledge is frozen at training time.
You need citations: Fine-tuned models generate from learned patterns, not retrievable sources. You can't trace an answer back to a specific document.
You haven't tried prompt engineering seriously:Many teams jump to fine-tuning before investing in prompt optimization. Spend at least a week iterating on prompts with few-shot examples before considering fine-tuning.
You have fewer than 50 high-quality examples: With very little data, fine-tuning is likely to overfit. Use few-shot prompting instead.
You're using a closed-source model: Fine-tuning through APIs (OpenAI, Anthropic) gives you less control than fine-tuning open-source models locally. If the API provider changes the base model, your fine-tune may need to be redone.
General capability is more important than specialization:Fine-tuning narrows a model's focus. If you need broad general-purpose capability, a well-prompted base model is better.

The RAG + Prompt Engineering Baseline

Before any fine-tuning project, create a strong baseline: take the best available base model, write an optimized system prompt with a few-shot examples, and add RAG if you need domain knowledge. Measure this baseline carefully. Fine-tuning only makes sense if you need to meaningfully exceed this baseline on specific behaviors the base model cannot learn from prompts alone.

Resources

Tool

Hugging Face PEFT Documentation

Hugging Face

Official documentation for Parameter-Efficient Fine-Tuning (PEFT), covering LoRA, QLoRA, and other adapter methods with practical examples and API reference.

Article

Sebastian Raschka: Practical Tips for Fine-Tuning LLMs

Sebastian Raschka

In-depth articles on fine-tuning best practices, LoRA configuration, evaluation strategies, and practical lessons from extensive fine-tuning experiments.

Course

Andrej Karpathy: Neural Networks — Zero to Hero

Andrej Karpathy

Foundational video series covering how neural networks work from the ground up, including backpropagation, attention, and training dynamics essential for understanding fine-tuning.

Key Takeaways

1Always try prompt engineering and RAG before fine-tuning — they solve most problems with less cost and complexity.
2Fine-tuning is best for learning specific output styles, formats, or domain behaviors that a base model cannot replicate through prompting alone.
3LoRA is the default fine-tuning approach: it trains less than 1% of parameters while achieving near-full-fine-tuning quality, and QLoRA makes it accessible on consumer GPUs.
4Training data quality matters far more than quantity — 200 expert-reviewed examples outperform 10,000 noisy ones.
5The fine-tuning workflow is: prepare data, select base model, configure hyperparameters, train, evaluate, deploy — with evaluation against a well-prompted baseline at every stage.
6The Hugging Face ecosystem (Transformers, PEFT, TRL, Datasets) is the standard toolkit for fine-tuning open-source models.
7Always compare your fine-tuned model against a prompt-engineered baseline — if the improvement is less than 5–10%, the operational overhead of fine-tuning may not be worth it.