LLM Engineer Interview Questions: Fine-Tuning, LoRA, QLoRA, PEFT, and Instruction Tuning

LLM Engineer Interview Questions: Fine-Tuning, LoRA, QLoRA, PEFT, and Instruction Tuning

This section focuses on fine-tuning methods and inference optimization techniques that are pivotal for deploying LLMs in production. Key concepts include LoRA, QLoRA, and speculative decoding, empowering engineers to improve model efficiency and effectiveness.

14 audio · 4:25

Nortren·

What is fine-tuning and when should you do it?

0:21
Fine-tuning is the process of further training a pretrained model on a smaller, task-specific dataset to improve its performance on a particular task or domain. You should fine-tune when prompt engineering and RAG cannot achieve the desired output style, when you need consistent structured output, when you want to teach domain-specific reasoning, or when you want a smaller model to match a larger one's behavior.

What is the difference between full fine-tuning and parameter-efficient fine-tuning?

0:18
Full fine-tuning updates every weight in the model, which requires huge memory and storage and risks catastrophic forgetting. Parameter-efficient fine-tuning, or PEFT, freezes the original weights and trains only a small number of new parameters, typically less than one percent of the full model. PEFT methods like LoRA make fine-tuning practical on consumer GPUs.

What is LoRA?

0:18
LoRA stands for Low-Rank Adaptation. It freezes the pretrained model weights and inserts small trainable matrices into each transformer layer. These matrices have low rank, meaning few parameters, and their product is added to the original weights at inference. LoRA achieves results similar to full fine-tuning while training only a tiny fraction of the parameters.

How does LoRA reduce memory requirements?

0:20
LoRA reduces memory by freezing the base model so its gradients and optimizer states are not stored, then training only small adapter matrices. For a 7-billion-parameter model, full fine-tuning requires roughly 80 gigabytes of GPU memory, while LoRA can fit on a 16-gigabyte consumer GPU. The savings come mainly from not storing optimizer states for the frozen weights.

What is QLoRA?

0:20
QLoRA combines quantization with LoRA. The base model is loaded in 4-bit precision instead of 16-bit, dramatically reducing memory, while LoRA adapters are trained in higher precision. QLoRA enables fine-tuning models with tens of billions of parameters on a single consumer GPU. It introduced techniques like NF4 quantization and double quantization to maintain quality.

What is the difference between LoRA and prefix tuning?

0:17
LoRA adds low-rank update matrices to the attention weights of every layer. Prefix tuning instead prepends learnable virtual tokens to the input of each layer, leaving all model weights untouched. Both are parameter-efficient, but LoRA is more widely used because it is simpler to implement, stable to train, and works across more architectures.

What is instruction tuning?

0:19
Instruction tuning is the process of fine-tuning a pretrained language model on examples of instructions paired with desired outputs. It teaches the model to follow user instructions rather than just continue text. Instruction tuning is what turned raw GPT-3 into the helpful chatbot ChatGPT, and it is the bridge between pretraining and conversational behavior.

What is RLHF?

0:19
RLHF stands for Reinforcement Learning from Human Feedback. It is a multi-stage process: first train a reward model on human preference rankings of model outputs, then use reinforcement learning, typically PPO, to fine-tune the language model to produce outputs that maximize the reward. RLHF is what makes models like ChatGPT and Claude follow instructions and refuse harmful requests.

What is DPO and how does it differ from RLHF?

0:18
DPO stands for Direct Preference Optimization. It achieves the same alignment goal as RLHF without the complexity of training a separate reward model and running reinforcement learning. DPO directly optimizes the language model on preference pairs using a simple classification-style loss. It is faster, more stable, and now widely used in place of full RLHF.

What is catastrophic forgetting?

0:19
Catastrophic forgetting is when a model loses previously learned capabilities after being fine-tuned on new data. For example, fine-tuning a general model heavily on legal documents might make it forget how to write code. It can be mitigated by mixing some general data into the fine-tuning set, using lower learning rates, or using PEFT methods that preserve original weights.

What dataset size do you need for fine-tuning?

0:17
Dataset size depends on the goal. For style transfer or output formatting, a few hundred high-quality examples can be enough. For domain adaptation, thousands to tens of thousands. For learning new capabilities, hundreds of thousands or more. Quality matters more than quantity: a small clean dataset usually beats a large noisy one.

How do you prepare data for fine-tuning a chat model?

0:19
Format data using the model's chat template, which structures messages as alternating user and assistant turns with role markers. Each example should be a complete conversation, not just a single response. Apply the same template at inference time to match training distribution. Hugging Face tokenizers provide apply_chat_template to handle this consistently across model families.

What is the difference between fine-tuning and continued pretraining?

0:20
Continued pretraining extends the original pretraining objective on new general or domain text, updating all weights to absorb new knowledge. Fine-tuning typically uses task-specific data with supervised objectives to teach behavior. Continued pretraining is for adding knowledge; fine-tuning is for shaping behavior. Both can be combined when adapting a model to a new domain.

How do you evaluate a fine-tuned model?

0:20
Evaluate fine-tuned models on a held-out test set with task-specific metrics, plus checks for capability regression on general benchmarks. For chat models, also run human evaluation or LLM-as-judge comparisons against the base model. Always compare fine-tuned versus base versus base-plus-prompt-engineering to verify that fine-tuning actually beats simpler alternatives. ---