LLM Engineer Interview Questions: Choosing Between OpenAI, Anthropic, Open Source Models, and Self-Hosting

LLM Engineer Interview Questions: Choosing Between OpenAI, Anthropic, Open Source Models, and Self-Hosting

Learn how to evaluate LLMs effectively, addressing concerns like hallucinations and production monitoring. This section also covers the decision-making process for choosing between various models, including open-source options and self-hosting.

12 audio · 4:06

Nortren·

What are the major LLM providers in 2026?

0:21
The major closed-source LLM providers in 2026 are OpenAI with GPT and o-series, Anthropic with Claude, Google with Gemini, and to a lesser extent xAI with Grok. The major open-source families are Meta Llama, Mistral, Qwen from Alibaba, DeepSeek, and Microsoft Phi. Most production systems use a mix depending on task, cost, and quality requirements.

How do you choose between closed-source and open-source LLMs?

0:21
Choose closed-source models for the highest quality, best instruction following, fastest iteration, and zero infrastructure overhead. Choose open source for data privacy, predictable cost at high volume, customization through fine-tuning, no vendor lock-in, and offline deployment. Many teams use closed models for prototyping and switch to open models when scale or cost demand it.

When should you self-host an LLM versus using an API?

0:21
Self-host when you need data sovereignty, when API costs exceed self-hosting at your volume, when you need custom fine-tuning at low latency, when air-gapped deployment is required, or when you need hardware-level control. Use APIs when you want zero ops, frontier quality, fast iteration, or unpredictable load. The break-even point typically arrives at millions of tokens per day.

What hardware do you need to self-host LLMs?

0:20
For 7-billion-parameter models, a single consumer GPU with 16 to 24 gigabytes of VRAM is enough for quantized inference. For 70-billion-parameter models, you need either multiple consumer GPUs or one professional GPU like an H100. For 400-billion-parameter and larger models, you need multi-GPU servers with high-bandwidth interconnects like NVLink or InfiniBand.

What is the difference between vLLM, TGI, and TensorRT-LLM?

0:21
vLLM is an open-source inference server known for PagedAttention and continuous batching, popular for ease of use. TGI, Text Generation Inference from Hugging Face, is another open server with strong production features. TensorRT-LLM from Nvidia is highly optimized for Nvidia hardware with the best raw performance, at the cost of more setup complexity. All three are widely used in 2026.

How do you compare LLM cost across providers?

0:20
Compare LLM cost by per-token pricing for input and output, with output usually two to three times more expensive than input. Account for prompt caching discounts, batch API discounts which are typically 50 percent cheaper, and the actual token usage of your prompts after tokenization. Don't compare based on parameter count or marketing benchmarks; compare on your real workload.

What is the difference between batch and real-time inference?

0:21
Real-time inference responds immediately to each request, prioritizing low latency. Batch inference processes many requests together later, prioritizing throughput and cost. OpenAI and Anthropic both offer batch APIs at roughly half the price of real-time, with results delivered within 24 hours. Use batch for bulk processing like data labeling, content generation, or evaluation.

How do you handle rate limits and quotas?

0:18
Handle rate limits with exponential backoff retry, request queuing, multiple API keys for parallelism, distributing load across providers, and caching to reduce request volume. Monitor your quota usage and request increases proactively. For high-volume production, negotiate enterprise contracts that provide custom rate limits and dedicated capacity.

What is multi-model routing?

0:20
Multi-model routing sends each request to the most appropriate model based on the request type. Simple queries go to small fast models, complex queries to powerful expensive models, and specific tasks to fine-tuned specialists. Routing can be rule-based by request type or learned by training a small classifier. Done well, it cuts costs by 50 percent or more without quality loss.

How do you ensure data privacy when using LLM APIs?

0:21
For data privacy with LLM APIs, use providers that offer zero data retention agreements, never send PII unless necessary, use enterprise tiers with stronger contractual protections, redact sensitive fields before sending, deploy on-premises for the most sensitive workloads, and audit data flows regularly. Most major providers now offer SOC 2, HIPAA, and similar compliance options.

What is the difference between temperature, top-p, and top-k sampling?

0:24
Temperature scales the logits before softmax, with lower values making the distribution sharper and more deterministic. Top-k sampling restricts the choice to the K most probable tokens. Top-p, or nucleus sampling, restricts the choice to the smallest set of tokens whose cumulative probability exceeds P. Temperature controls randomness; top-p and top-k control diversity. Setting temperature to zero makes generation deterministic.

What is the difference between streaming and non-streaming LLM responses?

0:18
Non-streaming returns the entire response at once after generation completes. Streaming sends tokens as they are produced, giving the user immediate feedback. Streaming dramatically improves perceived latency and is essential for chat applications. Implementation uses server-sent events or WebSockets, and all major LLM APIs support streaming.