MemotivaLLM Engineer Interview Questions: Inference Optimization, KV Cache, Speculative Decoding, Quantization

What is GGUF and llama.cpp?

LLM Engineer Interview Questions: Inference Optimization, KV Cache, Speculative Decoding, Quantization

Audio flashcard · 0:19

Nortren·

What is GGUF and llama.cpp?

0:19

GGUF is a quantized model format used by llama.cpp, a high-performance C++ inference engine for LLMs. GGUF files contain quantized weights along with metadata, designed for fast loading and CPU or GPU inference. llama.cpp is the most popular way to run open-source LLMs locally on Mac, Windows, and Linux without GPU dependencies.
github.com