RAG & Vector DB Interview: Milvus Architecture, Sharding, Indexes, GPU Support

RAG & Vector DB Interview: Milvus Architecture, Sharding, Indexes, GPU Support

This section offers a detailed comparison of popular vector databases like Pinecone, Qdrant, Weaviate, and Milvus. Understanding the differences and strengths of these technologies is crucial for making informed decisions.

12 audio · 5:57

Nortren·

What is Milvus and what problems does it solve?

0:30
Milvus is an open-source, cloud-native vector database designed for large-scale similarity search, maintained by Zilliz and a CNCF graduated project. It separates storage from compute, scales horizontally to billions of vectors, and supports multiple index types including HNSW, IVF, DiskANN, and GPU-accelerated indexes. Milvus targets production workloads in recommendation, image retrieval, semantic search, and RAG where query volume and data size exceed what single-node databases can handle. It is the underlying engine of the Zilliz Cloud managed service.

What is the architecture of Milvus?

0:31
Milvus follows a cloud-native architecture with four layers: access, coordinator, worker, and storage. The access layer handles client requests through a proxy. Coordinators manage cluster metadata, data placement, and scheduling. Workers handle data ingest, indexing, and query execution. Storage includes object storage like S3 for vectors, a log broker like Kafka or Pulsar for streaming writes, and a metadata store like etcd. This separation lets each component scale independently and gives Milvus its signature ability to handle billion-vector deployments.

What index types does Milvus support?

0:29
Milvus supports a wide range of index types including FLAT for exact search, IVF_FLAT, IVF_SQ8, IVF_PQ for inverted file indexes with various quantization, HNSW for graph-based search, SCANN for Google's research-backed ANN, DiskANN for disk-based large-scale search, and GPU-accelerated variants like GPU_IVF_FLAT and GPU_CAGRA. This variety lets teams choose the best index for their data size, recall target, and hardware, from high-recall HNSW in RAM to massive DiskANN indexes on SSD to GPU-accelerated batch processing.

What is the difference between Milvus standalone and distributed deployment?

0:30
Milvus standalone runs all components in a single process, simple to set up and suitable for development or small production workloads up to a few million vectors. Milvus distributed separates the access layer, coordinators, workers, and storage into independent services, enabling horizontal scaling to billions of vectors and high query throughput. Distributed deployment requires Kubernetes for orchestration plus external dependencies like etcd, object storage, and a message broker. Choose standalone for quick starts and distributed for serious production scale.

How does Milvus handle sharding?

0:29
Milvus splits each collection into logical partitions and physical shards. Partitions are user-defined logical groupings useful for filtering subsets like per-tenant or per-date data. Shards are physical divisions that distribute data across worker nodes for parallel processing, with the shard count set at collection creation. Queries can target specific partitions to reduce the search space, while the shard layer handles parallelization transparently. This two-level model lets users organize data logically while leaving physical distribution to Milvus.

What is Milvus Lite and when should you use it?

0:29
Milvus Lite is a lightweight embedded version of Milvus that runs in-process with Python applications, requiring no separate server or dependencies. It stores data in a local file and supports most of the standard Milvus API, making it ideal for prototyping, local development, unit testing, and embedded use cases like desktop apps or notebooks. Milvus Lite supports datasets up to a few million vectors depending on available memory. For production workloads, migrate to Milvus standalone or distributed using the same client code.

How does Milvus support GPU-accelerated vector search?

0:30
Milvus supports GPU indexes including GPU_IVF_FLAT, GPU_IVF_PQ, GPU_BRUTE_FORCE, and GPU_CAGRA, the last developed by NVIDIA for high-throughput batch retrieval. GPU indexes dramatically accelerate both indexing and search on large datasets, particularly useful for offline batch workloads and high query-per-second applications. GPU support requires NVIDIA hardware, the CUDA runtime, and the appropriate Milvus container image. For interactive low-QPS workloads, CPU HNSW often has better latency per query, so GPU pays off at scale.

What is Zilliz Cloud and how does it relate to Milvus?

0:31
Zilliz Cloud is the managed cloud service built on Milvus, operated by Zilliz, the primary maintainer of the Milvus open-source project. It offers dedicated and serverless deployment options on AWS, GCP, and Azure, with automated scaling, backups, monitoring, and upgrades. Zilliz Cloud adds enterprise features like single sign-on, audit logging, and private networking, plus optimized performance tuning that comes from running Milvus at scale for many customers. It is the production-ready hosted option for teams that prefer managed services.

What are Milvus consistency levels?

0:30
Milvus supports four consistency levels: Strong, which guarantees reads see all previous writes; Bounded staleness, the default, which tolerates a small delay between writes becoming visible; Session, where a client sees its own writes but not necessarily other clients' recent writes; and Eventually, which offers highest throughput by allowing any read to lag behind writes. Lower consistency levels reduce query latency because they skip synchronization with the log broker. Choose based on whether fresh data or low latency matters more for each query.

How does Milvus handle metadata filtering?

0:30
Milvus supports scalar filtering with boolean expressions over non-vector fields like strings, integers, and floats. Fields must be declared in the collection schema with appropriate types, and can have scalar indexes created on them for faster filtering. Queries combine a vector search with filter expressions using standard comparison and logical operators. Milvus applies filters during or before vector search depending on the index and filter selectivity, similar to other vector databases, and supports arrays, JSON fields, and nested property access.

What is the difference between Milvus collections, partitions, and shards?

0:28
A collection is the top-level container for related data, similar to a table, with a defined schema and index configuration. Partitions are user-defined logical groupings inside a collection, used to organize data so queries can target specific subsets like per-date or per-tenant without full-collection scans. Shards are physical divisions of a collection distributed across worker nodes for parallel query processing, with a fixed count set at creation. Partitions are for logical organization and query filtering, shards are for physical scaling.

How does Milvus support multi-tenancy?

0:30
Milvus supports multi-tenancy through three patterns at different granularities. Database-level isolation creates separate Milvus databases per tenant, giving the strongest isolation. Collection-level isolation creates one collection per tenant within a shared database, suitable for moderate tenant counts. Partition-key isolation shares a single collection across tenants and uses a partition key field to automatically route data, which works well for many small tenants. Choose based on tenant count, isolation requirements, and operational overhead you can accept. ---