RAG & Vector DB Interview: Pinecone Pods, Serverless, Namespaces, Metadata Filters

RAG & Vector DB Interview: Pinecone Pods, Serverless, Namespaces, Metadata Filters

Questions and materials on "RAG & Vector DB Interview: Pinecone Pods, Serverless, Namespaces, Metadata Filters"

12 audio · 5:57

Nortren·

What is Pinecone and what is it used for?

0:31
Pinecone is a managed vector database service for storing and searching high-dimensional embeddings at scale, used primarily for semantic search, recommendation, and Retrieval-Augmented Generation. It handles indexing, sharding, replication, and backups automatically, exposing a simple REST and gRPC API for upsert, query, and delete operations. Pinecone supports metadata filtering, hybrid sparse-dense search, namespaces for logical separation, and serverless or pod-based deployment. It integrates with LangChain, LlamaIndex, and major cloud providers including AWS, GCP, and Azure.

What is the difference between Pinecone pods and serverless?

0:31
Pinecone pods are provisioned dedicated resources with fixed capacity, billed per hour regardless of usage, giving predictable performance and low cold-start latency. Serverless indexes auto-scale compute and storage based on actual load, billed per read unit, write unit, and storage, with no infrastructure to manage. Serverless separates storage from compute, using object storage for durability, which cuts cost dramatically for low-traffic or spiky workloads. Pods remain available for existing customers with high steady throughput, but serverless is the recommended default for new Pinecone projects.

What is a Pinecone namespace and when should you use it?

0:28
A namespace is a logical partition within a Pinecone index, letting you isolate vectors by user, tenant, document set, or language without creating separate indexes. Queries are scoped to a single namespace, which improves performance because the search only considers vectors in that namespace. Use namespaces for multi-tenant applications, per-user personal data, or A/B testing different embedding models. Namespaces are free to create, and a single index can hold thousands of them with no pre-declaration required.

How does metadata filtering work in Pinecone?

0:30
Pinecone stores arbitrary JSON metadata with each vector and supports filtering queries by metadata conditions using operators like equals, in, greater than, and less than. Filters are applied during the vector search so the database does not return items that fail the filter, known as pre-filtering. Effective filtering requires metadata fields to be indexed, which Pinecone handles automatically in serverless indexes. Highly selective filters can hurt recall if the ANN index cannot find enough matching candidates, so cardinality matters in filter design.

What embedding dimensions does Pinecone support?

0:34
Pinecone supports arbitrary embedding dimensions up to 20000, though practical limits depend on the chosen index type and performance targets. Common dimensions match popular embedding models: 1536 for OpenAI text-embedding-3-small and ada-002, 3072 for text-embedding-3-large, 1024 for Cohere embed-v3, and 768 for many open-source sentence-transformers. Higher dimensions increase storage cost and search latency roughly linearly, so choose the smallest dimension that meets your retrieval quality target. The dimension is fixed at index creation and cannot be changed later.

What distance metrics does Pinecone support?

0:31
Pinecone supports three distance metrics: cosine similarity, Euclidean distance, and dot product. The metric is chosen at index creation and cannot be changed later. Cosine is the default for text embeddings because it compares angle regardless of magnitude, dot product is faster when vectors are normalized and often used for recommendation, and Euclidean is used for some image and biology embeddings where raw magnitude matters. Most modern embedding models expect either cosine or dot product, so choose based on the model's documentation.

How does Pinecone serverless separate storage and compute?

0:29
Pinecone serverless stores vector data in object storage like S3, while compute nodes load data on demand to serve queries. When a query arrives for a namespace or shard not currently in memory, the compute layer fetches it from storage, pays a cold-start latency penalty, and caches it for subsequent queries. This architecture scales storage near-infinitely at object storage cost, avoids idle compute fees, and handles spiky workloads without overprovisioning. It trades some tail latency for dramatically lower total cost at rest.

What is sparse-dense hybrid search in Pinecone?

0:28
Pinecone supports hybrid search by storing both a dense vector and a sparse vector per record, with the sparse vector typically generated by BM25 or a learned sparse model like SPLADE. Queries include both a dense and sparse vector plus an alpha weight to blend scores, returning results ranked by the combined score. Hybrid search improves recall on queries with rare terms, product names, or exact phrases that dense embeddings alone miss. Sparse indexes in Pinecone are stored alongside dense ones without separate infrastructure.

How do you handle upserts and deletes at scale in Pinecone?

0:28
Pinecone upserts insert or overwrite vectors by ID in batches, typically 100 to 1000 vectors per request for best throughput. Deletes can target specific IDs, all vectors in a namespace, or filtered subsets by metadata. Writes are eventually consistent, so a just-written vector may not appear in queries for a short window, typically under a second in serverless indexes. For high-volume ingest, use concurrent batches and the async client, and monitor index stats to confirm vector counts reach expected values.

What is a Pinecone index and what limits does it have?

0:30
A Pinecone index is the top-level data structure that holds vectors, their metadata, and the ANN structure for search. Each index has a fixed dimension, metric, and deployment type (serverless or pod) set at creation. Serverless indexes scale to billions of vectors and thousands of namespaces per index, with individual record metadata limited to 40 kilobytes. Per-project and per-organization quotas apply depending on plan. Starter tier allows a limited number of indexes, while paid tiers allow many more and support higher throughput.

How does Pinecone handle vector updates and versioning?

0:26
Pinecone treats upsert as insert-or-replace by vector ID, so writing a new vector with an existing ID replaces the old one atomically. There is no built-in versioning or history, so if you need to track changes, store version information in metadata like a version number or timestamp and query with a metadata filter. For A/B testing embedding models, use separate namespaces or indexes rather than updating in place, which lets you compare retrieval quality before fully switching.

How does Pinecone handle high-throughput query workloads?

0:31
Pinecone scales query throughput through replicas in pod-based indexes and automatic scaling in serverless, with typical single-pod throughput in the hundreds of queries per second range. Use multiple replicas for horizontal query scaling, batch query requests when possible, and route read-heavy workloads to dedicated indexes if they compete with write-heavy ones. Serverless indexes handle bursty traffic without replica tuning, but cold starts on rarely-queried namespaces can add latency spikes. Monitoring index latency and replica utilization identifies scaling needs. ---