MCP Advanced: Sampling, Server-to-Client LLM Requests, Cost Shifting Explained

MCP Advanced: Sampling, Server-to-Client LLM Requests, Cost Shifting Explained

Explore key sampling techniques and server-to-client LLM requests that optimize data processing. This section covers the intricacies of logging, progress notifications, and context methods for long tasks, ensuring efficient request handling.

8 audio · 3:53

Nortren·

What is sampling in MCP and what problem does it solve?

0:29
Sampling is a mechanism that allows an MCP server to access a language model like Claude through the connected MCP client, instead of calling the model directly. Without sampling, a server that needs text generation would require its own API key, authentication logic, cost management, and full Claude integration code. With sampling, the server creates a prompt and asks the client to make the Claude call on its behalf. The client, which already has a connection to Claude, handles the request and returns the generated text. This shifts both complexity and cost from the server to the client.

How does the sampling flow work step by step in MCP?

0:30
The sampling flow follows six steps. First, the server completes its work, such as fetching data from external sources. Second, the server creates a prompt asking for text generation. Third, the server sends a sampling request to the client through the MCP session. Fourth, the client calls Claude with the provided prompt using its own API credentials. Fifth, the client returns the generated text to the server. Sixth, the server uses the generated text in its final response. The server never touches the language model API directly throughout this entire process.

How do you implement sampling on the MCP server side?

0:29
On the server side, you use the create_message method available through the Context object in your tool function. You construct a SamplingMessage with a role of user and a TextContent object containing your prompt. You pass this message along with parameters like max_tokens and an optional system_prompt to the create_message call on the context session. The method returns a result object whose content property contains the generated text. If the content type is text, you extract and return it. This keeps the server code minimal since all model interaction happens on the client side.

How do you implement sampling on the MCP client side?

0:29
On the client side, you create a sampling callback function that receives request parameters and calls Claude using the Anthropic SDK or any other language model provider. The callback receives a RequestContext and CreateMessageRequestParams containing the messages from the server. You pass those messages to your Claude integration, get the response, and return a CreateMessageResult with the assistant role, the model name, and a TextContent object with the generated text. You then pass this callback when initializing your ClientSession using the sampling_callback parameter.

Why is sampling essential for publicly accessible MCP servers?

0:28
Sampling is essential for public MCP servers because without it, the server operator would pay for every language model call made by every user. If your research tool fetches Wikipedia articles and then needs to summarize them with Claude, each user request would cost you API tokens. With sampling, each client pays for their own AI usage while still benefiting from your server's tools and data processing. This makes it economically viable to build and share MCP servers publicly without worrying about runaway AI generation costs from unknown users.

What are the main benefits of using sampling instead of direct API access?

0:30
Sampling provides four key benefits. First, it reduces server complexity because the server does not need to integrate with language model APIs directly. Second, it shifts the cost burden so the client pays for token usage rather than the server. Third, the server needs no API keys or credentials for Claude, simplifying deployment and security. Fourth, it is perfect for public servers because you avoid the risk of random users generating unlimited text at your expense. The technique moves AI integration complexity to the client, which typically already has the necessary connections in place.

What is the difference between sampling and direct tool use in MCP?

0:29
Direct tool use means the client calls Claude, Claude decides to use a server tool, the server executes it and returns data, and Claude generates the final response. Sampling reverses one part of this flow. During tool execution, the server itself needs language model help to process intermediate results, so it asks the client to call Claude on its behalf. The key difference is direction: in tool use, the client initiates the model call. In sampling, the server initiates a model call through the client. Sampling is a server-to-client request for AI generation.

When should you use sampling versus handling summarization on the client side?

0:29
Use sampling when your server performs complex data processing and needs AI-generated summaries, transformations, or analysis as an intermediate step before returning results. If the summarization is the final step that the client can handle directly with its own Claude call, sampling is unnecessary. Sampling shines when the server needs to combine fetched data with generated text internally, for example researching multiple sources and producing a unified report, before the final response reaches the user. It keeps the server's processing pipeline self-contained. ---