Langchain Wrap Model Call
@wrap_model_call is one of the most powerful hooks in Middleware.
Unlike before/after which only observe, @wrap_model_call can completely control the model's execution process β retry, fallback, cache, or even skip the model and use preset responses.
* * *
## Understanding the handler callback
The core of @wrap_model_call is a handler callback function. Calling handler(request) will actually execute the model; not calling it will skip the model.
## Example
# Basic structure of wrap_model_call
# request: contains all information such as model, messages, tools
# handler: a callable object, executing it will actually call the model
@wrap_model_call
def my_middleware(request, handler):
# Can do anything before model call
print("Model is about to be called...")
# Calling handler(request) actually executes the model
response = handler(request)
# Can do anything after model call
print("Model call completed")
return response
* * *
## Scenario 1: Retry Mechanism
This is the most common scenario β model calls may fail due to network issues, automatic retry can improve reliability:
## Example
from dotenv import load_dotenv
load_dotenv()
from langchain.agents import create_agent
from langchain.agents.middleware import wrap_model_call
from langchain.chat_models import init_chat_model
from langchain.messages import HumanMessage
@wrap_model_call
def retry_on_error(request, handler):
"""Automatically retry on model call failure, up to 3 times"""
max_retries =3
last_error =None
for attempt in range(max_retries):
try:
result = handler(request)
if attempt >0:
print(f" Attempt {attempt + 1}")
return result
except Exception as e:
last_error = e
if attempt request.override() is an immutable method β it returns a new copy of the request, does not modify the original object. This ensures each call is independent and safe.
* * *
## Scenario 3: Caching Model Responses
For repeated queries, you can cache model responses to reduce API call costs:
## Example
from langchain.agents.middleware import wrap_model_call
from langchain.messages import AIMessage
# Simple in-memory cache
cache ={}
@wrap_model_call
def cache_responses(request, handler):
"""Cache model responses, don't call model for same questions"""
# Use the last user message content as cache key
messages = request.messages
if not messages:
return handler(request)
# Generate cache key
last_content =str(messages.content)if hasattr(messages,'content')else""
cache_key = last_content[:200]# Truncate too long content
# Check cache
if cache_key in cache:
print(f" Returning cached result directly")
cached = cache
return AIMessage(content=f"{cached}nn*(From cache)*")
# Cache miss, call model
result = handler(request)
# AIMessage content may be in a list, get the first one directly
if hasattr(result,'content'):
cache= result.content
print(f" Stored in cache, currently {len(cache)} items")
elif hasattr(result,'model_response'):
# For ExtendedModelResponse
pass
return result
model = init_chat_model("deepseek:deepseek-v4-flash", temperature=0)
agent = create_agent(
model=model,
middleware=,
system_prompt="You are Tutorial's assistant. Keep answers concise.",
)
# First query (cache miss)
result = agent.invoke({
"messages": [{"role": "user","content": "What is Python?"}]
})
print(f"First time: {result['messages'].content[:80]}...n")
# Second query with same question (cache hit)
result = agent.invoke({
"messages": [{"role": "user","content": "What is Python?"}]
})
print(f"Second time: {result['messages'].content[:80]}...")
Running result:
Stored in cache, currently 1 itemsFirst time: Python is a high-level programming language known for its concise and readable syntax... Returning cached result directlySecond time: Python is a high-level programming language known for its concise and readable syntax...*(From cache)*
* * *
## Scenario 4: Modifying request β Dynamically Injecting System Messages
## Example
from datetime import datetime
from langchain.agents.middleware import wrap_model_call
from langchain.messages import SystemMessage
YouTip