Home / News / Caching Strategies for AI Agent Traffic

Caching Strategies for AI Agent Traffic

Agentic AI is already reshaping how we think about and implement caching. Traditional caching simply stores assets or requests in an attempt to decrease latency and improve efficiency. Agentic AI changes the game in some important ways. You have to think differently when every call potentially costs money and every asset needs to be able to be found, and quickly at that.

These are just a few examples of how AI agents are changing caching. In this article, we’re going to show you some other caching strategies for AI agents to help ensure your products remain competitive and efficient in the agentic age!

Semantic Caching

Semantic caching is one of the most significant shifts in how caching is thought about and implemented in APIs. Traditionally, caching relies upon exact matches, pairing specific assets with a unique identifier like a key. Semantic caching considers context and potential meaning, on the other hand, providing essential clues so that an AI agent knows which one’s correct.

Imagine someone asks an AI assistant, “How can I retrieve my password?” or “I can’t log into my account.” Both essentially mean the same thing, but that’s a difficult thing to express in code. Semantic caching uses embeddings, which are vector representations of text. This allows semantic caching systems to return assets that are similar to a specific request if an exact match can’t be found.

To set up semantic caching properly, it’s important to properly configure the threshold for appropriate answers. If you make it too strict, you’ll miss out on appropriate answers. If you make it too broad, you risk returning inappropriate responses.

Optimizing for Cost and Performance

Agentic AI usually makes a request or implements an action using an LLM. At the moment, most LLMs have some sort of fee per call, even if it’s just using some of your account’s credits. It’s a good idea to cache requests and assets, if possible. Otherwise, your product or service could get very expensive, very quickly.

Properly configuring time-to-live (TTL) settings is a good way to balance data freshness, speed, and budget. A news app might refresh once every 24 hours, for example. An e-commerce app might keep the same pricing database until a certain threshold is passed, as another example.

Tiered caching can be set up to improve both cost and speed, as well. Frequently accessed data might be stored in in-memory storage for lightning-fast retrieval. Less necessary data might be stored in slower storage that costs less. This is an example of how caching can help balance both performance and affordability.

Types of Caching

Different types of agentic AI require different types of caching. Response caching is the most common and straightforward. In response caching, an LLM’s response is stored so that it can be reused. This is a good approach for queries on relatively static knowledge bases like FAQs.

Response caching works best when it’s paired with semantic caching, as it allows the correct answers to be returned even when questions vary. You can also implement a form of versioning to store data in order of freshness, making your program efficient while remaining as thorough as possible.

We’ve talked about embeddings a few times at this point. Generating embeddings for every query can be prohibitively slow as well as expensive. Embedding caching helps remove some of this load by caching vector embeddings for known inputs. You might be creating a product recommendation engine, for instance. The only drawback with this approach is that product recommendation improves over time. This can create a kind of version drift over time, so you’ll want to regenerate the vector embeddings periodically to make sure the embeddings remain relevant and accurate.

Many AI agents involve several components. Workflow-level caching is a style of caching that saves the returned assets from each component. Think of a travel app, for instance. The app might return the cheapest price for a flight from one destination to another. This could entail returning information about routes, flight data, pricing, and more. A workflow-level caching solution could cache resources from each component, making them faster, easier, and cheaper to retrieve.

Example of Caching For AI Agents

Let’s close this article out with an example so you can see these principles in action. We’re going to build a quick travel app to show you how caching can make your app more efficient. To start, you’ll need to register with AviationStack to get an API key, which we’ll use to retrieve real-time flight data.

Next, create a folder for your app in your programming directory. We’ve titled ours TravelApp, but feel free to name yours whatever you like. Once you’ve created the new folder, navigate inside it and make a blank file and name it TravelApp.py.

Next, you’re going to install the libraries you’ll need for the app. Run the following command via terminal:

pip install requests sentence-transformers transformers torch

sentence-transformers allows you to implement sentence embeddings, which we’ll use for semantic caching. Transformers and Torch are machine learning libraries that will be used to create and train your app’s model.

Now you’ll need to set the environment variable for your API key so we can call AviationStack. On Windows, open PowerShell and execute the following command:

$env:AIRSCRAPER_API_KEY="your_key_here"

Now we’re going to start building the app in earnest. Start by installing the libraries you’ll need in your TravelApp.py Python script.

import os
import json
import requests
import numpy as np
from hashlib import sha256
from sentence_transformers import SentenceTransformer, util
from transformers import pipeline
import random

You’ll also want to load the models you’ll be using for embeddings:

embedding_model = SentenceTransformer("all-MiniLM-L6-v2")
generator = pipeline("text2text-generation", model="google/flan-t5-base")

Now you’re going to create the variables that you’ll use for caching:

embedding_cache = {}
response_cache = {}
workflow_cache = {}
semantic_cache = []

Now you’re going to refine the semantic search engine, enabling semantic caching. Input the following code:

def search_semantic_cache(query, threshold=0.85):
 query_emb = get_query_embedding(query)
 best_score = 0
 best_response = None
 for item in semantic_cache:
 score = util.cos_sim(query_emb, item['embedding'])[0][0].item()
 if score > threshold and score > best_score:
 best_score = score
 best_response = item['response']
 return best_response

This allows your travel app to convert natural language queries into a numerical vector representation using the SentenceTransformer model “all-MiniLM-L6-v2.” This allows the model to infer context from the query rather than simply looking for exact matches. Once it’s created, the vector can be compared against previous queries. If the threshold is around .9 or above, a response already exists, so the model knows it doesn’t need to make an additional call.
Now you’re going to write the code to call the Aviationstack API:

def get_flight_data(dep_iata, arr_iata):
 params = {
 "access_key": AVIATIONSTACK_KEY,
 "dep_iata": dep_iata,
 "arr_iata": arr_iata
 }
 r = requests.get(BASE_URL, params=params)
 if r.status_code == 200:
 return r.json()
 else:
 return {"error": "Failed to fetch flight data"}

Finally, you’re going to conclude with the body of the code, that’s now able to use semantic searching and caching.

def plan_trip(query, dep_iata, arr_iata):
 # Check semantic cache
 cached = search_semantic_cache(query)
 if cached:
 print("[CACHE HIT] Semantic match found.")
 return cached

 # Otherwise fetch new data
 data = get_flight_data(dep_iata, arr_iata)
 response = json.dumps(data, indent=2)

 # Cache the embedding and response
 semantic_cache.append({
 "query": query,
 "embedding": get_query_embedding(query),
 "response": response
 })

 return response

if __name__ == "__main__":
 query = "Find flights from New York to Los Angeles"
 result = plan_trip(query, "JFK", "LAX")
 print(result)

Your final code should look like this:

def plan_trip(query, dep_iata, arr_iata):
 # Check semantic cache
 cached = search_semantic_cache(query)
 if cached:
 print("[CACHE HIT] Semantic match found.")
 return cached

 # Otherwise fetch new data
 data = get_flight_data(dep_iata, arr_iata)
 response = json.dumps(data, indent=2)

 # Cache the embedding and response
 semantic_cache.append({
 "query": query,
 "embedding": get_query_embedding(query),
 "response": response
 })

 return response

if __name__ == "__main__":
 query = "Find flights from New York to Los Angeles"
 result = plan_trip(query, "JFK", "LAX")
 print(result)

Running that script should return a result similar to the following:

 "flight_date": "2025-07-27",
 "flight_status": "active",
 "departure": {
 "airport": "John F Kennedy International",
 "timezone": "America/New_York",
 "iata": "JFK",
 "icao": "KJFK",
 "terminal": "8",
 "gate": "4",
 "delay": 19,
 "scheduled": "2025-07-27T06:00:00+00:00",
 "estimated": "2025-07-27T06:00:00+00:00",
 "actual": null,
 "estimated_runway": null,
 "actual_runway": null

If you add an addendum to the end of the script with an additional query, you can see the semantic caching in action:

if __name__ == "__main__":
 # First search (populates cache)
 query1 = "Find flights from New York to Los Angeles"
 result1 = plan_trip(query1, "JFK", "LAX")
 print(result1)

 print("\n---\n")

 # Second search (semantically similar)
 query2 = "Show me flights heading to LA from NYC"
 result2 = plan_trip(query2, "JFK", "LAX")
 print(result2)

The first query populates the cache, which can then be assessed using embedding caching. If a query is semantically similar to what’s in the cache, the app knows it doesn’t need to make an additional call. Even something as simple as removing semantically similar queries rather than making a call for every individual request can eliminate a sizable chunk of unnecessary calls. Not only will this not waste your quota, if you only get so many requests per month, it will also make your tools and apps perform as quickly and smoothly as possible.

Final Thoughts on Caching for AI Agents

Caching for AI agents isn’t just a matter of picking a new tool. It involves a fundamental shift in thinking. When every request can potentially cost money as well as time, it’s in everybody’s interest to keep those calls to a minimum. It also requires understanding embedding vectors, which are a fundamental part of how machine learning works.

To review, semantic caching allows systems to recognize and reuse answers based on meaning, rather than strict textual matches. Workflow-level caching takes things even further, making complex agentic AI systems as affordable and efficient as possible. Setting up caching properly can help make your agentic AI fast, affordable, and dependable for your users.

Tagged: