Building Semantic Search for Restaurant Menus
A practical guide to replacing keyword search with semantic search in food delivery apps. Two integration paths, pre-computed embeddings, and real code.
The vocabulary mismatch problem
We covered why keyword search fails for food delivery in depth already. Customers and restaurants speak different languages for the same food, and string matching can't bridge the gap. This post is about how to fix it.
You embed queries and menu items into the same vector space, then rank by cosine similarity. "Cold coffee" lands near "Iced Americano." "Something sweet" lands near Gulab Jamun.
Two paths, same response format
The POST /search endpoint accepts your corpus two ways.
Path A: Send raw text per call. You pass the query and menu items. The API embeds everything, runs the search, returns ranked results. Zero infrastructure. Best for prototyping or menus under a few hundred items.
Path B: Pre-compute embeddings. You call POST /embed once, store the vectors, and pass them with each search. The API skips corpus encoding entirely. At 100 items, search drops from ~80ms to ~15ms because only the query string needs embedding. Pre-computed embeddings are almost always the right choice for production.
Path A in 20 lines
import requests
API_KEY = "YOUR_KEY"
BASE = "https://dish-embed.latimal.com"
headers = {"X-API-Key": API_KEY, "Content-Type": "application/json"}
menu = [
"Gulab Jamun", "Chicken Biryani", "Rasmalai",
"Brownie Sundae", "Masala Dosa", "Mango Lassi",
"Paneer Tikka", "Matcha Latte",
]
response = requests.post(f"{BASE}/search",
headers=headers,
json={"query": "something sweet", "corpus": menu, "top_k": 3},
)
for result in response.json()["results"]:
print(f"{result['text']} (score: {result['score']:.3f})")Returns Gulab Jamun, Rasmalai, and Brownie Sundae. None contain the word "sweet." Swap the query to "healthy snack" or "coffee" and the results shift accordingly.
Each result carries a score (cosine similarity, 0 to 1) and a reranker_score from a precision reranking stage. Use the reranker score for hard thresholds: above 0.8 is a strong match, 0.5 to 0.8 is plausible, below 0.5 is noise. The API also filters weak matches before they reach you, so you may get fewer results than top_k requests.
Path B: pre-compute, store, search
Embed your catalog
The POST /embed/batch endpoint accepts up to 5,000 items in one call and handles chunking internally. No batching logic on your side.
import requests
API_KEY = "YOUR_KEY"
BASE = "https://dish-embed.latimal.com"
headers = {"X-API-Key": API_KEY, "Content-Type": "application/json"}
# load_menu_from_database() is your DB query; returns a list of strings.
# Stub it with a hardcoded list if you're prototyping.
menu_items = load_menu_from_database() # e.g., 2,000 items
resp = requests.post(f"{BASE}/embed/batch",
headers=headers,
json={"items": menu_items, "dimension": 384},
)
embeddings = resp.json()["embeddings"]
save_embeddings(menu_items, embeddings)Search with stored vectors
menu_items, embeddings = load_embeddings()
response = requests.post(f"{BASE}/search",
headers=headers,
json={
"query": "spicy noodles",
"corpus": menu_items,
"corpus_embeddings": embeddings,
"top_k": 10,
},
)
for result in response.json()["results"]:
print(f"{result['text']} (score: {result['score']:.3f})")Choosing your vector dimension
| Dimension | Vector size | Accuracy drop vs 384 | Best for |
|---|---|---|---|
| 64 | 256 bytes | ~15% | High-speed filtering, pre-screening |
| 128 | 512 bytes | 3-5% | High-speed filtering, catalogs over 1M items |
| 256 | 1 KB | 1-2% | Balanced speed and accuracy |
| 384 | 1.5 KB | Baseline | Maximum accuracy, fine-grained distinction |
For most search use cases, 384 is correct. Storage is cheap and the accuracy gain matters when you need to distinguish "Flat White" from "Cappuccino." Drop to 128 only if you're storing millions of vectors and retrieval latency is your bottleneck. See the Matryoshka dimensions guide for full benchmarks.
Messy menus, handled
Real POS data is noisy. "**BESTSELLER** Margherita Pizza [Medium, Cheesy Crust, Buy 1 Get 1]" instead of "Margherita Pizza." "Chkn Bry" instead of "Chicken Biryani." "दाल मखानी" and "Dal Makhani" and "Daal Makhni" on the same platform. The API strips promotional noise, resolves POS abbreviations, and handles 100+ languages with transliteration awareness before encoding. The query_preprocessed response field shows what text the API actually worked with. For the full breakdown of how this works, see the first post in this series.
At 10,000 items: real numbers
Single city. 50 restaurants. 200 items each. 10,000 total. Here's what the numbers actually look like.
Initial embed cost. 10,000 items at 0.05 credits each = 500 credits. Two calls to /embed/batch (5,000 items per call). Total time: roughly 4 seconds. At 384 dimensions, the full vector set is about 15 MB. Store it in Postgres with a JSONB column, in Redis, or load it into memory on startup. At this scale, all three work.
Per-query cost. With pre-computed embeddings, each search costs 0.05 credits (only the query string gets embedded). That scales linearly: 500 credits at 10K queries/day, 500K credits at 10M. Run the math against your credit rate.
Latency. POST /search with pre-computed embeddings: ~80ms p50, ~150ms p95 at 100 corpus items. At 10,000 items passed in a single call, expect higher times because cosine similarity still runs over the full corpus. Benchmark this for your geography; India-to-API round trips add 30-60ms over US-based calls. You should NOT use this approach for typeahead. Typeahead needs sub-50ms. For that, pre-filter by restaurant or category before hitting the API.
def search_menu(query, restaurant_id=None):
"""Search one restaurant or the whole city catalog."""
if restaurant_id:
items, embeddings = load_restaurant_embeddings(restaurant_id)
else:
items, embeddings = load_all_embeddings()
try:
resp = requests.post(f"{BASE}/search",
headers=headers,
json={
"query": query,
"corpus": items,
"corpus_embeddings": embeddings,
"top_k": 20,
},
timeout=2.0,
)
resp.raise_for_status()
return resp.json()["results"]
except (requests.Timeout, requests.HTTPError):
# Semantic search is down. Fall back to keyword matching.
return keyword_fallback(query, items)
# Single restaurant: fast, scoped
results = search_menu("spicy chicken", restaurant_id="rest_042")
# Cross-restaurant: find the best Pad Thai in the city
results = search_menu("Pad Thai")The keyword_fallback function is your existing text search (SQL LIKE, Elasticsearch, whatever you already have). Semantic search should augment your stack, not replace it entirely with a single point of failure.
Keeping embeddings fresh
Menus change. Lunch specials rotate. Seasonal items appear. A nightly re-embed job is the minimum, but menus on platforms like Swiggy and Zomato can change mid-day. If your data pipeline already emits menu-change events, hook embedding updates to those events. If it doesn't, hourly polling with diff detection works.
def update_menu_embeddings(current_menu, cached_menu, cached_embeddings):
"""Only re-embed items that are new or changed."""
# Build a lookup dict: O(1) per membership check instead of O(n) list scan
cache_index = {item: i for i, item in enumerate(cached_menu)}
new_items = [item for item in current_menu if item not in cache_index]
if not new_items:
return cached_menu, cached_embeddings
resp = requests.post(f"{BASE}/embed/batch",
headers=headers,
json={"items": new_items, "dimension": 384},
)
new_embeddings = resp.json()["embeddings"]
# Merge: keep unchanged items with their existing embeddings, add new ones
updated_menu = [item for item in current_menu if item in cache_index]
updated_embeddings = [cached_embeddings[cache_index[item]] for item in updated_menu]
updated_menu.extend(new_items)
updated_embeddings.extend(new_embeddings)
return updated_menu, updated_embeddingsA restaurant that swaps 20 items out of 200 means 20 new embeddings and one API call. Cost: 1 credit.
When you outgrow 10K items
The 50-restaurant example above is a single city. A national aggregator has 500K+ merchants. At that scale, you don't send 10M vectors to an external API per search call. You need a vector database.
Vector storage. Use Latimal's /embed or /embed/batch to generate vectors, then store them in pgvector (if you're already on Postgres), Qdrant, or Pinecone. Your search path becomes: embed the query via /embed (single item, ~15ms), then run approximate nearest neighbor search against your vector index locally. This scales to tens of millions of items with sub-50ms retrieval.
Hybrid retrieval. Pure semantic search misses when the user types an exact dish name. Pure keyword search misses when they type "something spicy." The production pattern most teams land on: keyword retrieval (Elasticsearch, Algolia, or even Postgres full-text) produces a candidate set, then semantic reranking scores those candidates by meaning. This gives you the recall of keywords plus the understanding of embeddings, and it integrates cleanly with existing infrastructure.
Business signal reranking. Semantic similarity tells you a dish matches the query. It doesn't tell you which restaurant to show first. In production, you'll rerank by restaurant rating, delivery distance, estimated time, and availability. Semantic score is one input to your ranking function. Probably not even the most important one in tier-3 cities where menus overlap 80% and the real differentiator is restaurant quality.
The decision you're actually making
You have two architectures to choose between, and the right one depends on where you are today.
Architecture 1: API-managed search. Send corpus + query to POST /search. Latimal handles embedding, similarity, and reranking. You store pre-computed vectors but no search index. Good up to roughly 500 items per search call, which covers single-restaurant search and small multi-restaurant deployments.
Architecture 2: Embed + own index. Use /embed/batch for vector generation. Store vectors in pgvector or a dedicated vector DB. Run ANN search yourself. Use Latimal for embedding quality, not for the search loop. This is the path for any platform doing cross-city or cross-chain search.
Most teams start with Architecture 1 and migrate when they hit the corpus-size ceiling. The vectors are the same either way, so migration is a storage change, not a data change. The search integration guide walks through both paths in detail. Try the search playground to see the results before writing any integration code.