How Food Embeddings Work
What are embeddings?
An embedding is a numeric vector (a list of numbers) that captures the meaning of a piece of text. Similar meanings produce similar vectors. You can measure how similar two items are by comparing their vectors using cosine similarity.
- 1.0 = identical meaning
- 0.8+ = very similar (likely the same dish)
- 0.5-0.7 = related (same category or cuisine)
- Below 0.3 = unrelated
Why food needs specialized embeddings
General-purpose embedding models (the kind you'd use for document search or chatbot retrieval) fail on food data in specific ways:
Transliteration blindness
"Murgh" is Hindi for chicken. "Murgh Makhani" and "Butter Chicken" are the same dish. Generic models treat "Murgh" as an unknown token and produce low similarity scores. dish-embed maps transliterations correctly across languages and scripts.
Noise sensitivity
Real menu data looks like this:
**NEW** 50% OFF Chicken Biryani (Serves 2) [Non-Veg]
A generic model embeds all that noise as part of the meaning. dish-embed strips it before embedding, so this matches "Chicken Biryani" with high confidence.
Cross-lingual understanding
"Pollo Asado" (Spanish), "Grilled Chicken" (English), "Murgh Tandoori" (Hindi) are all grilled chicken preparations. dish-embed produces similar embeddings for them across 100+ languages.
Dietary signal preservation
Generic models don't know that "Paneer Tikka" is vegetarian and "Chicken Tikka" is not. They see high text overlap and produce high similarity. dish-embed understands that protein differences change the fundamental nature of a dish.
What dish-embed knows
dish-embed has food-specific knowledge baked in:
- Which items are the same dish under different names
- Which items are related but distinct (Butter Chicken vs Dal Makhani)
- Cross-lingual equivalences across 100+ languages
- Cuisine and category relationships across Indian, East Asian, Southeast Asian, Middle Eastern, European, Latin American, and American cuisines
Using embeddings directly
If you want to store embeddings in your own vector database for custom search or clustering:
resp = requests.post(f"{BASE}/embed", headers=headers,
json={"items": ["Chicken Biryani", "Murgh Biryani", "Veg Pulao"], "dimension": 384})
embeddings = resp.json()["embeddings"]
# Each embedding is a list of 384 floats
# Store in Pinecone, Weaviate, pgvector, FAISS, etc.
You can choose your embedding dimension (128, 256, or 384) depending on your quality and storage requirements. See Matryoshka Dimensions for trade-offs.
Concepts
Concepts behind dish-embed. Covers food embeddings, Matryoshka dimensions, built-in preprocessing, and dietary detection signals.
Matryoshka Dimensions
Matryoshka embeddings let you choose 128, 256, or 384 dimensions at query time. Covers storage tradeoffs, accuracy impact, and per-dimension use cases.