Introducing FoodEval: a benchmark for food domain embeddings
MTEB and BEIR contain zero food evaluations, so a model can top the leaderboard and still miss that Paneer Tikka and Cottage Cheese Tikka are the same dish. FoodEval measures the food gap. It is public.
By Aditya Patni
Ask a top-ranked embedding model whether "Paneer Tikka" and "Cottage Cheese Tikka" are the same dish. They are. Paneer is cottage cheese. A line cook knows this, a hungry customer knows this, and a model sitting near the top of the MTEB leaderboard often does not. It will happily rank "Pad See Ew" as a strong match for "Pad Thai," merge "Bibimbap" with "Kimbap" because both look Korean, and lose the thread entirely when the same dish shows up in two scripts.
These aren't exotic edge cases. This is what a real menu looks like. Food delivery platforms handle millions of items across dozens of cuisines and writing systems, and the basic operations all lean on text embeddings: searching the menu, deduplicating near-identical items, classifying cuisines, recommending add-ons. If the embedding doesn't understand food, every one of those features quietly degrades.
The frustrating part is you can't see it coming. The big general benchmarks, MTEB and BEIR, contain zero food evaluations. A model can place top-ten on MTEB and still post 0.14 NDCG on diet-specific search, or fail to connect "Khubz Arabi" with "Arabic Flatbread." There was no standard way to catch this before you shipped it.
So we built one. It's called FoodEval, and it's public.
What FoodEval is
FoodEval is the first benchmark for evaluating text embedding models on food and menu tasks. Twelve tasks, 5,868 evaluation examples, 26 cuisine classes, multiple scripts and languages. The data comes from real production menus, with graded relevance judgments where it counts. It's built and published by Latimal.
The twelve tasks split into three families.
Search (scored with NDCG@10): how well the model ranks the right menu items for a query.
- Food search: ranked retrieval across Indian, global, and beverage menus.
- Concept search: abstract queries like "warm comfort food" or "crispy appetizer," where lexical overlap won't save you.
- Diet and allergen search: "celiac friendly," "halal food," "keto" against items that rarely spell those properties out.
- Noisy search: real typos and shorthand like "bibimbab" and "paner tikka."
Matching (scored with best F1): can the model tell when two menu lines are the same dish?
- Indian, global, beverage, and bakery matching, each its own task because each has its own traps.
- Portion size: "Large Pizza" vs "Small Pizza" is the same item; "Large Coffee" vs "Large Smoothie" is not.
- Noisy menu matching:
$14.99 Butter ChickenvsButter Chicken,***HOT*** Spicy RamenvsSpicy Ramen. - Cross-lingual matching: the same dish across romanized, bilingual, and cross-script forms.
Classification (macro-F1): a probe trained on frozen embeddings sorts 3,053 items into 26 cuisines. This measures how cleanly the embedding space separates cuisines on its own.
The headline finding
We ran eight models at 384 dimensions and ranked them by FoodEval Score, the equal-weighted average of Search, Matching, and Classification category means (one-third each).
Overall
| Rank | Model | Search | Matching | Classification | FoodEval Score |
|---|---|---|---|---|---|
| 1 | Latimal | 0.474 | 0.849 | 0.737 | 0.687 |
| 2 | Cohere embed-v4 | 0.517 | 0.741 | 0.737 | 0.665 |
| 3 | Alibaba GTE-large v1.5 | 0.474 | 0.699 | 0.716 | 0.630 |
| 4 | Nomic embed-text v1.5 | 0.436 | 0.739 | 0.710 | 0.629 |
| 5 | BAAI/bge-m3 | 0.416 | 0.718 | 0.701 | 0.612 |
| 6 | Lexical (TF) | 0.285 | 0.728 | 0.689 | 0.567 |
| 7 | Cohere multilingual-v3 | 0.390 | 0.682 | 0.506 | 0.526 |
| 8 | multilingual-e5-large | 0.394 | 0.704 | 0.399 | 0.499 |
The category breakdown tells a sharper story than a single number.
Search (NDCG@10)
| Rank | Model | Food | Concept | Diet | Noisy | Avg |
|---|---|---|---|---|---|---|
| 1 | Cohere embed-v4 | 0.644 | 0.494 | 0.183 | 0.748 | 0.517 |
| 2 | Alibaba GTE-large v1.5 | 0.602 | 0.469 | 0.227 | 0.599 | 0.474 |
| 3 | Latimal | 0.610 | 0.423 | 0.206 | 0.658 | 0.474 |
| 4 | Nomic embed-text v1.5 | 0.575 | 0.366 | 0.158 | 0.646 | 0.436 |
| 5 | BAAI/bge-m3 | 0.553 | 0.336 | 0.148 | 0.625 | 0.416 |
| 6 | multilingual-e5-large | 0.536 | 0.316 | 0.139 | 0.583 | 0.394 |
| 7 | Cohere multilingual-v3 | 0.513 | 0.348 | 0.137 | 0.560 | 0.390 |
| 8 | Lexical (TF) | 0.535 | 0.201 | 0.089 | 0.316 | 0.285 |
Matching (Best F1)
| Rank | Model | Indian | Global | Bev | Bakery | Portion | Noisy Match | X-Lingual | Avg |
|---|---|---|---|---|---|---|---|---|---|
| 1 | Latimal | 0.814 | 0.859 | 0.748 | 0.757 | 0.972 | 0.913 | 0.881 | 0.849 |
| 2 | Cohere embed-v4 | 0.732 | 0.829 | 0.710 | 0.691 | 0.835 | 0.667 | 0.721 | 0.741 |
| 3 | Nomic embed-text v1.5 | 0.731 | 0.732 | 0.715 | 0.684 | 0.855 | 0.750 | 0.707 | 0.739 |
| 4 | Lexical (TF) | 0.687 | 0.687 | 0.706 | 0.682 | 0.804 | 0.822 | 0.707 | 0.728 |
| 5 | BAAI/bge-m3 | 0.711 | 0.716 | 0.706 | 0.684 | 0.821 | 0.674 | 0.717 | 0.718 |
| 6 | Alibaba GTE-large v1.5 | 0.705 | 0.695 | 0.710 | 0.682 | 0.725 | 0.672 | 0.707 | 0.699 |
| 7 | multilingual-e5-large | 0.681 | 0.716 | 0.706 | 0.688 | 0.757 | 0.648 | 0.731 | 0.704 |
| 8 | Cohere multilingual-v3 | 0.694 | 0.671 | 0.706 | 0.682 | 0.669 | 0.640 | 0.714 | 0.682 |
Classification (Macro F1)
| Rank | Model | Cuisine |
|---|---|---|
| 1 | Latimal | 0.737 |
| 2 | Cohere embed-v4 | 0.737 |
| 3 | Alibaba GTE-large v1.5 | 0.716 |
| 4 | Nomic embed-text v1.5 | 0.710 |
| 5 | BAAI/bge-m3 | 0.701 |
| 6 | Lexical (TF) | 0.689 |
| 7 | Cohere multilingual-v3 | 0.506 |
| 8 | multilingual-e5-large | 0.399 |
General models hold their own on open-ended search. Cohere embed-v4 tops the Search category, leading food search, concept search, and noisy search. That makes sense: those tasks reward broad language understanding, the thing big general models are trained for. If your only job were a search box, several of these models would serve you fine.
The food-specific work is where they fall down. Look at the Matching category. Latimal leads every one of the seven tasks, often by a wide margin. Cross-lingual matching: 0.881 against 0.721 for the next best. Noisy menu matching: 0.913 against 0.750. Portion size: 0.972 against 0.855. These are the tasks that decide whether your dedup pipeline collapses two legitimately different dishes, or whether the same item written in Hindi and English ends up as two rows in your catalog. General embeddings have a blind spot here, and it's consistent across the field.
Two more things worth flagging. Diet and allergen search is hard for everyone; the best score is 0.227 and most models sit far below that. There's real headroom for the whole field. And cuisine classification surprised us: Cohere embed-v4 ties Latimal at 0.737, which says a strong general embedding space can separate cuisines well even without food-specific tuning. We're not claiming a clean sweep. We're claiming a specific, measurable gap on the food-specific tasks, and the numbers back it.
What the failures actually look like
A few concrete cases from the data.
Cross-script matching is where general multilingual models struggle most. They tend to key on "both of these are non-English food text" instead of the actual dish. So "Bibimbap 비빔밥" gets merged with "Kimbap 김밥" (different dishes), while a genuine match like "Beef Bowl" and "Gyudon" gets missed. Same failure mode, both directions.
Romanization trips them too. "Arabic Flatbread" and "Khubz Arabi" are the same bread. "Arabic Meat Pastry" and "Fatayer Laham" are the same pastry. A model with no food grounding has no reason to connect them, because there's almost no character overlap to lean on.
And the within-category confusables are brutal. "Baked Falafel" vs "Fried Falafel" are different items. "Large Coffee" vs "Large Smoothie" are different items. But "Regular Pepsi" vs "Large Pepsi" are the same item in two sizes. Getting these right needs an understanding of which modifiers change identity and which don't, and that's exactly what general training never teaches.
Run it on your own model
FoodEval ships as a pip package. The data is bundled, so there's nothing to download separately.
pip install foodeval
foodeval run --model BAAI/bge-m3 --dim 384Or from Python:
from foodeval.evaluate import run_benchmark
from foodeval.adapters.sentence_transformer import SentenceTransformerAdapter
adapter = SentenceTransformerAdapter("BAAI/bge-m3", truncate_dim=384)
result = run_benchmark(adapter)
print(result.to_markdown())The leaderboard standardizes on 384 dimensions for a fair comparison and a realistic production operating point, but you can evaluate at any dimension. There are adapters for sentence-transformers, OpenAI, and Bedrock, plus a simple protocol if you want to wrap something else. Every model in the table above is reproducible from the bundled data.
Why it's evaluation-only
FoodEval is released under CC-BY-NC 4.0 with an evaluation-only addendum. You can use it to benchmark, compare, and publish research. You can't use the data to train models. That line matters: the moment a benchmark leaks into training sets, the scores stop meaning anything. Keeping the data eval-only is how it stays a real measurement instead of a target everyone quietly overfits. The tooling and code are open for commercial use; the data is for evaluation.
Try it
If you run search, dedup, or recommendations over a menu, FoodEval will tell you something your current metrics won't. Point it at whatever you're using today and see where the food gap shows up.
- Data and benchmark on HuggingFace: huggingface.co/datasets/latimal/foodeval
- Code and leaderboard on GitHub: github.com/latimal/foodeval
pip install foodeval
Run your model. If it beats the table, open a PR with your result file and we'll add you to the leaderboard. We'd genuinely like to be dethroned on the food-specific tasks. That's the whole point of putting a number on it.