June 10, 2026

Introducing FoodEval: a benchmark for food domain embeddings

MTEB and BEIR have zero food evaluations, so top models miss that Paneer Tikka and Cottage Cheese Tikka match. FoodEval measures that gap. It's public.

By Aditya Patni

TL;DR

FoodEval is a public benchmark for food embeddings: 12 tasks, 5,868 examples, 26 menu classes drawn from real menus.
General models hold up on open-ended search and fall down on food-specific matching.
Latimal is #1 overall at 0.718 and leads every matching task in the F1 sweep.

Ask a top-ranked embedding model whether "Paneer Tikka" and "Cottage Cheese Tikka" are the same dish. They are. Paneer is cottage cheese. A line cook knows this, a hungry customer knows this, and a model sitting near the top of the MTEB leaderboard often does not. It will happily rank "Pad See Ew" as a strong match for "Pad Thai," merge "Bibimbap" with "Kimbap" because both look Korean, and lose the thread entirely when the same dish shows up in two scripts.

This is what a real menu looks like. Food delivery platforms handle millions of items across dozens of cuisines and writing systems, and the basic operations all lean on text embeddings: searching the menu, deduplicating near-identical items, classifying cuisines, recommending add-ons. If the embedding doesn't understand food, every one of those features quietly degrades.

And you can't see it coming. The big general benchmarks, MTEB and BEIR, contain zero food evaluations. A model can place top-ten on MTEB and still post 0.14 NDCG on diet-specific search, or fail to connect "Khubz Arabi" with "Arabic Flatbread." There was no standard way to catch this before you shipped it.

So we built one. It's called FoodEval, and it's public.

What FoodEval is

FoodEval is the food-domain benchmark for text embedding models. Twelve tasks, 5,868 evaluation examples, 26 menu classes, multiple scripts and languages. The data comes from real production menus, with graded relevance judgments where it counts. It's built and published by Latimal.

The twelve tasks split into three families.

Search (scored with NDCG@10): how well the model ranks the right menu items for a query.

Food search: ranked retrieval across Indian, global, and beverage menus.
Concept search: abstract queries like "warm comfort food" or "crispy appetizer," where lexical overlap won't save you.
Diet and allergen search: "celiac friendly," "halal food," "keto" against items that rarely spell those properties out.
Noisy search: real typos and shorthand like "bibimbab" and "paner tikka."

Matching (scored with best F1): can the model tell when two menu lines are the same dish?

Indian, global, beverage, and bakery matching, each its own task because each has its own traps.
Portion size: "Large Pizza" vs "Small Pizza" is the same item; "Large Coffee" vs "Large Smoothie" is not.
Noisy menu matching: $14.99 Butter Chicken vs Butter Chicken, ***HOT*** Spicy Ramen vs Spicy Ramen.
Cross-lingual matching: the same dish across romanized, bilingual, and cross-script forms.

Classification (macro-F1): a probe trained on frozen embeddings sorts 3,053 items into 26 cuisines. This measures how cleanly the embedding space separates cuisines on its own.

The headline finding

We ran ten models and ranked them by FoodEval Score, the equal-weighted average across all twelve tasks (1/12 each). The roster spans open-source embedders, commercial APIs including OpenAI text-embedding-3-large and Voyage 4 Large, and a lexical baseline as the floor. Every model, Latimal included, runs the identical harness: prompt-free encode at 384 dimensions with cosine similarity, no rerankers, and the same matching F1 threshold sweep.

Overall

Rank	Model	Search	Matching	Classification	FoodEval Score
1	1 · Latimal	0.478	0.851	0.738	0.718
2	2 · OpenAI text-embedding-3-large	0.554	0.758	0.833	0.696
3	3 · Voyage 4 Large	0.558	0.741	0.790	0.684
4	4 · Cohere embed-v4	0.517	0.741	0.737	0.666
5	5 · Nomic embed-text v1.5	0.436	0.739	0.710	0.636
6	6 · GTE-large v1.5	0.474	0.699	0.716	0.626
7	7 · BGE-M3	0.416	0.718	0.701	0.616
8	8 · Lexical (TF)	0.285	0.728	0.689	0.577
9	9 · Microsoft e5-large	0.394	0.704	0.399	0.575
10	10 · Cohere multilingual-v3	0.390	0.682	0.506	0.570

The category breakdown tells a sharper story than a single number.

Search (NDCG@10)

Rank	Model	Food	Concept	Diet	Noisy	Avg
1	1 · Voyage 4 Large	0.678	0.562	0.238	0.754	0.558
2	2 · OpenAI text-embedding-3-large	0.691	0.550	0.216	0.757	0.554
3	3 · Cohere embed-v4	0.644	0.494	0.183	0.748	0.517
4	4 · Latimal	0.613	0.435	0.201	0.665	0.478
5	5 · GTE-large v1.5	0.602	0.469	0.227	0.599	0.474
6	6 · Nomic embed-text v1.5	0.575	0.366	0.158	0.646	0.436
7	7 · BGE-M3	0.553	0.336	0.148	0.625	0.416
8	8 · Microsoft e5-large	0.536	0.316	0.139	0.583	0.394
9	9 · Cohere multilingual-v3	0.513	0.348	0.137	0.560	0.390
10	10 · Lexical (TF)	0.535	0.201	0.089	0.316	0.285

Matching (Best F1)

Rank	Model	Indian	Global	Bev	Bakery	Portion	Noisy Match	X-Lingual	Avg
1	1 · Latimal	0.817	0.867	0.746	0.755	0.972	0.916	0.886	0.851
2	2 · OpenAI text-embedding-3-large	0.745	0.828	0.715	0.735	0.849	0.685	0.748	0.758
3	3 · Voyage 4 Large	0.718	0.783	0.719	0.715	0.791	0.640	0.820	0.741
4	4 · Cohere embed-v4	0.732	0.829	0.710	0.691	0.835	0.667	0.721	0.741
5	5 · Nomic embed-text v1.5	0.731	0.732	0.715	0.684	0.855	0.750	0.707	0.739
6	6 · Lexical (TF)	0.687	0.687	0.706	0.682	0.804	0.822	0.707	0.728
7	7 · BGE-M3	0.711	0.716	0.706	0.684	0.821	0.674	0.717	0.718
8	8 · Microsoft e5-large	0.681	0.716	0.706	0.688	0.757	0.648	0.731	0.704
9	9 · GTE-large v1.5	0.705	0.695	0.710	0.682	0.725	0.672	0.707	0.699
10	10 · Cohere multilingual-v3	0.694	0.671	0.706	0.682	0.669	0.640	0.714	0.682

Classification (Macro F1)

Rank	Model	Cuisine
1	1 · OpenAI text-embedding-3-large	0.833
2	2 · Voyage 4 Large	0.790
3	3 · Latimal	0.738
4	4 · Cohere embed-v4	0.737
5	5 · GTE-large v1.5	0.716
6	6 · Nomic embed-text v1.5	0.710
7	7 · BGE-M3	0.701
8	8 · Lexical (TF)	0.689
9	9 · Cohere multilingual-v3	0.506
10	10 · Microsoft e5-large	0.399

General models hold their own on open-ended search. Voyage 4 Large and text-embedding-3-large take the top two spots in the Search category, with Cohere embed-v4 close behind. That makes sense: those tasks reward broad language understanding, the thing big general models are trained for. If your only job were a search box, several of these models would serve you fine.

The food-specific work is where they fall down. Look at the Matching category. Latimal leads every one of the seven tasks, often by a wide margin.

Cross-lingual matching: 0.886 against 0.820 for the next best.
Noisy menu matching: 0.916 against 0.822 for the next best, which is the lexical baseline.
Portion size: 0.972 against 0.855.

These are the tasks that decide whether your dedup pipeline collapses two legitimately different dishes, or whether the same item written in Hindi and English ends up as two rows in your catalog. General embeddings have a blind spot here, and it's consistent across the field.

You can poke Latimal's menu intelligence API directly in the playground.

Diet and allergen search is hard for everyone: the best score is 0.238, and most models sit far below it. There's real headroom for the whole field. And cuisine classification surprised us: text-embedding-3-large leads at 0.833, well ahead of Latimal at 0.738, which says a strong general embedding space can separate cuisines well even without food-specific tuning.

What the failures actually look like

Cross-script matching is where general multilingual models struggle most. They tend to key on "both of these are non-English food text" instead of the actual dish. So "Bibimbap 비빔밥" gets merged with "Kimbap 김밥" (different dishes), while a genuine match like "Beef Bowl" and "Gyudon" gets missed. Same failure mode, both directions.

Romanization trips them too. "Arabic Flatbread" and "Khubz Arabi" are the same bread. "Arabic Meat Pastry" and "Fatayer Laham" are the same pastry. A model with no food grounding has no reason to connect them, because there's almost no character overlap to lean on.

And the within-category confusables are brutal. "Baked Falafel" vs "Fried Falafel" are different items. "Large Coffee" vs "Large Smoothie" are different items. But "Regular Pepsi" vs "Large Pepsi" are the same item in two sizes. Getting these right needs an understanding of which modifiers change identity and which don't, and that's exactly what general training never teaches.

Run it on your own model

FoodEval ships as a pip package. The data is bundled, so there's nothing to download separately.

pip install foodeval
foodeval run --model BAAI/bge-m3 --dim 384

Or from Python:

from foodeval.evaluate import run_benchmark
from foodeval.adapters.sentence_transformer import SentenceTransformerAdapter

adapter = SentenceTransformerAdapter("BAAI/bge-m3", truncate_dim=384)
result = run_benchmark(adapter)
print(result.to_markdown())

The leaderboard standardizes on 384 dimensions for a fair comparison and a realistic production operating point, but you can evaluate at any dimension. There are adapters for sentence-transformers, OpenAI, and Bedrock, plus a simple protocol if you want to wrap something else. Every baseline in the table is reproducible from the bundled data. Latimal runs through its public API under the identical harness.

Why it's evaluation-only

FoodEval is released under CC-BY-NC 4.0 with an evaluation-only addendum. You can use it to benchmark, compare, and publish research. You can't use the data to train models. That line matters: the moment a benchmark leaks into training sets, the scores stop meaning anything. Keeping the data eval-only is how it stays a real measurement instead of a target everyone quietly overfits. The tooling and code are open for commercial use; the data is for evaluation.

Try it

If you run search, dedup, or recommendations over a menu, FoodEval will tell you something your current metrics won't. Point it at whatever you're using today and see where the food gap shows up.

Browse the full leaderboard
Data and benchmark on HuggingFace: huggingface.co/datasets/latimal/foodeval
Code and leaderboard on GitHub: github.com/latimal/foodeval
pip install foodeval

Run your model. If your model beats the table, open a PR with your result file and we'll add it to the leaderboard.