Menu Deduplication Pipeline

Menu data from multiple sources (POS systems, aggregators, manual entry) inevitably contains duplicates. "Chicken Biryani", "Chiken Biryani", and "Murgh Biryani" are all the same dish. This guide walks through the full dedup pipeline.

See the POST /dedup API reference for the full request/response schema. Try it live in the playground →

How it works

Load your menu data (CSV, database export, POS feed)
Send all items to POST /dedup
The API clusters duplicates and picks a canonical name per cluster
Update your database with canonical references

Full example

import requests
import csv

API_KEY = "YOUR_KEY"
BASE = "https://dish-embed.latimal.com"
headers = {"X-API-Key": API_KEY, "Content-Type": "application/json"}

# Load menu from CSV
with open("menu.csv") as f:
    items = [row["item_name"] for row in csv.DictReader(f)]

# Deduplicate
resp = requests.post(f"{BASE}/dedup", headers=headers, json={"items": items})
data = resp.json()

print(f"Found {len(data['clusters'])} duplicate groups")
print(f"{data['duplicate_items']} excess items to remove")

for cluster in data["clusters"]:
    print(f"\nCanonical: {cluster['canonical']}")
    print(f"  Duplicates: {', '.join(m for m in cluster['members'] if m != cluster['canonical'])}")

Handling large menus

The /dedup endpoint accepts up to 2,000 items per request. For larger menus, chunk your data:

def dedup_chunked(items, chunk_size=2000):
    all_clusters = []
    for i in range(0, len(items), chunk_size):
        chunk = items[i:i + chunk_size]
        resp = requests.post(f"{BASE}/dedup", headers=headers, json={"items": chunk})
        all_clusters.extend(resp.json()["clusters"])
    return all_clusters

Note that cross-chunk duplicates won't be caught. If you have more than 2,000 items, consider running /match on suspected pairs across chunk boundaries.

Threshold tuning

The default threshold works well for most menus. You can adjust it with the cosine_threshold parameter:

0.80 - Aggressive dedup. Catches more duplicates but may merge similar-but-different items (e.g., "Latte" and "Mocha").
0.85 - Default. Good balance of precision and recall.
0.90 - Conservative. Only merges near-identical items. Use this if false merges are costly.

resp = requests.post(f"{BASE}/dedup", headers=headers,
    json={"items": items, "cosine_threshold": 0.80})

Tips

Send raw menu text as-is. The API handles noise stripping and spelling normalization internally.
The canonical name in each cluster is the cleanest, most complete form. Use it as your display name.
Run dedup after every menu import, not just once. New data sources introduce new duplicates.
Dietary conflicts (e.g., "Chicken Burger" vs "Veg Burger") are never merged, regardless of threshold.