Skip to content

Commit dae0eb6

Browse files
committed
adding itgi as a draft working paper
1 parent dbeecfa commit dae0eb6

8 files changed

Lines changed: 671 additions & 0 deletions

File tree

Lines changed: 12 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,12 @@
1+
{
2+
"hash": "2e3cf2ec08bd763de82181afde795692",
3+
"result": {
4+
"engine": "jupyter",
5+
"markdown": "---\ntitle: Indian Food Ingredients & Label Variants\nauthor:\n - name: Lalitha A R\n affiliation: iSRL\n orcid: 0009-0001-7466-3531\n email: lalithaar.research@gmail.com\n corresponding: true\nx-contributor:\n - name: Subrat Sethi\n - name: Purnendu Shukla\ndoc-id: iSRL-26-02-DS-Variants\ndescription: \"A mapping of 2,500+ regional ingredient variations observed on Indian food labels, linking label variants to a canonical vocabulary. Note: this dataset has been superseded — the v1 approach was abandoned after finding it conflated noise reduction with meaningful cultural and linguistic variation.\"\ndate: February 2026\ndoi: 10.5281/zenodo.1871452\nlicense: \"CC BY\"\ncitation:\n type: dataset\n publisher: iSRL\n number: iSRL-26-02-DS-Variants\n\nabstract: |\n **This dataset has been superseded.** The v1 mapping approach — standardising ingredient\n label variants to a canonical vocabulary — was found to conflate noise reduction with\n meaningful cultural and linguistic variation. This document explains why the approach\n was abandoned and what replaced it.\n\n A mapping of 2500+ regional ingredient variations as observed in Indian labels.\n This dataset provides a structured mapping of the diverse ways ingredients are named\n on Indian food packaging, linking variants (the actual text found on labels) to a\n canon (a standardised, clean category).\n\n Example mapping: Canon: Acetic Acid (INS 260) — Variants: acidity regulator 260,\n vinegar, ins 260, acetic acid (260).\nresources:\n - data/ingredients.csv\nfilters:\n - ../../contributors.lua\nother-links:\n - text: \"Dataset (CSV)\"\n href: \"data/ingredients.csv\"\n icon: \"file-earmark-spreadsheet\"\n---\n\n```{=html}\n<script type=\"application/ld+json\">\n{\n \"@context\": \"https://schema.org\",\n \"@type\": \"Dataset\",\n \"name\": \"Indian Food Ingredients & Label Variants\",\n \"@id\": \"https://doi.org/10.5281/zenodo.1871452\",\n \"identifier\": [\n \"https://doi.org/10.5281/zenodo.1871452\",\n \"iSRL-26-02-DS-Variants\"\n ],\n \"description\": \"A mapping of 2,500+ regional ingredient variations observed on Indian food labels, linking label variants to a canonical vocabulary. Note: this dataset has been superseded — the v1 approach was abandoned after finding it conflated noise reduction with meaningful cultural and linguistic variation.\",\n \"creativeWorkStatus\": \"Superseded\",\n \"license\": \"https://creativecommons.org/licenses/by/4.0/\",\n \"url\": \"https://isrl-research.github.io/pub/2026-02-ds-variants/\",\n \"author\": {\n \"@type\": \"Person\",\n \"name\": \"Lalitha A R\",\n \"identifier\": \"https://orcid.org/0009-0001-7466-3531\",\n \"sameAs\": \"https://orcid.org/0009-0001-7466-3531\",\n \"email\": \"lalithaar.research@gmail.com\"\n },\n \"publisher\": {\n \"@type\": \"ResearchOrganization\",\n \"name\": \"iSRL\",\n \"url\": \"https://isrl-research.github.io\"\n }\n}\n</script>\n```\n\n::: {.callout-important}\n## This version has been superseded\n\nThis dataset is no longer maintained. The v1 approach was found to be structurally\ninadequate for the problem it was designed to solve. The full reasoning is documented\nbelow. The dataset remains available for reference at the link above.\n\nFor current work, see the\n[Identity, Transformation, and Function framework](https://doi.org/10.5281/zenodo.18714526)\nand its [justification companion](https://doi.org/10.5281/zenodo.18713318).\n:::\n\nWe released [Indian Food Ingredients & Label Variants](https://doi.org/10.34740/KAGGLE/DSV/14783287) (v1) with the goal of making ingredient label text parseable by machines. The dataset standardised ingredient names — mapping `kashmiri chilli` to `chilli`, for instance — on the assumption that a normalised vocabulary would make automated parsing tractable.\n\nTwo problems emerged as data collection continued.\n\nFirst, the approach trades away information the project is now explicitly committed to preserving. The data makes this concrete.\n\n::: {#load-data .cell execution_count=1}\n\n::: {.cell-output .cell-output-stdout}\n```\n canon variant\n0 A2 Protein a2 protein\n1 Acesulfame Potassium (INS 950) acesulfame k\n2 Acesulfame Potassium (INS 950) acesulfame potassium\n3 Acesulfame Potassium (INS 950) sweetener ins 950\n4 Acetic Acid (INS 260) acetic acid\n```\n:::\n:::\n\n\n::: {#chilli-canon .cell execution_count=2}\n``` {.python .cell-code code-fold=\"false\"}\nimport pandas as pd\n\ndf = pd.read_csv(\"data/ingredients.csv\", header=None, names=[\"canon\", \"variant\"])\n\n# All variants that map to Chilli in v1\nchilli = df[df[\"canon\"] == \"Chilli\"].copy()\n\n# The ones that carry regional and variety-level identity\nregional = chilli[chilli[\"variant\"].str.contains(\n \"kashmiri|mathania|jalapeño|lal mirch\", case=False\n)].reset_index(drop=True)\n\nprint(regional.to_string(index=False))\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n canon variant\nChilli kashmiri chilli\nChilli kashmiri lal mirch\nChilli mild jalapeño\nChilli salt with spices and condiments chillies and capsicum lal mirchi\nChilli spices and condiments kashmiri red chilli powder\nChilli spices and condiments mathania red chilli powder\nChilli stalkless kashmiri chillies\n```\n:::\n:::\n\n\nIn v1, every row above maps to `Chilli`. `Kashmiri lal mirch`, `mathania red chilli powder`, `stalkless kashmiri chillies` — all collapsed into the same canon as `chili powder` and `red chilly flakes`.\n\nThe brands that wrote these labels did not have to. `Kashmiri chilli` could have been declared as `chilli` — it would have been legally compliant. The choice to name it specifically was a choice to preserve something: a regional identity, a flavour profile, a cultural referent that Indian consumers recognise and reach for. The v1 mapping erases that choice.\n\nThis is not only a question of cultural fidelity. Ingredient identity has legal and fiscal consequences. Fresh alphonso mangoes attract 0% GST as an agricultural produce; mango pulp processed from a specific GI-tagged variety enters a different regulatory category. `Kashmiri chilli` carries a Geographical Indication; a generic `chilli` does not. When a mapping table collapses these into one canon, it does not simplify the data — it destroys the signal that downstream regulatory, taxation, and traceability systems depend on. Respecting the taste of India is not a sentiment; it is a data integrity requirement.\n\nSecond, the ingredient name space in Indian packaged food is too diverse for automated mapping to be reliable. The problem splits into two structurally different cases:\n\n- **Semantic variants** — spelling differences, typos, punctuation variation — can be resolved with a comprehensive mapping table, because the variation is noise around a stable referent. `Chenna`, `bengal gram flour`, and `chickpea flour` are different names for the same thing. `Palmitate` and `palm oil` are not — they are similar-sounding but distinct ingredients.\n- **Cultural and linguistic variants** — regional names, transliterations, variety-level distinctions (like alphonso mango) — cannot be mapped reliably because the variation itself carries meaning. A model trained on such a mapping would not learn the differences; it would erase them.\n\nMaintaining a single mapping table that handles both cases conflates the problem. In practice, it means tracking every normalisation decision made during data cleaning — effectively a log of every typo fixed across thousands of rows — with no mechanism to distinguish meaningful variation from noise.\n\nThe ingredient substrate under development makes this mapping unnecessary. A deterministic identity layer — one that assigns canonical identifiers to ingredients independent of how they are written on any given label — eliminates the need for probabilistic name matching at parse time. Labels are parsed against the substrate, not against a maintained vocabulary of variants.\n\nThe v1 dataset will remain available for reference. The label variants mapping will not be maintained going forward.\n\n---\n\nThis brings us to the question of how we extract the variants in a way that preserves\nthe signal.\n\nHow do we formalise that milk solids feels like it should be under milk while butter\nfeels different? How do we measure the distance between a variant and its source\ningredient?\n\nThese questions led to a food classification framework inspired by Ranganathan's 1933\nColon Classification[^cc][^cce] and grounded in Indian judicial and regulatory\nprecedents — FSSAI, ITC-HS, court rulings.\n\n[^cc]: Colon Classification (Faceted Classification) by S R Ranganathan, Father of Indian Library Science.\n\n[^cce]: Instead of a flat list, faceted classification lets us express a single object as a set of values across independent dimensions — the way filtering by price, type, and brand on Amazon works, rather than browsing a single ranked list.\n\n- [Identity, Transformation, and Function: A Tri-Axial Model for the Classification of Food Ingredient Identity](https://doi.org/10.5281/zenodo.18714526)\n- [Justification companion](https://doi.org/10.5281/zenodo.18713318)\n\n",
6+
"supporting": [
7+
"index_files"
8+
],
9+
"filters": [],
10+
"includes": {}
11+
}
12+
}

_freeze/site_libs/clipboard/clipboard.min.js

Lines changed: 7 additions & 0 deletions
Some generated files are not rendered by default. Learn more about customizing how changed files appear on GitHub.

_freeze/site_libs/quarto-listing/list.min.js

Lines changed: 2 additions & 0 deletions
Some generated files are not rendered by default. Learn more about customizing how changed files appear on GitHub.

0 commit comments

Comments
 (0)