ncduy0303
diff --git a/‎content/projects/scl_2021/images/leaderboard.png‎
223 KB b/‎content/projects/scl_2021/images/leaderboard.png‎
223 KB
diff --git a/‎content/projects/scl_2021/images/short_word_reconstruction.png‎
152 KB b/‎content/projects/scl_2021/images/short_word_reconstruction.png‎
152 KB
diff --git a/‎content/projects/scl_2021/pdfs/scl_2021.pdf‎
251 KB b/‎content/projects/scl_2021/pdfs/scl_2021.pdf‎
251 KB
diff --git a/‎content/projects/scl_2021/scl_2021.md‎
Lines changed: 107 additions & 0 deletions b/‎content/projects/scl_2021/scl_2021.md‎
Lines changed: 107 additions & 0 deletions
diff --git a/‎public/index.xml‎
Lines changed: 8 additions & 1 deletion b/‎public/index.xml‎
Lines changed: 8 additions & 1 deletion
diff --git a/‎public/projects/index.html‎
Lines changed: 3 additions & 1 deletion b/‎public/projects/index.html‎
Lines changed: 3 additions & 1 deletion
diff --git a/‎public/projects/index.xml‎
Lines changed: 8 additions & 1 deletion b/‎public/projects/index.xml‎
Lines changed: 8 additions & 1 deletion
diff --git a/‎public/projects/scl_2021/images/leaderboard.png‎
223 KB b/‎public/projects/scl_2021/images/leaderboard.png‎
223 KB
diff --git a/‎public/projects/scl_2021/images/short_word_reconstruction.png‎
152 KB b/‎public/projects/scl_2021/images/short_word_reconstruction.png‎
152 KB
diff --git a/‎public/projects/scl_2021/leaderboard.png‎
223 KB b/‎public/projects/scl_2021/leaderboard.png‎
223 KB
@@ -0,0 +1,107 @@
+---
+title: "Address Element Extraction - Shopee Code League 2021"
+date: 2025-07-31T09:53:11+08:00
+description: "A summary of my solution at the Shopee Code League 2021 Data Science Challenge"
+tags: ["Data Science", "Competition"]
+showToc: true
+draft: false
+---
+
+## 🎯 Problem Statement
+
+Unstructured, incomplete-and often misspelled-Indonesian addresses make accurate geocoding for last-mile delivery a major challenge. In the Shopee Code League 2021 Data Science round, we were given:
+
+- **300,000** training samples & **50,000** test addresses
+- The task: **Extract** Point of Interest (POI) and Street from raw address text
+- **Enable** downstream geocoding to optimize delivery routes and improve customer experience
+
+| Raw address                                                                     | POI             | Street               |
+| ------------------------------------------------------------------------------- | --------------- | -------------------- |
+| `cipinang besar selatan lintas ibadah, cipi jaya 1a no 3 rw 7 13410 jatinegara` | `lintas ibadah` | `cipinang jaya 1a`   |
+| `puri kemb timur`                                                               | _None_          | `puri kembang timur` |
+
+---
+
+## 🔍 NER Training Pipeline
+
+We formulated POI/Street extraction as a token-level Named Entity Recognition (NER) problem:
+
+1. **Tokenisation & Alignment**
+
+    - Clean and split addresses into word tokens using regular expressions
+    - Align ground-truth POI/Street spans to tokens via a simple linear substring search with **prefix matching** (to tolerate truncation or misspelling)
+    - Alignment failures accounted for only ~1,000 rows
+
+      | Field           | Value                                                 |
+      | --------------- | ----------------------------------------------------- |
+      | **Raw address** | `law stat, hayam wuruk, sumerta kelod denpasar timur` |
+      | **POI**         | `lawson station`                                      |
+      | **Street**      | `hayam wuruk`                                         |
+
+2. **IOBES + `{SHORT}` Tagging Scheme**
+
+    - **B/I/E/S** tags mark Beginning/Inside/End/Single-token entities
+    - **O** for Outside tokens
+    - **SHORT** for clipped or misspelled tokens needing correction
+
+      | Field                | Value                                                                          |
+      | -------------------- | ------------------------------------------------------------------------------ |
+      | **Raw address**      | `law stat, hayam wuruk, sumerta kelod denpasar timur`                          |
+      | **POI**              | `lawson station`                                                               |
+      | **Street**           | `hayam wuruk`                                                                  |
+      | **Individual words** | `['law', 'stat,', 'hayam', 'wuruk,', 'sumerta', 'kelod', 'denpasar', 'timur']` |
+      | **Individual tags**  | `['B-POI-SHORT', 'E-POI-SHORT', 'B-STR', 'E-STR', 'O', 'O', 'O', 'O']`         |
+
+3. **Model Fine-tuning**
+
+    - Pretrained transformers: **IndoBERT** (Indonesian) and **XLM** (multilingual)
+    - Single-token, multi-class classification (one tag per token)
+    - Optimiser: Adam + cross-entropy loss
+    - Trained for **5 epochs** (sufficient for convergence)
+    - Stabilisation & speedups via:
+      - Cyclic learning-rate scheduler with warm-up
+      - Mixed-precision training
+
+4. **Post-processing: SHORT Reconstruction**
+
+    - Build a one-to-one “fixer” dictionary from training data: observed SHORT tokens → full tokens (by frequency)
+    - At inference, replace each SHORT token with its dictionary lookup
+    - Simple but surprisingly effective, despite occasional unseen or ambiguous SHORT tokens
+    ![short_word_reconstruction](/projects/scl_2021/images/short_word_reconstruction.png)
+
+---
+
+## 🔄 Data Augmentation
+
+To diversify the training set and improve generalisation:
+
+- **Intra-sentence swaps:** randomly swap POI ↔ Street within the same address
+- **Cross-sentence swaps:** exchange POI/Street phrases between different addresses
+- **Result:** nearly **2×** increase in training examples
+
+---
+
+## 🛠️ Model Ensembling & Results
+
+- Trained multiple model checkpoints
+- **Averaged logits** across checkpoints → **+0.02** absolute accuracy boost
+- **Final test accuracy:** ~**70%**
+- **Rank:** 1st out of 1,034 teams
+- [Leaderboard](https://www.kaggle.com/competitions/scl-2021-ds/leaderboard)
+- [Solution](https://www.kaggle.com/competitions/scl-2021-ds/writeups/student-voidandtwotsts-1st-place-solution-scl-ds-2)
+![leaderboard](/projects/scl_2021/images/leaderboard.png)
+
+---
+
+## 💭 Reflections & Future Directions
+
+- **Data processing & augmentation** provided the largest gains in 2021
+- Today’s state-of-the-art pretrained models-and even LLMs-could recast this as a **text-generation** task rather than token classification
+- **Potential improvements:**
+  - More sophisticated augmentation (synonym substitution, paraphrasing)
+  - Replace the simple fixer dictionary with a **contextual language model** to “repair” SHORT tokens
+
+## 🔗 Download Slides
+
+You can download the summary slides here:  
+[**Shopee Code League 2021 – Address Elements Extraction (PDF)**](/projects/scl_2021/pdfs/scl_2021.pdf)
@@ -11,8 +11,15 @@
     </image>
     <generator>Hugo -- 0.148.2</generator>
     <language>en-us</language>
-    <lastBuildDate>Sun, 25 Jun 2023 17:04:32 +0800</lastBuildDate>
+    <lastBuildDate>Thu, 31 Jul 2025 09:53:11 +0800</lastBuildDate>
     <atom:link href="https://ncduy.github.io/index.xml" rel="self" type="application/rss+xml" />
+    <item>
+      <title>Address Element Extraction - Shopee Code League 2021</title>
+      <link>https://ncduy.github.io/projects/scl_2021/scl_2021/</link>
+      <pubDate>Thu, 31 Jul 2025 09:53:11 +0800</pubDate>
+      <guid>https://ncduy.github.io/projects/scl_2021/scl_2021/</guid>
+      <description>A summary of my solution at the Shopee Code League 2021 Data Science Challenge</description>
+    </item>
     <item>
       <title>Y1S2 Module Review</title>
       <link>https://ncduy.github.io/posts/module-reviews/y1s2/</link>
 
@@ -1,7 +1,9 @@
 <!doctype html><html lang=en dir=auto><head><meta charset=utf-8><meta http-equiv=X-UA-Compatible content="IE=edge"><meta name=viewport content="width=device-width,initial-scale=1,shrink-to-fit=no"><meta name=robots content="index, follow"><title>Projects | Duy Nguyen</title><meta name=keywords content><meta name=description content="Projects - Duy Nguyen"><meta name=author content><link rel=canonical href=https://ncduy.github.io/projects/><meta name=google-site-verification content="XYZabc"><meta name=yandex-verification content="XYZabc"><meta name=msvalidate.01 content="XYZabc"><link crossorigin=anonymous href=/assets/css/stylesheet.8fe10233a706bc87f2e08b3cf97b8bd4c0a80f10675a143675d59212121037c0.css integrity="sha256-j+ECM6cGvIfy4Is8+XuL1MCoDxBnWhQ2ddWSEhIQN8A=" rel="preload stylesheet" as=style><link rel=icon href=https://ncduy.github.io/%3Clink%20/%20abs%20url%3E><link rel=icon type=image/png sizes=16x16 href=https://ncduy.github.io/%3Clink%20/%20abs%20url%3E><link rel=icon type=image/png sizes=32x32 href=https://ncduy.github.io/%3Clink%20/%20abs%20url%3E><link rel=apple-touch-icon href=https://ncduy.github.io/%3Clink%20/%20abs%20url%3E><link rel=mask-icon href=https://ncduy.github.io/%3Clink%20/%20abs%20url%3E><meta name=theme-color content="#2e2e33"><meta name=msapplication-TileColor content="#2e2e33"><link rel=alternate type=application/rss+xml href=https://ncduy.github.io/projects/index.xml><link rel=alternate hreflang=en href=https://ncduy.github.io/projects/><noscript><style>#theme-toggle,.top-link{display:none}</style><style>@media(prefers-color-scheme:dark){:root{--theme:rgb(29, 30, 32);--entry:rgb(46, 46, 51);--primary:rgb(218, 218, 219);--secondary:rgb(155, 156, 157);--tertiary:rgb(65, 66, 68);--content:rgb(196, 196, 197);--code-block-bg:rgb(46, 46, 51);--code-bg:rgb(55, 56, 62);--border:rgb(51, 51, 51)}.list{background:var(--theme)}.list:not(.dark)::-webkit-scrollbar-track{background:0 0}.list:not(.dark)::-webkit-scrollbar-thumb{border-color:var(--theme)}}</style></noscript><meta property="og:url" content="https://ncduy.github.io/projects/"><meta property="og:site_name" content="Duy Nguyen"><meta property="og:title" content="Projects"><meta property="og:description" content="Year 4 Computer Science and Data Science undergraduate student at the National University of Singapore."><meta property="og:locale" content="en-us"><meta property="og:type" content="website"><meta property="og:image" content="https://ncduy.github.io/%3Clink%20or%20path%20of%20image%20for%20opengraph,%20twitter-cards%3E"><meta name=twitter:card content="summary_large_image"><meta name=twitter:image content="https://ncduy.github.io/%3Clink%20or%20path%20of%20image%20for%20opengraph,%20twitter-cards%3E"><meta name=twitter:title content="Projects"><meta name=twitter:description content="Year 4 Computer Science and Data Science undergraduate student at the National University of Singapore."><script type=application/ld+json>{"@context":"https://schema.org","@type":"BreadcrumbList","itemListElement":[{"@type":"ListItem","position":1,"name":"Projects","item":"https://ncduy.github.io/projects/"}]}</script></head><body class=list id=top><script>localStorage.getItem("pref-theme")==="dark"?document.body.classList.add("dark"):localStorage.getItem("pref-theme")==="light"?document.body.classList.remove("dark"):window.matchMedia("(prefers-color-scheme: dark)").matches&&document.body.classList.add("dark")</script><header class=header><nav class=nav><div class=logo><a href=https://ncduy.github.io/ accesskey=h title="Duy Nguyen (Alt + H)"><img src=https://ncduy.github.io/images/profile.jpg alt aria-label=logo height=35>Duy Nguyen</a><div class=logo-switches><button id=theme-toggle accesskey=t title="(Alt + T)" aria-label="Toggle theme">
 <svg id="moon" width="24" height="18" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round"><path d="M21 12.79A9 9 0 1111.21 3 7 7 0 0021 12.79z"/></svg>
 <svg id="sun" width="24" height="18" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round"><circle cx="12" cy="12" r="5"/><line x1="12" y1="1" x2="12" y2="3"/><line x1="12" y1="21" x2="12" y2="23"/><line x1="4.22" y1="4.22" x2="5.64" y2="5.64"/><line x1="18.36" y1="18.36" x2="19.78" y2="19.78"/><line x1="1" y1="12" x2="3" y2="12"/><line x1="21" y1="12" x2="23" y2="12"/><line x1="4.22" y1="19.78" x2="5.64" y2="18.36"/><line x1="18.36" y1="5.64" x2="19.78" y2="4.22"/></svg></button></div></div><ul id=menu><li><a href=https://ncduy.github.io/posts/ title=Posts><span>Posts</span></a></li><li><a href=https://ncduy.github.io/projects/ title=Projects><span class=active>Projects</span></a></li><li><a href=https://ncduy.github.io/profile/about/ title=About><span>About</span></a></li><li><a href=https://ncduy.github.io/archives/ title=Archives><span>Archives</span></a></li><li><a href=https://ncduy.github.io/tags/ title=Tags><span>Tags</span></a></li><li><a href=https://ncduy.github.io/search/ title="Search (Alt + /)" accesskey=/><span>Search</span></a></li></ul></nav></header><main class=main><header class=page-header><h1>Projects
-<a href=/projects/index.xml title=RSS aria-label=RSS><svg viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" height="23"><path d="M4 11a9 9 0 019 9"/><path d="M4 4a16 16 0 0116 16"/><circle cx="5" cy="19" r="1"/></svg></a></h1></header></main><footer class=footer><span>&copy; 2025 <a href=https://ncduy.github.io/>Duy Nguyen</a></span> ·
+<a href=/projects/index.xml title=RSS aria-label=RSS><svg viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" height="23"><path d="M4 11a9 9 0 019 9"/><path d="M4 4a16 16 0 0116 16"/><circle cx="5" cy="19" r="1"/></svg></a></h1></header><article class=post-entry><header class=entry-header><h2 class=entry-hint-parent>Address Element Extraction - Shopee Code League 2021</h2></header><div class=entry-content><p>🎯 Problem Statement Unstructured, incomplete-and often misspelled-Indonesian addresses make accurate geocoding for last-mile delivery a major challenge. In the Shopee Code League 2021 Data Science round, we were given:
+300,000 training samples & 50,000 test addresses The task: Extract Point of Interest (POI) and Street from raw address text Enable downstream geocoding to optimize delivery routes and improve customer experience Raw address POI Street cipinang besar selatan lintas ibadah, cipi jaya 1a no 3 rw 7 13410 jatinegara lintas ibadah cipinang jaya 1a puri kemb timur None puri kembang timur 🔍 NER Training Pipeline We formulated POI/Street extraction as a token-level Named Entity Recognition (NER) problem:
+...</p></div><footer class=entry-footer><span title='2025-07-31 09:53:11 +0800 +08'>July 31, 2025</span>&nbsp;·&nbsp;3 min&nbsp;·&nbsp;442 words</footer><a class=entry-link aria-label="post link to Address Element Extraction - Shopee Code League 2021" href=https://ncduy.github.io/projects/scl_2021/scl_2021/></a></article></main><footer class=footer><span>&copy; 2025 <a href=https://ncduy.github.io/>Duy Nguyen</a></span> ·
 <span>Powered by
 <a href=https://gohugo.io/ rel="noopener noreferrer" target=_blank>Hugo</a> &
 <a href=https://github.com/adityatelange/hugo-PaperMod/ rel=noopener target=_blank>PaperMod</a></span></footer><a href=#top aria-label="go to top" title="Go to Top (Alt + G)" class=top-link id=top-link accesskey=g><svg viewBox="0 0 12 6" fill="currentColor"><path d="M12 6H0l6-6z"/></svg>
 
@@ -11,7 +11,14 @@
     </image>
     <generator>Hugo -- 0.148.2</generator>
     <language>en-us</language>
-    <lastBuildDate></lastBuildDate>
+    <lastBuildDate>Thu, 31 Jul 2025 09:53:11 +0800</lastBuildDate>
     <atom:link href="https://ncduy.github.io/projects/index.xml" rel="self" type="application/rss+xml" />
+    <item>
+      <title>Address Element Extraction - Shopee Code League 2021</title>
+      <link>https://ncduy.github.io/projects/scl_2021/scl_2021/</link>
+      <pubDate>Thu, 31 Jul 2025 09:53:11 +0800</pubDate>
+      <guid>https://ncduy.github.io/projects/scl_2021/scl_2021/</guid>
+      <description>A summary of my solution at the Shopee Code League 2021 Data Science Challenge</description>
+    </item>
   </channel>
 </rss>