Skip to content

Commit 2027202

Browse files
committed
Add scl2021 project
1 parent 5e52a81 commit 2027202

23 files changed

+216
-29
lines changed
223 KB
Loading
152 KB
Loading
251 KB
Binary file not shown.
Lines changed: 107 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,107 @@
1+
---
2+
title: "Address Element Extraction - Shopee Code League 2021"
3+
date: 2025-07-31T09:53:11+08:00
4+
description: "A summary of my solution at the Shopee Code League 2021 Data Science Challenge"
5+
tags: ["Data Science", "Competition"]
6+
showToc: true
7+
draft: false
8+
---
9+
10+
## 🎯 Problem Statement
11+
12+
Unstructured, incomplete-and often misspelled-Indonesian addresses make accurate geocoding for last-mile delivery a major challenge. In the Shopee Code League 2021 Data Science round, we were given:
13+
14+
- **300,000** training samples & **50,000** test addresses
15+
- The task: **Extract** Point of Interest (POI) and Street from raw address text
16+
- **Enable** downstream geocoding to optimize delivery routes and improve customer experience
17+
18+
| Raw address | POI | Street |
19+
| ------------------------------------------------------------------------------- | --------------- | -------------------- |
20+
| `cipinang besar selatan lintas ibadah, cipi jaya 1a no 3 rw 7 13410 jatinegara` | `lintas ibadah` | `cipinang jaya 1a` |
21+
| `puri kemb timur` | _None_ | `puri kembang timur` |
22+
23+
---
24+
25+
## 🔍 NER Training Pipeline
26+
27+
We formulated POI/Street extraction as a token-level Named Entity Recognition (NER) problem:
28+
29+
1. **Tokenisation & Alignment**
30+
31+
- Clean and split addresses into word tokens using regular expressions
32+
- Align ground-truth POI/Street spans to tokens via a simple linear substring search with **prefix matching** (to tolerate truncation or misspelling)
33+
- Alignment failures accounted for only ~1,000 rows
34+
35+
| Field | Value |
36+
| --------------- | ----------------------------------------------------- |
37+
| **Raw address** | `law stat, hayam wuruk, sumerta kelod denpasar timur` |
38+
| **POI** | `lawson station` |
39+
| **Street** | `hayam wuruk` |
40+
41+
2. **IOBES + `{SHORT}` Tagging Scheme**
42+
43+
- **B/I/E/S** tags mark Beginning/Inside/End/Single-token entities
44+
- **O** for Outside tokens
45+
- **SHORT** for clipped or misspelled tokens needing correction
46+
47+
| Field | Value |
48+
| -------------------- | ------------------------------------------------------------------------------ |
49+
| **Raw address** | `law stat, hayam wuruk, sumerta kelod denpasar timur` |
50+
| **POI** | `lawson station` |
51+
| **Street** | `hayam wuruk` |
52+
| **Individual words** | `['law', 'stat,', 'hayam', 'wuruk,', 'sumerta', 'kelod', 'denpasar', 'timur']` |
53+
| **Individual tags** | `['B-POI-SHORT', 'E-POI-SHORT', 'B-STR', 'E-STR', 'O', 'O', 'O', 'O']` |
54+
55+
3. **Model Fine-tuning**
56+
57+
- Pretrained transformers: **IndoBERT** (Indonesian) and **XLM** (multilingual)
58+
- Single-token, multi-class classification (one tag per token)
59+
- Optimiser: Adam + cross-entropy loss
60+
- Trained for **5 epochs** (sufficient for convergence)
61+
- Stabilisation & speedups via:
62+
- Cyclic learning-rate scheduler with warm-up
63+
- Mixed-precision training
64+
65+
4. **Post-processing: SHORT Reconstruction**
66+
67+
- Build a one-to-one “fixer” dictionary from training data: observed SHORT tokens → full tokens (by frequency)
68+
- At inference, replace each SHORT token with its dictionary lookup
69+
- Simple but surprisingly effective, despite occasional unseen or ambiguous SHORT tokens
70+
![short_word_reconstruction](/projects/scl_2021/images/short_word_reconstruction.png)
71+
72+
---
73+
74+
## 🔄 Data Augmentation
75+
76+
To diversify the training set and improve generalisation:
77+
78+
- **Intra-sentence swaps:** randomly swap POI ↔ Street within the same address
79+
- **Cross-sentence swaps:** exchange POI/Street phrases between different addresses
80+
- **Result:** nearly **** increase in training examples
81+
82+
---
83+
84+
## 🛠️ Model Ensembling & Results
85+
86+
- Trained multiple model checkpoints
87+
- **Averaged logits** across checkpoints → **+0.02** absolute accuracy boost
88+
- **Final test accuracy:** ~**70%**
89+
- **Rank:** 1st out of 1,034 teams
90+
- [Leaderboard](https://www.kaggle.com/competitions/scl-2021-ds/leaderboard)
91+
- [Solution](https://www.kaggle.com/competitions/scl-2021-ds/writeups/student-voidandtwotsts-1st-place-solution-scl-ds-2)
92+
![leaderboard](/projects/scl_2021/images/leaderboard.png)
93+
94+
---
95+
96+
## 💭 Reflections & Future Directions
97+
98+
- **Data processing & augmentation** provided the largest gains in 2021
99+
- Today’s state-of-the-art pretrained models-and even LLMs-could recast this as a **text-generation** task rather than token classification
100+
- **Potential improvements:**
101+
- More sophisticated augmentation (synonym substitution, paraphrasing)
102+
- Replace the simple fixer dictionary with a **contextual language model** to “repair” SHORT tokens
103+
104+
## 🔗 Download Slides
105+
106+
You can download the summary slides here:
107+
[**Shopee Code League 2021 – Address Elements Extraction (PDF)**](/projects/scl_2021/pdfs/scl_2021.pdf)

public/index.xml

Lines changed: 8 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -11,8 +11,15 @@
1111
</image>
1212
<generator>Hugo -- 0.148.2</generator>
1313
<language>en-us</language>
14-
<lastBuildDate>Sun, 25 Jun 2023 17:04:32 +0800</lastBuildDate>
14+
<lastBuildDate>Thu, 31 Jul 2025 09:53:11 +0800</lastBuildDate>
1515
<atom:link href="https://ncduy.github.io/index.xml" rel="self" type="application/rss+xml" />
16+
<item>
17+
<title>Address Element Extraction - Shopee Code League 2021</title>
18+
<link>https://ncduy.github.io/projects/scl_2021/scl_2021/</link>
19+
<pubDate>Thu, 31 Jul 2025 09:53:11 +0800</pubDate>
20+
<guid>https://ncduy.github.io/projects/scl_2021/scl_2021/</guid>
21+
<description>A summary of my solution at the Shopee Code League 2021 Data Science Challenge</description>
22+
</item>
1623
<item>
1724
<title>Y1S2 Module Review</title>
1825
<link>https://ncduy.github.io/posts/module-reviews/y1s2/</link>

public/projects/index.html

Lines changed: 3 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1,7 +1,9 @@
11
<!doctype html><html lang=en dir=auto><head><meta charset=utf-8><meta http-equiv=X-UA-Compatible content="IE=edge"><meta name=viewport content="width=device-width,initial-scale=1,shrink-to-fit=no"><meta name=robots content="index, follow"><title>Projects | Duy Nguyen</title><meta name=keywords content><meta name=description content="Projects - Duy Nguyen"><meta name=author content><link rel=canonical href=https://ncduy.github.io/projects/><meta name=google-site-verification content="XYZabc"><meta name=yandex-verification content="XYZabc"><meta name=msvalidate.01 content="XYZabc"><link crossorigin=anonymous href=/assets/css/stylesheet.8fe10233a706bc87f2e08b3cf97b8bd4c0a80f10675a143675d59212121037c0.css integrity="sha256-j+ECM6cGvIfy4Is8+XuL1MCoDxBnWhQ2ddWSEhIQN8A=" rel="preload stylesheet" as=style><link rel=icon href=https://ncduy.github.io/%3Clink%20/%20abs%20url%3E><link rel=icon type=image/png sizes=16x16 href=https://ncduy.github.io/%3Clink%20/%20abs%20url%3E><link rel=icon type=image/png sizes=32x32 href=https://ncduy.github.io/%3Clink%20/%20abs%20url%3E><link rel=apple-touch-icon href=https://ncduy.github.io/%3Clink%20/%20abs%20url%3E><link rel=mask-icon href=https://ncduy.github.io/%3Clink%20/%20abs%20url%3E><meta name=theme-color content="#2e2e33"><meta name=msapplication-TileColor content="#2e2e33"><link rel=alternate type=application/rss+xml href=https://ncduy.github.io/projects/index.xml><link rel=alternate hreflang=en href=https://ncduy.github.io/projects/><noscript><style>#theme-toggle,.top-link{display:none}</style><style>@media(prefers-color-scheme:dark){:root{--theme:rgb(29, 30, 32);--entry:rgb(46, 46, 51);--primary:rgb(218, 218, 219);--secondary:rgb(155, 156, 157);--tertiary:rgb(65, 66, 68);--content:rgb(196, 196, 197);--code-block-bg:rgb(46, 46, 51);--code-bg:rgb(55, 56, 62);--border:rgb(51, 51, 51)}.list{background:var(--theme)}.list:not(.dark)::-webkit-scrollbar-track{background:0 0}.list:not(.dark)::-webkit-scrollbar-thumb{border-color:var(--theme)}}</style></noscript><meta property="og:url" content="https://ncduy.github.io/projects/"><meta property="og:site_name" content="Duy Nguyen"><meta property="og:title" content="Projects"><meta property="og:description" content="Year 4 Computer Science and Data Science undergraduate student at the National University of Singapore."><meta property="og:locale" content="en-us"><meta property="og:type" content="website"><meta property="og:image" content="https://ncduy.github.io/%3Clink%20or%20path%20of%20image%20for%20opengraph,%20twitter-cards%3E"><meta name=twitter:card content="summary_large_image"><meta name=twitter:image content="https://ncduy.github.io/%3Clink%20or%20path%20of%20image%20for%20opengraph,%20twitter-cards%3E"><meta name=twitter:title content="Projects"><meta name=twitter:description content="Year 4 Computer Science and Data Science undergraduate student at the National University of Singapore."><script type=application/ld+json>{"@context":"https://schema.org","@type":"BreadcrumbList","itemListElement":[{"@type":"ListItem","position":1,"name":"Projects","item":"https://ncduy.github.io/projects/"}]}</script></head><body class=list id=top><script>localStorage.getItem("pref-theme")==="dark"?document.body.classList.add("dark"):localStorage.getItem("pref-theme")==="light"?document.body.classList.remove("dark"):window.matchMedia("(prefers-color-scheme: dark)").matches&&document.body.classList.add("dark")</script><header class=header><nav class=nav><div class=logo><a href=https://ncduy.github.io/ accesskey=h title="Duy Nguyen (Alt + H)"><img src=https://ncduy.github.io/images/profile.jpg alt aria-label=logo height=35>Duy Nguyen</a><div class=logo-switches><button id=theme-toggle accesskey=t title="(Alt + T)" aria-label="Toggle theme">
22
<svg id="moon" width="24" height="18" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round"><path d="M21 12.79A9 9 0 1111.21 3 7 7 0 0021 12.79z"/></svg>
33
<svg id="sun" width="24" height="18" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round"><circle cx="12" cy="12" r="5"/><line x1="12" y1="1" x2="12" y2="3"/><line x1="12" y1="21" x2="12" y2="23"/><line x1="4.22" y1="4.22" x2="5.64" y2="5.64"/><line x1="18.36" y1="18.36" x2="19.78" y2="19.78"/><line x1="1" y1="12" x2="3" y2="12"/><line x1="21" y1="12" x2="23" y2="12"/><line x1="4.22" y1="19.78" x2="5.64" y2="18.36"/><line x1="18.36" y1="5.64" x2="19.78" y2="4.22"/></svg></button></div></div><ul id=menu><li><a href=https://ncduy.github.io/posts/ title=Posts><span>Posts</span></a></li><li><a href=https://ncduy.github.io/projects/ title=Projects><span class=active>Projects</span></a></li><li><a href=https://ncduy.github.io/profile/about/ title=About><span>About</span></a></li><li><a href=https://ncduy.github.io/archives/ title=Archives><span>Archives</span></a></li><li><a href=https://ncduy.github.io/tags/ title=Tags><span>Tags</span></a></li><li><a href=https://ncduy.github.io/search/ title="Search (Alt + /)" accesskey=/><span>Search</span></a></li></ul></nav></header><main class=main><header class=page-header><h1>Projects
4-
<a href=/projects/index.xml title=RSS aria-label=RSS><svg viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" height="23"><path d="M4 11a9 9 0 019 9"/><path d="M4 4a16 16 0 0116 16"/><circle cx="5" cy="19" r="1"/></svg></a></h1></header></main><footer class=footer><span>&copy; 2025 <a href=https://ncduy.github.io/>Duy Nguyen</a></span> ·
4+
<a href=/projects/index.xml title=RSS aria-label=RSS><svg viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" height="23"><path d="M4 11a9 9 0 019 9"/><path d="M4 4a16 16 0 0116 16"/><circle cx="5" cy="19" r="1"/></svg></a></h1></header><article class=post-entry><header class=entry-header><h2 class=entry-hint-parent>Address Element Extraction - Shopee Code League 2021</h2></header><div class=entry-content><p>🎯 Problem Statement Unstructured, incomplete-and often misspelled-Indonesian addresses make accurate geocoding for last-mile delivery a major challenge. In the Shopee Code League 2021 Data Science round, we were given:
5+
300,000 training samples & 50,000 test addresses The task: Extract Point of Interest (POI) and Street from raw address text Enable downstream geocoding to optimize delivery routes and improve customer experience Raw address POI Street cipinang besar selatan lintas ibadah, cipi jaya 1a no 3 rw 7 13410 jatinegara lintas ibadah cipinang jaya 1a puri kemb timur None puri kembang timur 🔍 NER Training Pipeline We formulated POI/Street extraction as a token-level Named Entity Recognition (NER) problem:
6+
...</p></div><footer class=entry-footer><span title='2025-07-31 09:53:11 +0800 +08'>July 31, 2025</span>&nbsp;·&nbsp;3 min&nbsp;·&nbsp;442 words</footer><a class=entry-link aria-label="post link to Address Element Extraction - Shopee Code League 2021" href=https://ncduy.github.io/projects/scl_2021/scl_2021/></a></article></main><footer class=footer><span>&copy; 2025 <a href=https://ncduy.github.io/>Duy Nguyen</a></span> ·
57
<span>Powered by
68
<a href=https://gohugo.io/ rel="noopener noreferrer" target=_blank>Hugo</a> &
79
<a href=https://github.com/adityatelange/hugo-PaperMod/ rel=noopener target=_blank>PaperMod</a></span></footer><a href=#top aria-label="go to top" title="Go to Top (Alt + G)" class=top-link id=top-link accesskey=g><svg viewBox="0 0 12 6" fill="currentColor"><path d="M12 6H0l6-6z"/></svg>

public/projects/index.xml

Lines changed: 8 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -11,7 +11,14 @@
1111
</image>
1212
<generator>Hugo -- 0.148.2</generator>
1313
<language>en-us</language>
14-
<lastBuildDate></lastBuildDate>
14+
<lastBuildDate>Thu, 31 Jul 2025 09:53:11 +0800</lastBuildDate>
1515
<atom:link href="https://ncduy.github.io/projects/index.xml" rel="self" type="application/rss+xml" />
16+
<item>
17+
<title>Address Element Extraction - Shopee Code League 2021</title>
18+
<link>https://ncduy.github.io/projects/scl_2021/scl_2021/</link>
19+
<pubDate>Thu, 31 Jul 2025 09:53:11 +0800</pubDate>
20+
<guid>https://ncduy.github.io/projects/scl_2021/scl_2021/</guid>
21+
<description>A summary of my solution at the Shopee Code League 2021 Data Science Challenge</description>
22+
</item>
1623
</channel>
1724
</rss>
223 KB
Loading
152 KB
Loading
223 KB
Loading

0 commit comments

Comments
 (0)