|
| 1 | +--- |
| 2 | +title: "Address Element Extraction - Shopee Code League 2021" |
| 3 | +date: 2025-07-31T09:53:11+08:00 |
| 4 | +description: "A summary of my solution at the Shopee Code League 2021 Data Science Challenge" |
| 5 | +tags: ["Data Science", "Competition"] |
| 6 | +showToc: true |
| 7 | +draft: false |
| 8 | +--- |
| 9 | + |
| 10 | +## 🎯 Problem Statement |
| 11 | + |
| 12 | +Unstructured, incomplete-and often misspelled-Indonesian addresses make accurate geocoding for last-mile delivery a major challenge. In the Shopee Code League 2021 Data Science round, we were given: |
| 13 | + |
| 14 | +- **300,000** training samples & **50,000** test addresses |
| 15 | +- The task: **Extract** Point of Interest (POI) and Street from raw address text |
| 16 | +- **Enable** downstream geocoding to optimize delivery routes and improve customer experience |
| 17 | + |
| 18 | +| Raw address | POI | Street | |
| 19 | +| ------------------------------------------------------------------------------- | --------------- | -------------------- | |
| 20 | +| `cipinang besar selatan lintas ibadah, cipi jaya 1a no 3 rw 7 13410 jatinegara` | `lintas ibadah` | `cipinang jaya 1a` | |
| 21 | +| `puri kemb timur` | _None_ | `puri kembang timur` | |
| 22 | + |
| 23 | +--- |
| 24 | + |
| 25 | +## 🔍 NER Training Pipeline |
| 26 | + |
| 27 | +We formulated POI/Street extraction as a token-level Named Entity Recognition (NER) problem: |
| 28 | + |
| 29 | +1. **Tokenisation & Alignment** |
| 30 | + |
| 31 | + - Clean and split addresses into word tokens using regular expressions |
| 32 | + - Align ground-truth POI/Street spans to tokens via a simple linear substring search with **prefix matching** (to tolerate truncation or misspelling) |
| 33 | + - Alignment failures accounted for only ~1,000 rows |
| 34 | + |
| 35 | + | Field | Value | |
| 36 | + | --------------- | ----------------------------------------------------- | |
| 37 | + | **Raw address** | `law stat, hayam wuruk, sumerta kelod denpasar timur` | |
| 38 | + | **POI** | `lawson station` | |
| 39 | + | **Street** | `hayam wuruk` | |
| 40 | + |
| 41 | +2. **IOBES + `{SHORT}` Tagging Scheme** |
| 42 | + |
| 43 | + - **B/I/E/S** tags mark Beginning/Inside/End/Single-token entities |
| 44 | + - **O** for Outside tokens |
| 45 | + - **SHORT** for clipped or misspelled tokens needing correction |
| 46 | + |
| 47 | + | Field | Value | |
| 48 | + | -------------------- | ------------------------------------------------------------------------------ | |
| 49 | + | **Raw address** | `law stat, hayam wuruk, sumerta kelod denpasar timur` | |
| 50 | + | **POI** | `lawson station` | |
| 51 | + | **Street** | `hayam wuruk` | |
| 52 | + | **Individual words** | `['law', 'stat,', 'hayam', 'wuruk,', 'sumerta', 'kelod', 'denpasar', 'timur']` | |
| 53 | + | **Individual tags** | `['B-POI-SHORT', 'E-POI-SHORT', 'B-STR', 'E-STR', 'O', 'O', 'O', 'O']` | |
| 54 | + |
| 55 | +3. **Model Fine-tuning** |
| 56 | + |
| 57 | + - Pretrained transformers: **IndoBERT** (Indonesian) and **XLM** (multilingual) |
| 58 | + - Single-token, multi-class classification (one tag per token) |
| 59 | + - Optimiser: Adam + cross-entropy loss |
| 60 | + - Trained for **5 epochs** (sufficient for convergence) |
| 61 | + - Stabilisation & speedups via: |
| 62 | + - Cyclic learning-rate scheduler with warm-up |
| 63 | + - Mixed-precision training |
| 64 | + |
| 65 | +4. **Post-processing: SHORT Reconstruction** |
| 66 | + |
| 67 | + - Build a one-to-one “fixer” dictionary from training data: observed SHORT tokens → full tokens (by frequency) |
| 68 | + - At inference, replace each SHORT token with its dictionary lookup |
| 69 | + - Simple but surprisingly effective, despite occasional unseen or ambiguous SHORT tokens |
| 70 | +  |
| 71 | + |
| 72 | +--- |
| 73 | + |
| 74 | +## 🔄 Data Augmentation |
| 75 | + |
| 76 | +To diversify the training set and improve generalisation: |
| 77 | + |
| 78 | +- **Intra-sentence swaps:** randomly swap POI ↔ Street within the same address |
| 79 | +- **Cross-sentence swaps:** exchange POI/Street phrases between different addresses |
| 80 | +- **Result:** nearly **2×** increase in training examples |
| 81 | + |
| 82 | +--- |
| 83 | + |
| 84 | +## 🛠️ Model Ensembling & Results |
| 85 | + |
| 86 | +- Trained multiple model checkpoints |
| 87 | +- **Averaged logits** across checkpoints → **+0.02** absolute accuracy boost |
| 88 | +- **Final test accuracy:** ~**70%** |
| 89 | +- **Rank:** 1st out of 1,034 teams |
| 90 | +- [Leaderboard](https://www.kaggle.com/competitions/scl-2021-ds/leaderboard) |
| 91 | +- [Solution](https://www.kaggle.com/competitions/scl-2021-ds/writeups/student-voidandtwotsts-1st-place-solution-scl-ds-2) |
| 92 | + |
| 93 | + |
| 94 | +--- |
| 95 | + |
| 96 | +## 💭 Reflections & Future Directions |
| 97 | + |
| 98 | +- **Data processing & augmentation** provided the largest gains in 2021 |
| 99 | +- Today’s state-of-the-art pretrained models-and even LLMs-could recast this as a **text-generation** task rather than token classification |
| 100 | +- **Potential improvements:** |
| 101 | + - More sophisticated augmentation (synonym substitution, paraphrasing) |
| 102 | + - Replace the simple fixer dictionary with a **contextual language model** to “repair” SHORT tokens |
| 103 | + |
| 104 | +## 🔗 Download Slides |
| 105 | + |
| 106 | +You can download the summary slides here: |
| 107 | +[**Shopee Code League 2021 – Address Elements Extraction (PDF)**](/projects/scl_2021/pdfs/scl_2021.pdf) |
0 commit comments