|
| 1 | +# Case Study: Large-Scale Text Classification with LLMs |
| 2 | + |
| 3 | +*Tentative time: 10 minutes* |
| 4 | + |
| 5 | +## Learning Objectives |
| 6 | + |
| 7 | +By the end of this section, you will be able to: |
| 8 | + |
| 9 | +- **Understand how LLMs enable ambitious policy research** with limited resources |
| 10 | +- **Apply practical validation strategies** to ensure research integrity when using LLMs at scale |
| 11 | +- **Recognize common challenges** in LLM classification projects (including unexpected censorship issues) |
| 12 | +- **Appreciate the importance of transparency** in documenting methods for others to build upon |
| 13 | +- **Implement an iterative approach** to developing and testing classification systems |
| 14 | + |
| 15 | +## The Policy Challenge: Understanding China's Role in the Energy Transition |
| 16 | + |
| 17 | +Last year, Yunnan Chen (Research Fellow at ODI) and I set out to answer critical questions about China's evolving role in global development finance. China has been a key source of lending to developing countries, but recent policy pronouncements suggested major shifts: |
| 18 | + |
| 19 | +- Movement toward a "Green Belt and Road Initiative" |
| 20 | +- Emphasis on "small and beautiful" projects |
| 21 | +- Transition from policy bank lending to co-financing with state-owned commercial banks (SOCBs) |
| 22 | + |
| 23 | +We needed empirical evidence: Was China actually supporting the green transition in developing countries? As lending shifted toward co-financing models, who exactly was participating in green projects? What types of projects were being funded, and at what scale? |
| 24 | + |
| 25 | +These weren't academic questions. Understanding China's actual role—not just the rhetoric—was essential for policymakers working on climate finance and energy transition in developing countries. |
| 26 | + |
| 27 | +## The Classification Challenge |
| 28 | + |
| 29 | +We needed to classify 18,000 Chinese overseas lending projects from AidData's GCDF 3.0 dataset into environmental categories: |
| 30 | + |
| 31 | +- **🟢 Green**: Solar, wind, hydro, nuclear, and other renewable energy |
| 32 | +- **🟫 Brown**: Coal, oil, and fossil fuel infrastructure |
| 33 | +- **🔘 Grey**: Projects with indirect impacts (transmission lines, natural gas) |
| 34 | +- **⚪ Neutral**: Non-energy projects |
| 35 | + |
| 36 | +Traditional approaches would have required: |
| 37 | + |
| 38 | +- 1,500 hours of work (5 minutes per project × 18,000 projects) |
| 39 | +- $22,500 in research assistant costs |
| 40 | +- Large grant funding to support such an effort |
| 41 | + |
| 42 | +We completed it in 15 hours for $1.58. |
| 43 | + |
| 44 | +## The Reality of Human vs. LLM Classification |
| 45 | + |
| 46 | +Let's be honest about manual classification at scale. I've done this work myself. After a few hours of coding projects, your eyes glaze over. You start questioning whether you're applying criteria consistently. Are you coding things the same way you did yesterday? Last week? |
| 47 | + |
| 48 | +Research assistants face the same challenges—and who can blame them if attention wanders during hour six of classifying infrastructure projects? This isn't about capability; it's about the mind-numbing nature of repetitive classification tasks. |
| 49 | + |
| 50 | +LLMs bring something humans can't sustain: endless patience and perfect consistency. They apply the same criteria to project 17,000 as they did to project 1. No fatigue, no drift in standards, no bad days. |
| 51 | + |
| 52 | +The question isn't whether LLMs are perfect—they're not. It's whether they can achieve good-enough accuracy with perfect consistency at a scale that makes ambitious research possible. |
| 53 | + |
| 54 | +## From Keywords to Context: Why LLMs Were Essential |
| 55 | + |
| 56 | +### The Keyword Approach Failed |
| 57 | + |
| 58 | +I started where most researchers would: keyword searches. I wrote regular expressions to find "solar," "wind," "coal," and other energy terms. |
| 59 | + |
| 60 | +It quickly became clear this wouldn't work: |
| 61 | + |
| 62 | +**Example 1**: "Development of 500MW solar power plant with backup diesel generator" |
| 63 | + |
| 64 | +- Keyword search sees: "diesel" → classifies as brown |
| 65 | +- Reality: This is a green project with minimal fossil fuel backup |
| 66 | + |
| 67 | +Keywords couldn't understand context. They couldn't distinguish between a solar plant with diesel backup (green) and a diesel plant with solar panels on the roof (brown). |
| 68 | + |
| 69 | +### LLMs Understand Context |
| 70 | + |
| 71 | +Large Language Models can read an entire project description and understand the primary purpose. This contextual understanding was exactly what we needed for accurate classification at scale. |
| 72 | + |
| 73 | +## Building a Validation Framework |
| 74 | + |
| 75 | +Finding no established best practices for validating LLM classifications in policy research, we developed our own multi-stage approach that attempted to balance pragmatism and rigor. |
| 76 | + |
| 77 | +### Stage 1: Inter-Model Agreement |
| 78 | + |
| 79 | +First, we tested how well different LLMs agreed with each other on classifications. This revealed important patterns: |
| 80 | + |
| 81 | + |
| 82 | + |
| 83 | +**Key insights from inter-model agreement:** |
| 84 | + |
| 85 | +- High agreement on NEUTRAL (94.4%) and GREEN (94.8%) projects |
| 86 | +- Lower agreement on GREY (84.1%) and BROWN (85.5%) categories |
| 87 | +- Agreement correlated with LLM confidence levels |
| 88 | +- Llama 3.3 was an outlier |
| 89 | + |
| 90 | +### Stage 2: Human Validation |
| 91 | + |
| 92 | +After establishing inter-model agreement patterns, my co-author and I manually classified 300 projects to test against human judgment: |
| 93 | + |
| 94 | +| Model | Overall Agreement | Green Projects Agreement | Cost (Full Dataset) | Time | |
| 95 | +|-------|-------------------|-------------------------|-------------------|------| |
| 96 | +| **Deepseek v3** | 91.8% | 95.5% | $1.58 | 15 hours | |
| 97 | +| Claude Sonnet 3.5 | 85.9% | 90.9% | ~$4,700 | 16 hours | |
| 98 | +| GPT-4o mini | 87.3% | 88.4% | ~$54 | 11 hours | |
| 99 | +| Llama 3.3 (local) | 70.1% | 76.2% | $0 | 338 hours | |
| 100 | + |
| 101 | +::: callout-note |
| 102 | +## Open Source Models: Promise vs. Reality |
| 103 | + |
| 104 | +We tested two open source options: |
| 105 | + |
| 106 | +- **Deepseek v3**: Technically open source but too large to run locally. We used their API, which performed excellently. |
| 107 | +- **Llama 3.3**: Small enough to run on a Mac Mini with 64GB RAM. Performance was poor (70% accuracy) and glacially slow (2 weeks for full dataset). |
| 108 | + |
| 109 | +The gap between frontier models (whether closed like Claude or API-accessible like Deepseek) and truly local models remains substantial. |
| 110 | +::: |
| 111 | + |
| 112 | +## The Iterative Development Process |
| 113 | + |
| 114 | +::: callout-tip |
| 115 | +## Start Small, Then Scale |
| 116 | + |
| 117 | +When developing your classification system: |
| 118 | + |
| 119 | +1. **Test with 5-10 examples** to refine your prompt |
| 120 | +2. **Validate on 50-100 projects** to catch edge cases |
| 121 | +3. **Run a larger test (500-1000)** to identify infrastructure issues |
| 122 | +4. **Only then process your full dataset** |
| 123 | + |
| 124 | +This approach saves time, money, and frustration. We caught several bugs and prompt improvements during small-scale testing that would have been expensive mistakes at full scale. |
| 125 | +::: |
| 126 | + |
| 127 | +## The Unexpected Challenge: Content Moderation |
| 128 | + |
| 129 | +Everything ran smoothly until 56 projects repeatedly failed with "Content Exists Risk" errors. Investigation revealed the failing projects mentioned politically sensitive Chinese figures like Xi Jinping's wife and disgraced former officials. |
| 130 | + |
| 131 | +Since these names were incidental to project descriptions, we replaced them with "a Chinese official." The classification resumed without issues. |
| 132 | + |
| 133 | + |
| 134 | + |
| 135 | +::: callout-note |
| 136 | +## Content Moderation Reality |
| 137 | + |
| 138 | +All LLM providers implement content moderation—not just Chinese companies. I once asked Gemini about a Trump administration policy's constitutionality, and it refused to answer because it said it didn't want to provide a potentially incorrect answer to a politically sensitive question. |
| 139 | + |
| 140 | +For researchers using public datasets like AidData's GCDF 3.0, these issues are manageable. Those working with sensitive data should carefully evaluate provider policies. |
| 141 | +::: |
| 142 | + |
| 143 | +## Policy-Relevant Findings |
| 144 | + |
| 145 | +The classification revealed surprising insights: |
| 146 | + |
| 147 | +### Finding 1: Limited Green Investment |
| 148 | + |
| 149 | +- Only $86.5 billion in green investments (5.8% of total Chinese lending) |
| 150 | +- Dominated by large hydropower (71.6%) and nuclear (12.0%) |
| 151 | +- Minimal solar (3.2%) and wind (3.7%) despite rhetoric |
| 152 | + |
| 153 | +### Finding 2: No Green Surge Over Time |
| 154 | +Despite talk of a "Green BRI," our data through 2021 showed no significant increase in renewable energy financing. |
| 155 | + |
| 156 | +### Finding 3: Bifurcated Co-financing Networks |
| 157 | +Green projects rely on public development banks while commercial co-financing focuses on traditional infrastructure—with little overlap between these networks. |
| 158 | + |
| 159 | +## Transparency and Reproducibility |
| 160 | + |
| 161 | +We published everything: |
| 162 | + |
| 163 | +- **[27-page methodological appendix](https://odi.org/en/publications/greener-on-the-other-side-mapping-chinas-overseas-co-financing-and-financial-innovation/)** |
| 164 | +- **[Complete code on GitHub](https://github.com/Teal-Insights/odi_china_lending_llm_classification)** |
| 165 | +- **All prompts and validation data** |
| 166 | + |
| 167 | +This transparency serves multiple purposes: |
| 168 | + |
| 169 | +1. **Exposes assumptions to scrutiny**: Our definition of "green" is contentious. By sharing our classification criteria, others can challenge or adapt it. |
| 170 | + |
| 171 | +2. **Enables others to build on our work**: The name standardization alone took enormous effort. Why should others reinvent that wheel? |
| 172 | + |
| 173 | +3. **Raises the bar on reproducibility**: While we didn't achieve full reproducibility (that would require packaging an open source model in a Docker container), we took significant steps toward transparency. |
| 174 | + |
| 175 | +As two authors with limited budget, LLM coding assistance was crucial in achieving this level of documentation and code quality. |
| 176 | + |
| 177 | +## The Transformation of Research Possibilities |
| 178 | + |
| 179 | +This project doesn't represent doing the impossible—someone with large grant funding could have hired teams to classify these projects manually. Instead, it shows how LLMs dramatically expand what's possible for researchers with limited resources. |
| 180 | + |
| 181 | +We face difficult policy challenges with constrained budgets. Tools that allow us to do more ambitious research with less funding are exciting and important. That's why we're working to push this conversation forward through transparent documentation and workshops like this one. |
| 182 | + |
| 183 | +## Key Takeaways |
| 184 | + |
| 185 | +1. **LLMs offer consistency at scale** that humans can't sustain for repetitive tasks |
| 186 | +2. **Multi-stage validation builds confidence**—test models against each other, then against human judgment |
| 187 | +3. **Iterative development saves time and money**—start small, catch bugs early |
| 188 | +4. **Transparency enables progress**—share your methods so others can build on them |
| 189 | +5. **Perfect is the enemy of good**—focus on enabling research that wouldn't happen otherwise |
| 190 | + |
| 191 | +## What You Can Do Now |
| 192 | + |
| 193 | +**For your own research:** |
| 194 | + |
| 195 | +1. Identify classification bottlenecks in your work |
| 196 | +2. Start with 10-20 examples to test feasibility |
| 197 | +3. Build validation into your process from the beginning |
| 198 | +4. Share your methods and code openly |
| 199 | +5. Focus on research questions that matter, not perfect methods |
| 200 | + |
| 201 | +**For the field:** |
| 202 | + |
| 203 | +- Contribute to emerging best practices |
| 204 | +- Build on others' work rather than starting from scratch |
| 205 | +- Be transparent about both successes and limitations |
| 206 | +- Remember: we're all figuring this out together |
| 207 | + |
| 208 | +## The Bigger Picture |
| 209 | + |
| 210 | +This project demonstrates how LLMs can transform resource-constrained research. We moved from assumptions about China's role in the energy transition to evidence-based analysis that informs real policy decisions—all with two researchers and minimal budget. |
| 211 | + |
| 212 | +The technology doesn't replace human judgment. It amplifies human expertise, allowing us to tackle questions at a scale that reveals patterns invisible to traditional methods. That's the promise worth pursuing. |
| 213 | + |
| 214 | +--- |
| 215 | + |
| 216 | +*This concludes our workshop on AI for the Skeptical Scholar. Thank you for joining us on this journey toward more ambitious, transparent, and impactful research.* |
0 commit comments