Skip to content

Commit 99d5144

Browse files
committed
Added initial version of Case study
1 parent b0b0035 commit 99d5144

31 files changed

+4758
-6
lines changed

_quarto.yml

Lines changed: 4 additions & 6 deletions
Original file line numberDiff line numberDiff line change
@@ -22,12 +22,10 @@ book:
2222
- apps_lit_review.qmd # Chapter 2.3
2323
- apps_other_tools.qmd # Chapter 2.6
2424
- apps_coding.qmd # Chapter 2.5
25-
#
26-
# - part: "Advanced Possibilities"
27-
# chapters:
28-
# - advanced_casestudy.qmd # Chapter 3.1
29-
# - advanced_programmatic.qmd # Chapter 3.2
30-
# - advanced_validation.qmd # Chapter 3.3
25+
26+
- part: "Advanced Possibilities"
27+
chapters:
28+
- advanced_casestudy.qmd # Chapter 3.1
3129
#
3230
# - part: "Q&A and Discussion"
3331
# chapters:

advanced_casestudy.qmd

Lines changed: 216 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,216 @@
1+
# Case Study: Large-Scale Text Classification with LLMs
2+
3+
*Tentative time: 10 minutes*
4+
5+
## Learning Objectives
6+
7+
By the end of this section, you will be able to:
8+
9+
- **Understand how LLMs enable ambitious policy research** with limited resources
10+
- **Apply practical validation strategies** to ensure research integrity when using LLMs at scale
11+
- **Recognize common challenges** in LLM classification projects (including unexpected censorship issues)
12+
- **Appreciate the importance of transparency** in documenting methods for others to build upon
13+
- **Implement an iterative approach** to developing and testing classification systems
14+
15+
## The Policy Challenge: Understanding China's Role in the Energy Transition
16+
17+
Last year, Yunnan Chen (Research Fellow at ODI) and I set out to answer critical questions about China's evolving role in global development finance. China has been a key source of lending to developing countries, but recent policy pronouncements suggested major shifts:
18+
19+
- Movement toward a "Green Belt and Road Initiative"
20+
- Emphasis on "small and beautiful" projects
21+
- Transition from policy bank lending to co-financing with state-owned commercial banks (SOCBs)
22+
23+
We needed empirical evidence: Was China actually supporting the green transition in developing countries? As lending shifted toward co-financing models, who exactly was participating in green projects? What types of projects were being funded, and at what scale?
24+
25+
These weren't academic questions. Understanding China's actual role—not just the rhetoric—was essential for policymakers working on climate finance and energy transition in developing countries.
26+
27+
## The Classification Challenge
28+
29+
We needed to classify 18,000 Chinese overseas lending projects from AidData's GCDF 3.0 dataset into environmental categories:
30+
31+
- **🟢 Green**: Solar, wind, hydro, nuclear, and other renewable energy
32+
- **🟫 Brown**: Coal, oil, and fossil fuel infrastructure
33+
- **🔘 Grey**: Projects with indirect impacts (transmission lines, natural gas)
34+
- **⚪ Neutral**: Non-energy projects
35+
36+
Traditional approaches would have required:
37+
38+
- 1,500 hours of work (5 minutes per project × 18,000 projects)
39+
- $22,500 in research assistant costs
40+
- Large grant funding to support such an effort
41+
42+
We completed it in 15 hours for $1.58.
43+
44+
## The Reality of Human vs. LLM Classification
45+
46+
Let's be honest about manual classification at scale. I've done this work myself. After a few hours of coding projects, your eyes glaze over. You start questioning whether you're applying criteria consistently. Are you coding things the same way you did yesterday? Last week?
47+
48+
Research assistants face the same challenges—and who can blame them if attention wanders during hour six of classifying infrastructure projects? This isn't about capability; it's about the mind-numbing nature of repetitive classification tasks.
49+
50+
LLMs bring something humans can't sustain: endless patience and perfect consistency. They apply the same criteria to project 17,000 as they did to project 1. No fatigue, no drift in standards, no bad days.
51+
52+
The question isn't whether LLMs are perfect—they're not. It's whether they can achieve good-enough accuracy with perfect consistency at a scale that makes ambitious research possible.
53+
54+
## From Keywords to Context: Why LLMs Were Essential
55+
56+
### The Keyword Approach Failed
57+
58+
I started where most researchers would: keyword searches. I wrote regular expressions to find "solar," "wind," "coal," and other energy terms.
59+
60+
It quickly became clear this wouldn't work:
61+
62+
**Example 1**: "Development of 500MW solar power plant with backup diesel generator"
63+
64+
- Keyword search sees: "diesel" → classifies as brown
65+
- Reality: This is a green project with minimal fossil fuel backup
66+
67+
Keywords couldn't understand context. They couldn't distinguish between a solar plant with diesel backup (green) and a diesel plant with solar panels on the roof (brown).
68+
69+
### LLMs Understand Context
70+
71+
Large Language Models can read an entire project description and understand the primary purpose. This contextual understanding was exactly what we needed for accurate classification at scale.
72+
73+
## Building a Validation Framework
74+
75+
Finding no established best practices for validating LLM classifications in policy research, we developed our own multi-stage approach that attempted to balance pragmatism and rigor.
76+
77+
### Stage 1: Inter-Model Agreement
78+
79+
First, we tested how well different LLMs agreed with each other on classifications. This revealed important patterns:
80+
81+
![LLM Agreement Analysis](images/llm_agreement_analysis.png)
82+
83+
**Key insights from inter-model agreement:**
84+
85+
- High agreement on NEUTRAL (94.4%) and GREEN (94.8%) projects
86+
- Lower agreement on GREY (84.1%) and BROWN (85.5%) categories
87+
- Agreement correlated with LLM confidence levels
88+
- Llama 3.3 was an outlier
89+
90+
### Stage 2: Human Validation
91+
92+
After establishing inter-model agreement patterns, my co-author and I manually classified 300 projects to test against human judgment:
93+
94+
| Model | Overall Agreement | Green Projects Agreement | Cost (Full Dataset) | Time |
95+
|-------|-------------------|-------------------------|-------------------|------|
96+
| **Deepseek v3** | 91.8% | 95.5% | $1.58 | 15 hours |
97+
| Claude Sonnet 3.5 | 85.9% | 90.9% | ~$4,700 | 16 hours |
98+
| GPT-4o mini | 87.3% | 88.4% | ~$54 | 11 hours |
99+
| Llama 3.3 (local) | 70.1% | 76.2% | $0 | 338 hours |
100+
101+
::: callout-note
102+
## Open Source Models: Promise vs. Reality
103+
104+
We tested two open source options:
105+
106+
- **Deepseek v3**: Technically open source but too large to run locally. We used their API, which performed excellently.
107+
- **Llama 3.3**: Small enough to run on a Mac Mini with 64GB RAM. Performance was poor (70% accuracy) and glacially slow (2 weeks for full dataset).
108+
109+
The gap between frontier models (whether closed like Claude or API-accessible like Deepseek) and truly local models remains substantial.
110+
:::
111+
112+
## The Iterative Development Process
113+
114+
::: callout-tip
115+
## Start Small, Then Scale
116+
117+
When developing your classification system:
118+
119+
1. **Test with 5-10 examples** to refine your prompt
120+
2. **Validate on 50-100 projects** to catch edge cases
121+
3. **Run a larger test (500-1000)** to identify infrastructure issues
122+
4. **Only then process your full dataset**
123+
124+
This approach saves time, money, and frustration. We caught several bugs and prompt improvements during small-scale testing that would have been expensive mistakes at full scale.
125+
:::
126+
127+
## The Unexpected Challenge: Content Moderation
128+
129+
Everything ran smoothly until 56 projects repeatedly failed with "Content Exists Risk" errors. Investigation revealed the failing projects mentioned politically sensitive Chinese figures like Xi Jinping's wife and disgraced former officials.
130+
131+
Since these names were incidental to project descriptions, we replaced them with "a Chinese official." The classification resumed without issues.
132+
133+
!["Non Traditional" Debugging](images/non_traditional_debugging.png)
134+
135+
::: callout-note
136+
## Content Moderation Reality
137+
138+
All LLM providers implement content moderation—not just Chinese companies. I once asked Gemini about a Trump administration policy's constitutionality, and it refused to answer because it said it didn't want to provide a potentially incorrect answer to a politically sensitive question.
139+
140+
For researchers using public datasets like AidData's GCDF 3.0, these issues are manageable. Those working with sensitive data should carefully evaluate provider policies.
141+
:::
142+
143+
## Policy-Relevant Findings
144+
145+
The classification revealed surprising insights:
146+
147+
### Finding 1: Limited Green Investment
148+
149+
- Only $86.5 billion in green investments (5.8% of total Chinese lending)
150+
- Dominated by large hydropower (71.6%) and nuclear (12.0%)
151+
- Minimal solar (3.2%) and wind (3.7%) despite rhetoric
152+
153+
### Finding 2: No Green Surge Over Time
154+
Despite talk of a "Green BRI," our data through 2021 showed no significant increase in renewable energy financing.
155+
156+
### Finding 3: Bifurcated Co-financing Networks
157+
Green projects rely on public development banks while commercial co-financing focuses on traditional infrastructure—with little overlap between these networks.
158+
159+
## Transparency and Reproducibility
160+
161+
We published everything:
162+
163+
- **[27-page methodological appendix](https://odi.org/en/publications/greener-on-the-other-side-mapping-chinas-overseas-co-financing-and-financial-innovation/)**
164+
- **[Complete code on GitHub](https://github.com/Teal-Insights/odi_china_lending_llm_classification)**
165+
- **All prompts and validation data**
166+
167+
This transparency serves multiple purposes:
168+
169+
1. **Exposes assumptions to scrutiny**: Our definition of "green" is contentious. By sharing our classification criteria, others can challenge or adapt it.
170+
171+
2. **Enables others to build on our work**: The name standardization alone took enormous effort. Why should others reinvent that wheel?
172+
173+
3. **Raises the bar on reproducibility**: While we didn't achieve full reproducibility (that would require packaging an open source model in a Docker container), we took significant steps toward transparency.
174+
175+
As two authors with limited budget, LLM coding assistance was crucial in achieving this level of documentation and code quality.
176+
177+
## The Transformation of Research Possibilities
178+
179+
This project doesn't represent doing the impossible—someone with large grant funding could have hired teams to classify these projects manually. Instead, it shows how LLMs dramatically expand what's possible for researchers with limited resources.
180+
181+
We face difficult policy challenges with constrained budgets. Tools that allow us to do more ambitious research with less funding are exciting and important. That's why we're working to push this conversation forward through transparent documentation and workshops like this one.
182+
183+
## Key Takeaways
184+
185+
1. **LLMs offer consistency at scale** that humans can't sustain for repetitive tasks
186+
2. **Multi-stage validation builds confidence**—test models against each other, then against human judgment
187+
3. **Iterative development saves time and money**—start small, catch bugs early
188+
4. **Transparency enables progress**—share your methods so others can build on them
189+
5. **Perfect is the enemy of good**—focus on enabling research that wouldn't happen otherwise
190+
191+
## What You Can Do Now
192+
193+
**For your own research:**
194+
195+
1. Identify classification bottlenecks in your work
196+
2. Start with 10-20 examples to test feasibility
197+
3. Build validation into your process from the beginning
198+
4. Share your methods and code openly
199+
5. Focus on research questions that matter, not perfect methods
200+
201+
**For the field:**
202+
203+
- Contribute to emerging best practices
204+
- Build on others' work rather than starting from scratch
205+
- Be transparent about both successes and limitations
206+
- Remember: we're all figuring this out together
207+
208+
## The Bigger Picture
209+
210+
This project demonstrates how LLMs can transform resource-constrained research. We moved from assumptions about China's role in the energy transition to evidence-based analysis that informs real policy decisions—all with two researchers and minimal budget.
211+
212+
The technology doesn't replace human judgment. It amplifies human expertise, allowing us to tackle questions at a scale that reveals patterns invisible to traditional methods. That's the promise worth pursuing.
213+
214+
---
215+
216+
*This concludes our workshop on AI for the Skeptical Scholar. Thank you for joining us on this journey toward more ambitious, transparent, and impactful research.*

advanced_casestudy_files/libs/bootstrap/bootstrap-e19dc0c07aeef78048e587c3f1edba7a.min.css

Lines changed: 12 additions & 0 deletions
Some generated files are not rendered by default. Learn more about customizing how changed files appear on GitHub.

0 commit comments

Comments
 (0)