|
1 | | -# Data Classification Research - Parent Entity Analysis |
2 | | -<img width="420" height="360" alt="project-thumbnail" src="https://github.com/user-attachments/assets/0482247c-10eb-4615-afca-7e2eb1a37b90" /> |
| 1 | +# Parent Entity Classification |
3 | 2 |
|
4 | | -## Project Overview |
| 3 | + |
5 | 4 |
|
6 | | -This research project focuses on identifying individuals vs. companies vs. family firms in vertical ownership patterns across global markets. The dataset contains approximately 1.4 million unique parent entities from over 120 countries, classified into four categories: |
| 5 | +Three-stage cascade: Rule-Based, LLM API, XLM-RoBERTa. Each stage runs only when the previous cannot classify with confidence. |
7 | 6 |
|
8 | | -- **Individual**: Personal names (e.g., "John Smith") |
9 | | -- **Company**: Corporate entities (e.g., "Midwest Cargo Inc.") |
10 | | -- **Family Firm**: Family-owned businesses (e.g., "John Smith & Sons Inc.") |
11 | | -- **Government**: Government branches and agencies (e.g., "Metropolis Water Agency") |
| 7 | +## Overview |
12 | 8 |
|
13 | | -## Research Objective |
| 9 | +Classification of parent entities in vertical ownership chains across global markets. ~1.4M entities from 120+ countries, four categories: |
14 | 10 |
|
15 | | -The goal is to study different vertical ownership patterns of firms across the world, where a firm "A" indirectly controls a firm "C" through an intermediary company "B" (A → B → C). The main research question is whether widely controlled firms behave differently from family firms or government-owned firms. |
| 11 | +- **Individual** – Personal names |
| 12 | +- **Company** – Corporate entities (Inc., Ltd., GMBH, etc.) |
| 13 | +- **Family Firm** – Family-owned (e.g. Smith & Sons Ltd) |
| 14 | +- **Government** – Government agencies and public institutions |
16 | 15 |
|
17 | | -## Web Application Features |
| 16 | +Research question: do ownership patterns (A → B → C chains) differ between widely held firms, family firms, and government-owned firms? |
18 | 17 |
|
19 | | -### 🏠 Dashboard |
20 | | -- **Overview Statistics**: Countries analyzed, total files, entity count, and languages |
21 | | -- **Interactive Charts**: Entity type distribution and top countries by entity count |
22 | | -- **Project Information**: Research background and objectives |
| 18 | +## Model Workflow |
23 | 19 |
|
24 | | -### 📁 File Browser |
25 | | -- **Complete File List**: All processed CSV and stats files from 120+ countries |
26 | | -- **Search & Filter**: Find files by country code or file type |
27 | | -- **File Viewer**: View CSV data in formatted tables and stats in readable format |
28 | | -- **Download**: Download individual files for further analysis |
| 20 | +1. **Rule-based** – Keywords, suffixes (GMBH, INC, LTD), government patterns. 56.5% coverage. |
| 21 | +2. **LLM API** – Claude Haiku via Anthropic Batches. 8.8% coverage. |
| 22 | +3. **XLM-RoBERTa** – Trained on steps 1 and 2. Zero-cost inference. Target ~89% coverage. |
29 | 23 |
|
30 | | -### 📊 Data Visualizations |
31 | | -- **Interactive Charts**: Generate custom visualizations for any country |
32 | | -- **Multiple Chart Types**: Entity type, language, and city distribution |
33 | | -- **Real-time Generation**: Select country and chart type to create visualizations |
| 24 | +## Data |
34 | 25 |
|
35 | | -### 📝 Research Notes & Feedback |
36 | | -- **Note Management**: Add, categorize, and manage research notes |
37 | | -- **Categories**: General, Methodology, Data Quality, Corrections, Improvements |
38 | | -- **Persistent Storage**: Notes saved locally for continued collaboration |
39 | | -- **Filtering**: Filter notes by category for organized review |
40 | | - |
41 | | -## File Structure |
| 26 | +### File structure |
42 | 27 |
|
43 | 28 | ``` |
44 | 29 | August/ |
45 | | -├── done_processed_AD_data.csv # Andorra processed data |
46 | | -├── done_processed_AD_data_stats.txt # Andorra statistics |
47 | | -├── done_processed_AF_data.csv # Afghanistan processed data |
48 | | -├── done_processed_AF_data_stats.txt # Afghanistan statistics |
| 30 | +├── done_processed_{CC}_data.csv # Per-country CSV |
| 31 | +├── done_processed_{CC}_data_stats.txt |
49 | 32 | └── ... (120+ countries) |
50 | 33 | ``` |
51 | 34 |
|
52 | | -### CSV File Format |
53 | | -Each CSV file contains the following columns: |
54 | | -- `parent_name`: Name of the parent entity |
55 | | -- `parent_id`: Unique identifier |
56 | | -- `parent_city`: City location |
57 | | -- `language`: Detected language code |
58 | | -- `entity_type`: Classification (individual, company, family_firm, government) |
59 | | - |
60 | | -### Stats File Format |
61 | | -Each stats file contains: |
62 | | -- Total record count |
63 | | -- Language distribution |
64 | | -- Entity type distribution |
65 | | -- Corrections made during processing |
66 | | -- Research notes and observations |
67 | | - |
68 | | -## How to Use the Web Application |
69 | | - |
70 | | -### For Professors/Researchers |
71 | | - |
72 | | -1. **View Dashboard**: Start with the overview to understand the scope and scale of the research |
73 | | -2. **Browse Files**: Use the file browser to explore specific countries or regions of interest |
74 | | -3. **Analyze Data**: View CSV files in formatted tables and read detailed statistics |
75 | | -4. **Create Visualizations**: Generate custom charts for specific countries or data aspects |
76 | | -5. **Add Notes**: Leave feedback, corrections, or suggestions using the notes system |
77 | | - |
78 | | -### Navigation Tips |
79 | | - |
80 | | -- **Search Files**: Use the search bar to find specific countries (e.g., "US", "UK", "DE") |
81 | | -- **Filter by Type**: Choose to view only CSV files or stats files |
82 | | -- **View Details**: Click "View" on any file to see its contents in a modal window |
83 | | -- **Generate Charts**: Select a country and chart type to create custom visualizations |
84 | | -- **Manage Notes**: Add categorized notes and filter them for organized review |
85 | | - |
86 | | -## Technical Implementation |
87 | | - |
88 | | -### Frontend Technologies |
89 | | -- **HTML5**: Semantic structure and accessibility |
90 | | -- **CSS3**: Modern styling with gradients, animations, and responsive design |
91 | | -- **JavaScript (ES6+)**: Interactive functionality and data processing |
92 | | -- **Bootstrap 5**: Responsive UI framework |
93 | | -- **Chart.js**: Interactive data visualizations |
94 | | -- **PapaParse**: CSV parsing and display |
95 | | - |
96 | | -### Key Features |
97 | | -- **Responsive Design**: Works on desktop, tablet, and mobile devices |
98 | | -- **Local Storage**: Notes persist between sessions |
99 | | -- **Real-time Search**: Instant file filtering and search |
100 | | -- **Interactive Charts**: Dynamic data visualizations |
101 | | -- **Modal File Viewer**: Clean file viewing experience |
102 | | - |
103 | | -## Research Methodology |
104 | | - |
105 | | -### Data Processing Pipeline |
106 | | -1. **Raw Data Collection**: 1.4M+ parent entities from 120+ countries |
107 | | -2. **Language Detection**: Automatic language identification for multi-language support |
108 | | -3. **Entity Classification**: ML-based classification into four categories |
109 | | -4. **Quality Control**: Manual review and correction of classifications |
110 | | -5. **Statistical Analysis**: Generation of comprehensive statistics per country |
111 | | - |
112 | | -### Classification Criteria |
113 | | -- **Individual**: Personal names, typically first and last names |
114 | | -- **Company**: Corporate entities with business suffixes (Inc., Ltd., Corp., etc.) |
115 | | -- **Family Firm**: Family-owned businesses with family indicators (& Sons, & Co., etc.) |
116 | | -- **Government**: Government agencies, public institutions, state-owned entities |
117 | | - |
118 | | -## Future Enhancements |
119 | | - |
120 | | -### Planned Features |
121 | | -- **Real-time Data Updates**: Live data synchronization |
122 | | -- **Advanced Analytics**: Statistical significance testing |
123 | | -- **Export Functionality**: PDF reports and data exports |
124 | | -- **Collaborative Features**: Multi-user note sharing |
125 | | -- **API Integration**: Direct database connectivity |
126 | | - |
127 | | -### Research Extensions |
128 | | -- **Temporal Analysis**: Ownership pattern changes over time |
129 | | -- **Geographic Clustering**: Regional ownership pattern analysis |
130 | | -- **Industry Classification**: Sector-specific ownership patterns |
131 | | -- **Cross-border Analysis**: International ownership structures |
132 | | - |
133 | | -## Contact & Support |
134 | | - |
135 | | -For questions about the research methodology, data processing, or web application functionality, please contact the research team. |
136 | | - |
137 | | ---- |
138 | | - |
139 | | -**Note**: This web application is designed for academic research purposes and provides a comprehensive interface for exploring and analyzing the processed parent entity classification data. |
140 | | -======= |
141 | | -# Research-Data |
142 | | -Parent entity classification analysis |
143 | | ->>>>>>> bbbdbfa48999d1cfb257fc8d2015f26ec38ac749 |
| 35 | +### CSV columns |
| 36 | + |
| 37 | +| Column | Description | |
| 38 | +|--------------|----------------------| |
| 39 | +| parent_name | Entity name | |
| 40 | +| parent_id | Orbis ID | |
| 41 | +| parent_city | City | |
| 42 | +| language | Detected language | |
| 43 | +| entity_type | individual, company, family_firm, government | |
| 44 | + |
| 45 | +### Stats file |
| 46 | + |
| 47 | +Per-country totals, language distribution, entity type distribution, corrections log. |
| 48 | + |
| 49 | +## Web interface |
| 50 | + |
| 51 | +- **Report** – Summary metrics, workflow diagram, classification results |
| 52 | +- **Simulation** – Entity classification walkthrough |
| 53 | +- **Data by Country** – Browse by country, view CSV, stats, charts |
| 54 | + |
| 55 | +Hosted at [puneethkotha.github.io/Research-Data](https://puneethkotha.github.io/Research-Data/). |
| 56 | + |
| 57 | +## Stack |
| 58 | + |
| 59 | +HTML, CSS, JavaScript, Chart.js. No backend; data served as static files. |
| 60 | + |
| 61 | +## Contact |
| 62 | + |
| 63 | +Puneeth Kotha · Prof. Belen Villalonga (Supervisor) · NYU Stern |
0 commit comments