Skip to content

Commit 5576614

Browse files
author
Puneeth
committed
README: professional copy, Model Workflow diagram, no emojis
Made-with: Cursor
1 parent fd645e0 commit 5576614

2 files changed

Lines changed: 47 additions & 127 deletions

File tree

README.md

Lines changed: 47 additions & 127 deletions
Original file line numberDiff line numberDiff line change
@@ -1,143 +1,63 @@
1-
# Data Classification Research - Parent Entity Analysis
2-
<img width="420" height="360" alt="project-thumbnail" src="https://github.com/user-attachments/assets/0482247c-10eb-4615-afca-7e2eb1a37b90" />
1+
# Parent Entity Classification
32

4-
## Project Overview
3+
![Model Workflow](model-workflow.png)
54

6-
This research project focuses on identifying individuals vs. companies vs. family firms in vertical ownership patterns across global markets. The dataset contains approximately 1.4 million unique parent entities from over 120 countries, classified into four categories:
5+
Three-stage cascade: Rule-Based, LLM API, XLM-RoBERTa. Each stage runs only when the previous cannot classify with confidence.
76

8-
- **Individual**: Personal names (e.g., "John Smith")
9-
- **Company**: Corporate entities (e.g., "Midwest Cargo Inc.")
10-
- **Family Firm**: Family-owned businesses (e.g., "John Smith & Sons Inc.")
11-
- **Government**: Government branches and agencies (e.g., "Metropolis Water Agency")
7+
## Overview
128

13-
## Research Objective
9+
Classification of parent entities in vertical ownership chains across global markets. ~1.4M entities from 120+ countries, four categories:
1410

15-
The goal is to study different vertical ownership patterns of firms across the world, where a firm "A" indirectly controls a firm "C" through an intermediary company "B" (A → B → C). The main research question is whether widely controlled firms behave differently from family firms or government-owned firms.
11+
- **Individual** – Personal names
12+
- **Company** – Corporate entities (Inc., Ltd., GMBH, etc.)
13+
- **Family Firm** – Family-owned (e.g. Smith & Sons Ltd)
14+
- **Government** – Government agencies and public institutions
1615

17-
## Web Application Features
16+
Research question: do ownership patterns (A → B → C chains) differ between widely held firms, family firms, and government-owned firms?
1817

19-
### 🏠 Dashboard
20-
- **Overview Statistics**: Countries analyzed, total files, entity count, and languages
21-
- **Interactive Charts**: Entity type distribution and top countries by entity count
22-
- **Project Information**: Research background and objectives
18+
## Model Workflow
2319

24-
### 📁 File Browser
25-
- **Complete File List**: All processed CSV and stats files from 120+ countries
26-
- **Search & Filter**: Find files by country code or file type
27-
- **File Viewer**: View CSV data in formatted tables and stats in readable format
28-
- **Download**: Download individual files for further analysis
20+
1. **Rule-based** – Keywords, suffixes (GMBH, INC, LTD), government patterns. 56.5% coverage.
21+
2. **LLM API** – Claude Haiku via Anthropic Batches. 8.8% coverage.
22+
3. **XLM-RoBERTa** – Trained on steps 1 and 2. Zero-cost inference. Target ~89% coverage.
2923

30-
### 📊 Data Visualizations
31-
- **Interactive Charts**: Generate custom visualizations for any country
32-
- **Multiple Chart Types**: Entity type, language, and city distribution
33-
- **Real-time Generation**: Select country and chart type to create visualizations
24+
## Data
3425

35-
### 📝 Research Notes & Feedback
36-
- **Note Management**: Add, categorize, and manage research notes
37-
- **Categories**: General, Methodology, Data Quality, Corrections, Improvements
38-
- **Persistent Storage**: Notes saved locally for continued collaboration
39-
- **Filtering**: Filter notes by category for organized review
40-
41-
## File Structure
26+
### File structure
4227

4328
```
4429
August/
45-
├── done_processed_AD_data.csv # Andorra processed data
46-
├── done_processed_AD_data_stats.txt # Andorra statistics
47-
├── done_processed_AF_data.csv # Afghanistan processed data
48-
├── done_processed_AF_data_stats.txt # Afghanistan statistics
30+
├── done_processed_{CC}_data.csv # Per-country CSV
31+
├── done_processed_{CC}_data_stats.txt
4932
└── ... (120+ countries)
5033
```
5134

52-
### CSV File Format
53-
Each CSV file contains the following columns:
54-
- `parent_name`: Name of the parent entity
55-
- `parent_id`: Unique identifier
56-
- `parent_city`: City location
57-
- `language`: Detected language code
58-
- `entity_type`: Classification (individual, company, family_firm, government)
59-
60-
### Stats File Format
61-
Each stats file contains:
62-
- Total record count
63-
- Language distribution
64-
- Entity type distribution
65-
- Corrections made during processing
66-
- Research notes and observations
67-
68-
## How to Use the Web Application
69-
70-
### For Professors/Researchers
71-
72-
1. **View Dashboard**: Start with the overview to understand the scope and scale of the research
73-
2. **Browse Files**: Use the file browser to explore specific countries or regions of interest
74-
3. **Analyze Data**: View CSV files in formatted tables and read detailed statistics
75-
4. **Create Visualizations**: Generate custom charts for specific countries or data aspects
76-
5. **Add Notes**: Leave feedback, corrections, or suggestions using the notes system
77-
78-
### Navigation Tips
79-
80-
- **Search Files**: Use the search bar to find specific countries (e.g., "US", "UK", "DE")
81-
- **Filter by Type**: Choose to view only CSV files or stats files
82-
- **View Details**: Click "View" on any file to see its contents in a modal window
83-
- **Generate Charts**: Select a country and chart type to create custom visualizations
84-
- **Manage Notes**: Add categorized notes and filter them for organized review
85-
86-
## Technical Implementation
87-
88-
### Frontend Technologies
89-
- **HTML5**: Semantic structure and accessibility
90-
- **CSS3**: Modern styling with gradients, animations, and responsive design
91-
- **JavaScript (ES6+)**: Interactive functionality and data processing
92-
- **Bootstrap 5**: Responsive UI framework
93-
- **Chart.js**: Interactive data visualizations
94-
- **PapaParse**: CSV parsing and display
95-
96-
### Key Features
97-
- **Responsive Design**: Works on desktop, tablet, and mobile devices
98-
- **Local Storage**: Notes persist between sessions
99-
- **Real-time Search**: Instant file filtering and search
100-
- **Interactive Charts**: Dynamic data visualizations
101-
- **Modal File Viewer**: Clean file viewing experience
102-
103-
## Research Methodology
104-
105-
### Data Processing Pipeline
106-
1. **Raw Data Collection**: 1.4M+ parent entities from 120+ countries
107-
2. **Language Detection**: Automatic language identification for multi-language support
108-
3. **Entity Classification**: ML-based classification into four categories
109-
4. **Quality Control**: Manual review and correction of classifications
110-
5. **Statistical Analysis**: Generation of comprehensive statistics per country
111-
112-
### Classification Criteria
113-
- **Individual**: Personal names, typically first and last names
114-
- **Company**: Corporate entities with business suffixes (Inc., Ltd., Corp., etc.)
115-
- **Family Firm**: Family-owned businesses with family indicators (& Sons, & Co., etc.)
116-
- **Government**: Government agencies, public institutions, state-owned entities
117-
118-
## Future Enhancements
119-
120-
### Planned Features
121-
- **Real-time Data Updates**: Live data synchronization
122-
- **Advanced Analytics**: Statistical significance testing
123-
- **Export Functionality**: PDF reports and data exports
124-
- **Collaborative Features**: Multi-user note sharing
125-
- **API Integration**: Direct database connectivity
126-
127-
### Research Extensions
128-
- **Temporal Analysis**: Ownership pattern changes over time
129-
- **Geographic Clustering**: Regional ownership pattern analysis
130-
- **Industry Classification**: Sector-specific ownership patterns
131-
- **Cross-border Analysis**: International ownership structures
132-
133-
## Contact & Support
134-
135-
For questions about the research methodology, data processing, or web application functionality, please contact the research team.
136-
137-
---
138-
139-
**Note**: This web application is designed for academic research purposes and provides a comprehensive interface for exploring and analyzing the processed parent entity classification data.
140-
=======
141-
# Research-Data
142-
Parent entity classification analysis
143-
>>>>>>> bbbdbfa48999d1cfb257fc8d2015f26ec38ac749
35+
### CSV columns
36+
37+
| Column | Description |
38+
|--------------|----------------------|
39+
| parent_name | Entity name |
40+
| parent_id | Orbis ID |
41+
| parent_city | City |
42+
| language | Detected language |
43+
| entity_type | individual, company, family_firm, government |
44+
45+
### Stats file
46+
47+
Per-country totals, language distribution, entity type distribution, corrections log.
48+
49+
## Web interface
50+
51+
- **Report** – Summary metrics, workflow diagram, classification results
52+
- **Simulation** – Entity classification walkthrough
53+
- **Data by Country** – Browse by country, view CSV, stats, charts
54+
55+
Hosted at [puneethkotha.github.io/Research-Data](https://puneethkotha.github.io/Research-Data/).
56+
57+
## Stack
58+
59+
HTML, CSS, JavaScript, Chart.js. No backend; data served as static files.
60+
61+
## Contact
62+
63+
Puneeth Kotha · Prof. Belen Villalonga (Supervisor) · NYU Stern

model-workflow.png

116 KB
Loading

0 commit comments

Comments
 (0)