Pavan chandru 119 request data: Synthetic data generation for request_comments and volunteer_rating by PavanChandru · Pull Request #164 · saayam-for-all/data

PavanChandru · 2026-05-04T21:18:34Z

Implemented synthetic data generation for request_comments and volunteer_rating tables.

Generated realistic mock data using Faker
Maintained referential integrity for user_id and request_id
Followed schema defined in db_info.json
Created CSV outputs in mock_db
Updated README with usage instructions

Ready for review.

Includes tested SQL outputs and Matplotlib charts covering key metrics

…of Task saayam-for-all#30) Includes complete implementation of JWT-based authentication and role-based access control (RBAC). Endpoints are protected based on user roles (admin, volunteer, requestor)

… root

…s/, tests/, scripts/

…BUTING.md

This PR includes: - Mock data generation script - Generated CSV files for volunteer_applications and user_skills -

…ies and subcategories

… help requests

…d_info" This reverts commit 83bcaf6.

…and volunteer_rating (saayam-for-all#133) * saayam-for-all#119 Add synthetic Data generator for request_comments and volunteer_rating * saayam-for-all#119 Upgrade synthetic data generator with 3-tier ML-grade text pipeline - Added 3-tier text generation: compositional grammar, stochastic perturbation, and diversity validator for ML-training-grade diversity - Added standalone usage mode (--rows flag) - Regenerated request_comments.csv and volunteer_rating.csv with higher-quality diverse data

…omments and volunteer_rating (saayam-for-all#133)" This reverts commit 93ac4c6.

Lambda function for Request_application_analytics

…est.py.

This reverts commit 1793457.

This reverts commit edfdb67.

…nts and volunteer_rating tables

Copilot

Pull request overview

This PR substantially broadens the data repo: it adds synthetic database assets, new data-engineering/data-analytics code paths, deployment/infrastructure scaffolding, and a large documentation/onboarding rewrite. Within the overall codebase, it looks like an attempt to bootstrap the database mock-data workflow plus several backend/analytics utilities, but the scope is much larger than the PR title/description suggest.

Changes:

Added mock-data generators and CSV outputs for volunteer/request-related tables plus many database lookup tables.
Added new data-engineering/data-analytics application code, Lambdas, scrapers, models, deployment scripts, and infrastructure manifests.
Replaced/added extensive documentation for onboarding, workflows, and project context.

Reviewed changes

Copilot reviewed 74 out of 96 changed files in this pull request and generated 44 comments.

Show a summary per file

File	Description
`README.md`	Replaced top-level repo README with onboarding/team handbook.
`LICENSE`	Deleted repository license file.
`database/README.MD`	Added placeholder database README.
`database/mock-data-generation/volunteer_applications.py`	Added volunteer application mock-data generator.
`database/mock-data-generation/utils.py`	Added shared mock-data utilities/constants.
`database/mock-data-generation/user_skills.py`	Added user-skills derivation script.
`database/mock-data-generation/user_skills.csv`	Added generated user-skills CSV.
`database/mock-data-generation/README.md`	Added mock-data generation documentation.
`database/mock-data-generation/generate_request_data.py`	Added request comments / volunteer rating generator.
`database/mock-data-generation/generate_mock_data.py`	Added volunteer mock-data entrypoint.
`database/mock_db/volunteer_rating.csv`	Added generated volunteer rating CSV.
`database/mock_db/users.csv`	Added placeholder users CSV.
`database/mock_db/request_comments.csv`	Added generated request comments CSV.
`database/lookup_tables/user_status.csv`	Added user-status lookup data.
`database/lookup_tables/user_category.csv`	Added user-category lookup data.
`database/lookup_tables/supporting_languages.csv`	Added languages lookup data.
`database/lookup_tables/request_type.csv`	Added request-type lookup data.
`database/lookup_tables/request_status.csv`	Added request-status lookup data.
`database/lookup_tables/request_priority.csv`	Added request-priority lookup data.
`database/lookup_tables/request_isleadvol.csv`	Added lead-volunteer lookup data.
`database/lookup_tables/request_for.csv`	Added request-for lookup data.
`database/lookup_tables/req_add_info_metadata.csv`	Added request additional-info metadata.
`database/lookup_tables/notification_types.csv`	Added notification-types lookup data.
`database/lookup_tables/notification_channels.csv`	Added notification-channel lookup data.
`database/lookup_tables/help_categories.csv`	Added help-category lookup data.
`database/lookup_tables/help_categories_map.csv`	Added help-category hierarchy map.
`database/lookup_tables/country.csv`	Added country lookup data.
`database/lookup_tables/.gitkeep`	Added lookup-tables placeholder.
`database/.gitkeep`	Added database directory placeholder.
`data-engineering/tests/.gitkeep`	Added tests directory placeholder.
`data-engineering/tests/__init__.py`	Added tests package marker.
`data-engineering/test_categorizer.py`	Added categorizer test script.
`data-engineering/TASK_TRACKER.md`	Added team task tracker.
`data-engineering/src/utils/get_tables_info.py`	Added DB table-inspection utility.
`data-engineering/src/utils/__init__.py`	Added utils package marker.
`data-engineering/src/translation/lang_detection.py`	Added language detection/translation helper.
`data-engineering/src/translation/__init__.py`	Added translation package marker.
`data-engineering/src/scrapers/ngo/malaysia.py`	Added Malaysia NGO scraper.
`data-engineering/src/scrapers/ngo/india.py`	Added India NGO scraper.
`data-engineering/src/scrapers/ngo/afghanistan.py`	Added Afghanistan NGO scraper.
`data-engineering/src/scrapers/ngo/__init__.py`	Added NGO scrapers package marker.
`data-engineering/src/scrapers/emergency_contacts/scraper.py`	Added emergency-contacts scraper.
`data-engineering/src/scrapers/emergency_contacts/loader.py`	Added emergency-contacts DB loader.
`data-engineering/src/scrapers/emergency_contacts/cleaner.py`	Added emergency-contacts cleaner.
`data-engineering/src/scrapers/emergency_contacts/__init__.py`	Added emergency-contacts package marker.
`data-engineering/src/scrapers/__init__.py`	Added scrapers package marker.
`data-engineering/src/saayam-org-aggregator/requirements.txt`	Added aggregator Lambda dependency notes.
`data-engineering/src/saayam-org-aggregator/lambda_function.py`	Added org-aggregator Lambda entrypoint.
`data-engineering/src/saayam-org-aggregator/helpers.py`	Added org-aggregator DB/GenAI helpers.
`data-engineering/src/models/fraud_requests.py`	Added fraud-requests model.
`data-engineering/src/models/__init__.py`	Added models package marker.
`data-engineering/src/main.py`	Added FastAPI analytics/auth app.
`data-engineering/src/extensions.py`	Added Flask SQLAlchemy extension holder.
`data-engineering/src/config.py`	Added Flask config module.
`data-engineering/src/categorizer/requirements.txt`	Added categorizer Lambda dependencies.
`data-engineering/src/categorizer/handler.py`	Added categorizer Lambda handler.
`data-engineering/src/categorizer/classifier.py`	Added OpenAI-based classifier.
`data-engineering/src/categorizer/categories.py`	Added classifier category map.
`data-engineering/src/categorizer/__init__.py`	Added categorizer package marker.
`data-engineering/src/app.py`	Added Flask app for fraud/translation APIs.
`data-engineering/src/aggregate-daily-metrics/lambda_function.py`	Added daily-metrics Lambda entrypoint.
`data-engineering/src/aggregate-daily-metrics/helpers.py`	Added daily-metrics DB/S3 helpers.
`data-engineering/src/aggregate-daily-metrics/__init__.py`	Added metrics package marker.
`data-engineering/src/__init__.py`	Added src package marker.
`data-engineering/scripts/deploy/deploy_categorizer.sh`	Added categorizer deploy script.
`data-engineering/scripts/deploy/deploy_aggregator.sh`	Added aggregator deploy script.
`data-engineering/requirements.txt`	Added Python dependency list.
`data-engineering/KNOWLEDGE_TRANSFER.md`	Added engineering knowledge-transfer doc.
`data-engineering/infrastructure/service.yaml`	Added Kubernetes Service manifest.
`data-engineering/infrastructure/Dockerfile`	Added container build config.
`data-engineering/infrastructure/docker-compose.yml`	Added local compose config.
`data-engineering/infrastructure/deployment.yaml`	Added Kubernetes Deployment manifest.
`data-engineering/deploy-lambda.yml`	Added non-active workflow-like deploy file.
`data-engineering/datasets/raw/emergency_numbers.csv`	Added raw emergency numbers dataset.
`data-engineering/datasets/cleaned/cleaned_emergency_numbers.csv`	Added cleaned emergency numbers dataset.
`data-engineering/CONTRIBUTING.md`	Added engineering contribution guide.
`data-engineering/.gitignore`	Added engineering gitignore.
`data-engineering/.env.example`	Added environment template.
`data-analytics/sql/my_query.sql`	Added placeholder SQL file.
`data-analytics/sql/.gitkeep`	Added SQL directory placeholder.
`data-analytics/README.md`	Added analytics README.
`data-analytics/notebooks/.gitkeep`	Added notebooks placeholder.
`data-analytics/mock-data-generation/readme.md`	Added placeholder analytics mock-data README.
`data-analytics/lambda_functions/volunteer_application_analytics.py`	Added volunteer analytics Lambda.
`data-analytics/lambda_functions/beneficiariesTrendAnalysis.py`	Added beneficiaries/request trend Lambda.
`data-analytics/lambda_functions/application_analytics_request.py.py`	Added request analytics Lambda file.
`data-analytics/docs/.gitkeep`	Added docs placeholder.
`data-analytics/dashboards/.gitkeep`	Added dashboards placeholder.
`.github/workflows/deploy-lambda.yml`	Added GitHub Actions Lambda deploy workflow.

Comments suppressed due to low confidence (1)

LICENSE:1

This PR deletes the repository's license file without any replacement. That changes the legal terms for every downstream user and contributor and is unrelated to the mock-data work described in the PR.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

+user_ids = [f"user_{i}" for i in range(1, 501)]
+request_ids = [f"req_{i}" for i in range(1, 1001)]


+## output Files:
+
+Genereated files will be in :
+database/mock_db/


+## Features
+- Generates realistic text using Faker
+- Maintains referential integrity for user_id and request_id
+- Ensures logical timestamp ordering (created_at ≤ last_updated_at)


+# will keep all url and configurations
+
+
+SQLALCHEMY_DATABASE_URI = 'postgresql://postgres:password@host.docker.internal:5432/Saayam'


+@app.post("/token")
+def login(form_data: OAuth2PasswordRequestForm = Depends()):
+    user = USER_DB.get(form_data.username)
+    if not user:
+        raise HTTPException(status_code=401, detail="Invalid credentials")
+    token_data = {"sub": user["username"], "role": user["role"]}
+    access_token = create_jwt_token(token_data)
+    return {"access_token": access_token, "token_type": "bearer"}


+fake = Faker()
+
+# -----------------------------
+# CONFIG
+# -----------------------------
+NUM_ROWS = 100   # keep small as per requirement


+
+# Set environment variables
+#ENV DATABASE_URL="postgresql://postgres:password@localhost:5432/Saayam"
+ENV FLASK_APP=app.py 


+@app.get("/analytics/total_requestors", response_model=List[UserCategoryCount], dependencies=[Depends(check_user_role("admin"))])
+def get_total_users():
+    conn = get_db_connection()
+    if not conn:
+        raise HTTPException(status_code=500, detail="DB connection failed")
+    try:
+        cur = conn.cursor()
+        cur.execute("""
+            SELECT uc.user_category, COUNT(u.user_id) AS total_users
+            FROM user_category uc
+            LEFT JOIN users u ON u.user_category_id = uc.user_category_id
+            GROUP BY uc.user_category
+            ORDER BY total_users DESC;
+        """)
+        result = cur.fetchall()
+        return [{"user_category": row[0], "total_users": row[1]} for row in result]


+def merge_organizations(db_organizations, genAI_organizations):
+    try:
+        db_organizations = db_organizations.rename(columns={
+            'org_name': 'name',
+            'city_name': 'location',
+            'phone': 'contact'
+        })[['name', 'location', 'contact', 'email', 'web_url', 'mission', 'source', "db_or_ai"]]
+
+        genAI_organizations = genAI_organizations.rename(columns={
+            'organization_name': 'name'
+        })[['name', 'location', 'contact', 'email', 'web_url', 'mission', 'source', "db_or_ai"]]
+
+        return pd.concat([db_organizations, genAI_organizations], ignore_index=True)


+  pull_request:
+    types: [closed]
+    branches: [main]
+    paths:
+      - 'data-engineering/src/saayam-org-*/**'
+


shubhamnarkhede and others added 30 commits May 4, 2026 16:10

Init

da002b1

Initial Fraud Detection Logic (30 mins multiple request restriction)

86ce4e6

Containerized the application in Docker

34f1374

Minikube Implementation

4b19c7e

Adding Emergency Numbers dataset

74b7dd8

Adding data to db

94332f8

Lang Translation WIP

cb9ab27

Lang Detect and Translation API

f276cd4

Updating Test Branch

13709c2

Adding scripts for 3 countries

169ad53

SQL queries and visualizations for Subtask saayam-for-all#33

1507cf1

Includes tested SQL outputs and Matplotlib charts covering key metrics

Add FastAPI analytics endpoints with RBAC (Subtask saayam-for-all#35 …

4274eac

…of Task saayam-for-all#30) Includes complete implementation of JWT-based authentication and role-based access control (RBAC). Endpoints are protected based on user roles (admin, volunteer, requestor)

Create ONBOARDING.md

363feca

Update ONBOARDING.md

11780c5

Update ONBOARDING.md

0e22111

Update ONBOARDING.md

5dd6f2a

Create KnowledgeTransfer.md

8f87dd6

Restructure: split repo into data-engineering/ and data-analytics/ at…

57dc1d7

… root

Clean up: remove pycache/venv, update .gitignore

53e1463

Restructure: organize into src/, datasets/, infrastructure/, notebook…

163f665

…s/, tests/, scripts/

Add .env.example, per-Lambda requirements, deploy scripts, and CONTRI…

94c2c2d

…BUTING.md

Update imports and file paths for new project structure

9b8b87d

Add file creation guidelines to CONTRIBUTING.md

422b745

Move ONBOARDING.md to root as DATA_ENGINEERING_ONBOARDING.md

a715b48

Update DATA_ENGINEERING_ONBOARDING.md

8a40534

Delete DATA_ENGINEERING_ONBOARDING.md

e3186be

Create README.md

267837f

Delete data-engineering/README.md

648f4bf

Update CONTRIBUTING.md

98efa86

Update KnowledgeTransfer.md

9d37b7c

saquibb8 and others added 27 commits May 4, 2026 16:15

Rename readme.me to readme.md

eba2ec6

Add get_tables_info utility and mock-data-generation

0d09ba6

Update TASK_TRACKER.md

d7b9959

Update TASK_TRACKER.md

a798157

Add scripts and csv's

8d41a49

This PR includes: - Mock data generation script - Generated CSV files for volunteer_applications and user_skills -

Update README

85632b7

saayam-for-all#100: Add categories config with all predefined categor…

74ae25a

…ies and subcategories

saayam-for-all#100: Add OpenAI classifier logic for auto-categorizing…

f391abf

… help requests

saayam-for-all#100: Add Lambda handler entry point for categorizer

72fb812

saayam-for-all#100: Add openai to categorizer Lambda dependencies

6062d97

saayam-for-all#100: Add python-dotenv to local dev dependencies

ef0fff7

saayam-for-all#100: Add 10 diverse test cases for categorizer Lambda

45f2684

code for volunteer details and volunteers assigned

6017dc8

Add mock data generation for request_guest_details and req_add_info

108c3bb

Revert "Add mock data generation for request_guest_details and req_ad…

c152320

…d_info" This reverts commit 83bcaf6.

Revert "saayam-for-all#119 Add synthetic Data generator for request_c…

da99de3

…omments and volunteer_rating (saayam-for-all#133)" This reverts commit 93ac4c6.

Create beneficiariesTrendAnalysis.py

3f5fadc

Create request_application_analytics.py

130cfd0

Lambda function for Request_application_analytics

Rename request_application_analytics.py to application_analytics_requ…

24d729b

…est.py.

Create volunteer_application_analytics.py

9c0b970

Update volunteer_application_analytics.py

390e220

Update application_analytics_request.py.py

7bb5020

volunteer details and assigned tables scripts updated

06c884e

Revert "volunteer details and assigned tables scripts updated"

3241e10

This reverts commit 1793457.

Revert "code for volunteer details and volunteers assigned"

d7d1194

This reverts commit edfdb67.

saayam-for-all#119: Added synthetic data generation for request_comme…

0305e35

…nts and volunteer_rating tables

Copilot AI review requested due to automatic review settings May 4, 2026 21:18

Copilot started reviewing on behalf of PavanChandru May 4, 2026 21:19 View session

Copilot AI reviewed May 4, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Pavan chandru 119 request data: Synthetic data generation for request_comments and volunteer_rating#164

Pavan chandru 119 request data: Synthetic data generation for request_comments and volunteer_rating#164
PavanChandru wants to merge 93 commits into
saayam-for-all:devfrom
PavanChandru:PavanChandru_119_request_data

PavanChandru commented May 4, 2026

Uh oh!

Copilot AI left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

16 participants

		user_ids = [f"user_{i}" for i in range(1, 501)]
		request_ids = [f"req_{i}" for i in range(1, 1001)]

		# will keep all url and configurations


		SQLALCHEMY_DATABASE_URI = 'postgresql://postgres:password@host.docker.internal:5432/Saayam'

Uh oh!

Conversation

PavanChandru commented May 4, 2026

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

16 participants