Pavan chandru 119 request data: Synthetic data generation for request_comments and volunteer_rating#164
Open
PavanChandru wants to merge 93 commits into
Open
Conversation
Includes tested SQL outputs and Matplotlib charts covering key metrics
…of Task saayam-for-all#30) Includes complete implementation of JWT-based authentication and role-based access control (RBAC). Endpoints are protected based on user roles (admin, volunteer, requestor)
…s/, tests/, scripts/
This PR includes: - Mock data generation script - Generated CSV files for volunteer_applications and user_skills -
…ies and subcategories
…d_info" This reverts commit 83bcaf6.
…and volunteer_rating (saayam-for-all#133) * saayam-for-all#119 Add synthetic Data generator for request_comments and volunteer_rating * saayam-for-all#119 Upgrade synthetic data generator with 3-tier ML-grade text pipeline - Added 3-tier text generation: compositional grammar, stochastic perturbation, and diversity validator for ML-training-grade diversity - Added standalone usage mode (--rows flag) - Regenerated request_comments.csv and volunteer_rating.csv with higher-quality diverse data
…omments and volunteer_rating (saayam-for-all#133)" This reverts commit 93ac4c6.
Lambda function for Request_application_analytics
This reverts commit 1793457.
This reverts commit edfdb67.
…nts and volunteer_rating tables
There was a problem hiding this comment.
Pull request overview
This PR substantially broadens the data repo: it adds synthetic database assets, new data-engineering/data-analytics code paths, deployment/infrastructure scaffolding, and a large documentation/onboarding rewrite. Within the overall codebase, it looks like an attempt to bootstrap the database mock-data workflow plus several backend/analytics utilities, but the scope is much larger than the PR title/description suggest.
Changes:
- Added mock-data generators and CSV outputs for volunteer/request-related tables plus many database lookup tables.
- Added new data-engineering/data-analytics application code, Lambdas, scrapers, models, deployment scripts, and infrastructure manifests.
- Replaced/added extensive documentation for onboarding, workflows, and project context.
Reviewed changes
Copilot reviewed 74 out of 96 changed files in this pull request and generated 44 comments.
Show a summary per file
| File | Description |
|---|---|
README.md |
Replaced top-level repo README with onboarding/team handbook. |
LICENSE |
Deleted repository license file. |
database/README.MD |
Added placeholder database README. |
database/mock-data-generation/volunteer_applications.py |
Added volunteer application mock-data generator. |
database/mock-data-generation/utils.py |
Added shared mock-data utilities/constants. |
database/mock-data-generation/user_skills.py |
Added user-skills derivation script. |
database/mock-data-generation/user_skills.csv |
Added generated user-skills CSV. |
database/mock-data-generation/README.md |
Added mock-data generation documentation. |
database/mock-data-generation/generate_request_data.py |
Added request comments / volunteer rating generator. |
database/mock-data-generation/generate_mock_data.py |
Added volunteer mock-data entrypoint. |
database/mock_db/volunteer_rating.csv |
Added generated volunteer rating CSV. |
database/mock_db/users.csv |
Added placeholder users CSV. |
database/mock_db/request_comments.csv |
Added generated request comments CSV. |
database/lookup_tables/user_status.csv |
Added user-status lookup data. |
database/lookup_tables/user_category.csv |
Added user-category lookup data. |
database/lookup_tables/supporting_languages.csv |
Added languages lookup data. |
database/lookup_tables/request_type.csv |
Added request-type lookup data. |
database/lookup_tables/request_status.csv |
Added request-status lookup data. |
database/lookup_tables/request_priority.csv |
Added request-priority lookup data. |
database/lookup_tables/request_isleadvol.csv |
Added lead-volunteer lookup data. |
database/lookup_tables/request_for.csv |
Added request-for lookup data. |
database/lookup_tables/req_add_info_metadata.csv |
Added request additional-info metadata. |
database/lookup_tables/notification_types.csv |
Added notification-types lookup data. |
database/lookup_tables/notification_channels.csv |
Added notification-channel lookup data. |
database/lookup_tables/help_categories.csv |
Added help-category lookup data. |
database/lookup_tables/help_categories_map.csv |
Added help-category hierarchy map. |
database/lookup_tables/country.csv |
Added country lookup data. |
database/lookup_tables/.gitkeep |
Added lookup-tables placeholder. |
database/.gitkeep |
Added database directory placeholder. |
data-engineering/tests/.gitkeep |
Added tests directory placeholder. |
data-engineering/tests/__init__.py |
Added tests package marker. |
data-engineering/test_categorizer.py |
Added categorizer test script. |
data-engineering/TASK_TRACKER.md |
Added team task tracker. |
data-engineering/src/utils/get_tables_info.py |
Added DB table-inspection utility. |
data-engineering/src/utils/__init__.py |
Added utils package marker. |
data-engineering/src/translation/lang_detection.py |
Added language detection/translation helper. |
data-engineering/src/translation/__init__.py |
Added translation package marker. |
data-engineering/src/scrapers/ngo/malaysia.py |
Added Malaysia NGO scraper. |
data-engineering/src/scrapers/ngo/india.py |
Added India NGO scraper. |
data-engineering/src/scrapers/ngo/afghanistan.py |
Added Afghanistan NGO scraper. |
data-engineering/src/scrapers/ngo/__init__.py |
Added NGO scrapers package marker. |
data-engineering/src/scrapers/emergency_contacts/scraper.py |
Added emergency-contacts scraper. |
data-engineering/src/scrapers/emergency_contacts/loader.py |
Added emergency-contacts DB loader. |
data-engineering/src/scrapers/emergency_contacts/cleaner.py |
Added emergency-contacts cleaner. |
data-engineering/src/scrapers/emergency_contacts/__init__.py |
Added emergency-contacts package marker. |
data-engineering/src/scrapers/__init__.py |
Added scrapers package marker. |
data-engineering/src/saayam-org-aggregator/requirements.txt |
Added aggregator Lambda dependency notes. |
data-engineering/src/saayam-org-aggregator/lambda_function.py |
Added org-aggregator Lambda entrypoint. |
data-engineering/src/saayam-org-aggregator/helpers.py |
Added org-aggregator DB/GenAI helpers. |
data-engineering/src/models/fraud_requests.py |
Added fraud-requests model. |
data-engineering/src/models/__init__.py |
Added models package marker. |
data-engineering/src/main.py |
Added FastAPI analytics/auth app. |
data-engineering/src/extensions.py |
Added Flask SQLAlchemy extension holder. |
data-engineering/src/config.py |
Added Flask config module. |
data-engineering/src/categorizer/requirements.txt |
Added categorizer Lambda dependencies. |
data-engineering/src/categorizer/handler.py |
Added categorizer Lambda handler. |
data-engineering/src/categorizer/classifier.py |
Added OpenAI-based classifier. |
data-engineering/src/categorizer/categories.py |
Added classifier category map. |
data-engineering/src/categorizer/__init__.py |
Added categorizer package marker. |
data-engineering/src/app.py |
Added Flask app for fraud/translation APIs. |
data-engineering/src/aggregate-daily-metrics/lambda_function.py |
Added daily-metrics Lambda entrypoint. |
data-engineering/src/aggregate-daily-metrics/helpers.py |
Added daily-metrics DB/S3 helpers. |
data-engineering/src/aggregate-daily-metrics/__init__.py |
Added metrics package marker. |
data-engineering/src/__init__.py |
Added src package marker. |
data-engineering/scripts/deploy/deploy_categorizer.sh |
Added categorizer deploy script. |
data-engineering/scripts/deploy/deploy_aggregator.sh |
Added aggregator deploy script. |
data-engineering/requirements.txt |
Added Python dependency list. |
data-engineering/KNOWLEDGE_TRANSFER.md |
Added engineering knowledge-transfer doc. |
data-engineering/infrastructure/service.yaml |
Added Kubernetes Service manifest. |
data-engineering/infrastructure/Dockerfile |
Added container build config. |
data-engineering/infrastructure/docker-compose.yml |
Added local compose config. |
data-engineering/infrastructure/deployment.yaml |
Added Kubernetes Deployment manifest. |
data-engineering/deploy-lambda.yml |
Added non-active workflow-like deploy file. |
data-engineering/datasets/raw/emergency_numbers.csv |
Added raw emergency numbers dataset. |
data-engineering/datasets/cleaned/cleaned_emergency_numbers.csv |
Added cleaned emergency numbers dataset. |
data-engineering/CONTRIBUTING.md |
Added engineering contribution guide. |
data-engineering/.gitignore |
Added engineering gitignore. |
data-engineering/.env.example |
Added environment template. |
data-analytics/sql/my_query.sql |
Added placeholder SQL file. |
data-analytics/sql/.gitkeep |
Added SQL directory placeholder. |
data-analytics/README.md |
Added analytics README. |
data-analytics/notebooks/.gitkeep |
Added notebooks placeholder. |
data-analytics/mock-data-generation/readme.md |
Added placeholder analytics mock-data README. |
data-analytics/lambda_functions/volunteer_application_analytics.py |
Added volunteer analytics Lambda. |
data-analytics/lambda_functions/beneficiariesTrendAnalysis.py |
Added beneficiaries/request trend Lambda. |
data-analytics/lambda_functions/application_analytics_request.py.py |
Added request analytics Lambda file. |
data-analytics/docs/.gitkeep |
Added docs placeholder. |
data-analytics/dashboards/.gitkeep |
Added dashboards placeholder. |
.github/workflows/deploy-lambda.yml |
Added GitHub Actions Lambda deploy workflow. |
Comments suppressed due to low confidence (1)
LICENSE:1
- This PR deletes the repository's license file without any replacement. That changes the legal terms for every downstream user and contributor and is unrelated to the mock-data work described in the PR.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
Comment on lines
+14
to
+15
| user_ids = [f"user_{i}" for i in range(1, 501)] | ||
| request_ids = [f"req_{i}" for i in range(1, 1001)] |
Comment on lines
+168
to
+171
| ## output Files: | ||
|
|
||
| Genereated files will be in : | ||
| database/mock_db/ |
Comment on lines
+152
to
+155
| ## Features | ||
| - Generates realistic text using Faker | ||
| - Maintains referential integrity for user_id and request_id | ||
| - Ensures logical timestamp ordering (created_at ≤ last_updated_at) |
| # will keep all url and configurations | ||
|
|
||
|
|
||
| SQLALCHEMY_DATABASE_URI = 'postgresql://postgres:password@host.docker.internal:5432/Saayam' |
Comment on lines
+94
to
+101
| @app.post("/token") | ||
| def login(form_data: OAuth2PasswordRequestForm = Depends()): | ||
| user = USER_DB.get(form_data.username) | ||
| if not user: | ||
| raise HTTPException(status_code=401, detail="Invalid credentials") | ||
| token_data = {"sub": user["username"], "role": user["role"]} | ||
| access_token = create_jwt_token(token_data) | ||
| return {"access_token": access_token, "token_type": "bearer"} |
Comment on lines
+6
to
+11
| fake = Faker() | ||
|
|
||
| # ----------------------------- | ||
| # CONFIG | ||
| # ----------------------------- | ||
| NUM_ROWS = 100 # keep small as per requirement |
|
|
||
| # Set environment variables | ||
| #ENV DATABASE_URL="postgresql://postgres:password@localhost:5432/Saayam" | ||
| ENV FLASK_APP=app.py |
Comment on lines
+106
to
+121
| @app.get("/analytics/total_requestors", response_model=List[UserCategoryCount], dependencies=[Depends(check_user_role("admin"))]) | ||
| def get_total_users(): | ||
| conn = get_db_connection() | ||
| if not conn: | ||
| raise HTTPException(status_code=500, detail="DB connection failed") | ||
| try: | ||
| cur = conn.cursor() | ||
| cur.execute(""" | ||
| SELECT uc.user_category, COUNT(u.user_id) AS total_users | ||
| FROM user_category uc | ||
| LEFT JOIN users u ON u.user_category_id = uc.user_category_id | ||
| GROUP BY uc.user_category | ||
| ORDER BY total_users DESC; | ||
| """) | ||
| result = cur.fetchall() | ||
| return [{"user_category": row[0], "total_users": row[1]} for row in result] |
Comment on lines
+68
to
+80
| def merge_organizations(db_organizations, genAI_organizations): | ||
| try: | ||
| db_organizations = db_organizations.rename(columns={ | ||
| 'org_name': 'name', | ||
| 'city_name': 'location', | ||
| 'phone': 'contact' | ||
| })[['name', 'location', 'contact', 'email', 'web_url', 'mission', 'source', "db_or_ai"]] | ||
|
|
||
| genAI_organizations = genAI_organizations.rename(columns={ | ||
| 'organization_name': 'name' | ||
| })[['name', 'location', 'contact', 'email', 'web_url', 'mission', 'source', "db_or_ai"]] | ||
|
|
||
| return pd.concat([db_organizations, genAI_organizations], ignore_index=True) |
Comment on lines
+4
to
+9
| pull_request: | ||
| types: [closed] | ||
| branches: [main] | ||
| paths: | ||
| - 'data-engineering/src/saayam-org-*/**' | ||
|
|
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Implemented synthetic data generation for request_comments and volunteer_rating tables.
Ready for review.