Skip to content

Pavan chandru 119 request data: Synthetic data generation for request_comments and volunteer_rating#164

Open
PavanChandru wants to merge 93 commits into
saayam-for-all:devfrom
PavanChandru:PavanChandru_119_request_data
Open

Pavan chandru 119 request data: Synthetic data generation for request_comments and volunteer_rating#164
PavanChandru wants to merge 93 commits into
saayam-for-all:devfrom
PavanChandru:PavanChandru_119_request_data

Conversation

@PavanChandru

Copy link
Copy Markdown

Implemented synthetic data generation for request_comments and volunteer_rating tables.

  • Generated realistic mock data using Faker
  • Maintained referential integrity for user_id and request_id
  • Followed schema defined in db_info.json
  • Created CSV outputs in mock_db
  • Updated README with usage instructions

Ready for review.

shubhamnarkhede and others added 30 commits May 4, 2026 16:10
Includes tested SQL outputs and Matplotlib charts covering key metrics
…of Task saayam-for-all#30)

Includes complete implementation of JWT-based authentication and role-based access control (RBAC). Endpoints are protected based on user roles (admin, volunteer, requestor)
saquibb8 and others added 27 commits May 4, 2026 16:15
This PR includes:
- Mock data generation script 
- Generated CSV files for volunteer_applications and user_skills
-
…and volunteer_rating (saayam-for-all#133)

* saayam-for-all#119 Add synthetic Data generator for request_comments and volunteer_rating

* saayam-for-all#119 Upgrade synthetic data generator with 3-tier ML-grade text pipeline

- Added 3-tier text generation: compositional grammar, stochastic perturbation, and diversity validator for ML-training-grade diversity
- Added standalone usage mode (--rows flag)
- Regenerated request_comments.csv and volunteer_rating.csv with higher-quality diverse data
Lambda function for Request_application_analytics
Copilot AI review requested due to automatic review settings May 4, 2026 21:18

Copilot AI left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR substantially broadens the data repo: it adds synthetic database assets, new data-engineering/data-analytics code paths, deployment/infrastructure scaffolding, and a large documentation/onboarding rewrite. Within the overall codebase, it looks like an attempt to bootstrap the database mock-data workflow plus several backend/analytics utilities, but the scope is much larger than the PR title/description suggest.

Changes:

  • Added mock-data generators and CSV outputs for volunteer/request-related tables plus many database lookup tables.
  • Added new data-engineering/data-analytics application code, Lambdas, scrapers, models, deployment scripts, and infrastructure manifests.
  • Replaced/added extensive documentation for onboarding, workflows, and project context.

Reviewed changes

Copilot reviewed 74 out of 96 changed files in this pull request and generated 44 comments.

Show a summary per file
File Description
README.md Replaced top-level repo README with onboarding/team handbook.
LICENSE Deleted repository license file.
database/README.MD Added placeholder database README.
database/mock-data-generation/volunteer_applications.py Added volunteer application mock-data generator.
database/mock-data-generation/utils.py Added shared mock-data utilities/constants.
database/mock-data-generation/user_skills.py Added user-skills derivation script.
database/mock-data-generation/user_skills.csv Added generated user-skills CSV.
database/mock-data-generation/README.md Added mock-data generation documentation.
database/mock-data-generation/generate_request_data.py Added request comments / volunteer rating generator.
database/mock-data-generation/generate_mock_data.py Added volunteer mock-data entrypoint.
database/mock_db/volunteer_rating.csv Added generated volunteer rating CSV.
database/mock_db/users.csv Added placeholder users CSV.
database/mock_db/request_comments.csv Added generated request comments CSV.
database/lookup_tables/user_status.csv Added user-status lookup data.
database/lookup_tables/user_category.csv Added user-category lookup data.
database/lookup_tables/supporting_languages.csv Added languages lookup data.
database/lookup_tables/request_type.csv Added request-type lookup data.
database/lookup_tables/request_status.csv Added request-status lookup data.
database/lookup_tables/request_priority.csv Added request-priority lookup data.
database/lookup_tables/request_isleadvol.csv Added lead-volunteer lookup data.
database/lookup_tables/request_for.csv Added request-for lookup data.
database/lookup_tables/req_add_info_metadata.csv Added request additional-info metadata.
database/lookup_tables/notification_types.csv Added notification-types lookup data.
database/lookup_tables/notification_channels.csv Added notification-channel lookup data.
database/lookup_tables/help_categories.csv Added help-category lookup data.
database/lookup_tables/help_categories_map.csv Added help-category hierarchy map.
database/lookup_tables/country.csv Added country lookup data.
database/lookup_tables/.gitkeep Added lookup-tables placeholder.
database/.gitkeep Added database directory placeholder.
data-engineering/tests/.gitkeep Added tests directory placeholder.
data-engineering/tests/__init__.py Added tests package marker.
data-engineering/test_categorizer.py Added categorizer test script.
data-engineering/TASK_TRACKER.md Added team task tracker.
data-engineering/src/utils/get_tables_info.py Added DB table-inspection utility.
data-engineering/src/utils/__init__.py Added utils package marker.
data-engineering/src/translation/lang_detection.py Added language detection/translation helper.
data-engineering/src/translation/__init__.py Added translation package marker.
data-engineering/src/scrapers/ngo/malaysia.py Added Malaysia NGO scraper.
data-engineering/src/scrapers/ngo/india.py Added India NGO scraper.
data-engineering/src/scrapers/ngo/afghanistan.py Added Afghanistan NGO scraper.
data-engineering/src/scrapers/ngo/__init__.py Added NGO scrapers package marker.
data-engineering/src/scrapers/emergency_contacts/scraper.py Added emergency-contacts scraper.
data-engineering/src/scrapers/emergency_contacts/loader.py Added emergency-contacts DB loader.
data-engineering/src/scrapers/emergency_contacts/cleaner.py Added emergency-contacts cleaner.
data-engineering/src/scrapers/emergency_contacts/__init__.py Added emergency-contacts package marker.
data-engineering/src/scrapers/__init__.py Added scrapers package marker.
data-engineering/src/saayam-org-aggregator/requirements.txt Added aggregator Lambda dependency notes.
data-engineering/src/saayam-org-aggregator/lambda_function.py Added org-aggregator Lambda entrypoint.
data-engineering/src/saayam-org-aggregator/helpers.py Added org-aggregator DB/GenAI helpers.
data-engineering/src/models/fraud_requests.py Added fraud-requests model.
data-engineering/src/models/__init__.py Added models package marker.
data-engineering/src/main.py Added FastAPI analytics/auth app.
data-engineering/src/extensions.py Added Flask SQLAlchemy extension holder.
data-engineering/src/config.py Added Flask config module.
data-engineering/src/categorizer/requirements.txt Added categorizer Lambda dependencies.
data-engineering/src/categorizer/handler.py Added categorizer Lambda handler.
data-engineering/src/categorizer/classifier.py Added OpenAI-based classifier.
data-engineering/src/categorizer/categories.py Added classifier category map.
data-engineering/src/categorizer/__init__.py Added categorizer package marker.
data-engineering/src/app.py Added Flask app for fraud/translation APIs.
data-engineering/src/aggregate-daily-metrics/lambda_function.py Added daily-metrics Lambda entrypoint.
data-engineering/src/aggregate-daily-metrics/helpers.py Added daily-metrics DB/S3 helpers.
data-engineering/src/aggregate-daily-metrics/__init__.py Added metrics package marker.
data-engineering/src/__init__.py Added src package marker.
data-engineering/scripts/deploy/deploy_categorizer.sh Added categorizer deploy script.
data-engineering/scripts/deploy/deploy_aggregator.sh Added aggregator deploy script.
data-engineering/requirements.txt Added Python dependency list.
data-engineering/KNOWLEDGE_TRANSFER.md Added engineering knowledge-transfer doc.
data-engineering/infrastructure/service.yaml Added Kubernetes Service manifest.
data-engineering/infrastructure/Dockerfile Added container build config.
data-engineering/infrastructure/docker-compose.yml Added local compose config.
data-engineering/infrastructure/deployment.yaml Added Kubernetes Deployment manifest.
data-engineering/deploy-lambda.yml Added non-active workflow-like deploy file.
data-engineering/datasets/raw/emergency_numbers.csv Added raw emergency numbers dataset.
data-engineering/datasets/cleaned/cleaned_emergency_numbers.csv Added cleaned emergency numbers dataset.
data-engineering/CONTRIBUTING.md Added engineering contribution guide.
data-engineering/.gitignore Added engineering gitignore.
data-engineering/.env.example Added environment template.
data-analytics/sql/my_query.sql Added placeholder SQL file.
data-analytics/sql/.gitkeep Added SQL directory placeholder.
data-analytics/README.md Added analytics README.
data-analytics/notebooks/.gitkeep Added notebooks placeholder.
data-analytics/mock-data-generation/readme.md Added placeholder analytics mock-data README.
data-analytics/lambda_functions/volunteer_application_analytics.py Added volunteer analytics Lambda.
data-analytics/lambda_functions/beneficiariesTrendAnalysis.py Added beneficiaries/request trend Lambda.
data-analytics/lambda_functions/application_analytics_request.py.py Added request analytics Lambda file.
data-analytics/docs/.gitkeep Added docs placeholder.
data-analytics/dashboards/.gitkeep Added dashboards placeholder.
.github/workflows/deploy-lambda.yml Added GitHub Actions Lambda deploy workflow.
Comments suppressed due to low confidence (1)

LICENSE:1

  • This PR deletes the repository's license file without any replacement. That changes the legal terms for every downstream user and contributor and is unrelated to the mock-data work described in the PR.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines +14 to +15
user_ids = [f"user_{i}" for i in range(1, 501)]
request_ids = [f"req_{i}" for i in range(1, 1001)]
Comment on lines +168 to +171
## output Files:

Genereated files will be in :
database/mock_db/
Comment on lines +152 to +155
## Features
- Generates realistic text using Faker
- Maintains referential integrity for user_id and request_id
- Ensures logical timestamp ordering (created_at ≤ last_updated_at)
# will keep all url and configurations


SQLALCHEMY_DATABASE_URI = 'postgresql://postgres:password@host.docker.internal:5432/Saayam'
Comment on lines +94 to +101
@app.post("/token")
def login(form_data: OAuth2PasswordRequestForm = Depends()):
user = USER_DB.get(form_data.username)
if not user:
raise HTTPException(status_code=401, detail="Invalid credentials")
token_data = {"sub": user["username"], "role": user["role"]}
access_token = create_jwt_token(token_data)
return {"access_token": access_token, "token_type": "bearer"}
Comment on lines +6 to +11
fake = Faker()

# -----------------------------
# CONFIG
# -----------------------------
NUM_ROWS = 100 # keep small as per requirement

# Set environment variables
#ENV DATABASE_URL="postgresql://postgres:password@localhost:5432/Saayam"
ENV FLASK_APP=app.py
Comment on lines +106 to +121
@app.get("/analytics/total_requestors", response_model=List[UserCategoryCount], dependencies=[Depends(check_user_role("admin"))])
def get_total_users():
conn = get_db_connection()
if not conn:
raise HTTPException(status_code=500, detail="DB connection failed")
try:
cur = conn.cursor()
cur.execute("""
SELECT uc.user_category, COUNT(u.user_id) AS total_users
FROM user_category uc
LEFT JOIN users u ON u.user_category_id = uc.user_category_id
GROUP BY uc.user_category
ORDER BY total_users DESC;
""")
result = cur.fetchall()
return [{"user_category": row[0], "total_users": row[1]} for row in result]
Comment on lines +68 to +80
def merge_organizations(db_organizations, genAI_organizations):
try:
db_organizations = db_organizations.rename(columns={
'org_name': 'name',
'city_name': 'location',
'phone': 'contact'
})[['name', 'location', 'contact', 'email', 'web_url', 'mission', 'source', "db_or_ai"]]

genAI_organizations = genAI_organizations.rename(columns={
'organization_name': 'name'
})[['name', 'location', 'contact', 'email', 'web_url', 'mission', 'source', "db_or_ai"]]

return pd.concat([db_organizations, genAI_organizations], ignore_index=True)
Comment on lines +4 to +9
pull_request:
types: [closed]
branches: [main]
paths:
- 'data-engineering/src/saayam-org-*/**'

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.