-
-
Notifications
You must be signed in to change notification settings - Fork 2
chrore: Upgrade Glue v5 and add Terraform #40
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
- Created environment-specific variable files for dev and staging environments. - Added README documentation for environment configurations. - Implemented IAM roles and policies for Glue job execution. - Configured S3 bucket for Glue artifacts with versioning and lifecycle rules. - Defined Glue job configurations for PySpark and Python shell jobs. - Established networking resources including security groups and Glue connections. - Added outputs for important resource identifiers and deployment summary. - Created example Terraform variables for Glue deployment POC. - Implemented local variables for common tags and resource naming conventions. - Added main Terraform configuration file as an entry point. - Included version constraints for Terraform and required providers.
WalkthroughThis update migrates AWS Glue ETL infrastructure provisioning from the Serverless Framework to a Terraform-based workflow. It removes all Serverless-related files and configuration, introduces comprehensive Terraform modules for Glue jobs, IAM, networking, S3, and logging, and updates documentation and scripts to reflect the new deployment process. Python, Node.js, and Docker environment versions are modernized, and code is refactored for improved logging, configuration management, and import consistency. Changes
Sequence Diagram(s)sequenceDiagram
participant User
participant Terraform
participant AWS
participant S3
participant Glue
User->>Terraform: terraform init / plan / apply
Terraform->>AWS: Provision S3, IAM, Networking, Glue Jobs, CloudWatch
AWS->>S3: Create S3 bucket and upload scripts/support files
AWS->>IAM: Create Glue job role and attach policies
AWS->>Glue: Create Glue jobs (PySpark, PythonShell) using S3 scripts
AWS->>CloudWatch: Create log groups for Glue jobs and service
User->>Glue: Trigger Glue job (via AWS Console or CLI)
Glue->>S3: Read scripts and support files
Glue->>CloudWatch: Stream logs during execution
Estimated code review effort4 (60–120 minutes) Poem
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. 🪧 TipsChatThere are 3 ways to chat with CodeRabbit:
SupportNeed help? Create a ticket on our support page for assistance with any issues or questions. Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments. CodeRabbit Commands (Invoked using PR comments)
Other keywords and placeholders
CodeRabbit Configuration File (
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 28
🔭 Outside diff range comments (1)
libs/io/helpers.py (1)
55-58:{database}and TLS flag never interpolated – missingfprefixOnly the first line of the multi-line string is an f-string. The second line is a plain string, so
{database}and the TLS ternary remain literally in the URI.- url = ( - f"mongodb://{encoded_user}:{encoded_pass}@{host}:{port}/" - "{database}?tls={'true' if ssl else 'false'}" - ) + url = ( + f"mongodb://{encoded_user}:{encoded_pass}@{host}:{port}/" + f"{database}?tls={'true' if ssl else 'false'}" + )
🧹 Nitpick comments (25)
libs/io/helpers.py (1)
2-4: Redundant blank line—follow project style guideA single blank line between standard-library imports is already present; adding a second one is unnecessary and trips some linters (e.g.
ruff E303).terraform/.gitignore (1)
1-4: Consider broader ignore patterns for Terraform artefactsAdd wildcard variants to avoid accidental commits of env-specific state files or crash logs:
+*.tfstate* +crash.log +.terraformrclibs/config/__init__.py (1)
1-1: Silence Ruff F401 by explicitly re-exporting public symbols
Configandget_configare purposefully re-exported but Ruff flags them as “imported-but-unused”. Add__all__to make the intent explicit:from .config import Config, get_config + +# Re-export for library consumers / tests. +__all__ = ["Config", "get_config"]libs/common/__init__.py (1)
2-2: Mark re-exports to avoid lint noise
cached_propertyis likely re-exported for downstream imports. Add it to__all__(or remove the import if that’s not the intent) to silence F401:from .perm import load_tls_ca_bundle from .util import cached_property + +__all__ = ["load_tls_ca_bundle", "cached_property"]libs/etl/__init__.py (1)
3-4: Expose public ETL helpers via__all__Several helpers are imported only to surface a clean package API, triggering Ruff F401. Clarify with:
from .extract import extract from .load_documentdb import load_to_document_db from .load_postgresql import load_to_postgresql_db from .load_s3 import load_to_s3 + +__all__ = [ + "extract", + "load_to_document_db", + "load_to_postgresql_db", + "load_to_s3", +]jobs/pythonshell_hello_world.py (1)
3-3: Consider using a module-specific logger instead of configuring the root logger.Setting the root logger level globally can affect logging behavior of imported libraries. For production code, consider using a module-specific logger:
-logging.getLogger().setLevel(logging.INFO) +logger = logging.getLogger(__name__) +logger.setLevel(logging.INFO)However, for a simple "hello world" script like this, the current approach is acceptable.
libs/etl/extract.py (1)
4-6: Import refactor is fine, but keepListimport if actually unused
Switching to relative imports is correct. Static linters will now flagfrom typing import Listas unused; consider removing it in a follow-up.libs/etl/load_s3.py (2)
1-1: Remove unused import.
GlueContextis imported but never used in this function.-from awsglue.context import GlueContext, DynamicFrame +from awsglue.context import DynamicFrame
14-16: Improve path validation.The current validation only checks for
Nonebut should also handle empty strings and whitespace-only strings.- if not path: + if not path or not path.strip(): raise ValueError("Missing destination path for S3 write")jobs/pyspark_hello_world.py (1)
1-1: Remove unused import.The
jsonimport is not used anywhere in the script and should be removed to keep the code clean.-import jsonterraform/s3.tf (1)
4-47: Excellent S3 security configuration!The S3 bucket configuration follows security best practices with private ACL, public access restrictions, server-side encryption, and appropriate lifecycle rules for cost optimization. The versioning and cleanup policies are well-configured for artifact management.
Optional improvement: Consider pinning the module source to a specific commit hash for production environments to ensure reproducible deployments:
module "s3_bucket" { source = "cloudposse/s3-bucket/aws" - version = "~> 4.0" + version = "4.2.0" # or specific commit hashlocal/localstack/start-jupyter.sh (1)
2-2: Consider using stricter error handling.The
set +Xoption disables command tracing but doesn't enable strict error handling. Consider usingset -euo pipefailfor better error handling in production environments.-set +X +set -euo pipefaillibs/logger.py (2)
19-26: Consider preventing duplicate handlers.The current logic only checks if handlers exist, but doesn't prevent adding duplicate handlers if the function is called multiple times with the same logger name.
# Configure a handler if none exists if not logger.handlers: handler = logging.StreamHandler() formatter = logging.Formatter( "%(asctime)s - %(name)s - %(levelname)s - %(message)s" ) handler.setFormatter(formatter) logger.addHandler(handler) + # Prevent propagation to avoid duplicate logs + logger.propagate = False
42-48: Enhance sensitive value detection.The current implementation only checks if any sensitive word is in the key, but doesn't handle nested structures or more complex patterns.
Consider improving the detection logic:
def mask_sensitive_value(key, value): """ Mask sensitive values in logs """ - sensitive_keys = ["password", "secret", "token", "key"] + sensitive_keys = ["password", "secret", "token", "key", "credential", "auth"] if ( any(sensitive_word in key.lower() for sensitive_word in sensitive_keys) - and value + and value is not None and str(value).strip() ): return "********" return valueterraform/terraform.tfvars.example (2)
23-23: Fix formatting inconsistency.There's extra whitespace in the
typefield assignment.- type = "spark" # Job type: "spark" or "pythonshell" + type = "spark" # Job type: "spark" or "pythonshell"
40-44: Additional support file adds complexity.The second support file (
jobs.zip) in this example differs from the first tfvars file. Ensure this is intentional and document the purpose of each support file.Consider adding comments to clarify the purpose of each support file:
support_files = [ { - local_path = "libs.zip" # Path to local file relative to terraform directory + local_path = "libs.zip" # Shared libraries and dependencies s3_prefix = "" # S3 prefix where the file will be uploaded }, { - local_path = "jobs.zip" + local_path = "jobs.zip" # Job scripts and configurations s3_prefix = "" } ]terraform/environments/dev.tfvars (1)
78-80: Consider more restrictive CIDR blocks for development.The CIDR block
10.0.0.0/8is very broad and may grant excessive network access for a development environment.Consider using a more restrictive CIDR block:
-database_cidr_blocks = ["10.0.0.0/8"] +database_cidr_blocks = ["10.0.0.0/16"] # More restrictive for developmentterraform/cloudwatch.tf (1)
6-21: Logs are un-encrypted; consider KMS for complianceCloudWatch Log Groups default to AWS-managed encryption. If your org requires customer-managed keys, add
kms_key_id.resource "aws_cloudwatch_log_group" "glue_job_logs" { for_each = var.glue_jobs - + kms_key_id = var.kms_key_id # new variableSame change applies to the service & crawler log groups.
terraform/networking.tf (1)
28-43: Outbound HTTP/HTTPS to 0.0.0.0/0 may breach least-privilegeIf Glue jobs only need to reach AWS services, restrict to AWS IP ranges or VPC endpoints; else document the requirement.
Dockerfile (1)
10-14: Consider pinning package versions for reproducibility.While the cleanup is good, running
yum update -ywithout version constraints could lead to non-reproducible builds if package versions change between builds.Consider specifying versions for critical packages to ensure build reproducibility:
-RUN yum update -y && \ - yum install -y gcc python3-devel wget && \ +RUN yum install -y \ + gcc \ + python3-devel \ + wget && \ yum clean all && \ rm -rf /var/cache/yumterraform/glue_jobs.tf (1)
84-103: Consider removing explicit aws_region parameter.AWS Glue jobs automatically inherit the region from their execution environment. Explicitly passing it as a job argument could cause issues if the job is copied to another region.
# Support files - reference the uploaded zip files "--extra-py-files" = "s3://${module.s3_bucket.bucket_id}/libs.zip,s3://${module.s3_bucket.bucket_id}/jobs.zip" - "--aws_region" = var.aws_region }, each.value.default_arguments )If your job code specifically needs the region, it should use the AWS SDK's region detection instead of relying on a passed parameter.
libs/config/config.py (1)
207-236: Consider using consistent log levels for parameter sources.Currently, default value usage is logged at DEBUG level while other sources use INFO level. This inconsistency could make troubleshooting difficult.
# Return default if not found in either place - logger.debug(f"Using default value for {env_param_name}/{job_param_name}") + logger.info(f"Using default value for {env_param_name}/{job_param_name}") return defaultThis ensures all parameter source decisions are visible at the same log level.
terraform/variables.tf (3)
4-8: Avoid hard-coding the AWS region
Hard-wiringus-east-1as the default can lead to accidental multi-region drift or bill surprises when the variable is not explicitly overridden. Either remove the default or set it tonulland compute a fallback viadata.aws_regioninside the module.-variable "aws_region" { - description = "AWS region for resources" - type = string - default = "us-east-1" -} +variable "aws_region" { + description = "AWS region for resources" + type = string + default = null # Require the caller to pass it or fall back to the current provider region +}
44-64: Strengthen validation of theglue_jobsobject
Nothing prevents typos such as"sparkk"or negativemax_concurrent_runs. Add a top-level variable validation to keep bad input from reaching the provider.# example validation – place under the variable block validation { condition = alltrue([ for job in values(var.glue_jobs) : contains(["spark","pythonshell"], job.type) && job.max_concurrent_runs >= 1 && (job.number_of_workers == null || job.number_of_workers > 0) ]) error_message = "Each glue_jobs entry must have type 'spark' or 'pythonshell', positive concurrency, and (if set) a positive number_of_workers." }
86-108: Duplicate literal lists – extract to a local to avoid drift
The permissible CloudWatch retention values are duplicated forlog_retention_daysandservice_log_retention_days. Move the list tolocalsand reference it in both validations to reduce maintenance overhead.+locals { + cw_retention_days = [ + 1,3,5,7,14,30,60,90,120,150,180,365,400,545,731,1827,3653 + ] +} … - validation { - condition = contains([ - 1, 3, 5, 7, 14, 30, 60, 90, 120, 150, 180, 365, 400, 545, 731, 1827, 3653 - ], var.log_retention_days) + validation { + condition = contains(local.cw_retention_days, var.log_retention_days) … - validation { - condition = contains([ - 1, 3, 5, 7, 14, 30, 60, 90, 120, 150, 180, 365, 400, 545, 731, 1827, 3653 - ], var.service_log_retention_days) + validation { + condition = contains(local.cw_retention_days, var.service_log_retention_days)
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
⛔ Files ignored due to path filters (2)
Pipfile.lockis excluded by!**/*.lockpackage-lock.jsonis excluded by!**/package-lock.json
📒 Files selected for processing (49)
.gitignore(1 hunks).node-version(1 hunks).python-version(1 hunks)Dockerfile(1 hunks)Pipfile(1 hunks)README.md(4 hunks)docs/TERRAFORM_DEPLOYMENT.md(1 hunks)glue.yml(0 hunks)jobs/etl/load_s3.py(0 hunks)jobs/pyspark_hello_world.py(2 hunks)jobs/pythonshell_hello_world.py(1 hunks)libs/aws/s3.py(2 hunks)libs/common/__init__.py(1 hunks)libs/config/__init__.py(1 hunks)libs/config/config.py(6 hunks)libs/config/secrets.py(1 hunks)libs/db/docdb_mongo.py(1 hunks)libs/db/mongo.py(1 hunks)libs/etl/__init__.py(1 hunks)libs/etl/extract.py(1 hunks)libs/etl/load_documentdb.py(1 hunks)libs/etl/load_postgresql.py(1 hunks)libs/etl/load_s3.py(1 hunks)libs/io/__init__.py(1 hunks)libs/io/helpers.py(1 hunks)libs/io/reader.py(1 hunks)libs/io/writer.py(1 hunks)libs/logger.py(1 hunks)local/compose.yml(2 hunks)local/localstack/start-jupyter.sh(1 hunks)package.json(2 hunks)serverless.yml(0 hunks)terraform/.gitignore(1 hunks)terraform/cloudwatch.tf(1 hunks)terraform/environments/README.md(1 hunks)terraform/environments/dev.tfvars(1 hunks)terraform/environments/staging.tfvars(1 hunks)terraform/glue_jobs.tf(1 hunks)terraform/iam.tf(1 hunks)terraform/locals.tf(1 hunks)terraform/main.tf(1 hunks)terraform/networking.tf(1 hunks)terraform/outputs.tf(1 hunks)terraform/s3.tf(1 hunks)terraform/scripts.tf(1 hunks)terraform/terraform.tfvars(1 hunks)terraform/terraform.tfvars.example(1 hunks)terraform/variables.tf(1 hunks)terraform/versions.tf(1 hunks)
💤 Files with no reviewable changes (3)
- serverless.yml
- jobs/etl/load_s3.py
- glue.yml
🧰 Additional context used
🧬 Code Graph Analysis (9)
libs/db/docdb_mongo.py (1)
libs/common/perm.py (1)
load_tls_ca_bundle(7-10)
libs/io/__init__.py (1)
libs/io/helpers.py (1)
get_connection_options(152-250)
libs/etl/__init__.py (1)
libs/etl/load_postgresql.py (1)
load_to_postgresql_db(7-11)
libs/etl/extract.py (3)
libs/io/reader.py (1)
read_from_options(6-34)libs/config/config.py (1)
Config(100-265)libs/aws/s3.py (1)
AwsS3Client(8-110)
libs/common/__init__.py (1)
libs/common/util.py (1)
cached_property(4-14)
libs/aws/s3.py (1)
libs/config/config.py (2)
get_config(271-290)s3_vars(136-147)
libs/etl/load_s3.py (2)
libs/io/writer.py (1)
write_from_options(6-22)libs/config/config.py (2)
Config(100-265)s3_vars(136-147)
libs/config/config.py (3)
libs/common/util.py (1)
cached_property(4-14)libs/config/env.py (2)
get_envs(41-53)get_var(18-35)libs/config/secrets.py (1)
get_secrets_resolver(12-27)
libs/logger.py (1)
libs/config/config.py (2)
aws_client_vars(113-133)s3_vars(136-147)
🪛 Ruff (0.12.2)
libs/config/__init__.py
1-1: .config.Config imported but unused; consider removing, adding to __all__, or using a redundant alias
(F401)
1-1: .config.get_config imported but unused; consider removing, adding to __all__, or using a redundant alias
(F401)
libs/io/__init__.py
1-1: .helpers.get_connection_options imported but unused; consider removing, adding to __all__, or using a redundant alias
(F401)
libs/etl/__init__.py
3-3: .load_postgresql.load_to_postgresql_db imported but unused; consider removing, adding to __all__, or using a redundant alias
(F401)
libs/common/__init__.py
2-2: .util.cached_property imported but unused; consider removing, adding to __all__, or using a redundant alias
(F401)
libs/io/writer.py
1-1: awsglue.context.GlueContext imported but unused
Remove unused import: awsglue.context.GlueContext
(F401)
libs/etl/load_s3.py
1-1: awsglue.context.GlueContext imported but unused
Remove unused import: awsglue.context.GlueContext
(F401)
jobs/pyspark_hello_world.py
1-1: json imported but unused
Remove unused import: json
(F401)
libs/config/config.py
3-3: os imported but unused
Remove unused import: os
(F401)
🪛 Shellcheck (0.10.0)
local/localstack/start-jupyter.sh
[warning] 10-10: Use 'cd ... || exit' or 'cd ... || return' in case cd fails.
(SC2164)
[warning] 15-15: This $? refers to a condition, not a command. Assign to a variable to avoid it being overwritten.
(SC2319)
🪛 Checkov (3.2.334)
terraform/s3.tf
[MEDIUM] 4-47: Ensure Terraform module sources use a commit hash
(CKV_TF_1)
terraform/iam.tf
[HIGH] 62-90: Ensure no IAM policies documents allow "*" as a statement's resource for restrictable actions
(CKV_AWS_356)
[MEDIUM] 118-143: Ensure Terraform module sources use a commit hash
(CKV_TF_1)
🪛 Gitleaks (8.27.2)
terraform/networking.tf
81-81: Identified a HashiCorp Terraform password field, risking unauthorized infrastructure configuration and security breaches.
(hashicorp-tf-password)
🔇 Additional comments (38)
.node-version (1)
1-1: Verify runtimes using Node v22.11.0Jumping from v16 → v22 is a breaking change. Ensure:
- Local dev containers have been rebuilt.
- Any Lambda / Glue-shell jobs that package Node won’t exceed currently supported AWS runtimes (latest ≈ v20).
Pipfile (1)
21-23: Python 3.11 bump – double-check Glue job base imagesGlue 5 images currently ship with CPython 3.10. Deploying wheels compiled for 3.11 will fail at runtime. Verify that:
- Glue job Docker images are also updated to a 3.11 variant, or
python_versionremains 3.10 for compatibility..python-version (1)
1-1: Verify Glue-runtime compatibility before locking to Python 3.11AWS Glue v5 currently ships with Python 3.10. Bumping the local environment to 3.11 risks introducing language features / wheel builds that will fail at job runtime.
Double-check Glue’s runtime image or pin the local version to 3.10 to stay aligned.libs/config/secrets.py (1)
2-7: Import grouping is now PEP-8 compliant – good to go
No functional impact was introduced by re-ordering theboto3import under the third-party section.
LGTM.libs/io/reader.py (1)
1-2: LGTM! Clean import organization.The import reordering maintains consistency with the codebase refactoring, and both imports are properly utilized in the function.
libs/db/mongo.py (1)
5-5: Good improvement! Relative imports enhance package portability.The change from absolute to relative imports follows Python packaging best practices and improves module portability.
libs/db/docdb_mongo.py (1)
3-3: Excellent! Consistent relative import usage.The change to relative imports aligns with the package-wide refactoring effort and maintains the same functionality for TLS certificate handling.
local/compose.yml (1)
4-4: Confirmstart-jupyter.shexists inside the final image and is executable
Changing the entrypoint path to/home/hadoop/start-jupyter.shwill fail fast if the Dockerfile does not copy + chmod this script at exactly that location with the correct user/group. Please double-check the Dockerfile layer (or add one) toCOPYthe script andRUN chmod +x.libs/etl/load_postgresql.py (1)
3-4: 👍 Relative import switch looks correct
The new..io.writerand..configpaths resolve cleanly one level up fromlibs/etl. Ensurelibs/__init__.pyandlibs/io/__init__.pyexist so the package is import-able by Glue.libs/etl/load_documentdb.py (1)
3-4: Consistent relative imports adopted
Matches the pattern used in other ETL modules; improves portability inside zipped Glue job packages. No further action..gitignore (1)
202-210: LGTM! Good additions to .gitignore.These are standard ignore patterns for macOS system files, Vim editor files, and common backup/temporary files. This improves repository hygiene by preventing accidental commits of temporary files.
libs/aws/s3.py (1)
5-5: LGTM! Consistent with relative import refactoring.The change to relative imports aligns with the broader refactoring across the libs package.
terraform/locals.tf (1)
1-14: Excellent Terraform locals configuration.This locals block follows best practices by:
- Centralizing common tags for consistent resource tagging
- Establishing a standardized naming convention with
name_prefix- Creating reusable variables that promote DRY principles
The structure supports maintainable and consistent infrastructure deployment.
README.md (2)
3-3: LGTM! Updated to reflect Terraform deployment.The description accurately reflects the migration to Terraform-based infrastructure deployment.
25-25: Comprehensive documentation update for Terraform migration.All references have been systematically updated from Serverless Framework to Terraform, maintaining consistency throughout the documentation. The simplified
git cloneinstallation approach is more straightforward than the previousnpx serverless installmethod.Also applies to: 32-32, 39-39, 75-75, 93-93, 99-99
libs/etl/load_s3.py (1)
7-19: Well-implemented S3 loading function.The function demonstrates good practices:
- Proper error handling for missing required parameters
- Safe copying of config to avoid mutation
- Correct S3 URI construction
- Integration with the writer utility
The logic is sound and follows the established patterns in the codebase.
terraform/environments/README.md (1)
1-62: Excellent documentation for environment management!This README provides comprehensive and well-structured guidance for managing environment-specific Terraform configurations. The inclusion of usage examples, step-by-step instructions for creating new environments, and variable precedence explanation makes it very user-friendly.
terraform/versions.tf (1)
1-22: Solid Terraform provider configuration!The version constraints and provider configuration follow best practices. The AWS provider v5.x requirement aligns perfectly with the Glue v5 upgrade objective, and the default tags setup ensures consistent resource tagging across the infrastructure.
jobs/pyspark_hello_world.py (1)
9-40: Excellent logging enhancements!The addition of structured logging throughout the job execution provides valuable observability for production monitoring. The logger setup and comprehensive logging at key stages (job start, data processing, S3 operations, completion) will greatly help with debugging and monitoring job performance.
terraform/scripts.tf (1)
4-28: Well-structured script upload configuration!The S3 object resources are properly configured with appropriate etag handling for change detection, consistent tagging strategy, and clear organization. The use of
filemd5()ensures scripts are only re-uploaded when changed, which is efficient for deployment pipelines.terraform/terraform.tfvars (1)
26-26: Verify AWS Glue 5.0 worker type compatibilityPlease confirm that the specified worker type (
G.1X) is supported by AWS Glue version 5.0. As of current public information, AWS hasn’t officially documented Glue 5.0’s available worker types. To ensure your Terraform configuration won’t fail at deployment:• Check the AWS Glue 5.0 documentation or the AWS Console for the list of valid worker types.
• Updateworker_typein terraform/terraform.tfvars if Glue 5.0 introduces new options or deprecates existing ones.terraform/environments/dev.tfvars (3)
32-34: Excellent development-specific configurations.The inclusion of
--enable-spark-ui,--enable-metrics, and--job-bookmark-disableare appropriate for development environments as they enhance debugging capabilities and ensure clean test runs.
83-85: Appropriate logging configuration for development.The shorter retention periods (7 and 3 days) and enabled crawler logs are well-suited for a development environment, balancing debugging needs with cost efficiency.
24-25: Confirm AWS Glue 5.0 support for “Standard” worker type
I searched the repo for any references to supported worker types and didn’t find guidance on using “Standard” with Spark jobs in Glue 5.0. Please verify against the official AWS Glue 5.0 documentation to ensure this worker type is valid for Spark workloads.• File requiring review:
terraform/environments/dev.tfvars(lines 24–25)
• Configuration to check:worker_type = "Standard" number_of_workers = 2terraform/environments/staging.tfvars (1)
49-56: Check Python-Shell worker class against Glue 5 docs
Standardis still accepted for Python-Shell, but Glue 5 introducedZ.0Xworkers with higher memory. If you need >1 GB RAM or Spark UI, consider switching:worker_type = "Z.0X"Confirm with the team’s sizing requirements before changing.
terraform/cloudwatch.tf (1)
23-35: Retention of operational logs is hard-coded
var.service_log_retention_daysmight differ from compliance requirements per env. Ensure the variable is surfaced in every tfvars (dev/staging/prod).terraform/iam.tf (1)
118-121: Pin module source to a commit or tagRelying on a floating version
~> 0.19can introduce breaking changes.source = "cloudposse/iam-role/aws" version = "0.19.2" # or specific commit SHADockerfile (2)
1-2: LGTM! Good platform support for multi-architecture.The upgrade to AWS Glue 5.0 and explicit platform support for both amd64 and arm64 architectures is a good practice for cross-platform compatibility.
25-29: LGTM! Good version pinning for JupyterLab components.Explicitly pinning versions for JupyterLab and related packages ensures consistent behavior across deployments.
docs/TERRAFORM_DEPLOYMENT.md (2)
117-130: LGTM! Good use of Terraform workspaces for environment management.The documentation correctly demonstrates using Terraform workspaces with environment-specific variable files for proper environment isolation.
407-421: LGTM! Correct Terraform import syntax.The import commands correctly handle the escaping needed for map keys in Terraform resource addresses.
terraform/outputs.tf (3)
8-11: LGTM! Good use of preconditions for output validation.The precondition ensures the S3 bucket was created successfully before exposing its name, providing early failure detection.
70-88: LGTM! Well-structured output for Glue jobs.The output elegantly merges both PySpark and Python shell jobs into a single map with consistent structure, making it easy for consumers to iterate over all jobs regardless of type.
144-152: LGTM! Proper handling of conditional resource outputs.The outputs correctly return
nullfor resources that aren't created when VPC is not configured, preventing confusion about resource availability.terraform/glue_jobs.tf (1)
119-166: LGTM! Proper configuration for Python shell jobs.The Python shell job configuration correctly omits Spark-specific settings while maintaining the same logging and monitoring capabilities as PySpark jobs.
libs/config/config.py (2)
56-83: LGTM! Robust parameter retrieval with format flexibility.The function handles AWS Glue's parameter naming conventions well, trying multiple formats to accommodate different naming styles.
85-98: LGTM! Good security practice for logging sensitive data.The function appropriately masks sensitive values based on parameter names, preventing accidental exposure in logs.
terraform/variables.tf (1)
73-77: Default CIDR block is overly permissive
10.0.0.0/8opens the security group to every private IP. Double-check if that is intentional; consider scoping to the actual database subnets.
Dockerfile
Outdated
| # Install Python tools and extract requirements | ||
| COPY Pipfile Pipfile.lock ./ | ||
| RUN pip3 install --upgrade pip && \ | ||
| pip3 install pipenv && \ | ||
| # Use the newer syntax for exporting requirements | ||
| pipenv requirements > requirements.txt && \ | ||
| pip3 install -r requirements.txt |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🛠️ Refactor suggestion
Optimize Python dependency installation process.
Installing pipenv just to export requirements adds unnecessary overhead to the Docker image. Consider pre-exporting requirements.txt in your build process.
Option 1 (Preferred): Pre-export requirements.txt in your CI/CD pipeline and copy it directly:
-# Install Python tools and extract requirements
-COPY Pipfile Pipfile.lock ./
-RUN pip3 install --upgrade pip && \
- pip3 install pipenv && \
- # Use the newer syntax for exporting requirements
- pipenv requirements > requirements.txt && \
- pip3 install -r requirements.txt
+# Install Python dependencies from pre-exported requirements
+COPY requirements.txt ./
+RUN pip3 install --upgrade pip && \
+ pip3 install -r requirements.txtOption 2: Add error handling if keeping the current approach:
RUN pip3 install --upgrade pip && \
pip3 install pipenv && \
- pipenv requirements > requirements.txt && \
+ pipenv requirements > requirements.txt || (echo "Failed to export requirements" && exit 1) && \
pip3 install -r requirements.txt🤖 Prompt for AI Agents
In Dockerfile lines 16 to 22, avoid installing pipenv inside the Docker image
just to export requirements.txt, as it adds unnecessary overhead. Instead,
pre-export the requirements.txt file during your CI/CD pipeline and copy it
directly into the image. If you must keep the current approach, add error
handling around the pipenv export command to prevent build failures and improve
robustness.
| # Install extra Spark drivers (PostgreSQL + MongoDB) | ||
| RUN mkdir -p /opt/spark/jars && \ | ||
| wget https://jdbc.postgresql.org/download/postgresql-42.2.12.jar -O /opt/spark/jars/postgresql-42.2.12.jar && \ | ||
| wget https://repo1.maven.org/maven2/org/mongodb/mongodb-driver-sync/5.5.1/mongodb-driver-sync-5.5.1.jar -O /opt/spark/jars/mongodb-driver-sync-5.5.1.jar |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Add checksum verification for downloaded JAR files.
Downloading JAR files without checksum verification poses a security risk. If the download URLs are compromised, malicious JARs could be introduced into your build.
Add checksum verification for the downloaded files:
-RUN mkdir -p /opt/spark/jars && \
- wget https://jdbc.postgresql.org/download/postgresql-42.2.12.jar -O /opt/spark/jars/postgresql-42.2.12.jar && \
- wget https://repo1.maven.org/maven2/org/mongodb/mongodb-driver-sync/5.5.1/mongodb-driver-sync-5.5.1.jar -O /opt/spark/jars/mongodb-driver-sync-5.5.1.jar
+RUN mkdir -p /opt/spark/jars && \
+ wget https://jdbc.postgresql.org/download/postgresql-42.2.12.jar -O /opt/spark/jars/postgresql-42.2.12.jar && \
+ echo "1f06e20e7bcbef36ebf539f0ed3e9b3c5337eeb9 /opt/spark/jars/postgresql-42.2.12.jar" | sha1sum -c - && \
+ wget https://repo1.maven.org/maven2/org/mongodb/mongodb-driver-sync/5.5.1/mongodb-driver-sync-5.5.1.jar -O /opt/spark/jars/mongodb-driver-sync-5.5.1.jar && \
+ echo "expected_sha1_here /opt/spark/jars/mongodb-driver-sync-5.5.1.jar" | sha1sum -c -Note: Replace expected_sha1_here with the actual SHA1 checksum of the MongoDB driver.
📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| # Install extra Spark drivers (PostgreSQL + MongoDB) | |
| RUN mkdir -p /opt/spark/jars && \ | |
| wget https://jdbc.postgresql.org/download/postgresql-42.2.12.jar -O /opt/spark/jars/postgresql-42.2.12.jar && \ | |
| wget https://repo1.maven.org/maven2/org/mongodb/mongodb-driver-sync/5.5.1/mongodb-driver-sync-5.5.1.jar -O /opt/spark/jars/mongodb-driver-sync-5.5.1.jar | |
| # Install extra Spark drivers (PostgreSQL + MongoDB) | |
| RUN mkdir -p /opt/spark/jars && \ | |
| wget https://jdbc.postgresql.org/download/postgresql-42.2.12.jar -O /opt/spark/jars/postgresql-42.2.12.jar && \ | |
| echo "1f06e20e7bcbef36ebf539f0ed3e9b3c5337eeb9 /opt/spark/jars/postgresql-42.2.12.jar" | sha1sum -c - && \ | |
| wget https://repo1.maven.org/maven2/org/mongodb/mongodb-driver-sync/5.5.1/mongodb-driver-sync-5.5.1.jar -O /opt/spark/jars/mongodb-driver-sync-5.5.1.jar && \ | |
| echo "expected_sha1_here /opt/spark/jars/mongodb-driver-sync-5.5.1.jar" | sha1sum -c - |
🤖 Prompt for AI Agents
In Dockerfile lines 34 to 37, the downloaded PostgreSQL and MongoDB JAR files
lack checksum verification, which is a security risk. To fix this, after each
wget command, add a step to verify the SHA1 checksum of the downloaded file
using a tool like sha1sum and compare it against the expected checksum. If the
checksum does not match, the build should fail to prevent using potentially
compromised files. Obtain the correct SHA1 checksums for both JARs and include
them in the verification commands.
terraform/networking.tf
Outdated
| connection_properties = { | ||
| JDBC_CONNECTION_URL = "jdbc:postgresql://localhost:5432/postgres" | ||
| USERNAME = "postgres" | ||
| PASSWORD = "password" | ||
| } | ||
|
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hard-coded JDBC password committed to VCS
Storing credentials in plain text violates security best practices and triggers secret-scanning tools.
- JDBC_CONNECTION_URL = "jdbc:postgresql://localhost:5432/postgres"
- USERNAME = "postgres"
- PASSWORD = "password"
+ # Replace with reference to Secrets Manager / SSM
+ JDBC_CONNECTION_URL = data.aws_secretsmanager_secret_version.db_creds.secret_string["url"]
+ USERNAME = data.aws_secretsmanager_secret_version.db_creds.secret_string["username"]
+ PASSWORD = data.aws_secretsmanager_secret_version.db_creds.secret_string["password"]Also add the secret data source and IAM permissions.
📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| connection_properties = { | |
| JDBC_CONNECTION_URL = "jdbc:postgresql://localhost:5432/postgres" | |
| USERNAME = "postgres" | |
| PASSWORD = "password" | |
| } | |
| connection_properties = { | |
| + # Replace with reference to Secrets Manager / SSM | |
| - JDBC_CONNECTION_URL = "jdbc:postgresql://localhost:5432/postgres" | |
| - USERNAME = "postgres" | |
| - PASSWORD = "password" | |
| + JDBC_CONNECTION_URL = data.aws_secretsmanager_secret_version.db_creds.secret_string["url"] | |
| + USERNAME = data.aws_secretsmanager_secret_version.db_creds.secret_string["username"] | |
| + PASSWORD = data.aws_secretsmanager_secret_version.db_creds.secret_string["password"] | |
| } |
🧰 Tools
🪛 Gitleaks (8.27.2)
81-81: Identified a HashiCorp Terraform password field, risking unauthorized infrastructure configuration and security breaches.
(hashicorp-tf-password)
🤖 Prompt for AI Agents
In terraform/networking.tf around lines 78 to 83, the JDBC password is
hard-coded in plain text, which is a security risk. Remove the hard-coded
password and instead reference it securely using a secret data source such as
AWS Secrets Manager or HashiCorp Vault. Additionally, update the Terraform
configuration to include the necessary IAM permissions to allow access to the
secret. This will ensure credentials are not exposed in version control and
comply with security best practices.
| pythonshell_hello_world = { | ||
| script_path = "jobs/pythonshell_hello_world.py" | ||
| type = "pythonshell" | ||
| python_version = "3.11" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🛠️ Refactor suggestion
Missing glue_version for Python shell job.
The Python shell job example is missing the glue_version parameter, which is required for consistency with the PySpark job configuration.
pythonshell_hello_world = {
script_path = "jobs/pythonshell_hello_world.py"
type = "pythonshell"
+ glue_version = "5.0"
python_version = "3.11"🤖 Prompt for AI Agents
In terraform/terraform.tfvars at line 47, the Python shell job configuration is
missing the required glue_version parameter. Add a glue_version entry with the
appropriate version value to ensure consistency with the PySpark job
configuration and proper job execution.
| variable "bucket_name" { | ||
| description = "Name for the S3 bucket storing Glue artifacts" | ||
| type = string | ||
| default = "glue-artifacts" | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🛠️ Refactor suggestion
Bucket name is not globally unique – expect deployment collisions
"glue-artifacts" will already exist in most accounts. Either (a) require callers to supply the bucket name, or (b) derive it dynamically with account/region suffixes to guarantee uniqueness.
-variable "bucket_name" {
- description = "Name for the S3 bucket storing Glue artifacts"
- type = string
- default = "glue-artifacts"
-}
+# The caller must supply a globally-unique bucket name. No default provided.
+variable "bucket_name" {
+ description = "Globally unique S3 bucket name to store Glue artifacts"
+ type = string
+}📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| variable "bucket_name" { | |
| description = "Name for the S3 bucket storing Glue artifacts" | |
| type = string | |
| default = "glue-artifacts" | |
| } | |
| variable "bucket_name" { | |
| description = "Globally unique S3 bucket name to store Glue artifacts" | |
| type = string | |
| } |
🤖 Prompt for AI Agents
In terraform/variables.tf around lines 32 to 36, the bucket_name variable has a
default value "glue-artifacts" which is not globally unique and may cause
deployment collisions. To fix this, remove the default value so the bucket_name
must be supplied by the caller, or modify the default to dynamically include
unique identifiers like the AWS account ID and region as suffixes to ensure
uniqueness.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 1
♻️ Duplicate comments (1)
docs/TERRAFORM_DEPLOYMENT.md (1)
110-115: Still missing a safety check before executing the upload scriptA prior review asked to verify that
test_script_upload.shexists beforechmod/ execution. That safeguard has not been applied, so running these commands in a wrong directory will fail silently.-# Make the upload script executable -chmod +x test_script_upload.sh - -# Upload scripts to S3 -./test_script_upload.sh +# Verify the script exists +if [ ! -f "test_script_upload.sh" ]; then + echo "Error: test_script_upload.sh not found. Ensure you are in the correct directory." >&2 + exit 1 +fi + +# Make the upload script executable +chmod +x test_script_upload.sh + +# Upload scripts to S3 +./test_script_upload.sh
🧹 Nitpick comments (1)
docs/TERRAFORM_DEPLOYMENT.md (1)
179-181: Malformed markdown headingLines 179-180 split the
## Migration from Serverless Frameworkheading over two lines, causing an empty “##” header. Merge them onto one line.-## - Migration from Serverless Framework +## Migration from Serverless Framework
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (6)
Dockerfile(1 hunks)docs/TERRAFORM_DEPLOYMENT.md(1 hunks)terraform/networking.tf(1 hunks)terraform/outputs.tf(1 hunks)terraform/terraform.tfvars(1 hunks)terraform/terraform.tfvars.example(1 hunks)
✅ Files skipped from review due to trivial changes (1)
- terraform/terraform.tfvars
🚧 Files skipped from review as they are similar to previous changes (4)
- terraform/networking.tf
- terraform/terraform.tfvars.example
- Dockerfile
- terraform/outputs.tf
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
- GitHub Check: super-linter
| # Optional variables | ||
| glue_version = "4.0" | ||
| python_version = "3" | ||
| ``` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🛠️ Refactor suggestion
Default Glue version contradicts PR objective
The sample terraform.tfvars still recommends glue_version = "4.0", while this PR explicitly upgrades to Glue v5. Keeping an outdated default will mislead users and cause runtime errors (Glue 4 → Python 3.7, Glue 5 → Python 3.10).
-# Optional variables
-glue_version = "4.0"
+# Optional variables
+# Use Glue v5 unless you have a hard reason to stay on 4.x
+glue_version = "5.0"📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| # Optional variables | |
| glue_version = "4.0" | |
| python_version = "3" | |
| ``` | |
| # Optional variables | |
| # Use Glue v5 unless you have a hard reason to stay on 4.x | |
| glue_version = "5.0" | |
| python_version = "3" |
🤖 Prompt for AI Agents
In docs/TERRAFORM_DEPLOYMENT.md around lines 82 to 85, update the default
glue_version value from "4.0" to "5.0" in the sample terraform.tfvars to align
with the PR's upgrade to Glue v5 and avoid user confusion or runtime errors.
What's this PR do?
This PR upgrades Glue to v5 and since Serverless Framework does not offer support for this version we change the deployment to Terraform
@ulises-jeremias @edamico @Tsemenzato
Summary by CodeRabbit
New Features
Improvements
Bug Fixes
Chores