Skip to content

Main#45

Merged
huypq02 merged 16 commits intodevelopfrom
main
Jan 17, 2026
Merged

Main#45
huypq02 merged 16 commits intodevelopfrom
main

Conversation

@huypq02
Copy link
Owner

@huypq02 huypq02 commented Jan 17, 2026

No description provided.

hotfix: add some env for setting cache directories to /tmp to avoid p…
hotfix: add command to install packages of nltk
hotfix: import ntlk after installing required libraries, esp. nlk
…E_HOME linux, adding github.event.workflow_run.conclusion == 'success' to prevent triggering cd.yml when ci.yml failed
hotfix: removing the ensure_nltk_resources() call and adding XDG_CACH…
hotfix: the permission error related to NLTK packages  when requestin…
Copilot AI review requested due to automatic review settings January 17, 2026 16:54
@huypq02 huypq02 merged commit 2b89891 into develop Jan 17, 2026
11 of 13 checks passed
Copy link

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR addresses NLTK resource management and deployment reliability for a text analysis application. The changes add a runtime fallback for NLTK data downloads, pre-download NLTK resources in the Docker image, configure cache directories for non-root user execution, and add a safety check to the deployment workflow.

Changes:

  • Added ensure_nltk_resources() function to download NLTK data at runtime if missing
  • Modified Dockerfile to pre-download NLTK resources during build and configure environment variables for cache directories
  • Added conditional check to GitHub Actions CD workflow to only deploy when CI succeeds

Reviewed changes

Copilot reviewed 3 out of 3 changed files in this pull request and generated 2 comments.

File Description
src/pipeline/text_stats.py Added function to ensure NLTK resources are available at runtime
Dockerfile Pre-download NLTK data during build and set environment variables for cache directories to support non-root user
.github/workflows/cd.yml Added condition to only deploy when CI workflow succeeds

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

nltk.data.find("corpora/stopwords")
except LookupError:
nltk.download("stopwords")

Copy link

Copilot AI Jan 17, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The ensure_nltk_resources function is defined but never called in the codebase. This function appears to be intended as a fallback mechanism to download NLTK resources at runtime if they're missing, but without being invoked, it serves no purpose. Consider either calling this function at module initialization (e.g., at the end of the module or in an init block) or removing it if the Dockerfile download strategy is sufficient.

Suggested change
ensure_nltk_resources()

Copilot uses AI. Check for mistakes.
# Set cache directories to /tmp to avoid permission issues
ENV MPLCONFIGDIR=/tmp/matplotlib
ENV XDG_CACHE_HOME=/tmp/.cache
ENV NLTK_DATA=/tmp/nltk_data
Copy link

Copilot AI Jan 17, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is a mismatch between the NLTK data download location and the runtime NLTK_DATA environment variable. NLTK resources are downloaded to '/usr/local/share/nltk_data' during the build (lines 8-10), but the NLTK_DATA environment variable is set to '/tmp/nltk_data' at runtime (line 23). This will cause NLTK to be unable to find the downloaded resources, leading to runtime errors. Either download NLTK data to '/tmp/nltk_data' during build, or change the NLTK_DATA environment variable to '/usr/local/share/nltk_data'.

Suggested change
ENV NLTK_DATA=/tmp/nltk_data
ENV NLTK_DATA=/usr/local/share/nltk_data

Copilot uses AI. Check for mistakes.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants