Conversation
Releases/v1.0.0
Releases/v2.0.0
Releases/v2.1.0
hotfix: publish host 0.0.0.0
Hotfix/stopwords not found
hotfix: add some env for setting cache directories to /tmp to avoid p…
hotfix: add command to install packages of nltk
hotfix: import ntlk after installing required libraries, esp. nlk
…E_HOME linux, adding github.event.workflow_run.conclusion == 'success' to prevent triggering cd.yml when ci.yml failed
hotfix: removing the ensure_nltk_resources() call and adding XDG_CACH…
hotfix: the permission error related to NLTK packages when requestin…
There was a problem hiding this comment.
Pull request overview
This PR addresses NLTK resource management and deployment reliability for a text analysis application. The changes add a runtime fallback for NLTK data downloads, pre-download NLTK resources in the Docker image, configure cache directories for non-root user execution, and add a safety check to the deployment workflow.
Changes:
- Added
ensure_nltk_resources()function to download NLTK data at runtime if missing - Modified Dockerfile to pre-download NLTK resources during build and configure environment variables for cache directories
- Added conditional check to GitHub Actions CD workflow to only deploy when CI succeeds
Reviewed changes
Copilot reviewed 3 out of 3 changed files in this pull request and generated 2 comments.
| File | Description |
|---|---|
| src/pipeline/text_stats.py | Added function to ensure NLTK resources are available at runtime |
| Dockerfile | Pre-download NLTK data during build and set environment variables for cache directories to support non-root user |
| .github/workflows/cd.yml | Added condition to only deploy when CI workflow succeeds |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| nltk.data.find("corpora/stopwords") | ||
| except LookupError: | ||
| nltk.download("stopwords") | ||
|
|
There was a problem hiding this comment.
The ensure_nltk_resources function is defined but never called in the codebase. This function appears to be intended as a fallback mechanism to download NLTK resources at runtime if they're missing, but without being invoked, it serves no purpose. Consider either calling this function at module initialization (e.g., at the end of the module or in an init block) or removing it if the Dockerfile download strategy is sufficient.
| ensure_nltk_resources() |
| # Set cache directories to /tmp to avoid permission issues | ||
| ENV MPLCONFIGDIR=/tmp/matplotlib | ||
| ENV XDG_CACHE_HOME=/tmp/.cache | ||
| ENV NLTK_DATA=/tmp/nltk_data |
There was a problem hiding this comment.
There is a mismatch between the NLTK data download location and the runtime NLTK_DATA environment variable. NLTK resources are downloaded to '/usr/local/share/nltk_data' during the build (lines 8-10), but the NLTK_DATA environment variable is set to '/tmp/nltk_data' at runtime (line 23). This will cause NLTK to be unable to find the downloaded resources, leading to runtime errors. Either download NLTK data to '/tmp/nltk_data' during build, or change the NLTK_DATA environment variable to '/usr/local/share/nltk_data'.
| ENV NLTK_DATA=/tmp/nltk_data | |
| ENV NLTK_DATA=/usr/local/share/nltk_data |
No description provided.