Skip to content
Merged

Main #45

Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
16 commits
Select commit Hold shift + click to select a range
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 2 additions & 0 deletions .github/workflows/cd.yml
Original file line number Diff line number Diff line change
Expand Up @@ -16,6 +16,8 @@ on:
jobs:
deployment:
runs-on: ubuntu-latest
# Only deploy if CI workflow succeeded
if: ${{ github.event.workflow_run.conclusion == 'success' }}
environment:
name: production
url: ${{ vars.RENDER_APP_URL }}
Expand Down
11 changes: 10 additions & 1 deletion Dockerfile
Original file line number Diff line number Diff line change
Expand Up @@ -3,7 +3,11 @@ WORKDIR /usr/local/app

# Install the application dependencies
COPY requirements.txt ./
RUN pip install --no-cache-dir -r requirements.txt
RUN pip install --no-cache-dir -r requirements.txt && \
python -c "import nltk; \
nltk.download('punkt', download_dir='/usr/local/share/nltk_data'); \
nltk.download('punkt_tab', download_dir='/usr/local/share/nltk_data'); \
nltk.download('stopwords', download_dir='/usr/local/share/nltk_data')"

# Copy in the source code
COPY src ./src
Expand All @@ -13,4 +17,9 @@ EXPOSE 5000
RUN useradd app
USER app

# Set cache directories to /tmp to avoid permission issues
ENV MPLCONFIGDIR=/tmp/matplotlib
ENV XDG_CACHE_HOME=/tmp/.cache
ENV NLTK_DATA=/tmp/nltk_data
Copy link

Copilot AI Jan 17, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is a mismatch between the NLTK data download location and the runtime NLTK_DATA environment variable. NLTK resources are downloaded to '/usr/local/share/nltk_data' during the build (lines 8-10), but the NLTK_DATA environment variable is set to '/tmp/nltk_data' at runtime (line 23). This will cause NLTK to be unable to find the downloaded resources, leading to runtime errors. Either download NLTK data to '/tmp/nltk_data' during build, or change the NLTK_DATA environment variable to '/usr/local/share/nltk_data'.

Suggested change
ENV NLTK_DATA=/tmp/nltk_data
ENV NLTK_DATA=/usr/local/share/nltk_data

Copilot uses AI. Check for mistakes.

CMD [ "python", "-m", "src.app.main"]
12 changes: 12 additions & 0 deletions src/pipeline/text_stats.py
Original file line number Diff line number Diff line change
Expand Up @@ -8,6 +8,18 @@
import os
import glob

def ensure_nltk_resources():
try:
nltk.data.find("tokenizers/punkt")
except LookupError:
nltk.download("punkt")
nltk.download("punkt_tab")

try:
nltk.data.find("corpora/stopwords")
except LookupError:
nltk.download("stopwords")

Copy link

Copilot AI Jan 17, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The ensure_nltk_resources function is defined but never called in the codebase. This function appears to be intended as a fallback mechanism to download NLTK resources at runtime if they're missing, but without being invoked, it serves no purpose. Consider either calling this function at module initialization (e.g., at the end of the module or in an init block) or removing it if the Dockerfile download strategy is sufficient.

Suggested change
ensure_nltk_resources()

Copilot uses AI. Check for mistakes.
def import_data(filename, rootpath):
"""Load a text file from rootpath and return its contents as a DataFrame."""
fullpath = os.path.join(rootpath, filename)
Expand Down
Loading