Skip to content

Historical football data collection#34

Open
thequinn wants to merge 27 commits into
vibing-ai:mainfrom
thequinn:historical-football-data-collection/wk1
Open

Historical football data collection#34
thequinn wants to merge 27 commits into
vibing-ai:mainfrom
thequinn:historical-football-data-collection/wk1

Conversation

@thequinn

@thequinn thequinn commented Aug 8, 2025

Copy link
Copy Markdown

Description

Goal:
Build essential historical dataset for top 3 football leagues using free data sources below, , covering 5 recent seasons with focus on MVP delivery.

Data sources:
football-data.co.uk
fbref.com

Step completed:

  1. Created the folders and files structure
  2. Added functionalities to download multiple formats, scrape the data, clean up and process the data before saving to new csv files.

Steps to complete:
3. Merge the data. ex. merge same league of the same season from different resources
4. Insert into Supabase

Type of Change

  • 🐛 Bug fix (non-breaking change which fixes an issue)
  • [ x ] ✨ New feature (non-breaking change which adds functionality)
  • 💥 Breaking change (fix or feature that would cause existing functionality to not work as expected)
  • 📚 Documentation update
  • 🔧 Maintenance (dependencies, CI, build tools, etc.)
  • ♻️ Refactoring (no functional changes)
  • ⚡ Performance improvement

Changes Made

  • Change 1: data-collection/collectors/football_data_collector.py
  • Change 2: data-collection/collectors/fbref_collector.py
  • Change 3: data-collection/scripts/collect_all.py

Testing

How has this been tested?

  • Unit tests added/updated
  • Integration tests added/updated
  • [ x ] Manual testing performed
  • No testing required (documentation, etc.)

Test Configuration:

  • Python version: [e.g. 3.11]
  • Node.js version: [e.g. 18.17.0]
  • Browser (if applicable): [e.g. Chrome 119]

Platform Impact

Which parts of the system are affected?

  • [ x ] AI Backend (Python agents)
  • Web Platform (Next.js frontend)
  • Database schema
  • [ ] API endpoints
  • Documentation
  • CI/CD workflows
  • Other: ___

Breaking Changes

Does this PR introduce any breaking changes?

  • [ x ] No breaking changes
  • Yes, breaking changes (please describe below)

If yes, describe the breaking changes and migration path:

Checklist

Before requesting a review, please ensure:

  • [ x ] Code follows the project's style guidelines
  • [ x ] Self-review of code has been performed
  • [ x ] Code is commented, particularly in hard-to-understand areas
  • Corresponding changes to documentation have been made
  • [ x ] Changes generate no new warnings
  • [ x ] Tests pass locally
  • Any dependent changes have been merged and published

Screenshots (if applicable)

Add screenshots to help explain your changes.

Related Issues

Closes #(issue_number)
Related to #(issue_number)

Additional Notes

Any additional information that reviewers should know.

Summary by CodeRabbit

  • New Features

    • Introduced new utilities for collecting and processing football match data from FBref.com and football-data.co.uk, including handling bot protection and data normalization.
    • Added scripts to automate data collection and processing from multiple online sources.
    • Implemented asynchronous API clients for fetching football data and testing API authentication.
    • Enhanced API client for fetching fixtures and teams with input validation, error handling, and detailed logging.
  • Tests

    • Added comprehensive tests for API clients, OpenAI API connectivity, environment setup, and football data processing.
  • Configuration

    • Updated static type checking, pre-commit, and pytest configurations for improved code quality and coverage.
    • Expanded .gitignore to exclude HTML and CSV files.
  • Style

    • Applied formatting improvements and code style consistency across multiple modules.
  • Documentation

    • Added docstrings and comments to clarify new modules and package purposes.

@coderabbitai

coderabbitai Bot commented Aug 8, 2025

Copy link
Copy Markdown

Walkthrough

This update introduces new data collection modules and scripts for football match data, implements and tests API clients for sports data, and enhances configuration and testing infrastructure. Key changes include new collectors for FBref and football-data.co.uk, expanded API-Football client functionality, additional test suites, stricter type and test configurations, and various formatting and stylistic improvements across the codebase.

Changes

Cohort / File(s) Change Summary
Data Collection Modules
data-collection/collectors/__init__.py, data-collection/collectors/fbref_collector.py, data-collection/collectors/football_data_collector.py
Introduced new modules for collecting and processing football match data from FBref.com and football-data.co.uk, including utilities for URL generation, HTML/CSV downloading, data extraction, cleaning, normalization, and saving. The FBref collector includes Selenium-based downloading to bypass bot protections.
Data Collection Script
data-collection/scripts/collect_all.py
Added a script to automate downloading, processing, and saving football data from multiple sources using the new collectors, with error handling and field mapping utilities.
API-Football Client Enhancements
ai-backend/tools/sports_apis.py
Expanded the APIFootballClient with full implementations for fetching fixtures and teams, input validation, error handling, and logging. Added new helper methods for payload building and HTTP requests.
API-Football Auth Test Client
ai-backend/tools/test_api_auth.py
Introduced an async client for testing API-Football authentication and data retrieval, with context management, error handling, and demonstration usage.
OpenAI and Quality Tools Tests
ai-backend/tests/test_openai.py, ai-backend/tests/test_quality_tools.py
Added tests for OpenAI API connectivity and a sample football data processor class, including async and sync methods for data processing and simulated fetching.
Fixtures API Tests
ai-backend/tests/test_fixtures.py
Added async pytest functions to test APIFootballClient's get_fixtures method under various filter conditions using mocked HTTP responses.
Environment Test Script
ai-backend/tests/test_environment.py
Added a script to verify installation of all required dependencies by attempting imports and printing results.
Configuration and Type Checking
.gitignore, ai-backend/.pre-commit-config.yaml, ai-backend/mypy.ini, ai-backend/pytest.ini, mypy.ini
Updated ignore patterns to include HTML and CSV files, changed quoting style in pre-commit config, enhanced pytest options with coverage and markers, and introduced/expanded mypy configuration for stricter type checking and third-party import handling.
Settings and Formatting Updates
ai-backend/config/settings.py, ai-backend/agents/editor.py, ai-backend/agents/researcher.py, ai-backend/main.py, scripts/seed-data.py
Updated settings to use uppercase env vars and new Pydantic validators; reformatted method signatures, logging, and data declarations for consistency; no logic changes.

Sequence Diagram(s)

sequenceDiagram
    participant Script as collect_all.py
    participant FBref as fbref_collector.py
    participant FootballData as football_data_collector.py
    participant Selenium as Selenium WebDriver
    participant Requests as requests

    Script->>FootballData: generate_football_data_url()
    Script->>FootballData: download_csv()
    Script->>FootballData: read_csv()
    Script->>FootballData: rename_columns(), add_new_columns_to_football_data()
    Script->>FootballData: reorder_df(), normalize_date(), save_df_to_csv()

    Script->>FBref: generate_fbref_url()
    Script->>Selenium: download_with_selenium()
    Selenium-->>FBref: HTML content
    Script->>FBref: save_html_to_file()
    Script->>FBref: extract_columns()
    Script->>FBref: add_new_columns_to_fbref(), reorder_df(), normalize_date(), create_csv()
Loading
sequenceDiagram
    participant Test as test_fixtures.py
    participant MockSession as MockSession
    participant API as APIFootballClient

    Test->>API: get_fixtures(...)
    API->>MockSession: get()
    MockSession-->>API: MockResponse (JSON)
    API-->>Test: List of fixtures
Loading
sequenceDiagram
    participant APIAuthTest as test_api_auth.py
    participant APIAuth as APIAuth
    participant aiohttp as aiohttp.ClientSession
    participant RapidAPI as API-Football

    APIAuthTest->>APIAuth: __aenter__()
    APIAuth->>aiohttp: ClientSession()
    APIAuthTest->>APIAuth: test_api_auth()
    APIAuth->>RapidAPI: GET /fixtures
    RapidAPI-->>APIAuth: JSON response
    APIAuth-->>APIAuthTest: Parsed data
    APIAuthTest->>APIAuth: __aexit__()
    APIAuth->>aiohttp: Close session
Loading

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~45 minutes

Poem

🐇
With paws on keys and nose a-twitch,
I gathered football data, quick!
From FBref and CSV streams,
Selenium danced through bot-filled dreams.
API clients tested, settings refined—
This bunny leaves no bug behind!
🏆⚽

Note

🔌 MCP (Model Context Protocol) integration is now available in Early Access!

Pro users can now connect to remote MCP servers under the Integrations page to get reviews and chat conversations that understand additional development context.

✨ Finishing Touches
  • 📝 Generate Docstrings
🧪 Generate unit tests
  • Create PR with unit tests
  • Post copyable unit tests in a comment

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share
🪧 Tips

Chat

There are 3 ways to chat with CodeRabbit:

  • Review comments: Directly reply to a review comment made by CodeRabbit. Example:
    • I pushed a fix in commit <commit_id>, please review it.
    • Explain this complex logic.
    • Open a follow-up GitHub issue for this discussion.
  • Files and specific lines of code (under the "Files changed" tab): Tag @coderabbitai in a new review comment at the desired location with your query. Examples:
    • @coderabbitai explain this code block.
  • PR comments: Tag @coderabbitai in a new PR comment to ask questions about the PR branch. For the best results, please provide a very specific query, as very limited context is provided in this mode. Examples:
    • @coderabbitai gather interesting stats about this repository and render them as a table. Additionally, render a pie chart showing the language distribution in the codebase.
    • @coderabbitai read src/utils.ts and explain its main purpose.
    • @coderabbitai read the files in the src/scheduler package and generate a class diagram using mermaid and a README in the markdown format.

Support

Need help? Create a ticket on our support page for assistance with any issues or questions.

CodeRabbit Commands (Invoked using PR comments)

  • @coderabbitai pause to pause the reviews on a PR.
  • @coderabbitai resume to resume the paused reviews.
  • @coderabbitai review to trigger an incremental review. This is useful when automatic reviews are disabled for the repository.
  • @coderabbitai full review to do a full review from scratch and review all the files again.
  • @coderabbitai summary to regenerate the summary of the PR.
  • @coderabbitai generate docstrings to generate docstrings for this PR.
  • @coderabbitai generate sequence diagram to generate a sequence diagram of the changes in this PR.
  • @coderabbitai generate unit tests to generate unit tests for this PR.
  • @coderabbitai resolve resolve all the CodeRabbit review comments.
  • @coderabbitai configuration to show the current CodeRabbit configuration for the repository.
  • @coderabbitai help to get help.

Other keywords and placeholders

  • Add @coderabbitai ignore anywhere in the PR description to prevent this PR from being reviewed.
  • Add @coderabbitai summary to generate the high-level summary at a specific location in the PR description.
  • Add @coderabbitai anywhere in the PR title to generate the title automatically.

CodeRabbit Configuration File (.coderabbit.yaml)

  • You can programmatically configure CodeRabbit by adding a .coderabbit.yaml file to the root of your repository.
  • Please see the configuration documentation for more information.
  • If your editor has YAML language server enabled, you can add the path at the top of this file to enable auto-completion and validation: # yaml-language-server: $schema=https://coderabbit.ai/integrations/schema.v2.json

Documentation and Community

  • Visit our Documentation for detailed information on how to use CodeRabbit.
  • Join our Discord Community to get help, request features, and share feedback.
  • Follow us on X/Twitter for updates and announcements.

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 10

🔭 Outside diff range comments (1)
ai-backend/.pre-commit-config.yaml (1)

25-26: Align Python versions across Black and mypy

There’s a mismatch between the Python target in your pre-commit Black hook and mypy config, which can lead to inconsistent parsing or type-checking behavior. Please align both tools to the same version.

• ai-backend/.pre-commit-config.yaml (line 25):

      language_version: python3.12

• ai-backend/mypy.ini (line 3):

python_version = 3.11

Suggested fix (choose one):

– Update mypy.ini to 3.12:

 [mypy]
-python_version = 3.11
+python_version = 3.12
 ...

– Or downgrade Black target to 3.11:

-      language_version: python3.12
+      language_version: python3.11
🧹 Nitpick comments (21)
data-collection/collectors/__init__.py (1)

1-3: Clarify scope in package docstring

If this package is focused on football data (per PR), reflect that for precision.

Apply this diff:

-"""
-Collectors package for sports data collection.
-"""
+"""
+Collectors for historical football data (fbref, football-data.co.uk).
+"""
ai-backend/mypy.ini (1)

46-67: Broad ignore_missing_imports reduces type safety; consider targeted alternatives

Good to unblock CI, but you’ll lose valuable checks for fastapi/pydantic/pytest, which generally have decent typing. Consider:

  • Remove ignores for fastapi/pydantic/uvicorn if possible.
  • Enable the Pydantic mypy plugin to improve model typing.

Add the Pydantic plugin at the top-level (outside this hunk):

[mypy]
plugins = pydantic.mypy

Then gradually remove these sections:

  • [mypy-fastapi.], [mypy-pydantic.], [mypy-uvicorn.*]
    Verify locally which ones are still needed. If certain submodules lack stubs, prefer narrow, module-specific ignores or inline # type: ignore[import] over package-wide ignores.
ai-backend/agents/researcher.py (1)

24-37: Consider stronger typing for return structure

Instead of dict[str, Any], define a TypedDict or dataclass for the research payload. It improves IDE assistance and catches schema drift early.

Example:

from typing import TypedDict

class TeamHistory(TypedDict, total=False):
    head_to_head_record: dict[str, int]  # e.g., {"wins": 10, "draws": 5, "losses": 7}
    recent_form: list[str]               # e.g., ["W", "D", "L", ...]
    key_matches: list[dict[str, str]]    # e.g., [{"date": "...", "score": "2-1", ...}]

Then annotate: async def research_team_history(...) -> TeamHistory:

ai-backend/main.py (1)

270-279: Prefer structured logging instead of “%s” placeholders

logger.info("Starting server on %s:%s", host, port) is fine, but elsewhere the codebase uses keyword/value pairs (extra= or logger.info(key=value)). Aligning styles improves log-parsing consistency.

mypy.ini (1)

25-46: Consider using official stub packages instead of blanket ignore_missing_imports

beautifulsoup4, aiohttp, openai, etc., all have published stubs (types-beautifulsoup4, aiohttp-types, …). Installing them lets you keep strict checking without silencing whole modules.

ai-backend/config/settings.py (2)

55-63: Validator duplicates min_length constraint

min_length=20 plus explicit len checks is redundant. Either rely on the field constraint or keep the validator but drop min_length to avoid double-error messages.


114-115: model_config may shadow future attributes

model_config is Pydantic-v2 specific. To remain forward-compatible (e.g., if pinning v2.5+), consider class Config with model_config = … inside, as recommended in docs, to avoid naming collisions.

data-collection/collectors/football_data_collector.py (4)

22-45: Unused league_name parameter

league_name is never referenced; drop it or incorporate it to avoid API confusion.

-def generate_football_data_url(league_id, league_name, season):
+def generate_football_data_url(league_id: str, season: int) -> str:

129-146: Replace print with proper logging

Raw print() hampers library reuse and log aggregation. Inject a module-level logger instead.

-import pprint
-
-print("After getting essential columns:")
-print(df_processed.head())
+logger = logging.getLogger(__name__)
+
+logger.debug("Essential columns sample:\n%s", df_processed.head())

149-157: Date normalization swallows errors silently

Returning invalid dates as NaT cast to "NaT" leads to string "NaT" values downstream. Consider dropping rows with NaT or at least logging count of failed parses.


160-165: Ensure processed directory exists and use os.path.splitext

-processed_filename = raw_filename.split(".")[0] + "_processed.csv"
+base, _ = os.path.splitext(raw_filename)
+processed_filename = f"{base}_processed.csv"
+os.makedirs(get_data_processed_folder(), exist_ok=True)

Return the saved path so callers can chain operations.

data-collection/collectors/fbref_collector.py (2)

100-111: Filename parsing is brittle

file.split("_") assumes exactly three underscores and no extra “_” in league names (e.g., “Champions_League”). Safer:

parts = file.rsplit("_", maxsplit=2)
if len(parts) != 3:
    raise ValueError(f"Unexpected filename format: {file}")
_, league, season_part = parts
season = season_part.split(".")[0]

158-180: Heavy code duplication with football_data_collector.py

reorder_df, get_columns, normalize_date, and CSV helpers are copy-pasted. Extract them into a shared util and import from both collectors to reduce maintenance cost.

data-collection/scripts/collect_all.py (2)

110-113: Variable reuse blurs types

df_cleaned = save_df_to_csv(df_cleaned, raw_csv_filename) reassigns the DataFrame variable to the path string returned by save_df_to_csv.
Prefer discarding the return value or storing it in a clearly-named variable to avoid accidental misuse later.


138-146: Path built via string concatenation – use os.path.join

html_filepath = os.path.join(get_data_folder(), "raw", html_filename)

avoids platform-specific separator bugs.

ai-backend/tools/sports_apis.py (1)

182-193: Logging entire team payload may leak unnecessary data

logger.info(f"Teams data: {json.dumps(data_response, indent=2)}") can emit thousands of lines and potentially sensitive info to logs.

Log the count instead, or guard behind a debug flag.

ai-backend/tests/test_openai.py (1)

16-30: Convert print-based skip to pytest skip

Use pytest.skip() so the test suite reports a skipped test instead of silently passing.

import pytest

if not os.getenv("OPENAI_API_KEY"):
    pytest.skip("OPENAI_API_KEY not set", allow_module_level=True)
ai-backend/tests/test_quality_tools.py (2)

12-13: processed_games is never used

The attribute is initialised but never updated; either populate it in process_game_data or drop it to avoid dead code.


60-63: Remove ad-hoc print from test code

Printing in test modules clutters CI logs. Rely on assertions instead, or move this demo into docs/ or an example script.

ai-backend/tests/test_fixtures.py (1)

36-39: monkeypatch fixture is declared but unused

Pytest will warn on this; simply drop the parameter.

ai-backend/tools/test_api_auth.py (1)

1-3: File name will be collected by pytest

A module named test_*.py under tools/ will be imported during test discovery even though it isn’t a test.
Rename the file (e.g. api_auth_demo.py) or add __test__ = False at module level to suppress collection.

📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 807bf41 and e7604e4.

📒 Files selected for processing (20)
  • .gitignore (2 hunks)
  • ai-backend/.pre-commit-config.yaml (1 hunks)
  • ai-backend/agents/editor.py (1 hunks)
  • ai-backend/agents/researcher.py (2 hunks)
  • ai-backend/config/settings.py (5 hunks)
  • ai-backend/main.py (3 hunks)
  • ai-backend/mypy.ini (1 hunks)
  • ai-backend/pytest.ini (1 hunks)
  • ai-backend/tests/test_environment.py (1 hunks)
  • ai-backend/tests/test_fixtures.py (1 hunks)
  • ai-backend/tests/test_openai.py (1 hunks)
  • ai-backend/tests/test_quality_tools.py (1 hunks)
  • ai-backend/tools/sports_apis.py (7 hunks)
  • ai-backend/tools/test_api_auth.py (1 hunks)
  • data-collection/collectors/__init__.py (1 hunks)
  • data-collection/collectors/fbref_collector.py (1 hunks)
  • data-collection/collectors/football_data_collector.py (1 hunks)
  • data-collection/scripts/collect_all.py (1 hunks)
  • mypy.ini (1 hunks)
  • scripts/seed-data.py (10 hunks)
🧰 Additional context used
🧬 Code Graph Analysis (2)
data-collection/collectors/football_data_collector.py (1)
data-collection/collectors/fbref_collector.py (6)
  • _here (25-26)
  • get_data_raw_folder (29-32)
  • get_data_processed_folder (35-38)
  • get_columns (177-180)
  • reorder_df (158-173)
  • normalize_date (147-155)
data-collection/collectors/fbref_collector.py (1)
data-collection/collectors/football_data_collector.py (6)
  • _here (48-49)
  • get_data_raw_folder (52-55)
  • get_data_processed_folder (58-61)
  • normalize_date (149-157)
  • reorder_df (129-146)
  • get_columns (123-126)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
  • GitHub Check: Codacy Static Code Analysis
🔇 Additional comments (10)
ai-backend/.pre-commit-config.yaml (1)

31-31: No-op quote style change — OK to keep

Switching to single quotes in args is fine; behavior unchanged.

.gitignore (1)

8-10: Section header for data artifacts — looks good

The new section marker improves readability.

ai-backend/agents/researcher.py (1)

24-24: Formatting change only — safe

Signature/logging compaction is non-functional and consistent with the module’s style. Sanitization preserved.

Also applies to: 36-36

ai-backend/agents/editor.py (1)

37-49: Formatting change looks good

Signature compacting is fine and keeps consistency with other agents. No further action required.

ai-backend/main.py (2)

114-117: Re-raising with HTTPException is fine

Wrapping the original exception preserves context via from e. No concerns.


148-156: Signature now fits 79-char line – OK

Only formatting changed. Behaviour untouched.

ai-backend/config/settings.py (1)

24-30: Verify callers after the ENV-VAR rename

Changing to OPENAI_API_KEY, SUPABASE_URL, etc., will break any code, CI secrets, or infra still exporting lowercase names. Double-check docker-compose, deployment manifests, and docs.

data-collection/collectors/fbref_collector.py (1)

226-232: Hard-coded Chrome binary path breaks non-macOS environments

Setting chrome_options.binary_location to a mac-specific path will fail on Linux/CI or Windows boxes. Detect the OS or allow an env-var override instead of hard-coding.

ai-backend/pytest.ini (2)

7-14: Declare pytest-cov as an explicit dev dependency

The new --cov=* options require the pytest-cov plugin; ensure it’s listed in your dev/CI requirements files, otherwise the test run will error out on unknown flags.


15-19: Make sure the custom markers are actually used

--strict-markers will fail if any of these markers are unused. Double-check that at least one test is decorated with each of slow, integration, unit, and asyncio, or drop the unused entries.

Comment thread .gitignore
Comment on lines +95 to +100
# HTMLs
*.html

# CSVs
*.csv

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Verification agent

🧩 Analysis chain

Scope data artifact ignores to data-collection only (avoid hiding docs/fixtures)

Ignoring all HTML/CSV globally is risky; it can unintentionally hide documentation, static site outputs, or CSV test fixtures. Limit ignores to the data-collection tree (or specific output dirs).

Suggested replacement:

-# HTMLs
-*.html
-
-# CSVs
-*.csv
+# Data outputs (restricted to data-collection)
+data-collection/**/*.html
+data-collection/**/*.csv

If your outputs live in known subfolders (e.g., data-collection/raw and data-collection/processed), tighten further:

  • data-collection/raw/**/*.csv
  • data-collection/processed/**/*.csv
  • data-collection/raw/**/*.html
  • data-collection/processed/**/*.html

🏁 Script executed:

#!/bin/bash
echo "Checking for .html files outside data-collection directory..."
rg --files --glob '*.html' | grep -v '^data-collection/'
echo ""
echo "Checking for .csv files outside data-collection directory..."
rg --files --glob '*.csv' | grep -v '^data-collection/'

Length of output: 391


🏁 Script executed:

#!/bin/bash
echo "Checking for a data-collection directory..."
fd -t d data-collection || echo ">> data-collection directory not found"

echo
echo "Searching for HTML files under data-collection..."
rg --files --glob 'data-collection/**/*.html' || echo ">> No HTML files found in data-collection"

echo
echo "Searching for CSV files under data-collection..."
rg --files --glob 'data-collection/**/*.csv' || echo ">> No CSV files found in data-collection"

Length of output: 650


Restrict HTML/CSV ignores to data-collection only

To avoid inadvertently hiding docs or test fixtures elsewhere in the repo, replace the global *.html and *.csv ignores with patterns scoped to your data-collection outputs.

• File: .gitignore (lines 95–100)
• Remove:

-# HTMLs
-*.html
-
-# CSVs
-*.csv

• Add:

+# Data-collection outputs
+data-collection/**/*.html
+data-collection/**/*.csv

If you know your outputs live in subfolders (e.g. raw/ and processed/), you can tighten further:

data-collection/raw/**/*.csv
data-collection/processed/**/*.csv
data-collection/raw/**/*.html
data-collection/processed/**/*.html
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
# HTMLs
*.html
# CSVs
*.csv
# Data-collection outputs
data-collection/**/*.html
data-collection/**/*.csv
🤖 Prompt for AI Agents
In the .gitignore file around lines 95 to 100, the current patterns '*.html' and
'*.csv' ignore these file types globally, which may hide important docs or test
fixtures elsewhere. Replace these global ignores with more specific patterns
scoped to the data-collection directory, such as 'data-collection/raw/**/*.csv',
'data-collection/processed/**/*.csv', 'data-collection/raw/**/*.html', and
'data-collection/processed/**/*.html' to restrict ignoring only to your
data-collection outputs.

Comment on lines +7 to +32
# Test core dependencies
try:
print("✅ OpenAI package imported successfully")
except ImportError as e:
print(f"❌ OpenAI import failed: {e}")

try:
print("✅ OpenAI Agents package imported successfully")
except ImportError as e:
print(f"❌ OpenAI Agents import failed: {e}")

try:
print("✅ FastAPI package imported successfully")
except ImportError as e:
print(f"❌ FastAPI import failed: {e}")

try:
print("✅ Pydantic package imported successfully")
except ImportError as e:
print(f"❌ Pydantic import failed: {e}")

try:
print("✅ Supabase package imported successfully")
except ImportError as e:
print(f"❌ Supabase import failed: {e}")

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue

Test never imports the modules it claims to verify

Each try block only prints a success message; the actual import statements are missing, so failures will never surface.

Example fix for one block (apply to all):

try:
    import openai  # noqa: F401
    print("✅ OpenAI package imported successfully")
except ImportError as e:
    print(f"❌ OpenAI import failed: {e}")
    raise

Also add assertions to fail the test when an import is missing so CI can detect environment problems.

🤖 Prompt for AI Agents
In ai-backend/tests/test_environment.py between lines 7 and 32, the try blocks
only print success messages without actually importing the modules, so import
failures won't be detected. Fix this by adding the appropriate import statements
inside each try block for the respective packages (e.g., import openai, import
fastapi, etc.). Also, add assertions or raise the ImportError in the except
blocks to ensure the test fails and CI detects missing dependencies.

@@ -0,0 +1,142 @@
import pytest

from tools.sports_apis import APIFootballClient

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue

Import path likely wrong

from tools.sports_apis import … assumes a top-level tools package.
From the repo layout it should be:

-from tools.sports_apis import APIFootballClient
+from ai_backend.tools.sports_apis import APIFootballClient

(or update PYTHONPATH).

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
from tools.sports_apis import APIFootballClient
from ai_backend.tools.sports_apis import APIFootballClient
🤖 Prompt for AI Agents
In ai-backend/tests/test_fixtures.py at line 3, the import statement uses 'from
tools.sports_apis import APIFootballClient', which assumes 'tools' is a
top-level package. To fix this, adjust the import path to reflect the actual
relative location of the 'tools' package within the repo structure, or
alternatively update the PYTHONPATH environment variable to include the
directory containing 'tools' so the import resolves correctly.

Comment on lines +82 to +86
data = await response.json()
data_stringified = json.dumps(data, indent=2)

# logger.info(f"Response data: {data_stringified}")
return data_stringified

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue

Return type mismatches the annotation

_fetch_data (and therefore test_api_auth) claim to return list[dict] but actually return a JSON string. Either adjust the annotation or return the parsed list/dict:

-                data = await response.json()
-                data_stringified = json.dumps(data, indent=2)
-                return data_stringified
+                return await response.json()
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
data = await response.json()
data_stringified = json.dumps(data, indent=2)
# logger.info(f"Response data: {data_stringified}")
return data_stringified
# logger.info(f"Response data: {data_stringified}")
return await response.json()
🤖 Prompt for AI Agents
In ai-backend/tools/test_api_auth.py around lines 82 to 86, the function
_fetch_data is annotated to return a list of dictionaries but currently returns
a JSON string. To fix this, modify the return statement to return the parsed
JSON data directly (the list/dict) instead of the stringified JSON, or
alternatively update the function's return type annotation to indicate it
returns a string. Choose the approach that aligns with the expected usage of the
function.

Comment on lines +15 to +21
from selenium import webdriver
from selenium.webdriver.chrome.options import Options
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from selenium.common.exceptions import TimeoutException, WebDriverException

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🛠️ Refactor suggestion

Unconditional Selenium import contradicts “no hard dependency” claim

selenium is imported at module load time. If the package isn’t installed (CI, head-less servers), simply importing this module will raise ModuleNotFoundError, even when Selenium features aren’t used.

Move the import inside download_with_selenium() (guarded by try/except ImportError) or wrap the top-level import in a soft-fail block.

🤖 Prompt for AI Agents
In data-collection/collectors/fbref_collector.py around lines 15 to 21, the
selenium imports are unconditional, causing a ModuleNotFoundError if selenium is
not installed even when its features are not used. To fix this, move all
selenium-related imports inside the download_with_selenium() function and wrap
them in a try/except ImportError block to softly handle the absence of selenium
and avoid import errors at module load time.

Comment on lines +67 to +96
def extract_columns(soup, data_fields: list[str], html_filepath: str) -> pd.DataFrame:
"""Extract specified columns from HTML using BeautifulSoup."""
parsed_rows = []
# Find the specific table we care about using BeautifulSoup
table = soup.find("table", class_="stats_table")

if table:
for row in table.find("tbody").find_all("tr"):
# Find the specific cells '<td>' by their 'data-stat' attribute
fields = {"date": "", "score": "", "home_team": "", "away_team": ""}
for field in fields:
# Find the specific cells '<td>' by their 'data-stat' attribute
cell = row.find("td", {"data-stat": field})
# Extract the text from each cell, checking if the cell was found
fields[field] = cell.text.strip() if cell else ""

# Create a dictionary for each row
parsed_rows.append(fields)

# Convert our list of dictionaries into a pandas DataFrame
if parsed_rows:
# 2nd arg preserves the order of the columns
df_bs = pd.DataFrame(
parsed_rows, columns=["date", "score", "home_team", "away_team"]
)
print("\nSuccess! Extracted columns using BeautifulSoup:\n")
else:
print("\nFailed to extract columns using BeautifulSoup.")

return df_bs

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue

extract_columns() can raise UnboundLocalError and misses null-safety

Issues:

  1. When parsed_rows is empty the variable df_bs is never defined, yet it’s returned – this will crash.
  2. table.find("tbody") may return None; chaining .find_all() without a check causes AttributeError.
  3. Parameter data_fields is accepted but never used → dead code / confusion.

Minimal patch:

-    if table:
-        for row in table.find("tbody").find_all("tr"):
+    if table and table.tbody:
+        for row in table.tbody.find_all("tr"):
@@
-    if parsed_rows:
+    if parsed_rows:
         df_bs = pd.DataFrame(
             parsed_rows, columns=["date", "score", "home_team", "away_team"]
         )
         print("\nSuccess! Extracted columns using BeautifulSoup:\n")
-    else:
-        print("\nFailed to extract columns using BeautifulSoup.")
-
-    return df_bs
+        return df_bs
+
+    print("\nFailed to extract columns using BeautifulSoup.")
+    return pd.DataFrame(columns=["date", "score", "home_team", "away_team"])
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
def extract_columns(soup, data_fields: list[str], html_filepath: str) -> pd.DataFrame:
"""Extract specified columns from HTML using BeautifulSoup."""
parsed_rows = []
# Find the specific table we care about using BeautifulSoup
table = soup.find("table", class_="stats_table")
if table:
for row in table.find("tbody").find_all("tr"):
# Find the specific cells '<td>' by their 'data-stat' attribute
fields = {"date": "", "score": "", "home_team": "", "away_team": ""}
for field in fields:
# Find the specific cells '<td>' by their 'data-stat' attribute
cell = row.find("td", {"data-stat": field})
# Extract the text from each cell, checking if the cell was found
fields[field] = cell.text.strip() if cell else ""
# Create a dictionary for each row
parsed_rows.append(fields)
# Convert our list of dictionaries into a pandas DataFrame
if parsed_rows:
# 2nd arg preserves the order of the columns
df_bs = pd.DataFrame(
parsed_rows, columns=["date", "score", "home_team", "away_team"]
)
print("\nSuccess! Extracted columns using BeautifulSoup:\n")
else:
print("\nFailed to extract columns using BeautifulSoup.")
return df_bs
def extract_columns(soup, data_fields: list[str], html_filepath: str) -> pd.DataFrame:
"""Extract specified columns from HTML using BeautifulSoup."""
parsed_rows = []
# Find the specific table we care about using BeautifulSoup
table = soup.find("table", class_="stats_table")
if table and table.tbody:
for row in table.tbody.find_all("tr"):
# Find the specific cells '<td>' by their 'data-stat' attribute
fields = {"date": "", "score": "", "home_team": "", "away_team": ""}
for field in fields:
# Find the specific cells '<td>' by their 'data-stat' attribute
cell = row.find("td", {"data-stat": field})
# Extract the text from each cell, checking if the cell was found
fields[field] = cell.text.strip() if cell else ""
# Create a dictionary for each row
parsed_rows.append(fields)
if parsed_rows:
df_bs = pd.DataFrame(
parsed_rows, columns=["date", "score", "home_team", "away_team"]
)
print("\nSuccess! Extracted columns using BeautifulSoup:\n")
return df_bs
print("\nFailed to extract columns using BeautifulSoup.")
return pd.DataFrame(columns=["date", "score", "home_team", "away_team"])
🤖 Prompt for AI Agents
In data-collection/collectors/fbref_collector.py around lines 67 to 96, fix the
UnboundLocalError by ensuring df_bs is always defined before return, even if
parsed_rows is empty, by initializing df_bs to an empty DataFrame. Add a null
check for table.find("tbody") before calling find_all to avoid AttributeError.
Remove the unused parameter data_fields from the function signature and any
related code to eliminate dead code and confusion.

Comment thread data-collection/collectors/football_data_collector.py
Comment on lines +111 to +119
def add_new_columns_to_football_data(df_clean, file):
"""Add new columns to df_clean."""
df_clean["match_id"] = df_clean.index

tmp = file.split("_")
df_clean["league"] = tmp[1]
df_clean["season"] = tmp[2].split(".")[0]
df_clean["source"] = tmp[0]
return df_clean

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🛠️ Refactor suggestion

Shadowing built-in file and fragile filename parsing

Rename parameter and parse safely:

-def add_new_columns_to_football_data(df_clean, file):
+def add_new_columns_to_football_data(df_clean: pd.DataFrame, filename: str) -> pd.DataFrame:
@@
-    tmp = file.split("_")
+    tmp = os.path.basename(filename).split("_")

Also use a stable match_id (e.g., hash of league/date/teams) instead of the index to guarantee cross-file uniqueness.

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
def add_new_columns_to_football_data(df_clean, file):
"""Add new columns to df_clean."""
df_clean["match_id"] = df_clean.index
tmp = file.split("_")
df_clean["league"] = tmp[1]
df_clean["season"] = tmp[2].split(".")[0]
df_clean["source"] = tmp[0]
return df_clean
def add_new_columns_to_football_data(df_clean: pd.DataFrame, filename: str) -> pd.DataFrame:
"""Add new columns to df_clean."""
df_clean["match_id"] = df_clean.index
tmp = os.path.basename(filename).split("_")
df_clean["league"] = tmp[1]
df_clean["season"] = tmp[2].split(".")[0]
df_clean["source"] = tmp[0]
return df_clean
🤖 Prompt for AI Agents
In data-collection/collectors/football_data_collector.py around lines 111 to
119, rename the parameter 'file' to avoid shadowing the built-in name, and
replace the fragile filename parsing with a more robust method such as using
pathlib or regex to extract parts safely. Additionally, replace the current
match_id assignment from the dataframe index with a stable unique identifier
generated by hashing a combination of league, date, and teams to ensure
uniqueness across files.

Comment thread data-collection/scripts/collect_all.py
Comment thread scripts/seed-data.py

# Add the shared directory to the Python path
sys.path.append(os.path.join(os.path.dirname(__file__), '..', 'shared'))
sys.path.append(os.path.join(os.path.dirname(__file__), "..", "shared"))

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🛠️ Refactor suggestion

Avoid run-time sys.path manipulation

Appending ../shared to sys.path is brittle and can hide packaging/import problems. Prefer:

-# Add the shared directory to the Python path
-sys.path.append(os.path.join(os.path.dirname(__file__), "..", "shared"))
+import importlib.util, pathlib, sys
+shared_path = pathlib.Path(__file__).resolve().parent.parent / "shared"
+sys.path.insert(0, str(shared_path))

…or, better, convert shared into a proper Python package and install it with your project.

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
sys.path.append(os.path.join(os.path.dirname(__file__), "..", "shared"))
import importlib.util, pathlib, sys
shared_path = pathlib.Path(__file__).resolve().parent.parent / "shared"
sys.path.insert(0, str(shared_path))
🤖 Prompt for AI Agents
In scripts/seed-data.py at line 14, avoid modifying sys.path at runtime by
appending ../shared, as this is brittle and can mask import issues. Instead,
refactor the project structure to make shared a proper Python package with an
__init__.py file, and install it as part of your project dependencies so it can
be imported normally without sys.path hacks.

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

🔭 Outside diff range comments (1)
data-collection/collectors/football_data_collector.py (1)

159-165: save_df_to_csv() should return useful info

Script relies on its return value but the function returns None.
Return either the processed DataFrame or the saved file path.

-    df.to_csv(filepath, index=False)
-    print(f"Processed CSV saved as '{filepath}'.")
+    df.to_csv(filepath, index=False)
+    print(f"Processed CSV saved as '{filepath}'.")
+    return filepath
♻️ Duplicate comments (4)
data-collection/collectors/fbref_collector.py (2)

14-18: Hard dependency on Selenium contradicts doc-string

The module promises “no hard Selenium dependencies”, yet selenium.* is imported at module import time. On environments without Selenium this raises ModuleNotFoundError, blocking even non-Selenium code paths.
Move these imports inside download_with_selenium() and guard them with a try/except ImportError soft-fail.


60-89: extract_columns() can still crash & misses null-safety

table.find("tbody") may return NoneAttributeError.
• When parsed_rows is empty df_bs is undefined → UnboundLocalError on return.
data_fields parameter is unused – dead code / confusion.

-    if table:
-        for row in table.find("tbody").find_all("tr"):
+    if table and table.tbody:
+        for row in table.tbody.find_all("tr"):
@@
-    if parsed_rows:
+    if parsed_rows:
         df_bs = pd.DataFrame(
             parsed_rows, columns=["date", "score", "home_team", "away_team"]
         )
         print("\nSuccess! Extracted columns using BeautifulSoup:\n")
-    else:
-        print("\nFailed to extract columns using BeautifulSoup.")
-
-    return df_bs
+        return df_bs
+
+    print("\nFailed to extract columns using BeautifulSoup.")
+    return pd.DataFrame(columns=["date", "score", "home_team", "away_team"])

Also drop the unused data_fields arg from the signature.

data-collection/collectors/football_data_collector.py (2)

63-88: download_csv(): missing directory creation, timeout & overwrite guard

Same issues flagged earlier remain: no os.makedirs, requests.get without timeout, and doc-string mentions overwrite flag that is not implemented.


110-118: Shadowing built-in name & fragile filename parsing

Parameter file shadows the historic built-in and split("_") fails on paths containing underscores or directories.
Rename to filename and use os.path.basename() or pathlib for safe parsing.

🧹 Nitpick comments (1)
data-collection/collectors/fbref_collector.py (1)

206-210: Mac-specific Chrome binary hard-coded

The explicit /Applications/... path breaks on Linux/CI. Detect the OS or let ChromeDriver pick the installed binary instead.

📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between e7604e4 and 4b49438.

📒 Files selected for processing (3)
  • data-collection/collectors/fbref_collector.py (1 hunks)
  • data-collection/collectors/football_data_collector.py (1 hunks)
  • data-collection/scripts/collect_all.py (1 hunks)
🧰 Additional context used
🧬 Code Graph Analysis (2)
data-collection/collectors/fbref_collector.py (1)
data-collection/collectors/football_data_collector.py (5)
  • _here (47-48)
  • get_data_raw_folder (51-54)
  • get_data_processed_folder (57-60)
  • normalize_date (148-156)
  • get_columns (122-125)
data-collection/collectors/football_data_collector.py (1)
data-collection/collectors/fbref_collector.py (5)
  • _here (21-22)
  • get_data_raw_folder (25-28)
  • get_data_processed_folder (31-34)
  • get_columns (170-173)
  • normalize_date (140-148)
🪛 GitHub Check: Codacy Static Code Analysis
data-collection/scripts/collect_all.py

[warning] 116-116: data-collection/scripts/collect_all.py#L116
Assigning result of a function call, where the function has no return

data-collection/collectors/fbref_collector.py

[warning] 11-11: data-collection/collectors/fbref_collector.py#L11
Unused BeautifulSoup imported from bs4


[warning] 17-17: data-collection/collectors/fbref_collector.py#L17
Unused expected_conditions imported from selenium.webdriver.support as EC


[warning] 145-145: data-collection/collectors/fbref_collector.py#L145
Try, Except, Pass detected.


[warning] 259-259: data-collection/collectors/fbref_collector.py#L259
Standard pseudo-random generators are not suitable for security/cryptographic purposes.

data-collection/collectors/football_data_collector.py

[warning] 80-80: data-collection/collectors/football_data_collector.py#L80
Call to requests without timeout


[warning] 153-153: data-collection/collectors/football_data_collector.py#L153
Try, Except, Pass detected.

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
  • GitHub Check: Codacy Static Code Analysis

Comment thread data-collection/scripts/collect_all.py Outdated
Comment on lines +138 to +146
html_filepath = get_data_folder() + "/raw/" + html_filename
try:
with open(html_filepath, "r", encoding="utf-8") as f:
html_content = f.read()
soup = BeautifulSoup(html_content, "lxml")
except (FileNotFoundError, PermissionError, UnicodeDecodeError) as e:
print(f"Error reading file: {e}")
# Return an empty soup as fallback
soup = BeautifulSoup("", "lxml")

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🛠️ Refactor suggestion

Non-portable path concatenation to html_filepath

Using string + "/raw/" + bypasses os.path.join, breaks on Windows, and risks double slashes.

-html_filepath = get_data_folder() + "/raw/" + html_filename
+html_filepath = os.path.join(get_data_folder(), "raw", html_filename)
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
html_filepath = get_data_folder() + "/raw/" + html_filename
try:
with open(html_filepath, "r", encoding="utf-8") as f:
html_content = f.read()
soup = BeautifulSoup(html_content, "lxml")
except (FileNotFoundError, PermissionError, UnicodeDecodeError) as e:
print(f"Error reading file: {e}")
# Return an empty soup as fallback
soup = BeautifulSoup("", "lxml")
html_filepath = os.path.join(get_data_folder(), "raw", html_filename)
try:
with open(html_filepath, "r", encoding="utf-8") as f:
html_content = f.read()
soup = BeautifulSoup(html_content, "lxml")
except (FileNotFoundError, PermissionError, UnicodeDecodeError) as e:
print(f"Error reading file: {e}")
# Return an empty soup as fallback
soup = BeautifulSoup("", "lxml")
🤖 Prompt for AI Agents
In data-collection/scripts/collect_all.py around lines 138 to 146, the
construction of html_filepath uses string concatenation with "+ '/raw/' +",
which is non-portable and can cause issues on Windows or result in double
slashes. Replace this concatenation with os.path.join to correctly and portably
join the directory paths and filename, ensuring compatibility across operating
systems.

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 5

🔭 Outside diff range comments (1)
data-collection/collectors/football_data_collector.py (1)

160-166: Make save robust: ensure folder exists, use splitext, and return filepath

Current code may fail if folder is missing and doesn’t return the path.

-def save_df_to_csv(df, raw_filename):
-    """Save DataFrame to CSV"""
-    processed_filename = raw_filename.split(".")[0] + "_processed.csv"
-    filepath = os.path.join(get_data_processed_folder(), processed_filename)
-    df.to_csv(filepath, index=False)
-    print(f"Processed CSV saved as '{filepath}'.")
+def save_df_to_csv(df: pd.DataFrame, raw_filename: str) -> str:
+    """Save DataFrame to CSV and return the output path."""
+    root, _ = os.path.splitext(os.path.basename(raw_filename))
+    processed_filename = f"{root}_processed.csv"
+    out_dir = get_data_processed_folder()
+    os.makedirs(out_dir, exist_ok=True)
+    filepath = os.path.join(out_dir, processed_filename)
+    df.to_csv(filepath, index=False)
+    print(f"Processed CSV saved as '{filepath}'.")
+    return filepath
♻️ Duplicate comments (4)
data-collection/collectors/fbref_collector.py (2)

13-17: Unconditional Selenium import contradicts “no hard dependency” and breaks import if Selenium isn’t installed

Move Selenium imports into the functions that need them and gate type hints to avoid runtime import. Also implement the documented requests-based fallback.

Apply:

@@
-import random
-from selenium import webdriver
-from selenium.webdriver.chrome.options import Options
-from selenium.webdriver.support.ui import WebDriverWait
-from selenium.common.exceptions import TimeoutException, WebDriverException
+import random
+import hashlib
+from typing import TYPE_CHECKING
+
+if TYPE_CHECKING:
+    # For type hints only; does not enforce Selenium at runtime.
+    from selenium.webdriver.chrome.options import Options

58-88: Fix UnboundLocalError, add null-safety, and actually use data_fields

When no rows are parsed, df_bs is undefined. Also table.tbody can be None. Parameter data_fields is unused.

-def extract_columns(soup, data_fields: list[str], html_filepath: str) -> pd.DataFrame:
+def extract_columns(soup, data_fields: list[str] | None, html_filepath: str) -> pd.DataFrame:
     """Extract specified columns from HTML using BeautifulSoup."""
-    parsed_rows = []
-    # Find the specific table we care about using BeautifulSoup
-    table = soup.find("table", class_="stats_table")
+    parsed_rows: list[dict] = []
+    # Find the specific table we care about using BeautifulSoup
+    table = soup.find("table", class_="stats_table")
 
-    if table:
-        for row in table.find("tbody").find_all("tr"):
+    fields_to_extract = data_fields or ["date", "score", "home_team", "away_team"]
+
+    if table and getattr(table, "tbody", None):
+        for row in table.tbody.find_all("tr"):
             # Find the specific cells '<td>' by their 'data-stat' attribute
-            fields = {"date": "", "score": "", "home_team": "", "away_team": ""}
-            for field in fields:
+            fields = {k: "" for k in fields_to_extract}
+            for field in fields_to_extract:
                 # Find the specific cells '<td>' by their 'data-stat' attribute
                 cell = row.find("td", {"data-stat": field})
                 # Extract the text from each cell, checking if the cell was found
                 fields[field] = cell.text.strip() if cell else ""
 
             # Create a dictionary for each row
             parsed_rows.append(fields)
 
     # Convert our list of dictionaries into a pandas DataFrame
-    if parsed_rows:
-        # 2nd arg preserves the order of the columns
-        df_bs = pd.DataFrame(
-            parsed_rows, columns=["date", "score", "home_team", "away_team"]
-        )
-        print("\nSuccess! Extracted columns using BeautifulSoup:\n")
-    else:
-        print("\nFailed to extract columns using BeautifulSoup.")
-
-    return df_bs
+    if parsed_rows:
+        df_bs = pd.DataFrame(parsed_rows, columns=fields_to_extract)
+        print("\nSuccess! Extracted columns using BeautifulSoup:\n")
+        return df_bs
+
+    print("\nFailed to extract columns using BeautifulSoup.")
+    return pd.DataFrame(columns=fields_to_extract)
data-collection/collectors/football_data_collector.py (2)

63-90: Implement timeout, directory creation, and real overwrite semantics (docstring mismatch)

Function can hang without timeout, fail if directory doesn’t exist, and mentions overwrite that doesn’t exist.

-def download_csv(url: str, filename: str = "football-data_data.csv"):
+def download_csv(
+    url: str,
+    filename: str = "football-data_data.csv",
+    *,
+    timeout: int = 30,
+    overwrite: bool = True,
+) -> str:
@@
-    - filename (str): Desired name for the saved file. Defaults to 'data.csv'.
-    - overwrite (bool): Whether to overwrite existing file. Defaults to False.
+    - filename (str): Desired name for the saved file.
+    - timeout (int): Network timeout in seconds.
+    - overwrite (bool): Whether to overwrite existing file. Defaults to True.
@@
-    filepath = os.path.join(get_data_raw_folder(), filename)
+    raw_dir = get_data_raw_folder()
+    os.makedirs(raw_dir, exist_ok=True)
+    filepath = os.path.join(raw_dir, filename)
+
+    if not overwrite and os.path.exists(filepath):
+        raise FileExistsError(f"{filepath} already exists. Set overwrite=True to replace.")
@@
-        response = requests.get(url)
+        response = requests.get(url, timeout=timeout)
         response.raise_for_status()  # Raise HTTPError for bad responses
@@
     # Write to data/raw/filename
     with open(filepath, "wb") as f:
         f.write(response.content)
     print(f"Raw CSV saved as {filepath}\n")
+    return filepath

110-119: Avoid fragile filename parsing and non-unique IDs; don’t use file as a parameter name

Use basename parsing and a stable deterministic match_id.

-def add_new_columns_to_football_data(df_clean, file):
+def add_new_columns_to_football_data(df_clean: pd.DataFrame, filename: str) -> pd.DataFrame:
     """Add new columns to df_clean."""
-    df_clean["match_id"] = df_clean.index
-
-    tmp = file.split("_")
-    df_clean["league"] = tmp[1]
-    df_clean["season"] = tmp[2].split(".")[0]
-    df_clean["source"] = tmp[0]
+    base = os.path.basename(filename)
+    parts = base.split("_")
+    df_clean["league"] = parts[1] if len(parts) > 1 else ""
+    df_clean["season"] = parts[2].split(".")[0] if len(parts) > 2 else ""
+    df_clean["source"] = parts[0] if parts else ""
+
+    # Stable, cross-file unique match_id
+    import hashlib
+    def _mk_id(row) -> str:
+        key = f"{row.get('date','')}|{row.get('league','')}|{row.get('season','')}|{row.get('home_team','')}|{row.get('away_team','')}"
+        return hashlib.sha1(key.encode('utf-8')).hexdigest()[:16]
+    df_clean["match_id"] = df_clean.apply(_mk_id, axis=1)
     return df_clean
🧹 Nitpick comments (6)
data-collection/collectors/fbref_collector.py (3)

35-46: League slugging may be fragile

Replacing spaces with '-' won’t cover punctuation/diacritics (e.g., “Liga Portugal”, “Ligue 1 Uber Eats”). Consider a mapping table for FBref slugs or accept an explicit league_slug parameter to avoid 404s.


258-259: Static analysis warning on random is a false positive in this context

Use of random.uniform here is not for crypto; it’s fine. If your linter enforces this rule, suppress it or switch to secrets/SystemRandom for appeasement.


212-223: Minor: remove strict dependency on Selenium types in annotations

To avoid import-time failures, keep annotations as strings or drop them. You’ve already gated imports; consider adding from future import annotations or using TYPE_CHECKING.

data-collection/collectors/football_data_collector.py (3)

21-45: Clarify API: rename league_id to league_code, drop unused league_name, and fix examples

league_name is unused; parameter name suggests numeric id, but path expects codes like “E0”, “SP1”. Examples for Champions League/La Liga show season code “2425” while text says 2023–2024 (should be “2324”).

-def generate_football_data_url(league_id, league_name, season):
+def generate_football_data_url(league_code: str, season: int) -> str:
@@
-    ex. Champions League 2023-2024
-        https://www.football-data.co.uk/mmz4281/2425/E1.csv
+    ex. Champions League 2023-2024
+        https://www.football-data.co.uk/mmz4281/2324/E1.csv
@@
-    ex. La Liga 2023-2024
-        https://www.football-data.co.uk/mmz4281/2425/SP1.csv
+    ex. La Liga 2023-2024
+        https://www.football-data.co.uk/mmz4281/2324/SP1.csv
@@
-    url = base_url + f"/{season_code}/{league_id}.csv"
+    url = base_url + f"/{season_code}/{league_code}.csv"

128-146: Don’t print DataFrame heads in library code

Noisy logs and potential perf hit. Return the processed DataFrame; leave logging to callers or add a verbose flag.

 def reorder_df_football_data(df):
     # Get specified columns and reorder them
     df_processed = get_columns(
@@
     )
-    print("After getting essential columns:")
-    print(df_processed.head())
-    print("df_processed.columns:", df_processed.columns)
     return df_processed

47-61: DRY: centralize shared path and column helpers

Both collectors duplicate _here, get_data_raw_folder, get_data_processed_folder, get_columns, normalize_date. Extract into a shared module (e.g., data-collection/common/io_utils.py) and import from there to keep behavior consistent.

📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 4b49438 and 3be68e3.

📒 Files selected for processing (3)
  • data-collection/collectors/fbref_collector.py (1 hunks)
  • data-collection/collectors/football_data_collector.py (1 hunks)
  • data-collection/scripts/collect_all.py (1 hunks)
🚧 Files skipped from review as they are similar to previous changes (1)
  • data-collection/scripts/collect_all.py
🧰 Additional context used
🧬 Code Graph Analysis (1)
data-collection/collectors/fbref_collector.py (1)
data-collection/collectors/football_data_collector.py (5)
  • _here (47-48)
  • get_data_raw_folder (51-54)
  • get_data_processed_folder (57-60)
  • normalize_date (148-157)
  • get_columns (122-125)
🪛 GitHub Check: Codacy Static Code Analysis
data-collection/collectors/fbref_collector.py

[warning] 258-258: data-collection/collectors/fbref_collector.py#L258
Standard pseudo-random generators are not suitable for security/cryptographic purposes.

data-collection/collectors/football_data_collector.py

[warning] 80-80: data-collection/collectors/football_data_collector.py#L80
Call to requests without timeout

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
  • GitHub Check: Codacy Static Code Analysis

Comment on lines +90 to +103
def add_new_columns_to_fbref(df: pd.DataFrame, file) -> pd.DataFrame:
# Fill in match_id column
df["match_id"] = range(len(df))
# Split score into home_score and away_score
df["home_score"], df["away_score"] = convert_score_to_home_score_and_away_score(
df["score"]
)

tmp = file.split("_")
df["league"] = tmp[1]
df["season"] = tmp[2].split(".")[0]
df["source"] = tmp[0]

return df

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🛠️ Refactor suggestion

Use stable match_id and robust filename parsing; avoid file param name

file.split("_") is fragile, and index-based match_id isn’t globally unique.

-def add_new_columns_to_fbref(df: pd.DataFrame, file) -> pd.DataFrame:
-    # Fill in match_id column
-    df["match_id"] = range(len(df))
+def add_new_columns_to_fbref(df: pd.DataFrame, filename: str) -> pd.DataFrame:
+    # Robust filename parsing
+    base = os.path.basename(filename)
+    parts = base.split("_")
+    # Expecting: source_league_season.html or similar
+    league = parts[1] if len(parts) > 1 else ""
+    season = parts[2].split(".")[0] if len(parts) > 2 else ""
@@
-    tmp = file.split("_")
-    df["league"] = tmp[1]
-    df["season"] = tmp[2].split(".")[0]
-    df["source"] = tmp[0]
+    df["league"] = league
+    df["season"] = season
+    df["source"] = parts[0] if parts else ""
+
+    # Stable deterministic match_id
+    def _mk_id(row) -> str:
+        key = f"{row.get('date','')}|{row.get('league','')}|{row.get('season','')}|{row.get('home_team','')}|{row.get('away_team','')}"
+        return hashlib.sha1(key.encode("utf-8")).hexdigest()[:16]
+    df["match_id"] = df.apply(_mk_id, axis=1)
 
     return df

Committable suggestion skipped: line range outside the PR's diff.

🤖 Prompt for AI Agents
In data-collection/collectors/fbref_collector.py around lines 90 to 103, the
current code uses a non-unique index-based match_id and fragile filename parsing
by splitting on underscores. To fix this, generate a stable, unique match_id
based on consistent match attributes instead of the row index, and replace the
filename parsing with a more robust method such as using regex or pathlib to
extract league, season, and source reliably. Also, rename the parameter from
'file' to avoid shadowing built-in names.

Comment on lines +175 to +186
def create_csv(df: pd.DataFrame, filename: str) -> str:
"""Save DataFrame to CSV under data/processed/filename"""

filepath = os.path.join(get_data_processed_folder(), filename)

# If output_path is not absolute, write under processed folder
if not os.path.isabs(filepath):
os.makedirs(os.path.dirname(filepath), exist_ok=True)
df.to_csv(filepath, index=False)
print(f"Processed CSV saved as '{filepath}'.")
return filepath

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue

Directory may not exist; ensure creation before writing CSV

filepath is absolute, so the current conditional skips makedirs. Create the directory unconditionally.

 def create_csv(df: pd.DataFrame, filename: str) -> str:
     """Save DataFrame to CSV under data/processed/filename"""
 
     filepath = os.path.join(get_data_processed_folder(), filename)
 
-    # If output_path is not absolute, write under processed folder
-    if not os.path.isabs(filepath):
-        os.makedirs(os.path.dirname(filepath), exist_ok=True)
+    os.makedirs(os.path.dirname(filepath), exist_ok=True)
     df.to_csv(filepath, index=False)
     print(f"Processed CSV saved as '{filepath}'.")
     return filepath
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
def create_csv(df: pd.DataFrame, filename: str) -> str:
"""Save DataFrame to CSV under data/processed/filename"""
filepath = os.path.join(get_data_processed_folder(), filename)
# If output_path is not absolute, write under processed folder
if not os.path.isabs(filepath):
os.makedirs(os.path.dirname(filepath), exist_ok=True)
df.to_csv(filepath, index=False)
print(f"Processed CSV saved as '{filepath}'.")
return filepath
def create_csv(df: pd.DataFrame, filename: str) -> str:
"""Save DataFrame to CSV under data/processed/filename"""
filepath = os.path.join(get_data_processed_folder(), filename)
os.makedirs(os.path.dirname(filepath), exist_ok=True)
df.to_csv(filepath, index=False)
print(f"Processed CSV saved as '{filepath}'.")
return filepath
🤖 Prompt for AI Agents
In data-collection/collectors/fbref_collector.py around lines 175 to 186, the
directory creation is conditional on filepath not being absolute, but filepath
is always absolute, so the directory may not be created before writing the CSV.
Fix this by removing the condition and always calling os.makedirs on the
directory of filepath with exist_ok=True before saving the CSV.

Comment on lines +188 to +216
def _configure_chrome_options() -> Options:
"""Create and return a configured Chrome Options object."""
chrome_options = Options()
chrome_options.add_argument("--disable-blink-features=AutomationControlled")
chrome_options.add_experimental_option("excludeSwitches", ["enable-automation"])
chrome_options.add_experimental_option("useAutomationExtension", False)
chrome_options.add_argument("--disable-extensions")
chrome_options.add_argument("--no-sandbox")
chrome_options.add_argument("--disable-dev-shm-usage")
# chrome_options.add_argument('--headless') # Optional: enable headless mode

# User agent to mimic real browser
chrome_options.add_argument(
"--user-agent=Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) "
"AppleWebKit/537.36 (KHTML, like Gecko) Chrome/120.0.0.0 Safari/537.36"
)

# Set Chrome binary location for macOS if available
chrome_bin = "/Applications/Google Chrome.app/Contents/MacOS/Google Chrome"
if os.path.exists(chrome_bin):
chrome_options.binary_location = chrome_bin
return chrome_options


def _create_driver(chrome_options: Options):
"""Initialize and return a Chrome WebDriver instance."""
print("Starting Chrome WebDriver...")
return webdriver.Chrome(options=chrome_options)

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🛠️ Refactor suggestion

Localize Selenium imports and avoid runtime type dependency in annotations

Import Selenium types inside functions and use string annotations to prevent ImportError at module import.

-def _configure_chrome_options() -> Options:
-    """Create and return a configured Chrome Options object."""
-    chrome_options = Options()
+def _configure_chrome_options() -> "Options":
+    """Create and return a configured Chrome Options object."""
+    from selenium.webdriver.chrome.options import Options
+    chrome_options = Options()
@@
-def _create_driver(chrome_options: Options):
-    """Initialize and return a Chrome WebDriver instance."""
-    print("Starting Chrome WebDriver...")
-    return webdriver.Chrome(options=chrome_options)
+def _create_driver(chrome_options):
+    """Initialize and return a Chrome WebDriver instance."""
+    print("Starting Chrome WebDriver...")
+    from selenium import webdriver
+    return webdriver.Chrome(options=chrome_options)
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
def _configure_chrome_options() -> Options:
"""Create and return a configured Chrome Options object."""
chrome_options = Options()
chrome_options.add_argument("--disable-blink-features=AutomationControlled")
chrome_options.add_experimental_option("excludeSwitches", ["enable-automation"])
chrome_options.add_experimental_option("useAutomationExtension", False)
chrome_options.add_argument("--disable-extensions")
chrome_options.add_argument("--no-sandbox")
chrome_options.add_argument("--disable-dev-shm-usage")
# chrome_options.add_argument('--headless') # Optional: enable headless mode
# User agent to mimic real browser
chrome_options.add_argument(
"--user-agent=Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) "
"AppleWebKit/537.36 (KHTML, like Gecko) Chrome/120.0.0.0 Safari/537.36"
)
# Set Chrome binary location for macOS if available
chrome_bin = "/Applications/Google Chrome.app/Contents/MacOS/Google Chrome"
if os.path.exists(chrome_bin):
chrome_options.binary_location = chrome_bin
return chrome_options
def _create_driver(chrome_options: Options):
"""Initialize and return a Chrome WebDriver instance."""
print("Starting Chrome WebDriver...")
return webdriver.Chrome(options=chrome_options)
def _configure_chrome_options() -> "Options":
"""Create and return a configured Chrome Options object."""
from selenium.webdriver.chrome.options import Options
chrome_options = Options()
chrome_options.add_argument("--disable-blink-features=AutomationControlled")
chrome_options.add_experimental_option("excludeSwitches", ["enable-automation"])
chrome_options.add_experimental_option("useAutomationExtension", False)
chrome_options.add_argument("--disable-extensions")
chrome_options.add_argument("--no-sandbox")
chrome_options.add_argument("--disable-dev-shm-usage")
# chrome_options.add_argument('--headless') # Optional: enable headless mode
# User agent to mimic real browser
chrome_options.add_argument(
"--user-agent=Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) "
"AppleWebKit/537.36 (KHTML, like Gecko) Chrome/120.0.0.0 Safari/537.36"
)
# Set Chrome binary location for macOS if available
chrome_bin = "/Applications/Google Chrome.app/Contents/MacOS/Google Chrome"
if os.path.exists(chrome_bin):
chrome_options.binary_location = chrome_bin
return chrome_options
def _create_driver(chrome_options):
"""Initialize and return a Chrome WebDriver instance."""
print("Starting Chrome WebDriver...")
from selenium import webdriver
return webdriver.Chrome(options=chrome_options)
🤖 Prompt for AI Agents
In data-collection/collectors/fbref_collector.py around lines 188 to 216, the
Selenium imports and type annotations are currently at the module level, which
can cause ImportError during module import. Move all Selenium-related imports
inside the functions that use them, and change type annotations referencing
Selenium types to use string literals instead. This localizes dependencies and
avoids runtime import errors.

Comment on lines +225 to +236
def _wait_out_challenge(driver, timeout: int = 30) -> None:
"""Wait for common anti-bot challenge pages to clear, within timeout."""
print("Waiting for page to load and any challenges to complete...")
try:
WebDriverWait(driver, timeout).until(
lambda d: "Just a moment" not in d.title
and "Checking your browser" not in d.page_source
)
print("Challenge completed or page loaded successfully.")
except TimeoutException:
print("Timeout waiting for challenge completion, but proceeding anyway...")

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🛠️ Refactor suggestion

Guard against missing Selenium and avoid NameError for TimeoutException

Import WebDriverWait/TimeoutException inside the function.

 def _wait_out_challenge(driver, timeout: int = 30) -> None:
     """Wait for common anti-bot challenge pages to clear, within timeout."""
     print("Waiting for page to load and any challenges to complete...")
     try:
+        from selenium.webdriver.support.ui import WebDriverWait
+        from selenium.common.exceptions import TimeoutException
         WebDriverWait(driver, timeout).until(
             lambda d: "Just a moment" not in d.title
             and "Checking your browser" not in d.page_source
         )
         print("Challenge completed or page loaded successfully.")
     except TimeoutException:
         print("Timeout waiting for challenge completion, but proceeding anyway...")
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
def _wait_out_challenge(driver, timeout: int = 30) -> None:
"""Wait for common anti-bot challenge pages to clear, within timeout."""
print("Waiting for page to load and any challenges to complete...")
try:
WebDriverWait(driver, timeout).until(
lambda d: "Just a moment" not in d.title
and "Checking your browser" not in d.page_source
)
print("Challenge completed or page loaded successfully.")
except TimeoutException:
print("Timeout waiting for challenge completion, but proceeding anyway...")
def _wait_out_challenge(driver, timeout: int = 30) -> None:
"""Wait for common anti-bot challenge pages to clear, within timeout."""
print("Waiting for page to load and any challenges to complete...")
try:
from selenium.webdriver.support.ui import WebDriverWait
from selenium.common.exceptions import TimeoutException
WebDriverWait(driver, timeout).until(
lambda d: "Just a moment" not in d.title
and "Checking your browser" not in d.page_source
)
print("Challenge completed or page loaded successfully.")
except TimeoutException:
print("Timeout waiting for challenge completion, but proceeding anyway...")
🤖 Prompt for AI Agents
In data-collection/collectors/fbref_collector.py around lines 225 to 236, the
function _wait_out_challenge uses WebDriverWait and TimeoutException without
importing them inside the function, which can cause NameError if Selenium is
missing. To fix this, move the imports of WebDriverWait and TimeoutException
inside the _wait_out_challenge function to guard against missing Selenium and
avoid NameError.

Comment on lines +246 to +275
def download_with_selenium(url):
"""Fetch dynamic page content using Selenium and return HTML (<=50 lines)."""
driver = None
try:
options = _configure_chrome_options()
driver = _create_driver(options)
_remove_webdriver_flag(driver)

print(f"Navigating to {url} ...")
driver.get(url)

_wait_out_challenge(driver, timeout=30)
time.sleep(random.uniform(2, 5)) # small buffer to ensure full render

html_content = _get_html_or_raise(driver)
print("Successfully retrieved page.")
return html_content

except WebDriverException as e:
print(f"WebDriver error: {e}")
print("\nTroubleshooting:")
print("1. Make sure Chrome browser is installed")
print("2. Install ChromeDriver: brew install --cask chromedriver")
print("3. Or use: pip install webdriver-manager")
raise ValueError(f"WebDriver error: {e}")

except Exception as e:
print(f"Unexpected error: {e}")
raise ValueError(f"Failed to scrape with Selenium: {e}")

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🛠️ Refactor suggestion

Add true “requests” fallback and avoid hard Selenium dependency

If Selenium isn’t available, gracefully fall back to a plain GET with headers and timeout. Also localize Selenium imports to this function and helpers.

 def download_with_selenium(url):
     """Fetch dynamic page content using Selenium and return HTML (<=50 lines)."""
     driver = None
     try:
-        options = _configure_chrome_options()
-        driver = _create_driver(options)
-        _remove_webdriver_flag(driver)
+        # Try Selenium path first; import lazily to avoid hard dependency.
+        try:
+            from selenium.common.exceptions import WebDriverException
+        except ImportError:
+            WebDriverException = Exception  # sentinel for isinstance checks
+
+        options = _configure_chrome_options()
+        driver = _create_driver(options)
+        _remove_webdriver_flag(driver)
@@
-        _wait_out_challenge(driver, timeout=30)
+        _wait_out_challenge(driver, timeout=30)
         time.sleep(random.uniform(2, 5))  # small buffer to ensure full render
@@
-        html_content = _get_html_or_raise(driver)
+        html_content = _get_html_or_raise(driver)
         print("Successfully retrieved page.")
         return html_content
-
-    except WebDriverException as e:
-        print(f"WebDriver error: {e}")
-        print("\nTroubleshooting:")
-        print("1. Make sure Chrome browser is installed")
-        print("2. Install ChromeDriver: brew install --cask chromedriver")
-        print("3. Or use: pip install webdriver-manager")
-        raise ValueError(f"WebDriver error: {e}")
+    except Exception as e:
+        # If Selenium is not installed or fails, try a lightweight fallback.
+        try:
+            import requests
+            print("Selenium unavailable/failed. Falling back to requests (static HTML only)...")
+            resp = requests.get(
+                url,
+                timeout=30,
+                headers={
+                    "User-Agent": "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) "
+                    "AppleWebKit/537.36 (KHTML, like Gecko) Chrome/120.0.0.0 Safari/537.36"
+                },
+            )
+            resp.raise_for_status()
+            return resp.text
+        except Exception as re:
+            print(f"Fallback fetch failed: {re}")
+            raise ValueError(f"Failed to fetch page via Selenium and requests: {e}") from re
@@
     finally:
         if driver:
             print("Closing browser...")
             driver.quit()
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
def download_with_selenium(url):
"""Fetch dynamic page content using Selenium and return HTML (<=50 lines)."""
driver = None
try:
options = _configure_chrome_options()
driver = _create_driver(options)
_remove_webdriver_flag(driver)
print(f"Navigating to {url} ...")
driver.get(url)
_wait_out_challenge(driver, timeout=30)
time.sleep(random.uniform(2, 5)) # small buffer to ensure full render
html_content = _get_html_or_raise(driver)
print("Successfully retrieved page.")
return html_content
except WebDriverException as e:
print(f"WebDriver error: {e}")
print("\nTroubleshooting:")
print("1. Make sure Chrome browser is installed")
print("2. Install ChromeDriver: brew install --cask chromedriver")
print("3. Or use: pip install webdriver-manager")
raise ValueError(f"WebDriver error: {e}")
except Exception as e:
print(f"Unexpected error: {e}")
raise ValueError(f"Failed to scrape with Selenium: {e}")
def download_with_selenium(url):
"""Fetch dynamic page content using Selenium and return HTML (<=50 lines)."""
driver = None
try:
# Try Selenium path first; import lazily to avoid hard dependency.
try:
from selenium.common.exceptions import WebDriverException
except ImportError:
WebDriverException = Exception # sentinel for isinstance checks
options = _configure_chrome_options()
driver = _create_driver(options)
_remove_webdriver_flag(driver)
print(f"Navigating to {url} ...")
driver.get(url)
_wait_out_challenge(driver, timeout=30)
time.sleep(random.uniform(2, 5)) # small buffer to ensure full render
html_content = _get_html_or_raise(driver)
print("Successfully retrieved page.")
return html_content
except Exception as e:
# If Selenium is not installed or fails, try a lightweight fallback.
try:
import requests
print("Selenium unavailable/failed. Falling back to requests (static HTML only)...")
resp = requests.get(
url,
timeout=30,
headers={
"User-Agent": (
"Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) "
"AppleWebKit/537.36 (KHTML, like Gecko) "
"Chrome/120.0.0.0 Safari/537.36"
)
},
)
resp.raise_for_status()
return resp.text
except Exception as re:
print(f"Fallback fetch failed: {re}")
raise ValueError(
f"Failed to fetch page via Selenium and requests: {e}"
) from re
finally:
if driver:
print("Closing browser...")
driver.quit()
🧰 Tools
🪛 GitHub Check: Codacy Static Code Analysis

[warning] 258-258: data-collection/collectors/fbref_collector.py#L258
Standard pseudo-random generators are not suitable for security/cryptographic purposes.

🤖 Prompt for AI Agents
In data-collection/collectors/fbref_collector.py around lines 246 to 275, the
download_with_selenium function currently depends directly on Selenium without
fallback. Modify the function to localize Selenium imports inside it and its
helpers, then add a fallback to perform a plain HTTP GET request with
appropriate headers and timeout if Selenium is not installed or fails to run.
This avoids a hard dependency on Selenium and improves robustness.

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

♻️ Duplicate comments (6)
data-collection/collectors/football_data_collector.py (2)

110-118: Fragile filename parsing & shadowing built-in file
Same issues flagged earlier: rename param, use os.path.basename, generate stable match_id.


63-88: Directory creation, overwrite flag still missing

download_csv():

  1. Fails if data/raw doesn’t exist (open() will raise).
  2. Docstring mentions overwrite but the flag is not implemented.
  3. No guard against silently clobbering an existing file.
+raw_dir = get_data_raw_folder()
+os.makedirs(raw_dir, exist_ok=True)
+filepath = os.path.join(raw_dir, filename)
+
+if not overwrite and os.path.exists(filepath):
+    raise FileExistsError(f"{filepath} exists. Set overwrite=True to replace.")
data-collection/collectors/fbref_collector.py (4)

12-17: Hard dependency on Selenium contradicts docstring

Top-level imports (selenium.*) will crash on environments without Selenium despite the “avoids hard dependency” claim. Move all Selenium imports inside the functions that require them and guard with try/except.


58-88: extract_columns() can crash & ignores data_fields

table.find("tbody") may be NoneAttributeError.
• When no rows parsed, df_bs is undefined → UnboundLocalError.
data_fields param is never used.
Harden null-checks, always return a DataFrame, or drop the unused param.


90-103: Non-unique match_id, fragile parsing, file shadowing

Issues previously noted remain: use deterministic hash for match_id, robust os.path.basename parsing, and rename file param.


175-185: Directory may not exist

create_csv() only calls os.makedirs when the path is not absolute, but filepath is always absolute—directory creation is skipped. Remove the condition.

🧹 Nitpick comments (1)
data-collection/collectors/football_data_collector.py (1)

21-44: Remove or use the league_name parameter

generate_football_data_url() ignores its league_name argument, which is misleading and invites bugs. Either incorporate it (e.g., slugify for sanity checks) or drop the parameter.

📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 3be68e3 and de3f134.

📒 Files selected for processing (2)
  • data-collection/collectors/fbref_collector.py (1 hunks)
  • data-collection/collectors/football_data_collector.py (1 hunks)
🧰 Additional context used
🧬 Code Graph Analysis (2)
data-collection/collectors/football_data_collector.py (1)
data-collection/collectors/fbref_collector.py (5)
  • _here (19-20)
  • get_data_raw_folder (23-26)
  • get_data_processed_folder (29-32)
  • get_columns (169-172)
  • normalize_date (138-147)
data-collection/collectors/fbref_collector.py (1)
data-collection/collectors/football_data_collector.py (5)
  • _here (47-48)
  • get_data_raw_folder (51-54)
  • get_data_processed_folder (57-60)
  • normalize_date (148-157)
  • get_columns (122-125)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
  • GitHub Check: Codacy Static Code Analysis

Comment on lines +246 to +277
def download_with_selenium(url):
"""Fetch dynamic page content using Selenium and return HTML (<=50 lines)."""
driver = None
try:
options = _configure_chrome_options()
driver = _create_driver(options)
_remove_webdriver_flag(driver)

print(f"Navigating to {url} ...")
driver.get(url)

_wait_out_challenge(driver, timeout=30)

# Using SystemRandom() is a secure random number generator. Itis a best practice for security-sensitive applications.
time.sleep(SystemRandom().uniform(2, 5)) # small buffer to ensure full render

html_content = _get_html_or_raise(driver)
print("Successfully retrieved page.")
return html_content

except WebDriverException as e:
print(f"WebDriver error: {e}")
print("\nTroubleshooting:")
print("1. Make sure Chrome browser is installed")
print("2. Install ChromeDriver: brew install --cask chromedriver")
print("3. Or use: pip install webdriver-manager")
raise ValueError(f"WebDriver error: {e}")

except Exception as e:
print(f"Unexpected error: {e}")
raise ValueError(f"Failed to scrape with Selenium: {e}")

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue

No fallback despite promise

download_with_selenium() claims a “requests-based fallback” yet catches WebDriverException and merely re-raises. Implement the fallback or revise the docstring.

🤖 Prompt for AI Agents
In data-collection/collectors/fbref_collector.py around lines 246 to 277, the
function download_with_selenium() claims to have a requests-based fallback in
its docstring but currently does not implement any fallback logic and just
re-raises exceptions. To fix this, either implement the fallback by catching
WebDriverException and then attempting to fetch the page content using requests
as a backup, or update the docstring to remove the mention of a requests-based
fallback to accurately reflect the current behavior.

Comment on lines +160 to +165
def save_df_to_csv(df, raw_filename):
"""Save DataFrame to CSV"""
processed_filename = raw_filename.split(".")[0] + "_processed.csv"
filepath = os.path.join(get_data_processed_folder(), processed_filename)
df.to_csv(filepath, index=False)
print(f"Processed CSV saved as '{filepath}'.")

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🛠️ Refactor suggestion

Ensure processed directory exists before write

save_df_to_csv() writes to data/processed but never os.makedirs(..., exist_ok=True). Add it to avoid FileNotFoundError.

🤖 Prompt for AI Agents
In data-collection/collectors/football_data_collector.py around lines 160 to
165, the function save_df_to_csv writes the DataFrame to a file in the processed
data directory but does not ensure the directory exists, which can cause a
FileNotFoundError. Before calling df.to_csv, add a call to os.makedirs with the
processed directory path and exist_ok=True to create the directory if it does
not exist.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant