Historical football data collection by thequinn · Pull Request #34 · vibing-ai/sports-scribe

thequinn · 2025-08-08T20:25:18Z

Description

Goal:
Build essential historical dataset for top 3 football leagues using free data sources below, , covering 5 recent seasons with focus on MVP delivery.

Data sources:
football-data.co.uk
fbref.com

Step completed:

Created the folders and files structure
Added functionalities to download multiple formats, scrape the data, clean up and process the data before saving to new csv files.

Steps to complete:
3. Merge the data. ex. merge same league of the same season from different resources
4. Insert into Supabase

Type of Change

🐛 Bug fix (non-breaking change which fixes an issue)
[ x ] ✨ New feature (non-breaking change which adds functionality)
💥 Breaking change (fix or feature that would cause existing functionality to not work as expected)
📚 Documentation update
🔧 Maintenance (dependencies, CI, build tools, etc.)
♻️ Refactoring (no functional changes)
⚡ Performance improvement

Changes Made

Change 1: data-collection/collectors/football_data_collector.py
Change 2: data-collection/collectors/fbref_collector.py
Change 3: data-collection/scripts/collect_all.py

Testing

How has this been tested?

Unit tests added/updated
Integration tests added/updated
[ x ] Manual testing performed
No testing required (documentation, etc.)

Test Configuration:

Python version: [e.g. 3.11]
Node.js version: [e.g. 18.17.0]
Browser (if applicable): [e.g. Chrome 119]

Platform Impact

Which parts of the system are affected?

Breaking Changes

Does this PR introduce any breaking changes?

[ x ] No breaking changes
Yes, breaking changes (please describe below)

If yes, describe the breaking changes and migration path:

Checklist

Before requesting a review, please ensure:

[ x ] Code follows the project's style guidelines
[ x ] Self-review of code has been performed
[ x ] Code is commented, particularly in hard-to-understand areas
Corresponding changes to documentation have been made
[ x ] Changes generate no new warnings
[ x ] Tests pass locally
Any dependent changes have been merged and published

Screenshots (if applicable)

Add screenshots to help explain your changes.

Related Issues

Closes #(issue_number)
Related to #(issue_number)

Additional Notes

Any additional information that reviewers should know.

Summary by CodeRabbit

New Features
- Introduced new utilities for collecting and processing football match data from FBref.com and football-data.co.uk, including handling bot protection and data normalization.
- Added scripts to automate data collection and processing from multiple online sources.
- Implemented asynchronous API clients for fetching football data and testing API authentication.
- Enhanced API client for fetching fixtures and teams with input validation, error handling, and detailed logging.
Tests
- Added comprehensive tests for API clients, OpenAI API connectivity, environment setup, and football data processing.
Configuration
- Updated static type checking, pre-commit, and pytest configurations for improved code quality and coverage.
- Expanded .gitignore to exclude HTML and CSV files.
Style
- Applied formatting improvements and code style consistency across multiple modules.
Documentation
- Added docstrings and comments to clarify new modules and package purposes.

…t 5 seasons

…taframe

coderabbitai · 2025-08-08T20:25:28Z

Walkthrough

This update introduces new data collection modules and scripts for football match data, implements and tests API clients for sports data, and enhances configuration and testing infrastructure. Key changes include new collectors for FBref and football-data.co.uk, expanded API-Football client functionality, additional test suites, stricter type and test configurations, and various formatting and stylistic improvements across the codebase.

Changes

Cohort / File(s)	Change Summary
Data Collection Modules `data-collection/collectors/__init__.py`, `data-collection/collectors/fbref_collector.py`, `data-collection/collectors/football_data_collector.py`	Introduced new modules for collecting and processing football match data from FBref.com and football-data.co.uk, including utilities for URL generation, HTML/CSV downloading, data extraction, cleaning, normalization, and saving. The FBref collector includes Selenium-based downloading to bypass bot protections.
Data Collection Script `data-collection/scripts/collect_all.py`	Added a script to automate downloading, processing, and saving football data from multiple sources using the new collectors, with error handling and field mapping utilities.
API-Football Client Enhancements `ai-backend/tools/sports_apis.py`	Expanded the APIFootballClient with full implementations for fetching fixtures and teams, input validation, error handling, and logging. Added new helper methods for payload building and HTTP requests.
API-Football Auth Test Client `ai-backend/tools/test_api_auth.py`	Introduced an async client for testing API-Football authentication and data retrieval, with context management, error handling, and demonstration usage.
OpenAI and Quality Tools Tests `ai-backend/tests/test_openai.py`, `ai-backend/tests/test_quality_tools.py`	Added tests for OpenAI API connectivity and a sample football data processor class, including async and sync methods for data processing and simulated fetching.
Fixtures API Tests `ai-backend/tests/test_fixtures.py`	Added async pytest functions to test APIFootballClient's get_fixtures method under various filter conditions using mocked HTTP responses.
Environment Test Script `ai-backend/tests/test_environment.py`	Added a script to verify installation of all required dependencies by attempting imports and printing results.
Configuration and Type Checking `.gitignore`, `ai-backend/.pre-commit-config.yaml`, `ai-backend/mypy.ini`, `ai-backend/pytest.ini`, `mypy.ini`	Updated ignore patterns to include HTML and CSV files, changed quoting style in pre-commit config, enhanced pytest options with coverage and markers, and introduced/expanded mypy configuration for stricter type checking and third-party import handling.
Settings and Formatting Updates `ai-backend/config/settings.py`, `ai-backend/agents/editor.py`, `ai-backend/agents/researcher.py`, `ai-backend/main.py`, `scripts/seed-data.py`	Updated settings to use uppercase env vars and new Pydantic validators; reformatted method signatures, logging, and data declarations for consistency; no logic changes.

Sequence Diagram(s)

sequenceDiagram
    participant Script as collect_all.py
    participant FBref as fbref_collector.py
    participant FootballData as football_data_collector.py
    participant Selenium as Selenium WebDriver
    participant Requests as requests

    Script->>FootballData: generate_football_data_url()
    Script->>FootballData: download_csv()
    Script->>FootballData: read_csv()
    Script->>FootballData: rename_columns(), add_new_columns_to_football_data()
    Script->>FootballData: reorder_df(), normalize_date(), save_df_to_csv()

    Script->>FBref: generate_fbref_url()
    Script->>Selenium: download_with_selenium()
    Selenium-->>FBref: HTML content
    Script->>FBref: save_html_to_file()
    Script->>FBref: extract_columns()
    Script->>FBref: add_new_columns_to_fbref(), reorder_df(), normalize_date(), create_csv()

sequenceDiagram
    participant Test as test_fixtures.py
    participant MockSession as MockSession
    participant API as APIFootballClient

    Test->>API: get_fixtures(...)
    API->>MockSession: get()
    MockSession-->>API: MockResponse (JSON)
    API-->>Test: List of fixtures

sequenceDiagram
    participant APIAuthTest as test_api_auth.py
    participant APIAuth as APIAuth
    participant aiohttp as aiohttp.ClientSession
    participant RapidAPI as API-Football

    APIAuthTest->>APIAuth: __aenter__()
    APIAuth->>aiohttp: ClientSession()
    APIAuthTest->>APIAuth: test_api_auth()
    APIAuth->>RapidAPI: GET /fixtures
    RapidAPI-->>APIAuth: JSON response
    APIAuth-->>APIAuthTest: Parsed data
    APIAuthTest->>APIAuth: __aexit__()
    APIAuth->>aiohttp: Close session

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~45 minutes

Poem

🐇
With paws on keys and nose a-twitch,
I gathered football data, quick!
From FBref and CSV streams,
Selenium danced through bot-filled dreams.
API clients tested, settings refined—
This bunny leaves no bug behind!
🏆⚽

Note

🔌 MCP (Model Context Protocol) integration is now available in Early Access!

Pro users can now connect to remote MCP servers under the Integrations page to get reviews and chat conversations that understand additional development context.

✨ Finishing Touches

📝 Generate Docstrings

🧪 Generate unit tests

Create PR with unit tests
Post copyable unit tests in a comment

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

🪧 Tips

Chat

There are 3 ways to chat with CodeRabbit:

Review comments: Directly reply to a review comment made by CodeRabbit. Example:
- I pushed a fix in commit <commit_id>, please review it.
- Explain this complex logic.
- Open a follow-up GitHub issue for this discussion.
Files and specific lines of code (under the "Files changed" tab): Tag @coderabbitai in a new review comment at the desired location with your query. Examples:
- @coderabbitai explain this code block.
PR comments: Tag @coderabbitai in a new PR comment to ask questions about the PR branch. For the best results, please provide a very specific query, as very limited context is provided in this mode. Examples:
- @coderabbitai gather interesting stats about this repository and render them as a table. Additionally, render a pie chart showing the language distribution in the codebase.
- @coderabbitai read src/utils.ts and explain its main purpose.
- @coderabbitai read the files in the src/scheduler package and generate a class diagram using mermaid and a README in the markdown format.

Support

Need help? Create a ticket on our support page for assistance with any issues or questions.

CodeRabbit Commands (Invoked using PR comments)

@coderabbitai pause to pause the reviews on a PR.
@coderabbitai resume to resume the paused reviews.
@coderabbitai review to trigger an incremental review. This is useful when automatic reviews are disabled for the repository.
@coderabbitai full review to do a full review from scratch and review all the files again.
@coderabbitai summary to regenerate the summary of the PR.
@coderabbitai generate docstrings to generate docstrings for this PR.
@coderabbitai generate sequence diagram to generate a sequence diagram of the changes in this PR.
@coderabbitai generate unit tests to generate unit tests for this PR.
@coderabbitai resolve resolve all the CodeRabbit review comments.
@coderabbitai configuration to show the current CodeRabbit configuration for the repository.
@coderabbitai help to get help.

Other keywords and placeholders

Add @coderabbitai ignore anywhere in the PR description to prevent this PR from being reviewed.
Add @coderabbitai summary to generate the high-level summary at a specific location in the PR description.
Add @coderabbitai anywhere in the PR title to generate the title automatically.

CodeRabbit Configuration File (`.coderabbit.yaml`)

You can programmatically configure CodeRabbit by adding a .coderabbit.yaml file to the root of your repository.
Please see the configuration documentation for more information.
If your editor has YAML language server enabled, you can add the path at the top of this file to enable auto-completion and validation: # yaml-language-server: $schema=https://coderabbit.ai/integrations/schema.v2.json

Documentation and Community

Visit our Documentation for detailed information on how to use CodeRabbit.
Join our Discord Community to get help, request features, and share feedback.
Follow us on X/Twitter for updates and announcements.

coderabbitai

Actionable comments posted: 10

🔭 Outside diff range comments (1)

ai-backend/.pre-commit-config.yaml (1)
25-26: Align Python versions across Black and mypy

There’s a mismatch between the Python target in your pre-commit Black hook and mypy config, which can lead to inconsistent parsing or type-checking behavior. Please align both tools to the same version.

• ai-backend/.pre-commit-config.yaml (line 25):
      language_version: python3.12
• ai-backend/mypy.ini (line 3):
python_version = 3.11
Suggested fix (choose one):

– Update mypy.ini to 3.12:
 [mypy]
-python_version = 3.11
+python_version = 3.12
 ...
– Or downgrade Black target to 3.11:
-      language_version: python3.12
+      language_version: python3.11

🧹 Nitpick comments (21)

data-collection/collectors/__init__.py (1)
1-3: Clarify scope in package docstring

If this package is focused on football data (per PR), reflect that for precision.

Apply this diff:
-"""
-Collectors package for sports data collection.
-"""
+"""
+Collectors for historical football data (fbref, football-data.co.uk).
+"""
ai-backend/mypy.ini (1)
46-67: Broad ignore_missing_imports reduces type safety; consider targeted alternatives

Good to unblock CI, but you’ll lose valuable checks for fastapi/pydantic/pytest, which generally have decent typing. Consider:

Remove ignores for fastapi/pydantic/uvicorn if possible.

Enable the Pydantic mypy plugin to improve model typing.

Add the Pydantic plugin at the top-level (outside this hunk):
[mypy]
plugins = pydantic.mypy
Then gradually remove these sections:

[mypy-fastapi.], [mypy-pydantic.], [mypy-uvicorn.*]
Verify locally which ones are still needed. If certain submodules lack stubs, prefer narrow, module-specific ignores or inline # type: ignore[import] over package-wide ignores.
ai-backend/agents/researcher.py (1)
24-37: Consider stronger typing for return structure

Instead of dict[str, Any], define a TypedDict or dataclass for the research payload. It improves IDE assistance and catches schema drift early.

Example:
from typing import TypedDict

class TeamHistory(TypedDict, total=False):
    head_to_head_record: dict[str, int]  # e.g., {"wins": 10, "draws": 5, "losses": 7}
    recent_form: list[str]               # e.g., ["W", "D", "L", ...]
    key_matches: list[dict[str, str]]    # e.g., [{"date": "...", "score": "2-1", ...}]
Then annotate: async def research_team_history(...) -> TeamHistory:
ai-backend/main.py (1)

270-279: Prefer structured logging instead of “%s” placeholders

logger.info("Starting server on %s:%s", host, port) is fine, but elsewhere the codebase uses keyword/value pairs (extra= or logger.info(key=value)). Aligning styles improves log-parsing consistency.

mypy.ini (1)

25-46: Consider using official stub packages instead of blanket ignore_missing_imports

beautifulsoup4, aiohttp, openai, etc., all have published stubs (types-beautifulsoup4, aiohttp-types, …). Installing them lets you keep strict checking without silencing whole modules.

ai-backend/config/settings.py (2)

55-63: Validator duplicates min_length constraint

min_length=20 plus explicit len checks is redundant. Either rely on the field constraint or keep the validator but drop min_length to avoid double-error messages.

114-115: model_config may shadow future attributes

model_config is Pydantic-v2 specific. To remain forward-compatible (e.g., if pinning v2.5+), consider class Config with model_config = … inside, as recommended in docs, to avoid naming collisions.
data-collection/collectors/football_data_collector.py (4)
22-45: Unused league_name parameter

league_name is never referenced; drop it or incorporate it to avoid API confusion.
-def generate_football_data_url(league_id, league_name, season):
+def generate_football_data_url(league_id: str, season: int) -> str:
129-146: Replace print with proper logging

Raw print() hampers library reuse and log aggregation. Inject a module-level logger instead.
-import pprint
-…
-print("After getting essential columns:")
-print(df_processed.head())
+logger = logging.getLogger(__name__)
+…
+logger.debug("Essential columns sample:\n%s", df_processed.head())
149-157: Date normalization swallows errors silently

Returning invalid dates as NaT cast to "NaT" leads to string "NaT" values downstream. Consider dropping rows with NaT or at least logging count of failed parses.

160-165: Ensure processed directory exists and use os.path.splitext
-processed_filename = raw_filename.split(".")[0] + "_processed.csv"
+base, _ = os.path.splitext(raw_filename)
+processed_filename = f"{base}_processed.csv"
+os.makedirs(get_data_processed_folder(), exist_ok=True)
Return the saved path so callers can chain operations.
data-collection/collectors/fbref_collector.py (2)
100-111: Filename parsing is brittle

file.split("_") assumes exactly three underscores and no extra “_” in league names (e.g., “Champions_League”). Safer:
parts = file.rsplit("_", maxsplit=2)
if len(parts) != 3:
    raise ValueError(f"Unexpected filename format: {file}")
_, league, season_part = parts
season = season_part.split(".")[0]
158-180: Heavy code duplication with football_data_collector.py

reorder_df, get_columns, normalize_date, and CSV helpers are copy-pasted. Extract them into a shared util and import from both collectors to reduce maintenance cost.
data-collection/scripts/collect_all.py (2)
110-113: Variable reuse blurs types

df_cleaned = save_df_to_csv(df_cleaned, raw_csv_filename) reassigns the DataFrame variable to the path string returned by save_df_to_csv.
Prefer discarding the return value or storing it in a clearly-named variable to avoid accidental misuse later.

138-146: Path built via string concatenation – use os.path.join
html_filepath = os.path.join(get_data_folder(), "raw", html_filename)
avoids platform-specific separator bugs.
ai-backend/tools/sports_apis.py (1)

182-193: Logging entire team payload may leak unnecessary data

logger.info(f"Teams data: {json.dumps(data_response, indent=2)}") can emit thousands of lines and potentially sensitive info to logs.

Log the count instead, or guard behind a debug flag.
ai-backend/tests/test_openai.py (1)
16-30: Convert print-based skip to pytest skip

Use pytest.skip() so the test suite reports a skipped test instead of silently passing.
import pytest

if not os.getenv("OPENAI_API_KEY"):
    pytest.skip("OPENAI_API_KEY not set", allow_module_level=True)
ai-backend/tests/test_quality_tools.py (2)

12-13: processed_games is never used

The attribute is initialised but never updated; either populate it in process_game_data or drop it to avoid dead code.

60-63: Remove ad-hoc print from test code

Printing in test modules clutters CI logs. Rely on assertions instead, or move this demo into docs/ or an example script.

ai-backend/tests/test_fixtures.py (1)

36-39: monkeypatch fixture is declared but unused

Pytest will warn on this; simply drop the parameter.

ai-backend/tools/test_api_auth.py (1)

1-3: File name will be collected by pytest

A module named test_*.py under tools/ will be imported during test discovery even though it isn’t a test.
Rename the file (e.g. api_auth_demo.py) or add __test__ = False at module level to suppress collection.

📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 807bf41 and e7604e4.

📒 Files selected for processing (20)

.gitignore (2 hunks)
ai-backend/.pre-commit-config.yaml (1 hunks)
ai-backend/agents/editor.py (1 hunks)
ai-backend/agents/researcher.py (2 hunks)
ai-backend/config/settings.py (5 hunks)
ai-backend/main.py (3 hunks)
ai-backend/mypy.ini (1 hunks)
ai-backend/pytest.ini (1 hunks)
ai-backend/tests/test_environment.py (1 hunks)
ai-backend/tests/test_fixtures.py (1 hunks)
ai-backend/tests/test_openai.py (1 hunks)
ai-backend/tests/test_quality_tools.py (1 hunks)
ai-backend/tools/sports_apis.py (7 hunks)
ai-backend/tools/test_api_auth.py (1 hunks)
data-collection/collectors/__init__.py (1 hunks)
data-collection/collectors/fbref_collector.py (1 hunks)
data-collection/collectors/football_data_collector.py (1 hunks)
data-collection/scripts/collect_all.py (1 hunks)
mypy.ini (1 hunks)
scripts/seed-data.py (10 hunks)

🧰 Additional context used

🧬 Code Graph Analysis (2)

data-collection/collectors/football_data_collector.py (1)

data-collection/collectors/fbref_collector.py (6)

_here (25-26)

get_data_raw_folder (29-32)

get_data_processed_folder (35-38)

get_columns (177-180)

reorder_df (158-173)

normalize_date (147-155)

data-collection/collectors/fbref_collector.py (1)

data-collection/collectors/football_data_collector.py (6)

_here (48-49)

get_data_raw_folder (52-55)

get_data_processed_folder (58-61)

normalize_date (149-157)

reorder_df (129-146)

get_columns (123-126)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)

GitHub Check: Codacy Static Code Analysis

🔇 Additional comments (10)

ai-backend/.pre-commit-config.yaml (1)

31-31: No-op quote style change — OK to keep

Switching to single quotes in args is fine; behavior unchanged.

.gitignore (1)

8-10: Section header for data artifacts — looks good

The new section marker improves readability.

ai-backend/agents/researcher.py (1)

24-24: Formatting change only — safe

Signature/logging compaction is non-functional and consistent with the module’s style. Sanitization preserved.

Also applies to: 36-36

ai-backend/agents/editor.py (1)

37-49: Formatting change looks good

Signature compacting is fine and keeps consistency with other agents. No further action required.

ai-backend/main.py (2)

114-117: Re-raising with HTTPException is fine

Wrapping the original exception preserves context via from e. No concerns.

148-156: Signature now fits 79-char line – OK

Only formatting changed. Behaviour untouched.

ai-backend/config/settings.py (1)

24-30: Verify callers after the ENV-VAR rename

Changing to OPENAI_API_KEY, SUPABASE_URL, etc., will break any code, CI secrets, or infra still exporting lowercase names. Double-check docker-compose, deployment manifests, and docs.

data-collection/collectors/fbref_collector.py (1)

226-232: Hard-coded Chrome binary path breaks non-macOS environments

Setting chrome_options.binary_location to a mac-specific path will fail on Linux/CI or Windows boxes. Detect the OS or allow an env-var override instead of hard-coding.

ai-backend/pytest.ini (2)

7-14: Declare pytest-cov as an explicit dev dependency

The new --cov=* options require the pytest-cov plugin; ensure it’s listed in your dev/CI requirements files, otherwise the test run will error out on unknown flags.

15-19: Make sure the custom markers are actually used

--strict-markers will fail if any of these markers are unused. Double-check that at least one test is decorated with each of slow, integration, unit, and asyncio, or drop the unused entries.

coderabbitai · 2025-08-08T20:33:46Z

+# HTMLs
+*.html
+
+# CSVs
+*.csv
+


💡 Verification agent

🧩 Analysis chain

Scope data artifact ignores to data-collection only (avoid hiding docs/fixtures)

Ignoring all HTML/CSV globally is risky; it can unintentionally hide documentation, static site outputs, or CSV test fixtures. Limit ignores to the data-collection tree (or specific output dirs).

Suggested replacement:

-# HTMLs -*.html - -# CSVs -*.csv +# Data outputs (restricted to data-collection) +data-collection/**/*.html +data-collection/**/*.csv

If your outputs live in known subfolders (e.g., data-collection/raw and data-collection/processed), tighten further:

data-collection/raw/**/*.csv

data-collection/processed/**/*.csv

data-collection/raw/**/*.html

data-collection/processed/**/*.html

🏁 Script executed:

#!/bin/bash echo "Checking for .html files outside data-collection directory..." rg --files --glob '*.html' | grep -v '^data-collection/' echo "" echo "Checking for .csv files outside data-collection directory..." rg --files --glob '*.csv' | grep -v '^data-collection/'

Length of output: 391

🏁 Script executed:

#!/bin/bash echo "Checking for a data-collection directory..." fd -t d data-collection || echo ">> data-collection directory not found" echo echo "Searching for HTML files under data-collection..." rg --files --glob 'data-collection/**/*.html' || echo ">> No HTML files found in data-collection" echo echo "Searching for CSV files under data-collection..." rg --files --glob 'data-collection/**/*.csv' || echo ">> No CSV files found in data-collection"

Length of output: 650

Restrict HTML/CSV ignores to data-collection only

To avoid inadvertently hiding docs or test fixtures elsewhere in the repo, replace the global *.html and *.csv ignores with patterns scoped to your data-collection outputs.

• File: .gitignore (lines 95–100)
• Remove:

-# HTMLs -*.html - -# CSVs -*.csv

• Add:

+# Data-collection outputs +data-collection/**/*.html +data-collection/**/*.csv

If you know your outputs live in subfolders (e.g. raw/ and processed/), you can tighten further:

data-collection/raw/**/*.csv data-collection/processed/**/*.csv data-collection/raw/**/*.html data-collection/processed/**/*.html

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

# HTMLs

*.html

# CSVs

*.csv

# Data-collection outputs

data-collection/**/*.html

data-collection/**/*.csv

🤖 Prompt for AI Agents

In the .gitignore file around lines 95 to 100, the current patterns '*.html' and '*.csv' ignore these file types globally, which may hide important docs or test fixtures elsewhere. Replace these global ignores with more specific patterns scoped to the data-collection directory, such as 'data-collection/raw/**/*.csv', 'data-collection/processed/**/*.csv', 'data-collection/raw/**/*.html', and 'data-collection/processed/**/*.html' to restrict ignoring only to your data-collection outputs.

coderabbitai · 2025-08-08T20:33:46Z

+# Test core dependencies
+try:
+    print("✅ OpenAI package imported successfully")
+except ImportError as e:
+    print(f"❌ OpenAI import failed: {e}")
+
+try:
+    print("✅ OpenAI Agents package imported successfully")
+except ImportError as e:
+    print(f"❌ OpenAI Agents import failed: {e}")
+
+try:
+    print("✅ FastAPI package imported successfully")
+except ImportError as e:
+    print(f"❌ FastAPI import failed: {e}")
+
+try:
+    print("✅ Pydantic package imported successfully")
+except ImportError as e:
+    print(f"❌ Pydantic import failed: {e}")
+
+try:
+    print("✅ Supabase package imported successfully")
+except ImportError as e:
+    print(f"❌ Supabase import failed: {e}")
+


⚠️ Potential issue

Test never imports the modules it claims to verify

Each try block only prints a success message; the actual import statements are missing, so failures will never surface.

Example fix for one block (apply to all):

try: import openai # noqa: F401 print("✅ OpenAI package imported successfully") except ImportError as e: print(f"❌ OpenAI import failed: {e}") raise

Also add assertions to fail the test when an import is missing so CI can detect environment problems.

🤖 Prompt for AI Agents

In ai-backend/tests/test_environment.py between lines 7 and 32, the try blocks only print success messages without actually importing the modules, so import failures won't be detected. Fix this by adding the appropriate import statements inside each try block for the respective packages (e.g., import openai, import fastapi, etc.). Also, add assertions or raise the ImportError in the except blocks to ensure the test fails and CI detects missing dependencies.

coderabbitai · 2025-08-08T20:33:47Z

@@ -0,0 +1,142 @@
+import pytest
+
+from tools.sports_apis import APIFootballClient


⚠️ Potential issue

Import path likely wrong

from tools.sports_apis import … assumes a top-level tools package.
From the repo layout it should be:

-from tools.sports_apis import APIFootballClient +from ai_backend.tools.sports_apis import APIFootballClient

(or update PYTHONPATH).

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

from tools.sports_apis import APIFootballClient

from ai_backend.tools.sports_apis import APIFootballClient

🤖 Prompt for AI Agents

In ai-backend/tests/test_fixtures.py at line 3, the import statement uses 'from tools.sports_apis import APIFootballClient', which assumes 'tools' is a top-level package. To fix this, adjust the import path to reflect the actual relative location of the 'tools' package within the repo structure, or alternatively update the PYTHONPATH environment variable to include the directory containing 'tools' so the import resolves correctly.

coderabbitai · 2025-08-08T20:33:47Z

+                data = await response.json()
+                data_stringified = json.dumps(data, indent=2)
+
+                # logger.info(f"Response data: {data_stringified}")
+                return data_stringified


⚠️ Potential issue

Return type mismatches the annotation

_fetch_data (and therefore test_api_auth) claim to return list[dict] but actually return a JSON string. Either adjust the annotation or return the parsed list/dict:

- data = await response.json() - data_stringified = json.dumps(data, indent=2) - return data_stringified + return await response.json()

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

data = await response.json()

data_stringified = json.dumps(data, indent=2)

# logger.info(f"Response data: {data_stringified}")

return data_stringified

# logger.info(f"Response data: {data_stringified}")

return await response.json()

🤖 Prompt for AI Agents

In ai-backend/tools/test_api_auth.py around lines 82 to 86, the function _fetch_data is annotated to return a list of dictionaries but currently returns a JSON string. To fix this, modify the return statement to return the parsed JSON data directly (the list/dict) instead of the stringified JSON, or alternatively update the function's return type annotation to indicate it returns a string. Choose the approach that aligns with the expected usage of the function.

coderabbitai · 2025-08-08T20:33:47Z

+from selenium import webdriver
+from selenium.webdriver.chrome.options import Options
+from selenium.webdriver.common.by import By
+from selenium.webdriver.support.ui import WebDriverWait
+from selenium.webdriver.support import expected_conditions as EC
+from selenium.common.exceptions import TimeoutException, WebDriverException
+


🛠️ Refactor suggestion

Unconditional Selenium import contradicts “no hard dependency” claim

selenium is imported at module load time. If the package isn’t installed (CI, head-less servers), simply importing this module will raise ModuleNotFoundError, even when Selenium features aren’t used.

Move the import inside download_with_selenium() (guarded by try/except ImportError) or wrap the top-level import in a soft-fail block.

🤖 Prompt for AI Agents

In data-collection/collectors/fbref_collector.py around lines 15 to 21, the selenium imports are unconditional, causing a ModuleNotFoundError if selenium is not installed even when its features are not used. To fix this, move all selenium-related imports inside the download_with_selenium() function and wrap them in a try/except ImportError block to softly handle the absence of selenium and avoid import errors at module load time.

coderabbitai · 2025-08-08T20:33:47Z

+def extract_columns(soup, data_fields: list[str], html_filepath: str) -> pd.DataFrame:
+    """Extract specified columns from HTML using BeautifulSoup."""
+    parsed_rows = []
+    # Find the specific table we care about using BeautifulSoup
+    table = soup.find("table", class_="stats_table")
+
+    if table:
+        for row in table.find("tbody").find_all("tr"):
+            # Find the specific cells '<td>' by their 'data-stat' attribute
+            fields = {"date": "", "score": "", "home_team": "", "away_team": ""}
+            for field in fields:
+                # Find the specific cells '<td>' by their 'data-stat' attribute
+                cell = row.find("td", {"data-stat": field})
+                # Extract the text from each cell, checking if the cell was found
+                fields[field] = cell.text.strip() if cell else ""
+
+            # Create a dictionary for each row
+            parsed_rows.append(fields)
+
+    # Convert our list of dictionaries into a pandas DataFrame
+    if parsed_rows:
+        # 2nd arg preserves the order of the columns
+        df_bs = pd.DataFrame(
+            parsed_rows, columns=["date", "score", "home_team", "away_team"]
+        )
+        print("\nSuccess! Extracted columns using BeautifulSoup:\n")
+    else:
+        print("\nFailed to extract columns using BeautifulSoup.")
+
+    return df_bs


⚠️ Potential issue

extract_columns() can raise UnboundLocalError and misses null-safety

Issues:

When parsed_rows is empty the variable df_bs is never defined, yet it’s returned – this will crash.

table.find("tbody") may return None; chaining .find_all() without a check causes AttributeError.

Parameter data_fields is accepted but never used → dead code / confusion.

Minimal patch:

- if table: - for row in table.find("tbody").find_all("tr"): + if table and table.tbody: + for row in table.tbody.find_all("tr"): @@ - if parsed_rows: + if parsed_rows: df_bs = pd.DataFrame( parsed_rows, columns=["date", "score", "home_team", "away_team"] ) print("\nSuccess! Extracted columns using BeautifulSoup:\n") - else: - print("\nFailed to extract columns using BeautifulSoup.") - - return df_bs + return df_bs + + print("\nFailed to extract columns using BeautifulSoup.") + return pd.DataFrame(columns=["date", "score", "home_team", "away_team"])

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

def extract_columns(soup, data_fields: list[str], html_filepath: str) -> pd.DataFrame:

"""Extract specified columns from HTML using BeautifulSoup."""

parsed_rows = []

# Find the specific table we care about using BeautifulSoup

table = soup.find("table", class_="stats_table")

if table:

for row in table.find("tbody").find_all("tr"):

# Find the specific cells '<td>' by their 'data-stat' attribute

fields = {"date": "", "score": "", "home_team": "", "away_team": ""}

for field in fields:

# Find the specific cells '<td>' by their 'data-stat' attribute

cell = row.find("td", {"data-stat": field})

# Extract the text from each cell, checking if the cell was found

fields[field] = cell.text.strip() if cell else ""

# Create a dictionary for each row

parsed_rows.append(fields)

# Convert our list of dictionaries into a pandas DataFrame

if parsed_rows:

# 2nd arg preserves the order of the columns

df_bs = pd.DataFrame(

parsed_rows, columns=["date", "score", "home_team", "away_team"]

)

print("\nSuccess! Extracted columns using BeautifulSoup:\n")

else:

print("\nFailed to extract columns using BeautifulSoup.")

return df_bs

def extract_columns(soup, data_fields: list[str], html_filepath: str) -> pd.DataFrame:

"""Extract specified columns from HTML using BeautifulSoup."""

parsed_rows = []

# Find the specific table we care about using BeautifulSoup

table = soup.find("table", class_="stats_table")

if table and table.tbody:

for row in table.tbody.find_all("tr"):

# Find the specific cells '<td>' by their 'data-stat' attribute

fields = {"date": "", "score": "", "home_team": "", "away_team": ""}

for field in fields:

# Find the specific cells '<td>' by their 'data-stat' attribute

cell = row.find("td", {"data-stat": field})

# Extract the text from each cell, checking if the cell was found

fields[field] = cell.text.strip() if cell else ""

# Create a dictionary for each row

parsed_rows.append(fields)

if parsed_rows:

df_bs = pd.DataFrame(

parsed_rows, columns=["date", "score", "home_team", "away_team"]

)

print("\nSuccess! Extracted columns using BeautifulSoup:\n")

return df_bs

print("\nFailed to extract columns using BeautifulSoup.")

return pd.DataFrame(columns=["date", "score", "home_team", "away_team"])

🤖 Prompt for AI Agents

In data-collection/collectors/fbref_collector.py around lines 67 to 96, fix the UnboundLocalError by ensuring df_bs is always defined before return, even if parsed_rows is empty, by initializing df_bs to an empty DataFrame. Add a null check for table.find("tbody") before calling find_all to avoid AttributeError. Remove the unused parameter data_fields from the function signature and any related code to eliminate dead code and confusion.

coderabbitai · 2025-08-08T20:33:47Z

+def add_new_columns_to_football_data(df_clean, file):
+    """Add new columns to df_clean."""
+    df_clean["match_id"] = df_clean.index
+
+    tmp = file.split("_")
+    df_clean["league"] = tmp[1]
+    df_clean["season"] = tmp[2].split(".")[0]
+    df_clean["source"] = tmp[0]
+    return df_clean


🛠️ Refactor suggestion

Shadowing built-in file and fragile filename parsing

Rename parameter and parse safely:

-def add_new_columns_to_football_data(df_clean, file): +def add_new_columns_to_football_data(df_clean: pd.DataFrame, filename: str) -> pd.DataFrame: @@ - tmp = file.split("_") + tmp = os.path.basename(filename).split("_")

Also use a stable match_id (e.g., hash of league/date/teams) instead of the index to guarantee cross-file uniqueness.

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

def add_new_columns_to_football_data(df_clean, file):

"""Add new columns to df_clean."""

df_clean["match_id"] = df_clean.index

tmp = file.split("_")

df_clean["league"] = tmp[1]

df_clean["season"] = tmp[2].split(".")[0]

df_clean["source"] = tmp[0]

return df_clean

def add_new_columns_to_football_data(df_clean: pd.DataFrame, filename: str) -> pd.DataFrame:

"""Add new columns to df_clean."""

df_clean["match_id"] = df_clean.index

tmp = os.path.basename(filename).split("_")

df_clean["league"] = tmp[1]

df_clean["season"] = tmp[2].split(".")[0]

df_clean["source"] = tmp[0]

return df_clean

🤖 Prompt for AI Agents

In data-collection/collectors/football_data_collector.py around lines 111 to 119, rename the parameter 'file' to avoid shadowing the built-in name, and replace the fragile filename parsing with a more robust method such as using pathlib or regex to extract parts safely. Additionally, replace the current match_id assignment from the dataframe index with a stable unique identifier generated by hashing a combination of league, date, and teams to ensure uniqueness across files.

coderabbitai · 2025-08-08T20:33:47Z


 # Add the shared directory to the Python path
-sys.path.append(os.path.join(os.path.dirname(__file__), '..', 'shared'))
+sys.path.append(os.path.join(os.path.dirname(__file__), "..", "shared"))


🛠️ Refactor suggestion

Avoid run-time sys.path manipulation

Appending ../shared to sys.path is brittle and can hide packaging/import problems. Prefer:

-# Add the shared directory to the Python path -sys.path.append(os.path.join(os.path.dirname(__file__), "..", "shared")) +import importlib.util, pathlib, sys +shared_path = pathlib.Path(__file__).resolve().parent.parent / "shared" +sys.path.insert(0, str(shared_path))

…or, better, convert shared into a proper Python package and install it with your project.

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

sys.path.append(os.path.join(os.path.dirname(__file__), "..", "shared"))

import importlib.util, pathlib, sys

shared_path = pathlib.Path(__file__).resolve().parent.parent / "shared"

sys.path.insert(0, str(shared_path))

🤖 Prompt for AI Agents

In scripts/seed-data.py at line 14, avoid modifying sys.path at runtime by appending ../shared, as this is brittle and can mask import issues. Instead, refactor the project structure to make shared a proper Python package with an __init__.py file, and install it as part of your project dependencies so it can be imported normally without sys.path hacks.

coderabbitai

Actionable comments posted: 2

🔭 Outside diff range comments (1)

data-collection/collectors/football_data_collector.py (1)
159-165: save_df_to_csv() should return useful info

Script relies on its return value but the function returns None.
Return either the processed DataFrame or the saved file path.
-    df.to_csv(filepath, index=False)
-    print(f"Processed CSV saved as '{filepath}'.")
+    df.to_csv(filepath, index=False)
+    print(f"Processed CSV saved as '{filepath}'.")
+    return filepath

♻️ Duplicate comments (4)

data-collection/collectors/fbref_collector.py (2)
14-18: Hard dependency on Selenium contradicts doc-string

The module promises “no hard Selenium dependencies”, yet selenium.* is imported at module import time. On environments without Selenium this raises ModuleNotFoundError, blocking even non-Selenium code paths.
Move these imports inside download_with_selenium() and guard them with a try/except ImportError soft-fail.

60-89: extract_columns() can still crash & misses null-safety

• table.find("tbody") may return None → AttributeError.
• When parsed_rows is empty df_bs is undefined → UnboundLocalError on return.
• data_fields parameter is unused – dead code / confusion.
-    if table:
-        for row in table.find("tbody").find_all("tr"):
+    if table and table.tbody:
+        for row in table.tbody.find_all("tr"):
@@
-    if parsed_rows:
+    if parsed_rows:
         df_bs = pd.DataFrame(
             parsed_rows, columns=["date", "score", "home_team", "away_team"]
         )
         print("\nSuccess! Extracted columns using BeautifulSoup:\n")
-    else:
-        print("\nFailed to extract columns using BeautifulSoup.")
-
-    return df_bs
+        return df_bs
+
+    print("\nFailed to extract columns using BeautifulSoup.")
+    return pd.DataFrame(columns=["date", "score", "home_team", "away_team"])
Also drop the unused data_fields arg from the signature.
data-collection/collectors/football_data_collector.py (2)

63-88: download_csv(): missing directory creation, timeout & overwrite guard

Same issues flagged earlier remain: no os.makedirs, requests.get without timeout, and doc-string mentions overwrite flag that is not implemented.

110-118: Shadowing built-in name & fragile filename parsing

Parameter file shadows the historic built-in and split("_") fails on paths containing underscores or directories.
Rename to filename and use os.path.basename() or pathlib for safe parsing.

🧹 Nitpick comments (1)

data-collection/collectors/fbref_collector.py (1)

206-210: Mac-specific Chrome binary hard-coded

The explicit /Applications/... path breaks on Linux/CI. Detect the OS or let ChromeDriver pick the installed binary instead.

📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between e7604e4 and 4b49438.

📒 Files selected for processing (3)

data-collection/collectors/fbref_collector.py (1 hunks)
data-collection/collectors/football_data_collector.py (1 hunks)
data-collection/scripts/collect_all.py (1 hunks)

🧰 Additional context used

🧬 Code Graph Analysis (2)

data-collection/collectors/fbref_collector.py (1)

data-collection/collectors/football_data_collector.py (5)

_here (47-48)

get_data_raw_folder (51-54)

get_data_processed_folder (57-60)

normalize_date (148-156)

get_columns (122-125)

data-collection/collectors/football_data_collector.py (1)

data-collection/collectors/fbref_collector.py (5)

_here (21-22)

get_data_raw_folder (25-28)

get_data_processed_folder (31-34)

get_columns (170-173)

normalize_date (140-148)

🪛 GitHub Check: Codacy Static Code Analysis

data-collection/scripts/collect_all.py

[warning] 116-116: data-collection/scripts/collect_all.py#L116
Assigning result of a function call, where the function has no return

data-collection/collectors/fbref_collector.py

[warning] 11-11: data-collection/collectors/fbref_collector.py#L11
Unused BeautifulSoup imported from bs4

[warning] 17-17: data-collection/collectors/fbref_collector.py#L17
Unused expected_conditions imported from selenium.webdriver.support as EC

[warning] 145-145: data-collection/collectors/fbref_collector.py#L145
Try, Except, Pass detected.

[warning] 259-259: data-collection/collectors/fbref_collector.py#L259
Standard pseudo-random generators are not suitable for security/cryptographic purposes.

data-collection/collectors/football_data_collector.py

[warning] 80-80: data-collection/collectors/football_data_collector.py#L80
Call to requests without timeout

[warning] 153-153: data-collection/collectors/football_data_collector.py#L153
Try, Except, Pass detected.

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)

GitHub Check: Codacy Static Code Analysis

coderabbitai · 2025-08-08T21:23:31Z

+            html_filepath = get_data_folder() + "/raw/" + html_filename
+            try:
+                with open(html_filepath, "r", encoding="utf-8") as f:
+                    html_content = f.read()
+                soup = BeautifulSoup(html_content, "lxml")
+            except (FileNotFoundError, PermissionError, UnicodeDecodeError) as e:
+                print(f"Error reading file: {e}")
+                # Return an empty soup as fallback
+                soup = BeautifulSoup("", "lxml")


🛠️ Refactor suggestion

Non-portable path concatenation to html_filepath

Using string + "/raw/" + bypasses os.path.join, breaks on Windows, and risks double slashes.

-html_filepath = get_data_folder() + "/raw/" + html_filename +html_filepath = os.path.join(get_data_folder(), "raw", html_filename)

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

html_filepath = get_data_folder() + "/raw/" + html_filename

try:

with open(html_filepath, "r", encoding="utf-8") as f:

html_content = f.read()

soup = BeautifulSoup(html_content, "lxml")

except (FileNotFoundError, PermissionError, UnicodeDecodeError) as e:

print(f"Error reading file: {e}")

# Return an empty soup as fallback

soup = BeautifulSoup("", "lxml")

html_filepath = os.path.join(get_data_folder(), "raw", html_filename)

try:

with open(html_filepath, "r", encoding="utf-8") as f:

html_content = f.read()

soup = BeautifulSoup(html_content, "lxml")

except (FileNotFoundError, PermissionError, UnicodeDecodeError) as e:

print(f"Error reading file: {e}")

# Return an empty soup as fallback

soup = BeautifulSoup("", "lxml")

🤖 Prompt for AI Agents

In data-collection/scripts/collect_all.py around lines 138 to 146, the construction of html_filepath uses string concatenation with "+ '/raw/' +", which is non-portable and can cause issues on Windows or result in double slashes. Replace this concatenation with os.path.join to correctly and portably join the directory paths and filename, ensuring compatibility across operating systems.

coderabbitai

Actionable comments posted: 5

🔭 Outside diff range comments (1)

data-collection/collectors/football_data_collector.py (1)

160-166: Make save robust: ensure folder exists, use splitext, and return filepath

Current code may fail if folder is missing and doesn’t return the path.

-def save_df_to_csv(df, raw_filename):
-    """Save DataFrame to CSV"""
-    processed_filename = raw_filename.split(".")[0] + "_processed.csv"
-    filepath = os.path.join(get_data_processed_folder(), processed_filename)
-    df.to_csv(filepath, index=False)
-    print(f"Processed CSV saved as '{filepath}'.")
+def save_df_to_csv(df: pd.DataFrame, raw_filename: str) -> str:
+    """Save DataFrame to CSV and return the output path."""
+    root, _ = os.path.splitext(os.path.basename(raw_filename))
+    processed_filename = f"{root}_processed.csv"
+    out_dir = get_data_processed_folder()
+    os.makedirs(out_dir, exist_ok=True)
+    filepath = os.path.join(out_dir, processed_filename)
+    df.to_csv(filepath, index=False)
+    print(f"Processed CSV saved as '{filepath}'.")
+    return filepath

♻️ Duplicate comments (4)

data-collection/collectors/fbref_collector.py (2)

13-17: Unconditional Selenium import contradicts “no hard dependency” and breaks import if Selenium isn’t installed

Move Selenium imports into the functions that need them and gate type hints to avoid runtime import. Also implement the documented requests-based fallback.

Apply:

@@
-import random
-from selenium import webdriver
-from selenium.webdriver.chrome.options import Options
-from selenium.webdriver.support.ui import WebDriverWait
-from selenium.common.exceptions import TimeoutException, WebDriverException
+import random
+import hashlib
+from typing import TYPE_CHECKING
+
+if TYPE_CHECKING:
+    # For type hints only; does not enforce Selenium at runtime.
+    from selenium.webdriver.chrome.options import Options

58-88: Fix UnboundLocalError, add null-safety, and actually use data_fields

When no rows are parsed, df_bs is undefined. Also table.tbody can be None. Parameter data_fields is unused.

-def extract_columns(soup, data_fields: list[str], html_filepath: str) -> pd.DataFrame:
+def extract_columns(soup, data_fields: list[str] | None, html_filepath: str) -> pd.DataFrame:
     """Extract specified columns from HTML using BeautifulSoup."""
-    parsed_rows = []
-    # Find the specific table we care about using BeautifulSoup
-    table = soup.find("table", class_="stats_table")
+    parsed_rows: list[dict] = []
+    # Find the specific table we care about using BeautifulSoup
+    table = soup.find("table", class_="stats_table")
 
-    if table:
-        for row in table.find("tbody").find_all("tr"):
+    fields_to_extract = data_fields or ["date", "score", "home_team", "away_team"]
+
+    if table and getattr(table, "tbody", None):
+        for row in table.tbody.find_all("tr"):
             # Find the specific cells '<td>' by their 'data-stat' attribute
-            fields = {"date": "", "score": "", "home_team": "", "away_team": ""}
-            for field in fields:
+            fields = {k: "" for k in fields_to_extract}
+            for field in fields_to_extract:
                 # Find the specific cells '<td>' by their 'data-stat' attribute
                 cell = row.find("td", {"data-stat": field})
                 # Extract the text from each cell, checking if the cell was found
                 fields[field] = cell.text.strip() if cell else ""
 
             # Create a dictionary for each row
             parsed_rows.append(fields)
 
     # Convert our list of dictionaries into a pandas DataFrame
-    if parsed_rows:
-        # 2nd arg preserves the order of the columns
-        df_bs = pd.DataFrame(
-            parsed_rows, columns=["date", "score", "home_team", "away_team"]
-        )
-        print("\nSuccess! Extracted columns using BeautifulSoup:\n")
-    else:
-        print("\nFailed to extract columns using BeautifulSoup.")
-
-    return df_bs
+    if parsed_rows:
+        df_bs = pd.DataFrame(parsed_rows, columns=fields_to_extract)
+        print("\nSuccess! Extracted columns using BeautifulSoup:\n")
+        return df_bs
+
+    print("\nFailed to extract columns using BeautifulSoup.")
+    return pd.DataFrame(columns=fields_to_extract)

data-collection/collectors/football_data_collector.py (2)

63-90: Implement timeout, directory creation, and real overwrite semantics (docstring mismatch)

Function can hang without timeout, fail if directory doesn’t exist, and mentions overwrite that doesn’t exist.

-def download_csv(url: str, filename: str = "football-data_data.csv"):
+def download_csv(
+    url: str,
+    filename: str = "football-data_data.csv",
+    *,
+    timeout: int = 30,
+    overwrite: bool = True,
+) -> str:
@@
-    - filename (str): Desired name for the saved file. Defaults to 'data.csv'.
-    - overwrite (bool): Whether to overwrite existing file. Defaults to False.
+    - filename (str): Desired name for the saved file.
+    - timeout (int): Network timeout in seconds.
+    - overwrite (bool): Whether to overwrite existing file. Defaults to True.
@@
-    filepath = os.path.join(get_data_raw_folder(), filename)
+    raw_dir = get_data_raw_folder()
+    os.makedirs(raw_dir, exist_ok=True)
+    filepath = os.path.join(raw_dir, filename)
+
+    if not overwrite and os.path.exists(filepath):
+        raise FileExistsError(f"{filepath} already exists. Set overwrite=True to replace.")
@@
-        response = requests.get(url)
+        response = requests.get(url, timeout=timeout)
         response.raise_for_status()  # Raise HTTPError for bad responses
@@
     # Write to data/raw/filename
     with open(filepath, "wb") as f:
         f.write(response.content)
     print(f"Raw CSV saved as {filepath}\n")
+    return filepath

110-119: Avoid fragile filename parsing and non-unique IDs; don’t use file as a parameter name

Use basename parsing and a stable deterministic match_id.

-def add_new_columns_to_football_data(df_clean, file):
+def add_new_columns_to_football_data(df_clean: pd.DataFrame, filename: str) -> pd.DataFrame:
     """Add new columns to df_clean."""
-    df_clean["match_id"] = df_clean.index
-
-    tmp = file.split("_")
-    df_clean["league"] = tmp[1]
-    df_clean["season"] = tmp[2].split(".")[0]
-    df_clean["source"] = tmp[0]
+    base = os.path.basename(filename)
+    parts = base.split("_")
+    df_clean["league"] = parts[1] if len(parts) > 1 else ""
+    df_clean["season"] = parts[2].split(".")[0] if len(parts) > 2 else ""
+    df_clean["source"] = parts[0] if parts else ""
+
+    # Stable, cross-file unique match_id
+    import hashlib
+    def _mk_id(row) -> str:
+        key = f"{row.get('date','')}|{row.get('league','')}|{row.get('season','')}|{row.get('home_team','')}|{row.get('away_team','')}"
+        return hashlib.sha1(key.encode('utf-8')).hexdigest()[:16]
+    df_clean["match_id"] = df_clean.apply(_mk_id, axis=1)
     return df_clean

🧹 Nitpick comments (6)

data-collection/collectors/fbref_collector.py (3)

35-46: League slugging may be fragile

Replacing spaces with '-' won’t cover punctuation/diacritics (e.g., “Liga Portugal”, “Ligue 1 Uber Eats”). Consider a mapping table for FBref slugs or accept an explicit league_slug parameter to avoid 404s.

258-259: Static analysis warning on random is a false positive in this context

Use of random.uniform here is not for crypto; it’s fine. If your linter enforces this rule, suppress it or switch to secrets/SystemRandom for appeasement.

212-223: Minor: remove strict dependency on Selenium types in annotations

To avoid import-time failures, keep annotations as strings or drop them. You’ve already gated imports; consider adding from future import annotations or using TYPE_CHECKING.
data-collection/collectors/football_data_collector.py (3)
21-45: Clarify API: rename league_id to league_code, drop unused league_name, and fix examples

league_name is unused; parameter name suggests numeric id, but path expects codes like “E0”, “SP1”. Examples for Champions League/La Liga show season code “2425” while text says 2023–2024 (should be “2324”).
-def generate_football_data_url(league_id, league_name, season):
+def generate_football_data_url(league_code: str, season: int) -> str:
@@
-    ex. Champions League 2023-2024
-        https://www.football-data.co.uk/mmz4281/2425/E1.csv
+    ex. Champions League 2023-2024
+        https://www.football-data.co.uk/mmz4281/2324/E1.csv
@@
-    ex. La Liga 2023-2024
-        https://www.football-data.co.uk/mmz4281/2425/SP1.csv
+    ex. La Liga 2023-2024
+        https://www.football-data.co.uk/mmz4281/2324/SP1.csv
@@
-    url = base_url + f"/{season_code}/{league_id}.csv"
+    url = base_url + f"/{season_code}/{league_code}.csv"
128-146: Don’t print DataFrame heads in library code

Noisy logs and potential perf hit. Return the processed DataFrame; leave logging to callers or add a verbose flag.
 def reorder_df_football_data(df):
     # Get specified columns and reorder them
     df_processed = get_columns(
@@
     )
-    print("After getting essential columns:")
-    print(df_processed.head())
-    print("df_processed.columns:", df_processed.columns)
     return df_processed
47-61: DRY: centralize shared path and column helpers

Both collectors duplicate _here, get_data_raw_folder, get_data_processed_folder, get_columns, normalize_date. Extract into a shared module (e.g., data-collection/common/io_utils.py) and import from there to keep behavior consistent.

📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 4b49438 and 3be68e3.

📒 Files selected for processing (3)

data-collection/collectors/fbref_collector.py (1 hunks)
data-collection/collectors/football_data_collector.py (1 hunks)
data-collection/scripts/collect_all.py (1 hunks)

🚧 Files skipped from review as they are similar to previous changes (1)

data-collection/scripts/collect_all.py

🧰 Additional context used

🧬 Code Graph Analysis (1)

data-collection/collectors/fbref_collector.py (1)

data-collection/collectors/football_data_collector.py (5)

_here (47-48)

get_data_raw_folder (51-54)

get_data_processed_folder (57-60)

normalize_date (148-157)

get_columns (122-125)

🪛 GitHub Check: Codacy Static Code Analysis

data-collection/collectors/fbref_collector.py

[warning] 258-258: data-collection/collectors/fbref_collector.py#L258
Standard pseudo-random generators are not suitable for security/cryptographic purposes.

data-collection/collectors/football_data_collector.py

[warning] 80-80: data-collection/collectors/football_data_collector.py#L80
Call to requests without timeout

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)

GitHub Check: Codacy Static Code Analysis

coderabbitai · 2025-08-08T21:32:48Z

+def add_new_columns_to_fbref(df: pd.DataFrame, file) -> pd.DataFrame:
+    # Fill in match_id column
+    df["match_id"] = range(len(df))
+    # Split score into home_score and away_score
+    df["home_score"], df["away_score"] = convert_score_to_home_score_and_away_score(
+        df["score"]
+    )
+
+    tmp = file.split("_")
+    df["league"] = tmp[1]
+    df["season"] = tmp[2].split(".")[0]
+    df["source"] = tmp[0]
+
+    return df


🛠️ Refactor suggestion

Use stable match_id and robust filename parsing; avoid file param name

file.split("_") is fragile, and index-based match_id isn’t globally unique.

-def add_new_columns_to_fbref(df: pd.DataFrame, file) -> pd.DataFrame: - # Fill in match_id column - df["match_id"] = range(len(df)) +def add_new_columns_to_fbref(df: pd.DataFrame, filename: str) -> pd.DataFrame: + # Robust filename parsing + base = os.path.basename(filename) + parts = base.split("_") + # Expecting: source_league_season.html or similar + league = parts[1] if len(parts) > 1 else "" + season = parts[2].split(".")[0] if len(parts) > 2 else "" @@ - tmp = file.split("_") - df["league"] = tmp[1] - df["season"] = tmp[2].split(".")[0] - df["source"] = tmp[0] + df["league"] = league + df["season"] = season + df["source"] = parts[0] if parts else "" + + # Stable deterministic match_id + def _mk_id(row) -> str: + key = f"{row.get('date','')}|{row.get('league','')}|{row.get('season','')}|{row.get('home_team','')}|{row.get('away_team','')}" + return hashlib.sha1(key.encode("utf-8")).hexdigest()[:16] + df["match_id"] = df.apply(_mk_id, axis=1) return df

Committable suggestion skipped: line range outside the PR's diff.

🤖 Prompt for AI Agents

In data-collection/collectors/fbref_collector.py around lines 90 to 103, the current code uses a non-unique index-based match_id and fragile filename parsing by splitting on underscores. To fix this, generate a stable, unique match_id based on consistent match attributes instead of the row index, and replace the filename parsing with a more robust method such as using regex or pathlib to extract league, season, and source reliably. Also, rename the parameter from 'file' to avoid shadowing built-in names.

coderabbitai · 2025-08-08T21:32:48Z

+def create_csv(df: pd.DataFrame, filename: str) -> str:
+    """Save DataFrame to CSV under data/processed/filename"""
+
+    filepath = os.path.join(get_data_processed_folder(), filename)
+
+    # If output_path is not absolute, write under processed folder
+    if not os.path.isabs(filepath):
+        os.makedirs(os.path.dirname(filepath), exist_ok=True)
+    df.to_csv(filepath, index=False)
+    print(f"Processed CSV saved as '{filepath}'.")
+    return filepath
+


⚠️ Potential issue

Directory may not exist; ensure creation before writing CSV

filepath is absolute, so the current conditional skips makedirs. Create the directory unconditionally.

def create_csv(df: pd.DataFrame, filename: str) -> str: """Save DataFrame to CSV under data/processed/filename""" filepath = os.path.join(get_data_processed_folder(), filename) - # If output_path is not absolute, write under processed folder - if not os.path.isabs(filepath): - os.makedirs(os.path.dirname(filepath), exist_ok=True) + os.makedirs(os.path.dirname(filepath), exist_ok=True) df.to_csv(filepath, index=False) print(f"Processed CSV saved as '{filepath}'.") return filepath

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

def create_csv(df: pd.DataFrame, filename: str) -> str:

"""Save DataFrame to CSV under data/processed/filename"""

filepath = os.path.join(get_data_processed_folder(), filename)

# If output_path is not absolute, write under processed folder

if not os.path.isabs(filepath):

os.makedirs(os.path.dirname(filepath), exist_ok=True)

df.to_csv(filepath, index=False)

print(f"Processed CSV saved as '{filepath}'.")

return filepath

def create_csv(df: pd.DataFrame, filename: str) -> str:

"""Save DataFrame to CSV under data/processed/filename"""

filepath = os.path.join(get_data_processed_folder(), filename)

os.makedirs(os.path.dirname(filepath), exist_ok=True)

df.to_csv(filepath, index=False)

print(f"Processed CSV saved as '{filepath}'.")

return filepath

🤖 Prompt for AI Agents

In data-collection/collectors/fbref_collector.py around lines 175 to 186, the directory creation is conditional on filepath not being absolute, but filepath is always absolute, so the directory may not be created before writing the CSV. Fix this by removing the condition and always calling os.makedirs on the directory of filepath with exist_ok=True before saving the CSV.

coderabbitai · 2025-08-08T21:32:48Z

+def _configure_chrome_options() -> Options:
+    """Create and return a configured Chrome Options object."""
+    chrome_options = Options()
+    chrome_options.add_argument("--disable-blink-features=AutomationControlled")
+    chrome_options.add_experimental_option("excludeSwitches", ["enable-automation"])
+    chrome_options.add_experimental_option("useAutomationExtension", False)
+    chrome_options.add_argument("--disable-extensions")
+    chrome_options.add_argument("--no-sandbox")
+    chrome_options.add_argument("--disable-dev-shm-usage")
+    # chrome_options.add_argument('--headless')  # Optional: enable headless mode
+
+    # User agent to mimic real browser
+    chrome_options.add_argument(
+        "--user-agent=Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) "
+        "AppleWebKit/537.36 (KHTML, like Gecko) Chrome/120.0.0.0 Safari/537.36"
+    )
+
+    # Set Chrome binary location for macOS if available
+    chrome_bin = "/Applications/Google Chrome.app/Contents/MacOS/Google Chrome"
+    if os.path.exists(chrome_bin):
+        chrome_options.binary_location = chrome_bin
+    return chrome_options
+
+
+def _create_driver(chrome_options: Options):
+    """Initialize and return a Chrome WebDriver instance."""
+    print("Starting Chrome WebDriver...")
+    return webdriver.Chrome(options=chrome_options)
+


🛠️ Refactor suggestion

Localize Selenium imports and avoid runtime type dependency in annotations

Import Selenium types inside functions and use string annotations to prevent ImportError at module import.

-def _configure_chrome_options() -> Options: - """Create and return a configured Chrome Options object.""" - chrome_options = Options() +def _configure_chrome_options() -> "Options": + """Create and return a configured Chrome Options object.""" + from selenium.webdriver.chrome.options import Options + chrome_options = Options() @@ -def _create_driver(chrome_options: Options): - """Initialize and return a Chrome WebDriver instance.""" - print("Starting Chrome WebDriver...") - return webdriver.Chrome(options=chrome_options) +def _create_driver(chrome_options): + """Initialize and return a Chrome WebDriver instance.""" + print("Starting Chrome WebDriver...") + from selenium import webdriver + return webdriver.Chrome(options=chrome_options)

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

def _configure_chrome_options() -> Options:

"""Create and return a configured Chrome Options object."""

chrome_options = Options()

chrome_options.add_argument("--disable-blink-features=AutomationControlled")

chrome_options.add_experimental_option("excludeSwitches", ["enable-automation"])

chrome_options.add_experimental_option("useAutomationExtension", False)

chrome_options.add_argument("--disable-extensions")

chrome_options.add_argument("--no-sandbox")

chrome_options.add_argument("--disable-dev-shm-usage")

# chrome_options.add_argument('--headless') # Optional: enable headless mode

# User agent to mimic real browser

chrome_options.add_argument(

"--user-agent=Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) "

"AppleWebKit/537.36 (KHTML, like Gecko) Chrome/120.0.0.0 Safari/537.36"

)

# Set Chrome binary location for macOS if available

chrome_bin = "/Applications/Google Chrome.app/Contents/MacOS/Google Chrome"

if os.path.exists(chrome_bin):

chrome_options.binary_location = chrome_bin

return chrome_options

def _create_driver(chrome_options: Options):

"""Initialize and return a Chrome WebDriver instance."""

print("Starting Chrome WebDriver...")

return webdriver.Chrome(options=chrome_options)

def _configure_chrome_options() -> "Options":

"""Create and return a configured Chrome Options object."""

from selenium.webdriver.chrome.options import Options

chrome_options = Options()

chrome_options.add_argument("--disable-blink-features=AutomationControlled")

chrome_options.add_experimental_option("excludeSwitches", ["enable-automation"])

chrome_options.add_experimental_option("useAutomationExtension", False)

chrome_options.add_argument("--disable-extensions")

chrome_options.add_argument("--no-sandbox")

chrome_options.add_argument("--disable-dev-shm-usage")

# chrome_options.add_argument('--headless') # Optional: enable headless mode

# User agent to mimic real browser

chrome_options.add_argument(

"--user-agent=Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) "

"AppleWebKit/537.36 (KHTML, like Gecko) Chrome/120.0.0.0 Safari/537.36"

)

# Set Chrome binary location for macOS if available

chrome_bin = "/Applications/Google Chrome.app/Contents/MacOS/Google Chrome"

if os.path.exists(chrome_bin):

chrome_options.binary_location = chrome_bin

return chrome_options

def _create_driver(chrome_options):

"""Initialize and return a Chrome WebDriver instance."""

print("Starting Chrome WebDriver...")

from selenium import webdriver

return webdriver.Chrome(options=chrome_options)

🤖 Prompt for AI Agents

In data-collection/collectors/fbref_collector.py around lines 188 to 216, the Selenium imports and type annotations are currently at the module level, which can cause ImportError during module import. Move all Selenium-related imports inside the functions that use them, and change type annotations referencing Selenium types to use string literals instead. This localizes dependencies and avoids runtime import errors.

coderabbitai · 2025-08-08T21:32:49Z

+def _wait_out_challenge(driver, timeout: int = 30) -> None:
+    """Wait for common anti-bot challenge pages to clear, within timeout."""
+    print("Waiting for page to load and any challenges to complete...")
+    try:
+        WebDriverWait(driver, timeout).until(
+            lambda d: "Just a moment" not in d.title
+            and "Checking your browser" not in d.page_source
+        )
+        print("Challenge completed or page loaded successfully.")
+    except TimeoutException:
+        print("Timeout waiting for challenge completion, but proceeding anyway...")
+


🛠️ Refactor suggestion

Guard against missing Selenium and avoid NameError for TimeoutException

Import WebDriverWait/TimeoutException inside the function.

def _wait_out_challenge(driver, timeout: int = 30) -> None: """Wait for common anti-bot challenge pages to clear, within timeout.""" print("Waiting for page to load and any challenges to complete...") try: + from selenium.webdriver.support.ui import WebDriverWait + from selenium.common.exceptions import TimeoutException WebDriverWait(driver, timeout).until( lambda d: "Just a moment" not in d.title and "Checking your browser" not in d.page_source ) print("Challenge completed or page loaded successfully.") except TimeoutException: print("Timeout waiting for challenge completion, but proceeding anyway...")

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

def _wait_out_challenge(driver, timeout: int = 30) -> None:

"""Wait for common anti-bot challenge pages to clear, within timeout."""

print("Waiting for page to load and any challenges to complete...")

try:

WebDriverWait(driver, timeout).until(

lambda d: "Just a moment" not in d.title

and "Checking your browser" not in d.page_source

)

print("Challenge completed or page loaded successfully.")

except TimeoutException:

print("Timeout waiting for challenge completion, but proceeding anyway...")

def _wait_out_challenge(driver, timeout: int = 30) -> None:

"""Wait for common anti-bot challenge pages to clear, within timeout."""

print("Waiting for page to load and any challenges to complete...")

try:

from selenium.webdriver.support.ui import WebDriverWait

from selenium.common.exceptions import TimeoutException

WebDriverWait(driver, timeout).until(

lambda d: "Just a moment" not in d.title

and "Checking your browser" not in d.page_source

)

print("Challenge completed or page loaded successfully.")

except TimeoutException:

print("Timeout waiting for challenge completion, but proceeding anyway...")

🤖 Prompt for AI Agents

In data-collection/collectors/fbref_collector.py around lines 225 to 236, the function _wait_out_challenge uses WebDriverWait and TimeoutException without importing them inside the function, which can cause NameError if Selenium is missing. To fix this, move the imports of WebDriverWait and TimeoutException inside the _wait_out_challenge function to guard against missing Selenium and avoid NameError.

coderabbitai · 2025-08-08T21:32:49Z

+def download_with_selenium(url):
+    """Fetch dynamic page content using Selenium and return HTML (<=50 lines)."""
+    driver = None
+    try:
+        options = _configure_chrome_options()
+        driver = _create_driver(options)
+        _remove_webdriver_flag(driver)
+
+        print(f"Navigating to {url} ...")
+        driver.get(url)
+
+        _wait_out_challenge(driver, timeout=30)
+        time.sleep(random.uniform(2, 5))  # small buffer to ensure full render
+
+        html_content = _get_html_or_raise(driver)
+        print("Successfully retrieved page.")
+        return html_content
+
+    except WebDriverException as e:
+        print(f"WebDriver error: {e}")
+        print("\nTroubleshooting:")
+        print("1. Make sure Chrome browser is installed")
+        print("2. Install ChromeDriver: brew install --cask chromedriver")
+        print("3. Or use: pip install webdriver-manager")
+        raise ValueError(f"WebDriver error: {e}")
+
+    except Exception as e:
+        print(f"Unexpected error: {e}")
+        raise ValueError(f"Failed to scrape with Selenium: {e}")
+


🛠️ Refactor suggestion

Add true “requests” fallback and avoid hard Selenium dependency

If Selenium isn’t available, gracefully fall back to a plain GET with headers and timeout. Also localize Selenium imports to this function and helpers.

def download_with_selenium(url): """Fetch dynamic page content using Selenium and return HTML (<=50 lines).""" driver = None try: - options = _configure_chrome_options() - driver = _create_driver(options) - _remove_webdriver_flag(driver) + # Try Selenium path first; import lazily to avoid hard dependency. + try: + from selenium.common.exceptions import WebDriverException + except ImportError: + WebDriverException = Exception # sentinel for isinstance checks + + options = _configure_chrome_options() + driver = _create_driver(options) + _remove_webdriver_flag(driver) @@ - _wait_out_challenge(driver, timeout=30) + _wait_out_challenge(driver, timeout=30) time.sleep(random.uniform(2, 5)) # small buffer to ensure full render @@ - html_content = _get_html_or_raise(driver) + html_content = _get_html_or_raise(driver) print("Successfully retrieved page.") return html_content - - except WebDriverException as e: - print(f"WebDriver error: {e}") - print("\nTroubleshooting:") - print("1. Make sure Chrome browser is installed") - print("2. Install ChromeDriver: brew install --cask chromedriver") - print("3. Or use: pip install webdriver-manager") - raise ValueError(f"WebDriver error: {e}") + except Exception as e: + # If Selenium is not installed or fails, try a lightweight fallback. + try: + import requests + print("Selenium unavailable/failed. Falling back to requests (static HTML only)...") + resp = requests.get( + url, + timeout=30, + headers={ + "User-Agent": "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) " + "AppleWebKit/537.36 (KHTML, like Gecko) Chrome/120.0.0.0 Safari/537.36" + }, + ) + resp.raise_for_status() + return resp.text + except Exception as re: + print(f"Fallback fetch failed: {re}") + raise ValueError(f"Failed to fetch page via Selenium and requests: {e}") from re @@ finally: if driver: print("Closing browser...") driver.quit()

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

def download_with_selenium(url):

"""Fetch dynamic page content using Selenium and return HTML (<=50 lines)."""

driver = None

try:

options = _configure_chrome_options()

driver = _create_driver(options)

_remove_webdriver_flag(driver)

print(f"Navigating to {url} ...")

driver.get(url)

_wait_out_challenge(driver, timeout=30)

time.sleep(random.uniform(2, 5)) # small buffer to ensure full render

html_content = _get_html_or_raise(driver)

print("Successfully retrieved page.")

return html_content

except WebDriverException as e:

print(f"WebDriver error: {e}")

print("\nTroubleshooting:")

print("1. Make sure Chrome browser is installed")

print("2. Install ChromeDriver: brew install --cask chromedriver")

print("3. Or use: pip install webdriver-manager")

raise ValueError(f"WebDriver error: {e}")

except Exception as e:

print(f"Unexpected error: {e}")

raise ValueError(f"Failed to scrape with Selenium: {e}")

def download_with_selenium(url):

"""Fetch dynamic page content using Selenium and return HTML (<=50 lines)."""

driver = None

try:

# Try Selenium path first; import lazily to avoid hard dependency.

try:

from selenium.common.exceptions import WebDriverException

except ImportError:

WebDriverException = Exception # sentinel for isinstance checks

options = _configure_chrome_options()

driver = _create_driver(options)

_remove_webdriver_flag(driver)

print(f"Navigating to {url} ...")

driver.get(url)

_wait_out_challenge(driver, timeout=30)

time.sleep(random.uniform(2, 5)) # small buffer to ensure full render

html_content = _get_html_or_raise(driver)

print("Successfully retrieved page.")

return html_content

except Exception as e:

# If Selenium is not installed or fails, try a lightweight fallback.

try:

import requests

print("Selenium unavailable/failed. Falling back to requests (static HTML only)...")

resp = requests.get(

url,

timeout=30,

headers={

"User-Agent": (

"Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) "

"AppleWebKit/537.36 (KHTML, like Gecko) "

"Chrome/120.0.0.0 Safari/537.36"

)

},

)

resp.raise_for_status()

return resp.text

except Exception as re:

print(f"Fallback fetch failed: {re}")

raise ValueError(

f"Failed to fetch page via Selenium and requests: {e}"

) from re

finally:

if driver:

print("Closing browser...")

driver.quit()

🧰 Tools

🪛 GitHub Check: Codacy Static Code Analysis

[warning] 258-258: data-collection/collectors/fbref_collector.py#L258
Standard pseudo-random generators are not suitable for security/cryptographic purposes.

🤖 Prompt for AI Agents

In data-collection/collectors/fbref_collector.py around lines 246 to 275, the download_with_selenium function currently depends directly on Selenium without fallback. Modify the function to localize Selenium imports inside it and its helpers, then add a fallback to perform a plain HTTP GET request with appropriate headers and timeout if Selenium is not installed or fails to run. This avoids a hard dependency on Selenium and improves robustness.

… purpose

coderabbitai

Actionable comments posted: 2

♻️ Duplicate comments (6)

data-collection/collectors/football_data_collector.py (2)
110-118: Fragile filename parsing & shadowing built-in file
Same issues flagged earlier: rename param, use os.path.basename, generate stable match_id.

63-88: Directory creation, overwrite flag still missing

download_csv():

Fails if data/raw doesn’t exist (open() will raise).

Docstring mentions overwrite but the flag is not implemented.

No guard against silently clobbering an existing file.
+raw_dir = get_data_raw_folder()
+os.makedirs(raw_dir, exist_ok=True)
+filepath = os.path.join(raw_dir, filename)
+
+if not overwrite and os.path.exists(filepath):
+    raise FileExistsError(f"{filepath} exists. Set overwrite=True to replace.")
data-collection/collectors/fbref_collector.py (4)

12-17: Hard dependency on Selenium contradicts docstring

Top-level imports (selenium.*) will crash on environments without Selenium despite the “avoids hard dependency” claim. Move all Selenium imports inside the functions that require them and guard with try/except.

58-88: extract_columns() can crash & ignores data_fields

• table.find("tbody") may be None → AttributeError.
• When no rows parsed, df_bs is undefined → UnboundLocalError.
• data_fields param is never used.
Harden null-checks, always return a DataFrame, or drop the unused param.

90-103: Non-unique match_id, fragile parsing, file shadowing

Issues previously noted remain: use deterministic hash for match_id, robust os.path.basename parsing, and rename file param.

175-185: Directory may not exist

create_csv() only calls os.makedirs when the path is not absolute, but filepath is always absolute—directory creation is skipped. Remove the condition.

🧹 Nitpick comments (1)

data-collection/collectors/football_data_collector.py (1)

21-44: Remove or use the league_name parameter

generate_football_data_url() ignores its league_name argument, which is misleading and invites bugs. Either incorporate it (e.g., slugify for sanity checks) or drop the parameter.

📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 3be68e3 and de3f134.

📒 Files selected for processing (2)

data-collection/collectors/fbref_collector.py (1 hunks)
data-collection/collectors/football_data_collector.py (1 hunks)

🧰 Additional context used

🧬 Code Graph Analysis (2)

data-collection/collectors/football_data_collector.py (1)

data-collection/collectors/fbref_collector.py (5)

_here (19-20)

get_data_raw_folder (23-26)

get_data_processed_folder (29-32)

get_columns (169-172)

normalize_date (138-147)

data-collection/collectors/fbref_collector.py (1)

data-collection/collectors/football_data_collector.py (5)

_here (47-48)

get_data_raw_folder (51-54)

get_data_processed_folder (57-60)

normalize_date (148-157)

get_columns (122-125)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)

GitHub Check: Codacy Static Code Analysis

coderabbitai · 2025-08-08T23:13:24Z

+def download_with_selenium(url):
+    """Fetch dynamic page content using Selenium and return HTML (<=50 lines)."""
+    driver = None
+    try:
+        options = _configure_chrome_options()
+        driver = _create_driver(options)
+        _remove_webdriver_flag(driver)
+
+        print(f"Navigating to {url} ...")
+        driver.get(url)
+
+        _wait_out_challenge(driver, timeout=30)
+
+        # Using SystemRandom() is a secure random number generator.  Itis a best practice for security-sensitive applications.
+        time.sleep(SystemRandom().uniform(2, 5))  # small buffer to ensure full render
+
+        html_content = _get_html_or_raise(driver)
+        print("Successfully retrieved page.")
+        return html_content
+
+    except WebDriverException as e:
+        print(f"WebDriver error: {e}")
+        print("\nTroubleshooting:")
+        print("1. Make sure Chrome browser is installed")
+        print("2. Install ChromeDriver: brew install --cask chromedriver")
+        print("3. Or use: pip install webdriver-manager")
+        raise ValueError(f"WebDriver error: {e}")
+
+    except Exception as e:
+        print(f"Unexpected error: {e}")
+        raise ValueError(f"Failed to scrape with Selenium: {e}")
+


⚠️ Potential issue

No fallback despite promise

download_with_selenium() claims a “requests-based fallback” yet catches WebDriverException and merely re-raises. Implement the fallback or revise the docstring.

🤖 Prompt for AI Agents

In data-collection/collectors/fbref_collector.py around lines 246 to 277, the function download_with_selenium() claims to have a requests-based fallback in its docstring but currently does not implement any fallback logic and just re-raises exceptions. To fix this, either implement the fallback by catching WebDriverException and then attempting to fetch the page content using requests as a backup, or update the docstring to remove the mention of a requests-based fallback to accurately reflect the current behavior.

coderabbitai · 2025-08-08T23:13:24Z

+def save_df_to_csv(df, raw_filename):
+    """Save DataFrame to CSV"""
+    processed_filename = raw_filename.split(".")[0] + "_processed.csv"
+    filepath = os.path.join(get_data_processed_folder(), processed_filename)
+    df.to_csv(filepath, index=False)
+    print(f"Processed CSV saved as '{filepath}'.")


🛠️ Refactor suggestion

Ensure processed directory exists before write

save_df_to_csv() writes to data/processed but never os.makedirs(..., exist_ok=True). Add it to avoid FileNotFoundError.

🤖 Prompt for AI Agents

In data-collection/collectors/football_data_collector.py around lines 160 to 165, the function save_df_to_csv writes the DataFrame to a file in the processed data directory but does not ensure the directory exists, which can cause a FileNotFoundError. Before calling df.to_csv, add a call to os.makedirs with the processed directory path and exist_ok=True to create the directory if it does not exist.

Anna S added 24 commits July 6, 2025 14:27

Issue vibing-ai#6 completed

7e98368

moved to ai-backend/tests/

54a7273

Completed Issue-7

44c5914

implement ed get_fixtures method and created test script

d1e4a83

Recommit changegd made in Issue-7.

9a5e989

added get_teams function

afe9985

update

6188847

successfully downloaded .csv files of the Premier League from the las…

2a5a72d

…t 5 seasons

almost done with data cleaning function

4ec99da

Finished data cleansing from downloaded ..csv files to new .csv files

aed95fb

Changed file paths from relative to absolute

9a49b86

expand download to Champions Leagus and La Liga

8a874e4

added get_columns() to allow for dymanic selection of columns in a da…

e654696

…taframe

remove unused import

dbecda7

Removed calling requests to retrieve the .html pages

4a550f5

Moved code from fbref_collector.py to collect_all.py

dd31560

Removed download_single_html_page()

aca6524

filled in empty columns before creating a csv file

ace2e88

Rewrite extract_elements() and rename it to extract_columns()

80287b5

Rearrange columns in dataframe before writing to csv

7c43ec2

Refactoring code

cf5c766

Refactor

f324cc5

update

09f7eaa

Code refractored and cleaned up

e7604e4

coderabbitai Bot reviewed Aug 8, 2025

View reviewed changes

Refractor

4b49438

coderabbitai Bot reviewed Aug 8, 2025

View reviewed changes

refractor

3be68e3

coderabbitai Bot reviewed Aug 8, 2025

View reviewed changes

Added SystemRandom() in fbref_collector.py for security/cryptographic…

de3f134

… purpose

coderabbitai Bot reviewed Aug 8, 2025

View reviewed changes

+              # HTMLs
+              *.html
+              # CSVs
+              *.csv

-# HTMLs
-*.html
-# CSVs
-*.csv
+# Data-collection outputs
+data-collection/**/*.html
+data-collection/**/*.csv

		@@ -0,0 +1,142 @@
		import pytest

		from tools.sports_apis import APIFootballClient

	from tools.sports_apis import APIFootballClient
	from ai_backend.tools.sports_apis import APIFootballClient

-sys.path.append(os.path.join(os.path.dirname(__file__), "..", "shared"))
+import importlib.util, pathlib, sys
+shared_path = pathlib.Path(__file__).resolve().parent.parent / "shared"
+sys.path.insert(0, str(shared_path))

Conversation

thequinn commented Aug 8, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Type of Change

Changes Made

Testing

Platform Impact

Breaking Changes

Checklist

Screenshots (if applicable)

Related Issues

Additional Notes

Summary by CodeRabbit

Uh oh!

coderabbitai Bot commented Aug 8, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Sequence Diagram(s)

Estimated code review effort

Poem

Chat

Support

CodeRabbit Commands (Invoked using PR comments)

Other keywords and placeholders

CodeRabbit Configuration File (.coderabbit.yaml)

Documentation and Community

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot Aug 8, 2025

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot Aug 8, 2025

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot Aug 8, 2025

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot Aug 8, 2025

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot Aug 8, 2025

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot Aug 8, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

coderabbitai Bot Aug 8, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

coderabbitai Bot Aug 8, 2025

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

coderabbitai Bot Aug 8, 2025

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot Aug 8, 2025

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot Aug 8, 2025

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot Aug 8, 2025

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot Aug 8, 2025

Choose a reason for hiding this comment

Uh oh!

thequinn commented Aug 8, 2025 •

edited

Loading

coderabbitai Bot commented Aug 8, 2025 •

edited

Loading

CodeRabbit Configuration File (`.coderabbit.yaml`)