Historical football data collection#34
Conversation
WalkthroughThis update introduces new data collection modules and scripts for football match data, implements and tests API clients for sports data, and enhances configuration and testing infrastructure. Key changes include new collectors for FBref and football-data.co.uk, expanded API-Football client functionality, additional test suites, stricter type and test configurations, and various formatting and stylistic improvements across the codebase. Changes
Sequence Diagram(s)sequenceDiagram
participant Script as collect_all.py
participant FBref as fbref_collector.py
participant FootballData as football_data_collector.py
participant Selenium as Selenium WebDriver
participant Requests as requests
Script->>FootballData: generate_football_data_url()
Script->>FootballData: download_csv()
Script->>FootballData: read_csv()
Script->>FootballData: rename_columns(), add_new_columns_to_football_data()
Script->>FootballData: reorder_df(), normalize_date(), save_df_to_csv()
Script->>FBref: generate_fbref_url()
Script->>Selenium: download_with_selenium()
Selenium-->>FBref: HTML content
Script->>FBref: save_html_to_file()
Script->>FBref: extract_columns()
Script->>FBref: add_new_columns_to_fbref(), reorder_df(), normalize_date(), create_csv()
sequenceDiagram
participant Test as test_fixtures.py
participant MockSession as MockSession
participant API as APIFootballClient
Test->>API: get_fixtures(...)
API->>MockSession: get()
MockSession-->>API: MockResponse (JSON)
API-->>Test: List of fixtures
sequenceDiagram
participant APIAuthTest as test_api_auth.py
participant APIAuth as APIAuth
participant aiohttp as aiohttp.ClientSession
participant RapidAPI as API-Football
APIAuthTest->>APIAuth: __aenter__()
APIAuth->>aiohttp: ClientSession()
APIAuthTest->>APIAuth: test_api_auth()
APIAuth->>RapidAPI: GET /fixtures
RapidAPI-->>APIAuth: JSON response
APIAuth-->>APIAuthTest: Parsed data
APIAuthTest->>APIAuth: __aexit__()
APIAuth->>aiohttp: Close session
Estimated code review effort🎯 4 (Complex) | ⏱️ ~45 minutes Poem
Note 🔌 MCP (Model Context Protocol) integration is now available in Early Access!Pro users can now connect to remote MCP servers under the Integrations page to get reviews and chat conversations that understand additional development context. ✨ Finishing Touches
🧪 Generate unit tests
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. 🪧 TipsChatThere are 3 ways to chat with CodeRabbit:
SupportNeed help? Create a ticket on our support page for assistance with any issues or questions. CodeRabbit Commands (Invoked using PR comments)
Other keywords and placeholders
CodeRabbit Configuration File (
|
There was a problem hiding this comment.
Actionable comments posted: 10
🔭 Outside diff range comments (1)
ai-backend/.pre-commit-config.yaml (1)
25-26: Align Python versions across Black and mypyThere’s a mismatch between the Python target in your pre-commit Black hook and mypy config, which can lead to inconsistent parsing or type-checking behavior. Please align both tools to the same version.
• ai-backend/.pre-commit-config.yaml (line 25):
language_version: python3.12• ai-backend/mypy.ini (line 3):
python_version = 3.11Suggested fix (choose one):
– Update mypy.ini to 3.12:
[mypy] -python_version = 3.11 +python_version = 3.12 ...– Or downgrade Black target to 3.11:
- language_version: python3.12 + language_version: python3.11
🧹 Nitpick comments (21)
data-collection/collectors/__init__.py (1)
1-3: Clarify scope in package docstringIf this package is focused on football data (per PR), reflect that for precision.
Apply this diff:
-""" -Collectors package for sports data collection. -""" +""" +Collectors for historical football data (fbref, football-data.co.uk). +"""ai-backend/mypy.ini (1)
46-67: Broad ignore_missing_imports reduces type safety; consider targeted alternativesGood to unblock CI, but you’ll lose valuable checks for fastapi/pydantic/pytest, which generally have decent typing. Consider:
- Remove ignores for fastapi/pydantic/uvicorn if possible.
- Enable the Pydantic mypy plugin to improve model typing.
Add the Pydantic plugin at the top-level (outside this hunk):
[mypy] plugins = pydantic.mypyThen gradually remove these sections:
- [mypy-fastapi.], [mypy-pydantic.], [mypy-uvicorn.*]
Verify locally which ones are still needed. If certain submodules lack stubs, prefer narrow, module-specific ignores or inline# type: ignore[import]over package-wide ignores.ai-backend/agents/researcher.py (1)
24-37: Consider stronger typing for return structureInstead of dict[str, Any], define a TypedDict or dataclass for the research payload. It improves IDE assistance and catches schema drift early.
Example:
from typing import TypedDict class TeamHistory(TypedDict, total=False): head_to_head_record: dict[str, int] # e.g., {"wins": 10, "draws": 5, "losses": 7} recent_form: list[str] # e.g., ["W", "D", "L", ...] key_matches: list[dict[str, str]] # e.g., [{"date": "...", "score": "2-1", ...}]Then annotate: async def research_team_history(...) -> TeamHistory:
ai-backend/main.py (1)
270-279: Prefer structured logging instead of “%s” placeholders
logger.info("Starting server on %s:%s", host, port)is fine, but elsewhere the codebase uses keyword/value pairs (extra=orlogger.info(key=value)). Aligning styles improves log-parsing consistency.mypy.ini (1)
25-46: Consider using official stub packages instead of blanketignore_missing_imports
beautifulsoup4,aiohttp,openai, etc., all have published stubs (types-beautifulsoup4,aiohttp-types, …). Installing them lets you keep strict checking without silencing whole modules.ai-backend/config/settings.py (2)
55-63: Validator duplicatesmin_lengthconstraint
min_length=20plus explicitlenchecks is redundant. Either rely on the field constraint or keep the validator but dropmin_lengthto avoid double-error messages.
114-115:model_configmay shadow future attributes
model_configis Pydantic-v2 specific. To remain forward-compatible (e.g., if pinning v2.5+), considerclass Configwithmodel_config = …inside, as recommended in docs, to avoid naming collisions.data-collection/collectors/football_data_collector.py (4)
22-45: Unusedleague_nameparameter
league_nameis never referenced; drop it or incorporate it to avoid API confusion.-def generate_football_data_url(league_id, league_name, season): +def generate_football_data_url(league_id: str, season: int) -> str:
129-146: ReplaceRaw
print()hampers library reuse and log aggregation. Inject a module-level logger instead.-import pprint -… -print("After getting essential columns:") -print(df_processed.head()) +logger = logging.getLogger(__name__) +… +logger.debug("Essential columns sample:\n%s", df_processed.head())
149-157: Date normalization swallows errors silentlyReturning invalid dates as
NaTcast to"NaT"leads to string"NaT"values downstream. Consider dropping rows withNaTor at least logging count of failed parses.
160-165: Ensure processed directory exists and useos.path.splitext-processed_filename = raw_filename.split(".")[0] + "_processed.csv" +base, _ = os.path.splitext(raw_filename) +processed_filename = f"{base}_processed.csv" +os.makedirs(get_data_processed_folder(), exist_ok=True)Return the saved path so callers can chain operations.
data-collection/collectors/fbref_collector.py (2)
100-111: Filename parsing is brittle
file.split("_")assumes exactly three underscores and no extra “_” in league names (e.g., “Champions_League”). Safer:parts = file.rsplit("_", maxsplit=2) if len(parts) != 3: raise ValueError(f"Unexpected filename format: {file}") _, league, season_part = parts season = season_part.split(".")[0]
158-180: Heavy code duplication withfootball_data_collector.py
reorder_df,get_columns,normalize_date, and CSV helpers are copy-pasted. Extract them into a shared util and import from both collectors to reduce maintenance cost.data-collection/scripts/collect_all.py (2)
110-113: Variable reuse blurs types
df_cleaned = save_df_to_csv(df_cleaned, raw_csv_filename)reassigns the DataFrame variable to the path string returned bysave_df_to_csv.
Prefer discarding the return value or storing it in a clearly-named variable to avoid accidental misuse later.
138-146: Path built via string concatenation – useos.path.joinhtml_filepath = os.path.join(get_data_folder(), "raw", html_filename)avoids platform-specific separator bugs.
ai-backend/tools/sports_apis.py (1)
182-193: Logging entire team payload may leak unnecessary data
logger.info(f"Teams data: {json.dumps(data_response, indent=2)}")can emit thousands of lines and potentially sensitive info to logs.Log the count instead, or guard behind a debug flag.
ai-backend/tests/test_openai.py (1)
16-30: Convert print-based skip to pytest skipUse
pytest.skip()so the test suite reports a skipped test instead of silently passing.import pytest if not os.getenv("OPENAI_API_KEY"): pytest.skip("OPENAI_API_KEY not set", allow_module_level=True)ai-backend/tests/test_quality_tools.py (2)
12-13:processed_gamesis never usedThe attribute is initialised but never updated; either populate it in
process_game_dataor drop it to avoid dead code.
60-63: Remove ad-hocPrinting in test modules clutters CI logs. Rely on assertions instead, or move this demo into
docs/or an example script.ai-backend/tests/test_fixtures.py (1)
36-39:monkeypatchfixture is declared but unusedPytest will warn on this; simply drop the parameter.
ai-backend/tools/test_api_auth.py (1)
1-3: File name will be collected by pytestA module named
test_*.pyundertools/will be imported during test discovery even though it isn’t a test.
Rename the file (e.g.api_auth_demo.py) or add__test__ = Falseat module level to suppress collection.
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (20)
.gitignore(2 hunks)ai-backend/.pre-commit-config.yaml(1 hunks)ai-backend/agents/editor.py(1 hunks)ai-backend/agents/researcher.py(2 hunks)ai-backend/config/settings.py(5 hunks)ai-backend/main.py(3 hunks)ai-backend/mypy.ini(1 hunks)ai-backend/pytest.ini(1 hunks)ai-backend/tests/test_environment.py(1 hunks)ai-backend/tests/test_fixtures.py(1 hunks)ai-backend/tests/test_openai.py(1 hunks)ai-backend/tests/test_quality_tools.py(1 hunks)ai-backend/tools/sports_apis.py(7 hunks)ai-backend/tools/test_api_auth.py(1 hunks)data-collection/collectors/__init__.py(1 hunks)data-collection/collectors/fbref_collector.py(1 hunks)data-collection/collectors/football_data_collector.py(1 hunks)data-collection/scripts/collect_all.py(1 hunks)mypy.ini(1 hunks)scripts/seed-data.py(10 hunks)
🧰 Additional context used
🧬 Code Graph Analysis (2)
data-collection/collectors/football_data_collector.py (1)
data-collection/collectors/fbref_collector.py (6)
_here(25-26)get_data_raw_folder(29-32)get_data_processed_folder(35-38)get_columns(177-180)reorder_df(158-173)normalize_date(147-155)
data-collection/collectors/fbref_collector.py (1)
data-collection/collectors/football_data_collector.py (6)
_here(48-49)get_data_raw_folder(52-55)get_data_processed_folder(58-61)normalize_date(149-157)reorder_df(129-146)get_columns(123-126)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
- GitHub Check: Codacy Static Code Analysis
🔇 Additional comments (10)
ai-backend/.pre-commit-config.yaml (1)
31-31: No-op quote style change — OK to keepSwitching to single quotes in args is fine; behavior unchanged.
.gitignore (1)
8-10: Section header for data artifacts — looks goodThe new section marker improves readability.
ai-backend/agents/researcher.py (1)
24-24: Formatting change only — safeSignature/logging compaction is non-functional and consistent with the module’s style. Sanitization preserved.
Also applies to: 36-36
ai-backend/agents/editor.py (1)
37-49: Formatting change looks goodSignature compacting is fine and keeps consistency with other agents. No further action required.
ai-backend/main.py (2)
114-117: Re-raising with HTTPException is fineWrapping the original exception preserves context via
from e. No concerns.
148-156: Signature now fits 79-char line – OKOnly formatting changed. Behaviour untouched.
ai-backend/config/settings.py (1)
24-30: Verify callers after the ENV-VAR renameChanging to
OPENAI_API_KEY,SUPABASE_URL, etc., will break any code, CI secrets, or infra still exporting lowercase names. Double-check docker-compose, deployment manifests, and docs.data-collection/collectors/fbref_collector.py (1)
226-232: Hard-coded Chrome binary path breaks non-macOS environmentsSetting
chrome_options.binary_locationto a mac-specific path will fail on Linux/CI or Windows boxes. Detect the OS or allow an env-var override instead of hard-coding.ai-backend/pytest.ini (2)
7-14: Declarepytest-covas an explicit dev dependencyThe new
--cov=*options require thepytest-covplugin; ensure it’s listed in your dev/CI requirements files, otherwise the test run will error out on unknown flags.
15-19: Make sure the custom markers are actually used
--strict-markerswill fail if any of these markers are unused. Double-check that at least one test is decorated with each ofslow,integration,unit, andasyncio, or drop the unused entries.
| # HTMLs | ||
| *.html | ||
|
|
||
| # CSVs | ||
| *.csv | ||
|
|
There was a problem hiding this comment.
💡 Verification agent
🧩 Analysis chain
Scope data artifact ignores to data-collection only (avoid hiding docs/fixtures)
Ignoring all HTML/CSV globally is risky; it can unintentionally hide documentation, static site outputs, or CSV test fixtures. Limit ignores to the data-collection tree (or specific output dirs).
Suggested replacement:
-# HTMLs
-*.html
-
-# CSVs
-*.csv
+# Data outputs (restricted to data-collection)
+data-collection/**/*.html
+data-collection/**/*.csvIf your outputs live in known subfolders (e.g., data-collection/raw and data-collection/processed), tighten further:
- data-collection/raw/**/*.csv
- data-collection/processed/**/*.csv
- data-collection/raw/**/*.html
- data-collection/processed/**/*.html
🏁 Script executed:
#!/bin/bash
echo "Checking for .html files outside data-collection directory..."
rg --files --glob '*.html' | grep -v '^data-collection/'
echo ""
echo "Checking for .csv files outside data-collection directory..."
rg --files --glob '*.csv' | grep -v '^data-collection/'Length of output: 391
🏁 Script executed:
#!/bin/bash
echo "Checking for a data-collection directory..."
fd -t d data-collection || echo ">> data-collection directory not found"
echo
echo "Searching for HTML files under data-collection..."
rg --files --glob 'data-collection/**/*.html' || echo ">> No HTML files found in data-collection"
echo
echo "Searching for CSV files under data-collection..."
rg --files --glob 'data-collection/**/*.csv' || echo ">> No CSV files found in data-collection"Length of output: 650
Restrict HTML/CSV ignores to data-collection only
To avoid inadvertently hiding docs or test fixtures elsewhere in the repo, replace the global *.html and *.csv ignores with patterns scoped to your data-collection outputs.
• File: .gitignore (lines 95–100)
• Remove:
-# HTMLs
-*.html
-
-# CSVs
-*.csv• Add:
+# Data-collection outputs
+data-collection/**/*.html
+data-collection/**/*.csvIf you know your outputs live in subfolders (e.g. raw/ and processed/), you can tighten further:
data-collection/raw/**/*.csv
data-collection/processed/**/*.csv
data-collection/raw/**/*.html
data-collection/processed/**/*.html
📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| # HTMLs | |
| *.html | |
| # CSVs | |
| *.csv | |
| # Data-collection outputs | |
| data-collection/**/*.html | |
| data-collection/**/*.csv |
🤖 Prompt for AI Agents
In the .gitignore file around lines 95 to 100, the current patterns '*.html' and
'*.csv' ignore these file types globally, which may hide important docs or test
fixtures elsewhere. Replace these global ignores with more specific patterns
scoped to the data-collection directory, such as 'data-collection/raw/**/*.csv',
'data-collection/processed/**/*.csv', 'data-collection/raw/**/*.html', and
'data-collection/processed/**/*.html' to restrict ignoring only to your
data-collection outputs.
| # Test core dependencies | ||
| try: | ||
| print("✅ OpenAI package imported successfully") | ||
| except ImportError as e: | ||
| print(f"❌ OpenAI import failed: {e}") | ||
|
|
||
| try: | ||
| print("✅ OpenAI Agents package imported successfully") | ||
| except ImportError as e: | ||
| print(f"❌ OpenAI Agents import failed: {e}") | ||
|
|
||
| try: | ||
| print("✅ FastAPI package imported successfully") | ||
| except ImportError as e: | ||
| print(f"❌ FastAPI import failed: {e}") | ||
|
|
||
| try: | ||
| print("✅ Pydantic package imported successfully") | ||
| except ImportError as e: | ||
| print(f"❌ Pydantic import failed: {e}") | ||
|
|
||
| try: | ||
| print("✅ Supabase package imported successfully") | ||
| except ImportError as e: | ||
| print(f"❌ Supabase import failed: {e}") | ||
|
|
There was a problem hiding this comment.
Test never imports the modules it claims to verify
Each try block only prints a success message; the actual import statements are missing, so failures will never surface.
Example fix for one block (apply to all):
try:
import openai # noqa: F401
print("✅ OpenAI package imported successfully")
except ImportError as e:
print(f"❌ OpenAI import failed: {e}")
raiseAlso add assertions to fail the test when an import is missing so CI can detect environment problems.
🤖 Prompt for AI Agents
In ai-backend/tests/test_environment.py between lines 7 and 32, the try blocks
only print success messages without actually importing the modules, so import
failures won't be detected. Fix this by adding the appropriate import statements
inside each try block for the respective packages (e.g., import openai, import
fastapi, etc.). Also, add assertions or raise the ImportError in the except
blocks to ensure the test fails and CI detects missing dependencies.
| @@ -0,0 +1,142 @@ | |||
| import pytest | |||
|
|
|||
| from tools.sports_apis import APIFootballClient | |||
There was a problem hiding this comment.
Import path likely wrong
from tools.sports_apis import … assumes a top-level tools package.
From the repo layout it should be:
-from tools.sports_apis import APIFootballClient
+from ai_backend.tools.sports_apis import APIFootballClient(or update PYTHONPATH).
📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| from tools.sports_apis import APIFootballClient | |
| from ai_backend.tools.sports_apis import APIFootballClient |
🤖 Prompt for AI Agents
In ai-backend/tests/test_fixtures.py at line 3, the import statement uses 'from
tools.sports_apis import APIFootballClient', which assumes 'tools' is a
top-level package. To fix this, adjust the import path to reflect the actual
relative location of the 'tools' package within the repo structure, or
alternatively update the PYTHONPATH environment variable to include the
directory containing 'tools' so the import resolves correctly.
| data = await response.json() | ||
| data_stringified = json.dumps(data, indent=2) | ||
|
|
||
| # logger.info(f"Response data: {data_stringified}") | ||
| return data_stringified |
There was a problem hiding this comment.
Return type mismatches the annotation
_fetch_data (and therefore test_api_auth) claim to return list[dict] but actually return a JSON string. Either adjust the annotation or return the parsed list/dict:
- data = await response.json()
- data_stringified = json.dumps(data, indent=2)
- return data_stringified
+ return await response.json()📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| data = await response.json() | |
| data_stringified = json.dumps(data, indent=2) | |
| # logger.info(f"Response data: {data_stringified}") | |
| return data_stringified | |
| # logger.info(f"Response data: {data_stringified}") | |
| return await response.json() |
🤖 Prompt for AI Agents
In ai-backend/tools/test_api_auth.py around lines 82 to 86, the function
_fetch_data is annotated to return a list of dictionaries but currently returns
a JSON string. To fix this, modify the return statement to return the parsed
JSON data directly (the list/dict) instead of the stringified JSON, or
alternatively update the function's return type annotation to indicate it
returns a string. Choose the approach that aligns with the expected usage of the
function.
| from selenium import webdriver | ||
| from selenium.webdriver.chrome.options import Options | ||
| from selenium.webdriver.common.by import By | ||
| from selenium.webdriver.support.ui import WebDriverWait | ||
| from selenium.webdriver.support import expected_conditions as EC | ||
| from selenium.common.exceptions import TimeoutException, WebDriverException | ||
|
|
There was a problem hiding this comment.
🛠️ Refactor suggestion
Unconditional Selenium import contradicts “no hard dependency” claim
selenium is imported at module load time. If the package isn’t installed (CI, head-less servers), simply importing this module will raise ModuleNotFoundError, even when Selenium features aren’t used.
Move the import inside download_with_selenium() (guarded by try/except ImportError) or wrap the top-level import in a soft-fail block.
🤖 Prompt for AI Agents
In data-collection/collectors/fbref_collector.py around lines 15 to 21, the
selenium imports are unconditional, causing a ModuleNotFoundError if selenium is
not installed even when its features are not used. To fix this, move all
selenium-related imports inside the download_with_selenium() function and wrap
them in a try/except ImportError block to softly handle the absence of selenium
and avoid import errors at module load time.
| def extract_columns(soup, data_fields: list[str], html_filepath: str) -> pd.DataFrame: | ||
| """Extract specified columns from HTML using BeautifulSoup.""" | ||
| parsed_rows = [] | ||
| # Find the specific table we care about using BeautifulSoup | ||
| table = soup.find("table", class_="stats_table") | ||
|
|
||
| if table: | ||
| for row in table.find("tbody").find_all("tr"): | ||
| # Find the specific cells '<td>' by their 'data-stat' attribute | ||
| fields = {"date": "", "score": "", "home_team": "", "away_team": ""} | ||
| for field in fields: | ||
| # Find the specific cells '<td>' by their 'data-stat' attribute | ||
| cell = row.find("td", {"data-stat": field}) | ||
| # Extract the text from each cell, checking if the cell was found | ||
| fields[field] = cell.text.strip() if cell else "" | ||
|
|
||
| # Create a dictionary for each row | ||
| parsed_rows.append(fields) | ||
|
|
||
| # Convert our list of dictionaries into a pandas DataFrame | ||
| if parsed_rows: | ||
| # 2nd arg preserves the order of the columns | ||
| df_bs = pd.DataFrame( | ||
| parsed_rows, columns=["date", "score", "home_team", "away_team"] | ||
| ) | ||
| print("\nSuccess! Extracted columns using BeautifulSoup:\n") | ||
| else: | ||
| print("\nFailed to extract columns using BeautifulSoup.") | ||
|
|
||
| return df_bs |
There was a problem hiding this comment.
extract_columns() can raise UnboundLocalError and misses null-safety
Issues:
- When
parsed_rowsis empty the variabledf_bsis never defined, yet it’s returned – this will crash. table.find("tbody")may returnNone; chaining.find_all()without a check causesAttributeError.- Parameter
data_fieldsis accepted but never used → dead code / confusion.
Minimal patch:
- if table:
- for row in table.find("tbody").find_all("tr"):
+ if table and table.tbody:
+ for row in table.tbody.find_all("tr"):
@@
- if parsed_rows:
+ if parsed_rows:
df_bs = pd.DataFrame(
parsed_rows, columns=["date", "score", "home_team", "away_team"]
)
print("\nSuccess! Extracted columns using BeautifulSoup:\n")
- else:
- print("\nFailed to extract columns using BeautifulSoup.")
-
- return df_bs
+ return df_bs
+
+ print("\nFailed to extract columns using BeautifulSoup.")
+ return pd.DataFrame(columns=["date", "score", "home_team", "away_team"])📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| def extract_columns(soup, data_fields: list[str], html_filepath: str) -> pd.DataFrame: | |
| """Extract specified columns from HTML using BeautifulSoup.""" | |
| parsed_rows = [] | |
| # Find the specific table we care about using BeautifulSoup | |
| table = soup.find("table", class_="stats_table") | |
| if table: | |
| for row in table.find("tbody").find_all("tr"): | |
| # Find the specific cells '<td>' by their 'data-stat' attribute | |
| fields = {"date": "", "score": "", "home_team": "", "away_team": ""} | |
| for field in fields: | |
| # Find the specific cells '<td>' by their 'data-stat' attribute | |
| cell = row.find("td", {"data-stat": field}) | |
| # Extract the text from each cell, checking if the cell was found | |
| fields[field] = cell.text.strip() if cell else "" | |
| # Create a dictionary for each row | |
| parsed_rows.append(fields) | |
| # Convert our list of dictionaries into a pandas DataFrame | |
| if parsed_rows: | |
| # 2nd arg preserves the order of the columns | |
| df_bs = pd.DataFrame( | |
| parsed_rows, columns=["date", "score", "home_team", "away_team"] | |
| ) | |
| print("\nSuccess! Extracted columns using BeautifulSoup:\n") | |
| else: | |
| print("\nFailed to extract columns using BeautifulSoup.") | |
| return df_bs | |
| def extract_columns(soup, data_fields: list[str], html_filepath: str) -> pd.DataFrame: | |
| """Extract specified columns from HTML using BeautifulSoup.""" | |
| parsed_rows = [] | |
| # Find the specific table we care about using BeautifulSoup | |
| table = soup.find("table", class_="stats_table") | |
| if table and table.tbody: | |
| for row in table.tbody.find_all("tr"): | |
| # Find the specific cells '<td>' by their 'data-stat' attribute | |
| fields = {"date": "", "score": "", "home_team": "", "away_team": ""} | |
| for field in fields: | |
| # Find the specific cells '<td>' by their 'data-stat' attribute | |
| cell = row.find("td", {"data-stat": field}) | |
| # Extract the text from each cell, checking if the cell was found | |
| fields[field] = cell.text.strip() if cell else "" | |
| # Create a dictionary for each row | |
| parsed_rows.append(fields) | |
| if parsed_rows: | |
| df_bs = pd.DataFrame( | |
| parsed_rows, columns=["date", "score", "home_team", "away_team"] | |
| ) | |
| print("\nSuccess! Extracted columns using BeautifulSoup:\n") | |
| return df_bs | |
| print("\nFailed to extract columns using BeautifulSoup.") | |
| return pd.DataFrame(columns=["date", "score", "home_team", "away_team"]) |
🤖 Prompt for AI Agents
In data-collection/collectors/fbref_collector.py around lines 67 to 96, fix the
UnboundLocalError by ensuring df_bs is always defined before return, even if
parsed_rows is empty, by initializing df_bs to an empty DataFrame. Add a null
check for table.find("tbody") before calling find_all to avoid AttributeError.
Remove the unused parameter data_fields from the function signature and any
related code to eliminate dead code and confusion.
| def add_new_columns_to_football_data(df_clean, file): | ||
| """Add new columns to df_clean.""" | ||
| df_clean["match_id"] = df_clean.index | ||
|
|
||
| tmp = file.split("_") | ||
| df_clean["league"] = tmp[1] | ||
| df_clean["season"] = tmp[2].split(".")[0] | ||
| df_clean["source"] = tmp[0] | ||
| return df_clean |
There was a problem hiding this comment.
🛠️ Refactor suggestion
Shadowing built-in file and fragile filename parsing
Rename parameter and parse safely:
-def add_new_columns_to_football_data(df_clean, file):
+def add_new_columns_to_football_data(df_clean: pd.DataFrame, filename: str) -> pd.DataFrame:
@@
- tmp = file.split("_")
+ tmp = os.path.basename(filename).split("_")Also use a stable match_id (e.g., hash of league/date/teams) instead of the index to guarantee cross-file uniqueness.
📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| def add_new_columns_to_football_data(df_clean, file): | |
| """Add new columns to df_clean.""" | |
| df_clean["match_id"] = df_clean.index | |
| tmp = file.split("_") | |
| df_clean["league"] = tmp[1] | |
| df_clean["season"] = tmp[2].split(".")[0] | |
| df_clean["source"] = tmp[0] | |
| return df_clean | |
| def add_new_columns_to_football_data(df_clean: pd.DataFrame, filename: str) -> pd.DataFrame: | |
| """Add new columns to df_clean.""" | |
| df_clean["match_id"] = df_clean.index | |
| tmp = os.path.basename(filename).split("_") | |
| df_clean["league"] = tmp[1] | |
| df_clean["season"] = tmp[2].split(".")[0] | |
| df_clean["source"] = tmp[0] | |
| return df_clean |
🤖 Prompt for AI Agents
In data-collection/collectors/football_data_collector.py around lines 111 to
119, rename the parameter 'file' to avoid shadowing the built-in name, and
replace the fragile filename parsing with a more robust method such as using
pathlib or regex to extract parts safely. Additionally, replace the current
match_id assignment from the dataframe index with a stable unique identifier
generated by hashing a combination of league, date, and teams to ensure
uniqueness across files.
|
|
||
| # Add the shared directory to the Python path | ||
| sys.path.append(os.path.join(os.path.dirname(__file__), '..', 'shared')) | ||
| sys.path.append(os.path.join(os.path.dirname(__file__), "..", "shared")) |
There was a problem hiding this comment.
🛠️ Refactor suggestion
Avoid run-time sys.path manipulation
Appending ../shared to sys.path is brittle and can hide packaging/import problems. Prefer:
-# Add the shared directory to the Python path
-sys.path.append(os.path.join(os.path.dirname(__file__), "..", "shared"))
+import importlib.util, pathlib, sys
+shared_path = pathlib.Path(__file__).resolve().parent.parent / "shared"
+sys.path.insert(0, str(shared_path))…or, better, convert shared into a proper Python package and install it with your project.
📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| sys.path.append(os.path.join(os.path.dirname(__file__), "..", "shared")) | |
| import importlib.util, pathlib, sys | |
| shared_path = pathlib.Path(__file__).resolve().parent.parent / "shared" | |
| sys.path.insert(0, str(shared_path)) |
🤖 Prompt for AI Agents
In scripts/seed-data.py at line 14, avoid modifying sys.path at runtime by
appending ../shared, as this is brittle and can mask import issues. Instead,
refactor the project structure to make shared a proper Python package with an
__init__.py file, and install it as part of your project dependencies so it can
be imported normally without sys.path hacks.
There was a problem hiding this comment.
Actionable comments posted: 2
🔭 Outside diff range comments (1)
data-collection/collectors/football_data_collector.py (1)
159-165:save_df_to_csv()should return useful infoScript relies on its return value but the function returns
None.
Return either the processedDataFrameor the saved file path.- df.to_csv(filepath, index=False) - print(f"Processed CSV saved as '{filepath}'.") + df.to_csv(filepath, index=False) + print(f"Processed CSV saved as '{filepath}'.") + return filepath
♻️ Duplicate comments (4)
data-collection/collectors/fbref_collector.py (2)
14-18: Hard dependency on Selenium contradicts doc-stringThe module promises “no hard Selenium dependencies”, yet
selenium.*is imported at module import time. On environments without Selenium this raisesModuleNotFoundError, blocking even non-Selenium code paths.
Move these imports insidedownload_with_selenium()and guard them with atry/except ImportErrorsoft-fail.
60-89:extract_columns()can still crash & misses null-safety•
table.find("tbody")may returnNone→AttributeError.
• Whenparsed_rowsis emptydf_bsis undefined →UnboundLocalErroron return.
•data_fieldsparameter is unused – dead code / confusion.- if table: - for row in table.find("tbody").find_all("tr"): + if table and table.tbody: + for row in table.tbody.find_all("tr"): @@ - if parsed_rows: + if parsed_rows: df_bs = pd.DataFrame( parsed_rows, columns=["date", "score", "home_team", "away_team"] ) print("\nSuccess! Extracted columns using BeautifulSoup:\n") - else: - print("\nFailed to extract columns using BeautifulSoup.") - - return df_bs + return df_bs + + print("\nFailed to extract columns using BeautifulSoup.") + return pd.DataFrame(columns=["date", "score", "home_team", "away_team"])Also drop the unused
data_fieldsarg from the signature.data-collection/collectors/football_data_collector.py (2)
63-88:download_csv(): missing directory creation, timeout & overwrite guardSame issues flagged earlier remain: no
os.makedirs,requests.getwithouttimeout, and doc-string mentionsoverwriteflag that is not implemented.
110-118: Shadowing built-in name & fragile filename parsingParameter
fileshadows the historic built-in andsplit("_")fails on paths containing underscores or directories.
Rename tofilenameand useos.path.basename()orpathlibfor safe parsing.
🧹 Nitpick comments (1)
data-collection/collectors/fbref_collector.py (1)
206-210: Mac-specific Chrome binary hard-codedThe explicit
/Applications/...path breaks on Linux/CI. Detect the OS or let ChromeDriver pick the installed binary instead.
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (3)
data-collection/collectors/fbref_collector.py(1 hunks)data-collection/collectors/football_data_collector.py(1 hunks)data-collection/scripts/collect_all.py(1 hunks)
🧰 Additional context used
🧬 Code Graph Analysis (2)
data-collection/collectors/fbref_collector.py (1)
data-collection/collectors/football_data_collector.py (5)
_here(47-48)get_data_raw_folder(51-54)get_data_processed_folder(57-60)normalize_date(148-156)get_columns(122-125)
data-collection/collectors/football_data_collector.py (1)
data-collection/collectors/fbref_collector.py (5)
_here(21-22)get_data_raw_folder(25-28)get_data_processed_folder(31-34)get_columns(170-173)normalize_date(140-148)
🪛 GitHub Check: Codacy Static Code Analysis
data-collection/scripts/collect_all.py
[warning] 116-116: data-collection/scripts/collect_all.py#L116
Assigning result of a function call, where the function has no return
data-collection/collectors/fbref_collector.py
[warning] 11-11: data-collection/collectors/fbref_collector.py#L11
Unused BeautifulSoup imported from bs4
[warning] 17-17: data-collection/collectors/fbref_collector.py#L17
Unused expected_conditions imported from selenium.webdriver.support as EC
[warning] 145-145: data-collection/collectors/fbref_collector.py#L145
Try, Except, Pass detected.
[warning] 259-259: data-collection/collectors/fbref_collector.py#L259
Standard pseudo-random generators are not suitable for security/cryptographic purposes.
data-collection/collectors/football_data_collector.py
[warning] 80-80: data-collection/collectors/football_data_collector.py#L80
Call to requests without timeout
[warning] 153-153: data-collection/collectors/football_data_collector.py#L153
Try, Except, Pass detected.
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
- GitHub Check: Codacy Static Code Analysis
| html_filepath = get_data_folder() + "/raw/" + html_filename | ||
| try: | ||
| with open(html_filepath, "r", encoding="utf-8") as f: | ||
| html_content = f.read() | ||
| soup = BeautifulSoup(html_content, "lxml") | ||
| except (FileNotFoundError, PermissionError, UnicodeDecodeError) as e: | ||
| print(f"Error reading file: {e}") | ||
| # Return an empty soup as fallback | ||
| soup = BeautifulSoup("", "lxml") |
There was a problem hiding this comment.
🛠️ Refactor suggestion
Non-portable path concatenation to html_filepath
Using string + "/raw/" + bypasses os.path.join, breaks on Windows, and risks double slashes.
-html_filepath = get_data_folder() + "/raw/" + html_filename
+html_filepath = os.path.join(get_data_folder(), "raw", html_filename)📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| html_filepath = get_data_folder() + "/raw/" + html_filename | |
| try: | |
| with open(html_filepath, "r", encoding="utf-8") as f: | |
| html_content = f.read() | |
| soup = BeautifulSoup(html_content, "lxml") | |
| except (FileNotFoundError, PermissionError, UnicodeDecodeError) as e: | |
| print(f"Error reading file: {e}") | |
| # Return an empty soup as fallback | |
| soup = BeautifulSoup("", "lxml") | |
| html_filepath = os.path.join(get_data_folder(), "raw", html_filename) | |
| try: | |
| with open(html_filepath, "r", encoding="utf-8") as f: | |
| html_content = f.read() | |
| soup = BeautifulSoup(html_content, "lxml") | |
| except (FileNotFoundError, PermissionError, UnicodeDecodeError) as e: | |
| print(f"Error reading file: {e}") | |
| # Return an empty soup as fallback | |
| soup = BeautifulSoup("", "lxml") |
🤖 Prompt for AI Agents
In data-collection/scripts/collect_all.py around lines 138 to 146, the
construction of html_filepath uses string concatenation with "+ '/raw/' +",
which is non-portable and can cause issues on Windows or result in double
slashes. Replace this concatenation with os.path.join to correctly and portably
join the directory paths and filename, ensuring compatibility across operating
systems.
There was a problem hiding this comment.
Actionable comments posted: 5
🔭 Outside diff range comments (1)
data-collection/collectors/football_data_collector.py (1)
160-166: Make save robust: ensure folder exists, use splitext, and return filepathCurrent code may fail if folder is missing and doesn’t return the path.
-def save_df_to_csv(df, raw_filename): - """Save DataFrame to CSV""" - processed_filename = raw_filename.split(".")[0] + "_processed.csv" - filepath = os.path.join(get_data_processed_folder(), processed_filename) - df.to_csv(filepath, index=False) - print(f"Processed CSV saved as '{filepath}'.") +def save_df_to_csv(df: pd.DataFrame, raw_filename: str) -> str: + """Save DataFrame to CSV and return the output path.""" + root, _ = os.path.splitext(os.path.basename(raw_filename)) + processed_filename = f"{root}_processed.csv" + out_dir = get_data_processed_folder() + os.makedirs(out_dir, exist_ok=True) + filepath = os.path.join(out_dir, processed_filename) + df.to_csv(filepath, index=False) + print(f"Processed CSV saved as '{filepath}'.") + return filepath
♻️ Duplicate comments (4)
data-collection/collectors/fbref_collector.py (2)
13-17: Unconditional Selenium import contradicts “no hard dependency” and breaks import if Selenium isn’t installedMove Selenium imports into the functions that need them and gate type hints to avoid runtime import. Also implement the documented requests-based fallback.
Apply:
@@ -import random -from selenium import webdriver -from selenium.webdriver.chrome.options import Options -from selenium.webdriver.support.ui import WebDriverWait -from selenium.common.exceptions import TimeoutException, WebDriverException +import random +import hashlib +from typing import TYPE_CHECKING + +if TYPE_CHECKING: + # For type hints only; does not enforce Selenium at runtime. + from selenium.webdriver.chrome.options import Options
58-88: Fix UnboundLocalError, add null-safety, and actually usedata_fieldsWhen no rows are parsed,
df_bsis undefined. Alsotable.tbodycan be None. Parameterdata_fieldsis unused.-def extract_columns(soup, data_fields: list[str], html_filepath: str) -> pd.DataFrame: +def extract_columns(soup, data_fields: list[str] | None, html_filepath: str) -> pd.DataFrame: """Extract specified columns from HTML using BeautifulSoup.""" - parsed_rows = [] - # Find the specific table we care about using BeautifulSoup - table = soup.find("table", class_="stats_table") + parsed_rows: list[dict] = [] + # Find the specific table we care about using BeautifulSoup + table = soup.find("table", class_="stats_table") - if table: - for row in table.find("tbody").find_all("tr"): + fields_to_extract = data_fields or ["date", "score", "home_team", "away_team"] + + if table and getattr(table, "tbody", None): + for row in table.tbody.find_all("tr"): # Find the specific cells '<td>' by their 'data-stat' attribute - fields = {"date": "", "score": "", "home_team": "", "away_team": ""} - for field in fields: + fields = {k: "" for k in fields_to_extract} + for field in fields_to_extract: # Find the specific cells '<td>' by their 'data-stat' attribute cell = row.find("td", {"data-stat": field}) # Extract the text from each cell, checking if the cell was found fields[field] = cell.text.strip() if cell else "" # Create a dictionary for each row parsed_rows.append(fields) # Convert our list of dictionaries into a pandas DataFrame - if parsed_rows: - # 2nd arg preserves the order of the columns - df_bs = pd.DataFrame( - parsed_rows, columns=["date", "score", "home_team", "away_team"] - ) - print("\nSuccess! Extracted columns using BeautifulSoup:\n") - else: - print("\nFailed to extract columns using BeautifulSoup.") - - return df_bs + if parsed_rows: + df_bs = pd.DataFrame(parsed_rows, columns=fields_to_extract) + print("\nSuccess! Extracted columns using BeautifulSoup:\n") + return df_bs + + print("\nFailed to extract columns using BeautifulSoup.") + return pd.DataFrame(columns=fields_to_extract)data-collection/collectors/football_data_collector.py (2)
63-90: Implement timeout, directory creation, and real overwrite semantics (docstring mismatch)Function can hang without timeout, fail if directory doesn’t exist, and mentions
overwritethat doesn’t exist.-def download_csv(url: str, filename: str = "football-data_data.csv"): +def download_csv( + url: str, + filename: str = "football-data_data.csv", + *, + timeout: int = 30, + overwrite: bool = True, +) -> str: @@ - - filename (str): Desired name for the saved file. Defaults to 'data.csv'. - - overwrite (bool): Whether to overwrite existing file. Defaults to False. + - filename (str): Desired name for the saved file. + - timeout (int): Network timeout in seconds. + - overwrite (bool): Whether to overwrite existing file. Defaults to True. @@ - filepath = os.path.join(get_data_raw_folder(), filename) + raw_dir = get_data_raw_folder() + os.makedirs(raw_dir, exist_ok=True) + filepath = os.path.join(raw_dir, filename) + + if not overwrite and os.path.exists(filepath): + raise FileExistsError(f"{filepath} already exists. Set overwrite=True to replace.") @@ - response = requests.get(url) + response = requests.get(url, timeout=timeout) response.raise_for_status() # Raise HTTPError for bad responses @@ # Write to data/raw/filename with open(filepath, "wb") as f: f.write(response.content) print(f"Raw CSV saved as {filepath}\n") + return filepath
110-119: Avoid fragile filename parsing and non-unique IDs; don’t usefileas a parameter nameUse basename parsing and a stable deterministic
match_id.-def add_new_columns_to_football_data(df_clean, file): +def add_new_columns_to_football_data(df_clean: pd.DataFrame, filename: str) -> pd.DataFrame: """Add new columns to df_clean.""" - df_clean["match_id"] = df_clean.index - - tmp = file.split("_") - df_clean["league"] = tmp[1] - df_clean["season"] = tmp[2].split(".")[0] - df_clean["source"] = tmp[0] + base = os.path.basename(filename) + parts = base.split("_") + df_clean["league"] = parts[1] if len(parts) > 1 else "" + df_clean["season"] = parts[2].split(".")[0] if len(parts) > 2 else "" + df_clean["source"] = parts[0] if parts else "" + + # Stable, cross-file unique match_id + import hashlib + def _mk_id(row) -> str: + key = f"{row.get('date','')}|{row.get('league','')}|{row.get('season','')}|{row.get('home_team','')}|{row.get('away_team','')}" + return hashlib.sha1(key.encode('utf-8')).hexdigest()[:16] + df_clean["match_id"] = df_clean.apply(_mk_id, axis=1) return df_clean
🧹 Nitpick comments (6)
data-collection/collectors/fbref_collector.py (3)
35-46: League slugging may be fragileReplacing spaces with '-' won’t cover punctuation/diacritics (e.g., “Liga Portugal”, “Ligue 1 Uber Eats”). Consider a mapping table for FBref slugs or accept an explicit
league_slugparameter to avoid 404s.
258-259: Static analysis warning on random is a false positive in this contextUse of
random.uniformhere is not for crypto; it’s fine. If your linter enforces this rule, suppress it or switch tosecrets/SystemRandomfor appeasement.
212-223: Minor: remove strict dependency on Selenium types in annotationsTo avoid import-time failures, keep annotations as strings or drop them. You’ve already gated imports; consider adding from future import annotations or using TYPE_CHECKING.
data-collection/collectors/football_data_collector.py (3)
21-45: Clarify API: renameleague_idtoleague_code, drop unusedleague_name, and fix examples
league_nameis unused; parameter name suggests numeric id, but path expects codes like “E0”, “SP1”. Examples for Champions League/La Liga show season code “2425” while text says 2023–2024 (should be “2324”).-def generate_football_data_url(league_id, league_name, season): +def generate_football_data_url(league_code: str, season: int) -> str: @@ - ex. Champions League 2023-2024 - https://www.football-data.co.uk/mmz4281/2425/E1.csv + ex. Champions League 2023-2024 + https://www.football-data.co.uk/mmz4281/2324/E1.csv @@ - ex. La Liga 2023-2024 - https://www.football-data.co.uk/mmz4281/2425/SP1.csv + ex. La Liga 2023-2024 + https://www.football-data.co.uk/mmz4281/2324/SP1.csv @@ - url = base_url + f"/{season_code}/{league_id}.csv" + url = base_url + f"/{season_code}/{league_code}.csv"
128-146: Don’t print DataFrame heads in library codeNoisy logs and potential perf hit. Return the processed DataFrame; leave logging to callers or add a verbose flag.
def reorder_df_football_data(df): # Get specified columns and reorder them df_processed = get_columns( @@ ) - print("After getting essential columns:") - print(df_processed.head()) - print("df_processed.columns:", df_processed.columns) return df_processed
47-61: DRY: centralize shared path and column helpersBoth collectors duplicate _here, get_data_raw_folder, get_data_processed_folder, get_columns, normalize_date. Extract into a shared module (e.g., data-collection/common/io_utils.py) and import from there to keep behavior consistent.
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (3)
data-collection/collectors/fbref_collector.py(1 hunks)data-collection/collectors/football_data_collector.py(1 hunks)data-collection/scripts/collect_all.py(1 hunks)
🚧 Files skipped from review as they are similar to previous changes (1)
- data-collection/scripts/collect_all.py
🧰 Additional context used
🧬 Code Graph Analysis (1)
data-collection/collectors/fbref_collector.py (1)
data-collection/collectors/football_data_collector.py (5)
_here(47-48)get_data_raw_folder(51-54)get_data_processed_folder(57-60)normalize_date(148-157)get_columns(122-125)
🪛 GitHub Check: Codacy Static Code Analysis
data-collection/collectors/fbref_collector.py
[warning] 258-258: data-collection/collectors/fbref_collector.py#L258
Standard pseudo-random generators are not suitable for security/cryptographic purposes.
data-collection/collectors/football_data_collector.py
[warning] 80-80: data-collection/collectors/football_data_collector.py#L80
Call to requests without timeout
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
- GitHub Check: Codacy Static Code Analysis
| def add_new_columns_to_fbref(df: pd.DataFrame, file) -> pd.DataFrame: | ||
| # Fill in match_id column | ||
| df["match_id"] = range(len(df)) | ||
| # Split score into home_score and away_score | ||
| df["home_score"], df["away_score"] = convert_score_to_home_score_and_away_score( | ||
| df["score"] | ||
| ) | ||
|
|
||
| tmp = file.split("_") | ||
| df["league"] = tmp[1] | ||
| df["season"] = tmp[2].split(".")[0] | ||
| df["source"] = tmp[0] | ||
|
|
||
| return df |
There was a problem hiding this comment.
🛠️ Refactor suggestion
Use stable match_id and robust filename parsing; avoid file param name
file.split("_") is fragile, and index-based match_id isn’t globally unique.
-def add_new_columns_to_fbref(df: pd.DataFrame, file) -> pd.DataFrame:
- # Fill in match_id column
- df["match_id"] = range(len(df))
+def add_new_columns_to_fbref(df: pd.DataFrame, filename: str) -> pd.DataFrame:
+ # Robust filename parsing
+ base = os.path.basename(filename)
+ parts = base.split("_")
+ # Expecting: source_league_season.html or similar
+ league = parts[1] if len(parts) > 1 else ""
+ season = parts[2].split(".")[0] if len(parts) > 2 else ""
@@
- tmp = file.split("_")
- df["league"] = tmp[1]
- df["season"] = tmp[2].split(".")[0]
- df["source"] = tmp[0]
+ df["league"] = league
+ df["season"] = season
+ df["source"] = parts[0] if parts else ""
+
+ # Stable deterministic match_id
+ def _mk_id(row) -> str:
+ key = f"{row.get('date','')}|{row.get('league','')}|{row.get('season','')}|{row.get('home_team','')}|{row.get('away_team','')}"
+ return hashlib.sha1(key.encode("utf-8")).hexdigest()[:16]
+ df["match_id"] = df.apply(_mk_id, axis=1)
return dfCommittable suggestion skipped: line range outside the PR's diff.
🤖 Prompt for AI Agents
In data-collection/collectors/fbref_collector.py around lines 90 to 103, the
current code uses a non-unique index-based match_id and fragile filename parsing
by splitting on underscores. To fix this, generate a stable, unique match_id
based on consistent match attributes instead of the row index, and replace the
filename parsing with a more robust method such as using regex or pathlib to
extract league, season, and source reliably. Also, rename the parameter from
'file' to avoid shadowing built-in names.
| def create_csv(df: pd.DataFrame, filename: str) -> str: | ||
| """Save DataFrame to CSV under data/processed/filename""" | ||
|
|
||
| filepath = os.path.join(get_data_processed_folder(), filename) | ||
|
|
||
| # If output_path is not absolute, write under processed folder | ||
| if not os.path.isabs(filepath): | ||
| os.makedirs(os.path.dirname(filepath), exist_ok=True) | ||
| df.to_csv(filepath, index=False) | ||
| print(f"Processed CSV saved as '{filepath}'.") | ||
| return filepath | ||
|
|
There was a problem hiding this comment.
Directory may not exist; ensure creation before writing CSV
filepath is absolute, so the current conditional skips makedirs. Create the directory unconditionally.
def create_csv(df: pd.DataFrame, filename: str) -> str:
"""Save DataFrame to CSV under data/processed/filename"""
filepath = os.path.join(get_data_processed_folder(), filename)
- # If output_path is not absolute, write under processed folder
- if not os.path.isabs(filepath):
- os.makedirs(os.path.dirname(filepath), exist_ok=True)
+ os.makedirs(os.path.dirname(filepath), exist_ok=True)
df.to_csv(filepath, index=False)
print(f"Processed CSV saved as '{filepath}'.")
return filepath📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| def create_csv(df: pd.DataFrame, filename: str) -> str: | |
| """Save DataFrame to CSV under data/processed/filename""" | |
| filepath = os.path.join(get_data_processed_folder(), filename) | |
| # If output_path is not absolute, write under processed folder | |
| if not os.path.isabs(filepath): | |
| os.makedirs(os.path.dirname(filepath), exist_ok=True) | |
| df.to_csv(filepath, index=False) | |
| print(f"Processed CSV saved as '{filepath}'.") | |
| return filepath | |
| def create_csv(df: pd.DataFrame, filename: str) -> str: | |
| """Save DataFrame to CSV under data/processed/filename""" | |
| filepath = os.path.join(get_data_processed_folder(), filename) | |
| os.makedirs(os.path.dirname(filepath), exist_ok=True) | |
| df.to_csv(filepath, index=False) | |
| print(f"Processed CSV saved as '{filepath}'.") | |
| return filepath |
🤖 Prompt for AI Agents
In data-collection/collectors/fbref_collector.py around lines 175 to 186, the
directory creation is conditional on filepath not being absolute, but filepath
is always absolute, so the directory may not be created before writing the CSV.
Fix this by removing the condition and always calling os.makedirs on the
directory of filepath with exist_ok=True before saving the CSV.
| def _configure_chrome_options() -> Options: | ||
| """Create and return a configured Chrome Options object.""" | ||
| chrome_options = Options() | ||
| chrome_options.add_argument("--disable-blink-features=AutomationControlled") | ||
| chrome_options.add_experimental_option("excludeSwitches", ["enable-automation"]) | ||
| chrome_options.add_experimental_option("useAutomationExtension", False) | ||
| chrome_options.add_argument("--disable-extensions") | ||
| chrome_options.add_argument("--no-sandbox") | ||
| chrome_options.add_argument("--disable-dev-shm-usage") | ||
| # chrome_options.add_argument('--headless') # Optional: enable headless mode | ||
|
|
||
| # User agent to mimic real browser | ||
| chrome_options.add_argument( | ||
| "--user-agent=Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) " | ||
| "AppleWebKit/537.36 (KHTML, like Gecko) Chrome/120.0.0.0 Safari/537.36" | ||
| ) | ||
|
|
||
| # Set Chrome binary location for macOS if available | ||
| chrome_bin = "/Applications/Google Chrome.app/Contents/MacOS/Google Chrome" | ||
| if os.path.exists(chrome_bin): | ||
| chrome_options.binary_location = chrome_bin | ||
| return chrome_options | ||
|
|
||
|
|
||
| def _create_driver(chrome_options: Options): | ||
| """Initialize and return a Chrome WebDriver instance.""" | ||
| print("Starting Chrome WebDriver...") | ||
| return webdriver.Chrome(options=chrome_options) | ||
|
|
There was a problem hiding this comment.
🛠️ Refactor suggestion
Localize Selenium imports and avoid runtime type dependency in annotations
Import Selenium types inside functions and use string annotations to prevent ImportError at module import.
-def _configure_chrome_options() -> Options:
- """Create and return a configured Chrome Options object."""
- chrome_options = Options()
+def _configure_chrome_options() -> "Options":
+ """Create and return a configured Chrome Options object."""
+ from selenium.webdriver.chrome.options import Options
+ chrome_options = Options()
@@
-def _create_driver(chrome_options: Options):
- """Initialize and return a Chrome WebDriver instance."""
- print("Starting Chrome WebDriver...")
- return webdriver.Chrome(options=chrome_options)
+def _create_driver(chrome_options):
+ """Initialize and return a Chrome WebDriver instance."""
+ print("Starting Chrome WebDriver...")
+ from selenium import webdriver
+ return webdriver.Chrome(options=chrome_options)📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| def _configure_chrome_options() -> Options: | |
| """Create and return a configured Chrome Options object.""" | |
| chrome_options = Options() | |
| chrome_options.add_argument("--disable-blink-features=AutomationControlled") | |
| chrome_options.add_experimental_option("excludeSwitches", ["enable-automation"]) | |
| chrome_options.add_experimental_option("useAutomationExtension", False) | |
| chrome_options.add_argument("--disable-extensions") | |
| chrome_options.add_argument("--no-sandbox") | |
| chrome_options.add_argument("--disable-dev-shm-usage") | |
| # chrome_options.add_argument('--headless') # Optional: enable headless mode | |
| # User agent to mimic real browser | |
| chrome_options.add_argument( | |
| "--user-agent=Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) " | |
| "AppleWebKit/537.36 (KHTML, like Gecko) Chrome/120.0.0.0 Safari/537.36" | |
| ) | |
| # Set Chrome binary location for macOS if available | |
| chrome_bin = "/Applications/Google Chrome.app/Contents/MacOS/Google Chrome" | |
| if os.path.exists(chrome_bin): | |
| chrome_options.binary_location = chrome_bin | |
| return chrome_options | |
| def _create_driver(chrome_options: Options): | |
| """Initialize and return a Chrome WebDriver instance.""" | |
| print("Starting Chrome WebDriver...") | |
| return webdriver.Chrome(options=chrome_options) | |
| def _configure_chrome_options() -> "Options": | |
| """Create and return a configured Chrome Options object.""" | |
| from selenium.webdriver.chrome.options import Options | |
| chrome_options = Options() | |
| chrome_options.add_argument("--disable-blink-features=AutomationControlled") | |
| chrome_options.add_experimental_option("excludeSwitches", ["enable-automation"]) | |
| chrome_options.add_experimental_option("useAutomationExtension", False) | |
| chrome_options.add_argument("--disable-extensions") | |
| chrome_options.add_argument("--no-sandbox") | |
| chrome_options.add_argument("--disable-dev-shm-usage") | |
| # chrome_options.add_argument('--headless') # Optional: enable headless mode | |
| # User agent to mimic real browser | |
| chrome_options.add_argument( | |
| "--user-agent=Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) " | |
| "AppleWebKit/537.36 (KHTML, like Gecko) Chrome/120.0.0.0 Safari/537.36" | |
| ) | |
| # Set Chrome binary location for macOS if available | |
| chrome_bin = "/Applications/Google Chrome.app/Contents/MacOS/Google Chrome" | |
| if os.path.exists(chrome_bin): | |
| chrome_options.binary_location = chrome_bin | |
| return chrome_options | |
| def _create_driver(chrome_options): | |
| """Initialize and return a Chrome WebDriver instance.""" | |
| print("Starting Chrome WebDriver...") | |
| from selenium import webdriver | |
| return webdriver.Chrome(options=chrome_options) |
🤖 Prompt for AI Agents
In data-collection/collectors/fbref_collector.py around lines 188 to 216, the
Selenium imports and type annotations are currently at the module level, which
can cause ImportError during module import. Move all Selenium-related imports
inside the functions that use them, and change type annotations referencing
Selenium types to use string literals instead. This localizes dependencies and
avoids runtime import errors.
| def _wait_out_challenge(driver, timeout: int = 30) -> None: | ||
| """Wait for common anti-bot challenge pages to clear, within timeout.""" | ||
| print("Waiting for page to load and any challenges to complete...") | ||
| try: | ||
| WebDriverWait(driver, timeout).until( | ||
| lambda d: "Just a moment" not in d.title | ||
| and "Checking your browser" not in d.page_source | ||
| ) | ||
| print("Challenge completed or page loaded successfully.") | ||
| except TimeoutException: | ||
| print("Timeout waiting for challenge completion, but proceeding anyway...") | ||
|
|
There was a problem hiding this comment.
🛠️ Refactor suggestion
Guard against missing Selenium and avoid NameError for TimeoutException
Import WebDriverWait/TimeoutException inside the function.
def _wait_out_challenge(driver, timeout: int = 30) -> None:
"""Wait for common anti-bot challenge pages to clear, within timeout."""
print("Waiting for page to load and any challenges to complete...")
try:
+ from selenium.webdriver.support.ui import WebDriverWait
+ from selenium.common.exceptions import TimeoutException
WebDriverWait(driver, timeout).until(
lambda d: "Just a moment" not in d.title
and "Checking your browser" not in d.page_source
)
print("Challenge completed or page loaded successfully.")
except TimeoutException:
print("Timeout waiting for challenge completion, but proceeding anyway...")📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| def _wait_out_challenge(driver, timeout: int = 30) -> None: | |
| """Wait for common anti-bot challenge pages to clear, within timeout.""" | |
| print("Waiting for page to load and any challenges to complete...") | |
| try: | |
| WebDriverWait(driver, timeout).until( | |
| lambda d: "Just a moment" not in d.title | |
| and "Checking your browser" not in d.page_source | |
| ) | |
| print("Challenge completed or page loaded successfully.") | |
| except TimeoutException: | |
| print("Timeout waiting for challenge completion, but proceeding anyway...") | |
| def _wait_out_challenge(driver, timeout: int = 30) -> None: | |
| """Wait for common anti-bot challenge pages to clear, within timeout.""" | |
| print("Waiting for page to load and any challenges to complete...") | |
| try: | |
| from selenium.webdriver.support.ui import WebDriverWait | |
| from selenium.common.exceptions import TimeoutException | |
| WebDriverWait(driver, timeout).until( | |
| lambda d: "Just a moment" not in d.title | |
| and "Checking your browser" not in d.page_source | |
| ) | |
| print("Challenge completed or page loaded successfully.") | |
| except TimeoutException: | |
| print("Timeout waiting for challenge completion, but proceeding anyway...") |
🤖 Prompt for AI Agents
In data-collection/collectors/fbref_collector.py around lines 225 to 236, the
function _wait_out_challenge uses WebDriverWait and TimeoutException without
importing them inside the function, which can cause NameError if Selenium is
missing. To fix this, move the imports of WebDriverWait and TimeoutException
inside the _wait_out_challenge function to guard against missing Selenium and
avoid NameError.
| def download_with_selenium(url): | ||
| """Fetch dynamic page content using Selenium and return HTML (<=50 lines).""" | ||
| driver = None | ||
| try: | ||
| options = _configure_chrome_options() | ||
| driver = _create_driver(options) | ||
| _remove_webdriver_flag(driver) | ||
|
|
||
| print(f"Navigating to {url} ...") | ||
| driver.get(url) | ||
|
|
||
| _wait_out_challenge(driver, timeout=30) | ||
| time.sleep(random.uniform(2, 5)) # small buffer to ensure full render | ||
|
|
||
| html_content = _get_html_or_raise(driver) | ||
| print("Successfully retrieved page.") | ||
| return html_content | ||
|
|
||
| except WebDriverException as e: | ||
| print(f"WebDriver error: {e}") | ||
| print("\nTroubleshooting:") | ||
| print("1. Make sure Chrome browser is installed") | ||
| print("2. Install ChromeDriver: brew install --cask chromedriver") | ||
| print("3. Or use: pip install webdriver-manager") | ||
| raise ValueError(f"WebDriver error: {e}") | ||
|
|
||
| except Exception as e: | ||
| print(f"Unexpected error: {e}") | ||
| raise ValueError(f"Failed to scrape with Selenium: {e}") | ||
|
|
There was a problem hiding this comment.
🛠️ Refactor suggestion
Add true “requests” fallback and avoid hard Selenium dependency
If Selenium isn’t available, gracefully fall back to a plain GET with headers and timeout. Also localize Selenium imports to this function and helpers.
def download_with_selenium(url):
"""Fetch dynamic page content using Selenium and return HTML (<=50 lines)."""
driver = None
try:
- options = _configure_chrome_options()
- driver = _create_driver(options)
- _remove_webdriver_flag(driver)
+ # Try Selenium path first; import lazily to avoid hard dependency.
+ try:
+ from selenium.common.exceptions import WebDriverException
+ except ImportError:
+ WebDriverException = Exception # sentinel for isinstance checks
+
+ options = _configure_chrome_options()
+ driver = _create_driver(options)
+ _remove_webdriver_flag(driver)
@@
- _wait_out_challenge(driver, timeout=30)
+ _wait_out_challenge(driver, timeout=30)
time.sleep(random.uniform(2, 5)) # small buffer to ensure full render
@@
- html_content = _get_html_or_raise(driver)
+ html_content = _get_html_or_raise(driver)
print("Successfully retrieved page.")
return html_content
-
- except WebDriverException as e:
- print(f"WebDriver error: {e}")
- print("\nTroubleshooting:")
- print("1. Make sure Chrome browser is installed")
- print("2. Install ChromeDriver: brew install --cask chromedriver")
- print("3. Or use: pip install webdriver-manager")
- raise ValueError(f"WebDriver error: {e}")
+ except Exception as e:
+ # If Selenium is not installed or fails, try a lightweight fallback.
+ try:
+ import requests
+ print("Selenium unavailable/failed. Falling back to requests (static HTML only)...")
+ resp = requests.get(
+ url,
+ timeout=30,
+ headers={
+ "User-Agent": "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) "
+ "AppleWebKit/537.36 (KHTML, like Gecko) Chrome/120.0.0.0 Safari/537.36"
+ },
+ )
+ resp.raise_for_status()
+ return resp.text
+ except Exception as re:
+ print(f"Fallback fetch failed: {re}")
+ raise ValueError(f"Failed to fetch page via Selenium and requests: {e}") from re
@@
finally:
if driver:
print("Closing browser...")
driver.quit()📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| def download_with_selenium(url): | |
| """Fetch dynamic page content using Selenium and return HTML (<=50 lines).""" | |
| driver = None | |
| try: | |
| options = _configure_chrome_options() | |
| driver = _create_driver(options) | |
| _remove_webdriver_flag(driver) | |
| print(f"Navigating to {url} ...") | |
| driver.get(url) | |
| _wait_out_challenge(driver, timeout=30) | |
| time.sleep(random.uniform(2, 5)) # small buffer to ensure full render | |
| html_content = _get_html_or_raise(driver) | |
| print("Successfully retrieved page.") | |
| return html_content | |
| except WebDriverException as e: | |
| print(f"WebDriver error: {e}") | |
| print("\nTroubleshooting:") | |
| print("1. Make sure Chrome browser is installed") | |
| print("2. Install ChromeDriver: brew install --cask chromedriver") | |
| print("3. Or use: pip install webdriver-manager") | |
| raise ValueError(f"WebDriver error: {e}") | |
| except Exception as e: | |
| print(f"Unexpected error: {e}") | |
| raise ValueError(f"Failed to scrape with Selenium: {e}") | |
| def download_with_selenium(url): | |
| """Fetch dynamic page content using Selenium and return HTML (<=50 lines).""" | |
| driver = None | |
| try: | |
| # Try Selenium path first; import lazily to avoid hard dependency. | |
| try: | |
| from selenium.common.exceptions import WebDriverException | |
| except ImportError: | |
| WebDriverException = Exception # sentinel for isinstance checks | |
| options = _configure_chrome_options() | |
| driver = _create_driver(options) | |
| _remove_webdriver_flag(driver) | |
| print(f"Navigating to {url} ...") | |
| driver.get(url) | |
| _wait_out_challenge(driver, timeout=30) | |
| time.sleep(random.uniform(2, 5)) # small buffer to ensure full render | |
| html_content = _get_html_or_raise(driver) | |
| print("Successfully retrieved page.") | |
| return html_content | |
| except Exception as e: | |
| # If Selenium is not installed or fails, try a lightweight fallback. | |
| try: | |
| import requests | |
| print("Selenium unavailable/failed. Falling back to requests (static HTML only)...") | |
| resp = requests.get( | |
| url, | |
| timeout=30, | |
| headers={ | |
| "User-Agent": ( | |
| "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) " | |
| "AppleWebKit/537.36 (KHTML, like Gecko) " | |
| "Chrome/120.0.0.0 Safari/537.36" | |
| ) | |
| }, | |
| ) | |
| resp.raise_for_status() | |
| return resp.text | |
| except Exception as re: | |
| print(f"Fallback fetch failed: {re}") | |
| raise ValueError( | |
| f"Failed to fetch page via Selenium and requests: {e}" | |
| ) from re | |
| finally: | |
| if driver: | |
| print("Closing browser...") | |
| driver.quit() |
🧰 Tools
🪛 GitHub Check: Codacy Static Code Analysis
[warning] 258-258: data-collection/collectors/fbref_collector.py#L258
Standard pseudo-random generators are not suitable for security/cryptographic purposes.
🤖 Prompt for AI Agents
In data-collection/collectors/fbref_collector.py around lines 246 to 275, the
download_with_selenium function currently depends directly on Selenium without
fallback. Modify the function to localize Selenium imports inside it and its
helpers, then add a fallback to perform a plain HTTP GET request with
appropriate headers and timeout if Selenium is not installed or fails to run.
This avoids a hard dependency on Selenium and improves robustness.
There was a problem hiding this comment.
Actionable comments posted: 2
♻️ Duplicate comments (6)
data-collection/collectors/football_data_collector.py (2)
110-118: Fragile filename parsing & shadowing built-infile
Same issues flagged earlier: rename param, useos.path.basename, generate stablematch_id.
63-88: Directory creation,overwriteflag still missing
download_csv():
- Fails if
data/rawdoesn’t exist (open()will raise).- Docstring mentions
overwritebut the flag is not implemented.- No guard against silently clobbering an existing file.
+raw_dir = get_data_raw_folder() +os.makedirs(raw_dir, exist_ok=True) +filepath = os.path.join(raw_dir, filename) + +if not overwrite and os.path.exists(filepath): + raise FileExistsError(f"{filepath} exists. Set overwrite=True to replace.")data-collection/collectors/fbref_collector.py (4)
12-17: Hard dependency on Selenium contradicts docstringTop-level imports (
selenium.*) will crash on environments without Selenium despite the “avoids hard dependency” claim. Move all Selenium imports inside the functions that require them and guard withtry/except.
58-88:extract_columns()can crash & ignoresdata_fields•
table.find("tbody")may beNone→AttributeError.
• When no rows parsed,df_bsis undefined →UnboundLocalError.
•data_fieldsparam is never used.
Harden null-checks, always return a DataFrame, or drop the unused param.
90-103: Non-uniquematch_id, fragile parsing,fileshadowingIssues previously noted remain: use deterministic hash for
match_id, robustos.path.basenameparsing, and renamefileparam.
175-185: Directory may not exist
create_csv()only callsos.makedirswhen the path is not absolute, butfilepathis always absolute—directory creation is skipped. Remove the condition.
🧹 Nitpick comments (1)
data-collection/collectors/football_data_collector.py (1)
21-44: Remove or use theleague_nameparameter
generate_football_data_url()ignores itsleague_nameargument, which is misleading and invites bugs. Either incorporate it (e.g., slugify for sanity checks) or drop the parameter.
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (2)
data-collection/collectors/fbref_collector.py(1 hunks)data-collection/collectors/football_data_collector.py(1 hunks)
🧰 Additional context used
🧬 Code Graph Analysis (2)
data-collection/collectors/football_data_collector.py (1)
data-collection/collectors/fbref_collector.py (5)
_here(19-20)get_data_raw_folder(23-26)get_data_processed_folder(29-32)get_columns(169-172)normalize_date(138-147)
data-collection/collectors/fbref_collector.py (1)
data-collection/collectors/football_data_collector.py (5)
_here(47-48)get_data_raw_folder(51-54)get_data_processed_folder(57-60)normalize_date(148-157)get_columns(122-125)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
- GitHub Check: Codacy Static Code Analysis
| def download_with_selenium(url): | ||
| """Fetch dynamic page content using Selenium and return HTML (<=50 lines).""" | ||
| driver = None | ||
| try: | ||
| options = _configure_chrome_options() | ||
| driver = _create_driver(options) | ||
| _remove_webdriver_flag(driver) | ||
|
|
||
| print(f"Navigating to {url} ...") | ||
| driver.get(url) | ||
|
|
||
| _wait_out_challenge(driver, timeout=30) | ||
|
|
||
| # Using SystemRandom() is a secure random number generator. Itis a best practice for security-sensitive applications. | ||
| time.sleep(SystemRandom().uniform(2, 5)) # small buffer to ensure full render | ||
|
|
||
| html_content = _get_html_or_raise(driver) | ||
| print("Successfully retrieved page.") | ||
| return html_content | ||
|
|
||
| except WebDriverException as e: | ||
| print(f"WebDriver error: {e}") | ||
| print("\nTroubleshooting:") | ||
| print("1. Make sure Chrome browser is installed") | ||
| print("2. Install ChromeDriver: brew install --cask chromedriver") | ||
| print("3. Or use: pip install webdriver-manager") | ||
| raise ValueError(f"WebDriver error: {e}") | ||
|
|
||
| except Exception as e: | ||
| print(f"Unexpected error: {e}") | ||
| raise ValueError(f"Failed to scrape with Selenium: {e}") | ||
|
|
There was a problem hiding this comment.
No fallback despite promise
download_with_selenium() claims a “requests-based fallback” yet catches WebDriverException and merely re-raises. Implement the fallback or revise the docstring.
🤖 Prompt for AI Agents
In data-collection/collectors/fbref_collector.py around lines 246 to 277, the
function download_with_selenium() claims to have a requests-based fallback in
its docstring but currently does not implement any fallback logic and just
re-raises exceptions. To fix this, either implement the fallback by catching
WebDriverException and then attempting to fetch the page content using requests
as a backup, or update the docstring to remove the mention of a requests-based
fallback to accurately reflect the current behavior.
| def save_df_to_csv(df, raw_filename): | ||
| """Save DataFrame to CSV""" | ||
| processed_filename = raw_filename.split(".")[0] + "_processed.csv" | ||
| filepath = os.path.join(get_data_processed_folder(), processed_filename) | ||
| df.to_csv(filepath, index=False) | ||
| print(f"Processed CSV saved as '{filepath}'.") |
There was a problem hiding this comment.
🛠️ Refactor suggestion
Ensure processed directory exists before write
save_df_to_csv() writes to data/processed but never os.makedirs(..., exist_ok=True). Add it to avoid FileNotFoundError.
🤖 Prompt for AI Agents
In data-collection/collectors/football_data_collector.py around lines 160 to
165, the function save_df_to_csv writes the DataFrame to a file in the processed
data directory but does not ensure the directory exists, which can cause a
FileNotFoundError. Before calling df.to_csv, add a call to os.makedirs with the
processed directory path and exist_ok=True to create the directory if it does
not exist.
Description
Goal:
Build essential historical dataset for top 3 football leagues using free data sources below, , covering 5 recent seasons with focus on MVP delivery.
Data sources:
football-data.co.uk
fbref.com
Step completed:
Steps to complete:
3. Merge the data. ex. merge same league of the same season from different resources
4. Insert into Supabase
Type of Change
Changes Made
Testing
How has this been tested?
Test Configuration:
Platform Impact
Which parts of the system are affected?
Breaking Changes
Does this PR introduce any breaking changes?
If yes, describe the breaking changes and migration path:
Checklist
Before requesting a review, please ensure:
Screenshots (if applicable)
Add screenshots to help explain your changes.
Related Issues
Closes #(issue_number)
Related to #(issue_number)
Additional Notes
Any additional information that reviewers should know.
Summary by CodeRabbit
New Features
Tests
Configuration
.gitignoreto exclude HTML and CSV files.Style
Documentation