-
-
Notifications
You must be signed in to change notification settings - Fork 78
Open
Description
I have been analyzing UEFA Champions League (CL) football data by filtering matches from the 2024 season onwards. My objective is to identify all clubs that participated in these CL matches by collecting unique club IDs from both home and away teams.
During this analysis, I observed that certain clubs are missing from the clubs.csv file. As an example, all Swiss clubs appear to be absent from the dataset: the domestic_competition_id of Switzerland's main competition is missing.
Code Implementation
Below is the Python code I have been using for this analysis:
# Download the datasets
df_games = pd.read_csv(kagglehub.dataset_download("davidcariboo/player-scores", path='games.csv'))
df_clubs = pd.read_csv(kagglehub.dataset_download("davidcariboo/player-scores", path='clubs.csv'))
# Filter for Champions League games from 2024 season onwards
cl_games = df_games[(df_games['competition_id'] == 'CL') & (df_games['season'] >= 2024)]
# Get all club IDs (both home and away) that participated in CL games
home_clubs = cl_games['home_club_id'].unique().tolist()
away_clubs = cl_games['away_club_id'].unique().tolist()
# Combine the two lists and get unique values using pandas
all_clubs = home_clubs + away_clubs
all_cl_clubs = pd.Series(all_clubs).unique().tolist()
# Check if all CL club IDs exist in the clubs dataframe
missing_club_ids = [club_id for club_id in all_cl_clubs if club_id not in df_clubs['club_id'].values]
print(f"Number of club IDs not found in clubs dataframe: {len(missing_club_ids)}")
# Try to get names for missing clubs from the games dataframe
missing_clubs_info = []
for club_id in missing_club_ids:
# Check if it appears as home club
home_matches = cl_games[cl_games['home_club_id'] == club_id]
away_matches = cl_games[cl_games['away_club_id'] == club_id]
if not home_matches.empty:
club_name = home_matches['home_club_name'].iloc[0]
elif not away_matches.empty:
club_name = away_matches['away_club_name'].iloc[0]
else:
club_name = "Unknown"
missing_clubs_info.append({
'club_id': club_id,
'club_name': club_name,
'home_matches': len(home_matches),
'away_matches': len(away_matches)
})
# Print information about the missing clubs
if missing_clubs_info:
print("\nMissing clubs information:")
for club in missing_clubs_info:
print(f"ID: {club['club_id']}, Name: {club['club_name']}, Home matches: {club['home_matches']}, Away matches: {club['away_matches']}")Data Discrepancy in Competition Files
To further investigate the issue, I examined the two main files containing competition data. I found a discrepancy between the number of entries in these files:
- The file transfermarkt-datasets/data/competitions.json has 44 entries.
- The file transfermarkt-scraper/samples/competitions.json has 62 entries.
Reactions are currently unavailable
Metadata
Metadata
Assignees
Labels
No labels