Skip to content

Missing clubs in clubs.csv and discrepancy in competitions.json files #321

@LucaPolese

Description

@LucaPolese

I have been analyzing UEFA Champions League (CL) football data by filtering matches from the 2024 season onwards. My objective is to identify all clubs that participated in these CL matches by collecting unique club IDs from both home and away teams.

During this analysis, I observed that certain clubs are missing from the clubs.csv file. As an example, all Swiss clubs appear to be absent from the dataset: the domestic_competition_id of Switzerland's main competition is missing.

Code Implementation

Below is the Python code I have been using for this analysis:

# Download the datasets
df_games = pd.read_csv(kagglehub.dataset_download("davidcariboo/player-scores", path='games.csv'))
df_clubs = pd.read_csv(kagglehub.dataset_download("davidcariboo/player-scores", path='clubs.csv'))

# Filter for Champions League games from 2024 season onwards
cl_games = df_games[(df_games['competition_id'] == 'CL') & (df_games['season'] >= 2024)]

# Get all club IDs (both home and away) that participated in CL games
home_clubs = cl_games['home_club_id'].unique().tolist()
away_clubs = cl_games['away_club_id'].unique().tolist()

# Combine the two lists and get unique values using pandas
all_clubs = home_clubs + away_clubs
all_cl_clubs = pd.Series(all_clubs).unique().tolist()

# Check if all CL club IDs exist in the clubs dataframe
missing_club_ids = [club_id for club_id in all_cl_clubs if club_id not in df_clubs['club_id'].values]
print(f"Number of club IDs not found in clubs dataframe: {len(missing_club_ids)}")

# Try to get names for missing clubs from the games dataframe
missing_clubs_info = []

for club_id in missing_club_ids:
    # Check if it appears as home club
    home_matches = cl_games[cl_games['home_club_id'] == club_id]
    away_matches = cl_games[cl_games['away_club_id'] == club_id]

    if not home_matches.empty:
        club_name = home_matches['home_club_name'].iloc[0]
    elif not away_matches.empty:
        club_name = away_matches['away_club_name'].iloc[0]
    else:
        club_name = "Unknown"

    missing_clubs_info.append({
        'club_id': club_id,
        'club_name': club_name,
        'home_matches': len(home_matches),
        'away_matches': len(away_matches)
    })

# Print information about the missing clubs
if missing_clubs_info:
    print("\nMissing clubs information:")
    for club in missing_clubs_info:
        print(f"ID: {club['club_id']}, Name: {club['club_name']}, Home matches: {club['home_matches']}, Away matches: {club['away_matches']}")

Data Discrepancy in Competition Files

To further investigate the issue, I examined the two main files containing competition data. I found a discrepancy between the number of entries in these files:

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions