Skip to content

[SoFIFA] Read_player_ratings return only 1 record #889

@mttam

Description

@mttam

Describe the bug
the method read_player_ratings return only the last player. Specifically because there is an incorrect indentation of tht XPath extraction and ratings.append() are outside the player loop, so only the last player's scores are processed and appended.

Python Version
Python 3.11.4

Affected scrapers
This affects the following scrapers:

  • SoFIFA

Code example

import soccerdata as sd
sofifa = sd.SoFIFA(leagues="ENG-Premier League", versions="latest")
    print(sofifa.read_player_ratings(team="Arsenal")

Error message

no error message 

Error output

                  fifa_edition        update overallrating  ... gk_kicking gk_positioning gk_reflexes
player                                                      ...
Takehiro Tomiyasu        FC 25  Jul 17, 2025            78  ...          6              5          11

[1 rows x 38 columns]

Additional context
I fix the problem with GPT-5 mini but im not sure is the correct way (or an effective issue) because i only dowload the collection.

Code fix sofifa.py

def read_player_ratings(
        self,
        team: Optional[Union[str, list[str]]] = None,
        player: Optional[Union[int, list[int]]] = None,
    ) -> pd.DataFrame:
        """Retrieve ratings for players.

        Parameters
        ----------
        team: str or list of str, optional
            Team(s) to retrieve. If None, will retrieve all teams.
        player: int or list of int, optional
            Player(s) to retrieve. If None, will retrieve all players.

        Returns
        -------
        pd.DataFrame
        """
        # build url
        urlmask = SO_FIFA_API + "/player/{}/?r={}&set=true"
        filemask = "player_{}_{}.html"

        # get player IDs
        if player is None:
            players = self.read_players(team=team).index.unique()
        elif isinstance(player, int):
            players = [player]
        else:
            players = player

        # prepare empty data frame
        ratings = []

        # define labels to use for score extraction from player profile pages
        score_labels = [
            "Overall rating",
            "Potential",
            "Crossing",
            "Finishing",
            "Heading accuracy",
            "Short passing",
            "Volleys",
            "Dribbling",
            "Curve",
            "FK Accuracy",
            "Long passing",
            "Ball control",
            "Acceleration",
            "Sprint speed",
            "Agility",
            "Reactions",
            "Balance",
            "Shot power",
            "Jumping",
            "Stamina",
            "Strength",
            "Long shots",
            "Aggression",
            "Interceptions",
            "Positioning",
            "Vision",
            "Penalties",
            "Composure",
            "Defensive awareness",
            "Standing tackle",
            "Sliding tackle",
            "GK Diving",
            "GK Handling",
            "GK Kicking",
            "GK Positioning",
            "GK Reflexes",
        ]

        iterator = list(product(self.versions.iterrows(), players))
        for i, ((version_id, version), player) in enumerate(iterator):
            logger.info(
                "[%s/%s] Retrieving ratings for player with ID %s in %s edition",
                i + 1,
                len(iterator),
                player,
                version["update"],
            )

            # read html page (player overview)
            filepath = self.data_dir / filemask.format(player, version_id)
            url = urlmask.format(player, version_id)
            reader = self.get(url, filepath)

            # extract scores one-by-one
            tree = html.parse(reader, parser=html.HTMLParser(encoding="utf8"))

            # get player name safely
            node_player_name_nodes = tree.xpath("//div[contains(@class, 'profile')]/h1")
            if node_player_name_nodes:
                node_player_name = node_player_name_nodes[0]
                # Extract what is before <br>
                before_br = node_player_name.xpath("string(./text()[1])").strip()
                # Extract what is after <br>
                after_br = node_player_name.xpath(
                    "string(./br/following-sibling::text()[1])"
                ).strip()
                player_name = before_br if before_br else after_br
            else:
                player_name = None

            scores = {"player": player_name, **version.to_dict()}

            # Try each XPath until one returns a result
            for s in score_labels:
                value = None
                xpaths = [
                    f"//p[.//text()[contains(.,'{s}')]]/span/em",
                    f"//div[contains(.,'{s}')]/em",
                    f"//li[not(self::script)][.//text()[contains(.,'{s}')]]/em",
                ]
                for xpath in xpaths:
                    nodes = tree.xpath(xpath)
                    if nodes:  # If at least one match is found
                        text = nodes[0].text
                        value = text.strip() if text is not None else None
                        break  # Stop checking other XPaths once we find a valid value

                scores[s] = value  # will be None if not found

            ratings.append(scores)
        # return data frame
        return pd.DataFrame(ratings).pipe(standardize_colnames).set_index(["player"]).sort_index()

Contributor Action Plan

  • I’m unsure how to fix this, but I'm willing to work on it with guidance.

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions