I'll be collecting data from FBref and building a database for players in the Women Super League starting from the 2021/22 season and up till the current one. I decided to build this database so I have recent data to use along with the free WSL data offered by StatsBomb. The free data ends at the 2020-21 seaason so this database starts from the season after.
I'll collect only player-level data and no team statistics like total points, goal difference, final standings etc. To get team level performance statistics, I'll just aggregate by team.
Here's a link to an example page and the kind of data I'll be collecting from it: Manchester City Women 2023-24
Tables:
- Goalkeeping and advanced goalkeeping which I'll combine.
- Shooting
- Passing
- Pass Types
- Goal and shot creation
- Possession
- Some columns in playing time table
- Defensive actions
- Miscellaneous stats
I'll collect the per 90 mins data as it provides a fairer comparisons between players.
I'll collect the data by scraping the webpages for each team across their seasons in the league. As the performance data are already in tables, I'll combine the data for a table with all the corresponding ones across all teams and season. For example, the final shooting table will hold all the shooting data for all players in all teams and across all seasons. This way, it's easy to query and see a player's performance over time.
I'll build the database first with data until last season (2023-24) and when it's fully running, I'll then add data for the current season. As this current season is being played now, the database would need monthly updates. And so I'll take time to model the collecting and updating process.
Another reason is to make the active season be a sort of breakpoint for the database. And so any update would affect this season only.
- All code are in the
scriptsfolder. - I'll build the main database in the first_iteration subfolder.
- And add updates in the updates subfolder.
- Data Processing: Pandas, re (regex), IO (StringIO for tables).
- Web Scraping: BeautifulSoup, Selenium.
- Cloud Storage: Google Cloud Client Library.
- Automation: Chrome WebDriver.
The list of functions notebook contains all the custom functions used in this project.