transfermarkt-scraper

A web scraper for collecting data from Transfermarkt website. It recurses into the Transfermarkt hierarchy to find competitions, games, clubs, players and appearances, and extract them as JSON objects.

====> Confederations ====> Competitions ====> (Clubs, Games) ====> Players ====> Appearances

Each one of these entities can be discovered and refreshed separately by invoking the corresponding crawler.

run

This is a scrapy project, so it needs to be run with the scrapy command line util. A conda environment.yml file is provided with a definition for the necessary environment to run the scraper.

# create and activate conda environment
conda env create -f environment.yml
conda activate transfermarkt-scraper

# discover confederantions and competitions on separate invokations
scrapy crawl confederations > confederations.json
scrapy crawl competitions -a parents=confederations.json > competitions.json

# you can use intermediate files or pipe crawlers one after the other to traverse the hierarchy 
cat competitions | head -2 \
    | scrapy crawl clubs \
    | scrapy crawl players \
    | scrapy crawl appearances

Alternatively you can also use dcaribou/transfermarkt-scraper docker image

docker run \
    -ti -v "$(pwd)"/.:/app \
    dcaribou/transfermarkt-scraper:main \
    scrapy crawl competitions -a parents=samples/confederations.json

⚠️ When using this scraper please identify your project accordingly by using a custom user agent. You can pass the user agent string using the USER_AGENT scrapy setting. For example, scrapy crawl players -s USER_AGENT=<your user agent>

Items are extracted in JSON format with one JSON object per item (confederation, league, club, player or appearance), which gets printed to the stdout. Samples of extracted data are provided in the samples folder.

Check out transfermarkt-datasets to see transfermarkt-scraper in action on a real analytics project.

config

Check setting.py for a reference of available configuration options

contribute

Extending existing crawlers in this project in order to scrape additional data or even creating new crawlers is quite straightforward. If you want to contribute with an enhancement to transfermarkt-scraper I suggest that you follow a workflow similar to

Fork the repository
Modify or add new crawlers to tfmkt/spiders. Here is an example PR that extends the games crawler to scrape a few additional fields from Transfermakt games page.
Create a PR with your changes and a short description for the enhancement and send it over 🚀

It is usually also a good idea to have a short discussion about the enhancement beforehand. If you want to propose a change and collect some feeback before you start coding you can do so by creating an issue with your idea in the Issues section.

Name		Name	Last commit message	Last commit date
Latest commit History 45 Commits
.github/workflows		.github/workflows
samples		samples
tfmkt		tfmkt
.gitignore		.gitignore
.release-it.json		.release-it.json
Dockerfile		Dockerfile
README.md		README.md
environment.yml		environment.yml
requirements.txt		requirements.txt
scrapy.cfg		scrapy.cfg

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

transfermarkt-scraper

run

config

contribute

About

Uh oh!

Releases

Packages

Languages

DonFloriano27/transfermarkt-scraper

Folders and files

Latest commit

History

Repository files navigation

transfermarkt-scraper

run

config

contribute

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages