IMDb Movie Crawler

A powerful Python-based web crawler that collects comprehensive movie information from IMDb using both GraphQL API and web scraping techniques. This tool can gather detailed movie data including basic information, reviews, and ratings for any type of movies based on customizable filters.

Features

Advanced movie filtering capabilities similar to IMDb's search:
- By country of origin
- By year range
- By genre
- By rating range
- By number of votes
- By title type (movie, TV series, etc.)
- By language
- By user reviews count
Collects detailed movie information including:
- Basic details (title, original title, year, runtime)
- Ratings and reviews
- Plot summaries
- Genre information
- Country of origin
- Popularity rankings
- Certificate ratings
Comprehensive user review collection
JSON to CSV conversion for easy data analysis
Robust logging system
Rate limiting to prevent server overload
Progress saving and error handling
Session management with automatic retry

Project Structure

IMDb-Crawler/
├── imdb_crawler.py           # Main crawler for basic movie information
├── movie_detail_crawler.py   # Detailed movie information crawler
├── user_review_crawler.py    # Movie reviews crawler
├── filter_movies.py          # Movie filtering script
├── json_to_csv_converter.py  # JSON to CSV conversion utility
├── utils/
│   └── logger.py            # Logging utility
├── logs/                    # Log files directory
├── output/                  # Output files directory
└── error_logs/             # Error logging directory

Requirements

Python 3.x
Chrome Browser
Selenium WebDriver

Installation

Clone the repository:

git clone https://github.com/DAN3002/IMDb-Crawler.git
cd IMDb-Crawler

Install required packages:

pip install -r requirements.txt

Install Chrome WebDriver for your Chrome browser version

Usage

Basic Movie Crawling:

python imdb_crawler.py

Detailed Movie Information:

python movie_detail_crawler.py

User Reviews Collection:

python user_review_crawler.py

Custom Filtering:

# Example in filter_movies.py
filter_criteria = {
    'votes_min': 1000,           # Minimum votes
    'rating_min': 7.0,           # Minimum rating
    'year_range': (2000, 2024),  # Year range
    'countries': ['US', 'UK'],   # Countries
    'genres': ['Action', 'Drama'],# Genres
    'reviews_min': 5             # Minimum reviews
}

Convert Results to CSV:

python json_to_csv_converter.py

Customizing Filter Criteria When Crawling Movies

# Example in imdb_crawler.py
variables = {
    "first": self.PAGE_SIZE,
    "locale": "vi-VN",
    "originCountryConstraint": {
      "anyPrimaryCountries": ["VN"]
    },
    "titleTypeConstraint":{"anyTitleTypeIds":["movie"],"excludeTitleTypeIds":[]},
    "sortBy": "POPULARITY",
    "sortOrder": "ASC"
}

Output Formats

The crawler generates several output files:

movie_details.json: Complete movie information
filtered_movies.json: Filtered movie results
movie_reviews.json: User reviews data
Corresponding CSV files for each JSON file

Error Handling

The crawler includes comprehensive error handling and logging:

Automatic session refresh on connection issues
Rate limiting to prevent IP blocking
Progress saving for long-running crawls
Detailed error logs in error_logs directory

Author

This project is created by @DAN3002.

Contributing

Fork the repository
Create your feature branch (git checkout -b feature/AmazingFeature)
Commit your changes (git commit -m 'Add some AmazingFeature')
Push to the branch (git push origin feature/AmazingFeature)
Open a Pull Request

License

This project is licensed under the Apache License 2.0 - see the LICENSE file for details.

Disclaimer

This tool is for educational purposes only. Please review IMDb's terms of service and robots.txt before using this crawler. Ensure you comply with IMDb's usage policies and implement appropriate rate limiting.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

IMDb Movie Crawler

Features

Project Structure

Requirements

Installation

Usage

Customizing Filter Criteria When Crawling Movies

Output Formats

Error Handling

Author

Contributing

License

Disclaimer

About

Uh oh!

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 15 Commits
utils		utils
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
filter_movies.py		filter_movies.py
imdb_crawler.py		imdb_crawler.py
json_to_csv_converter.py		json_to_csv_converter.py
movie_detail_crawler.py		movie_detail_crawler.py
requirements.txt		requirements.txt
user_review_crawler.py		user_review_crawler.py

License

DAN3002/IMDb-Crawler

Folders and files

Latest commit

History

Repository files navigation

IMDb Movie Crawler

Features

Project Structure

Requirements

Installation

Usage

Customizing Filter Criteria When Crawling Movies

Output Formats

Error Handling

Author

Contributing

License

Disclaimer

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Uh oh!

Languages