Skip to content

qasimbilalstack/reddit-dl

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

42 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Reddit-dl

License Python Version

focused fork of gallery-dl — that uses OAuth2 for secure API access, and MD5-based content deduplication to avoid re-downloading. It automatically detects and removes duplicate media files based on content (MD5 hash), ensuring you only store unique files. Built for speed (concurrent workers + polite rate-limiting) and efficiency, while keeping logs and output organized.

Table of Contents

Features

  • OAuth2 Authentication - Secure Reddit API access using script app credentials
  • MD5-Based Deduplication - Automatically detects and deletes duplicate files by content hash
  • Simple & Reliable - Persistent SQLite database tracks seen content across all runs
  • Gallery Support - Automatic expansion and host-specific URL normalization
  • Flexible Output - Organized downloads with customizable directory structure
  • Comprehensive Logging - Detailed audit trails for all download activities
  • High Performance - Parallel, rate-limited downloads for maximum speed

Installation

Requirements

  • Python 3.8 or higher
  • requests library (automatically installed)

Development Installation (Recommended)

For local development with editable installation:

git clone https://github.com/qasimbilalstack/reddit-dl.git
cd reddit-dl
python -m pip install -e .

Direct Installation from GitHub

Install the latest version directly:

python -m pip install "git+https://github.com/qasimbilalstack/reddit-dl.git"

Virtual Environment Installation

Recommended approach to avoid dependency conflicts:

# Create and activate virtual environment
python -m venv reddit-dl-env
source reddit-dl-env/bin/activate  # On Windows: reddit-dl-env\Scripts\activate

# Install reddit-dl
python -m pip install -e .

Using pipx (Isolated Installation)

Install as an isolated command-line tool:

python -m pip install --user pipx
python -m pipx ensurepath
pipx install git+https://github.com/qasimbilalstack/reddit-dl.git

Running Without Installation

Execute directly as a Python module:

python -m reddit_dl.extractor --config config.json <urls>

Updating

Update Development Installation

If you installed using the development method (git clone + pip install -e .):

cd reddit-dl
git pull origin main
python -m pip install -e . --upgrade

Update Direct GitHub Installation

If you installed directly from GitHub:

python -m pip install --upgrade "git+https://github.com/qasimbilalstack/reddit-dl.git"

Update pipx Installation

pipx upgrade reddit-dl

Or reinstall to ensure latest version:

pipx uninstall reddit-dl
pipx install git+https://github.com/qasimbilalstack/reddit-dl.git

Quickstart

1. Create Reddit Application

  1. Visit Reddit App Preferences
  2. Click "Create App" or "Create Another App"
  3. Select "script" as the application type
  4. Note your client ID (under the app name) and client secret

2. Configure Authentication

Copy the example configuration and add your credentials:

cp config.example.json config.json

Edit config.json with your Reddit app credentials:

3. Start Downloading

Download media from a Reddit user:

reddit-dl --config config.json "https://www.reddit.com/user/SomeUser/"

Files will be saved to downloads/ with organized subfolders and detailed logs in downloads/logs.txt.

Configuration

Parameter Description Default
client_id Reddit app client ID Required
client_secret Reddit app client secret Required
username Reddit username Required
password Reddit password Required
user_agent Custom user agent string reddit-dl/0.1
output_dir Download directory downloads
token_cache Path to OAuth token cache file ~/.reddit_dl_tokens.json
max_posts Default maximum posts per source Unlimited
default_max_posts Default max posts when no --max-posts or --all 1000
md5_save_interval MD5 database checkpoint frequency (saves after N downloads) 10
parallel_downloads Number of parallel downloads 4
requests_per_second Rate limit for download requests (per second) 4.0

Recommended conservative presets (choose one based on your environment):

  • Gentle (very low load): parallel_downloads: 1, requests_per_second: 1.0
  • Conservative (recommended): parallel_downloads: 2, requests_per_second: 1.0
  • Balanced (default): parallel_downloads: 4, requests_per_second: 4.0

Configuration example all available options:

{
  "extractor": {
    "reddit": {
      "oauth": {
        "client_id": "YOUR_CLIENT_ID",
        "client_secret": "YOUR_CLIENT_SECRET",
        "username": "YOUR_REDDIT_USERNAME",
        "password": "YOUR_REDDIT_PASSWORD"
      },
      "user_agent": "reddit-dl/0.1 by YOUR_USERNAME",
  "output_dir": "downloads",
  "md5_save_interval": 10,
      "token_cache": "~/.reddit_dl_tokens.json",
      "default_max_posts": 1000,
      "parallel_downloads": 2,
      "requests_per_second": 1.0
    }
  }
}

CLI Reference

Usage

reddit-dl [OPTIONS] URLS...

Positional Arguments

urls - One or more Reddit URLs to process

Supported URL formats:

  • User pages: https://www.reddit.com/user/USERNAME/
  • Subreddits: https://www.reddit.com/r/SUBREDDIT/
  • Individual posts: https://www.reddit.com/r/SUBREDDIT/comments/POST_ID/
  • Shortened URLs: https://redd.it/POST_ID

Options

General Options

Option Description
-h, --help Show help message and exit
-c, --config CONFIG Path to configuration JSON file
--debug Enable debug logging output (also bypasses MD5 deduplication to see all downloads)

Source Selection

Option Description
-u, --user USER Reddit username(s) to fetch (comma-separated or repeat flag)
-r, --subreddit SUBREDDIT Subreddit name(s) to fetch (comma-separated or repeat flag)
-p, --postid POSTID Post ID(s) to fetch (comma-separated or repeat flag)

Download Control

Option Description
--output OUTPUT_DIR, -o OUTPUT_DIR Output directory for downloads (overrides config file setting)
--max-posts MAX_POSTS Maximum number of posts to fetch
--all Fetch all available posts (follow pagination)
--per-page N Number of posts to request per page when paginating (default: 100, max: 100)
--sort {hot,new,top,rising,best} Listing sort order to request from Reddit (default: new)
--force Retry previously failed downloads (Note: MD5 deduplication always runs)
--retry-failed Retry previously failed downloads
--clear-failed Clear the failed URLs tracking database
--prefer-mp4 Prefer MP4 video format when available (adds ?format=mp4 to compatible URLs)

Performance Options

Option Description
--save-interval N Save MD5 database every N downloads (default: 10)

Content Control

Option Description
--save-json Save per-post metadata JSON files (disabled by default for faster downloads)
--save-meta-only Only save per-post metadata JSON files; do not download media files
--comments Fetch comments in addition to submissions (disabled by default). Without this flag only submissions are fetched (uses /submitted/ URLs)

User Profile Options

Option Description
--save-bio Fetch user profile bio(s) and save compact JSON into <outdir>/user_bio (for --user)
--only-verified When specified with --user or --subreddit, only process users/posts whose profile has verified: true

Examples

Basic Usage

Download recent posts from a user:

reddit-dl --config config.json "https://www.reddit.com/user/SomeUser/"
# Or using the --user flag:
reddit-dl --config config.json --user SomeUser

Download to a specific directory:

# Use short form (-o)
reddit-dl --config config.json --user SomeUser -o /path/to/downloads

# Use long form (--output)
reddit-dl --config config.json --user SomeUser --output ./my_reddit_content

Download from a subreddit:

reddit-dl --config config.json "https://www.reddit.com/r/earthporn/"
# Or using the --subreddit flag:
reddit-dl --config config.json --subreddit earthporn

# Download top posts from a subreddit:
reddit-dl --config config.json --sort top --subreddit earthporn

Download a specific post:

reddit-dl --config config.json "https://www.reddit.com/r/pics/comments/abc123/..."
# Or using the --postid flag:
reddit-dl --config config.json --postid abc123

Advanced Usage

Download all available posts from multiple sources:

reddit-dl --config config.json --all \
  "https://www.reddit.com/user/User1/" \
  "https://www.reddit.com/r/subreddit1/" \
  "https://www.reddit.com/r/subreddit2/"
# Or using flags (can mix and match):
reddit-dl --config config.json --all \
  --user User1,User2 \
  --subreddit subreddit1,subreddit2

Download from multiple users and subreddits:

# Using comma-separated lists (recommended):
reddit-dl --config config.json \
  --user User1,User2,User3 \
  --subreddit pics,funny,aww \
  --postid abc123,def456

# Or using repeated flags:
reddit-dl --config config.json \
  --user User1 --user User2 \
  --subreddit pics --subreddit funny \
  --postid abc123 --postid def456

Limit downloads and enable debug logging:

reddit-dl --config config.json --max-posts 50 --debug \
  "https://www.reddit.com/user/SomeUser/"
# Or with flags:
reddit-dl --config config.json --max-posts 50 --debug --user SomeUser

Force re-download with custom save interval:

reddit-dl --config config.json --force --save-interval 1 \
  "https://www.reddit.com/user/SomeUser/"

Retry failed downloads from previous sessions:

reddit-dl --config config.json --retry-failed

Download with custom sort order and pagination:

# Download top posts with custom page size
reddit-dl --config config.json --sort top --per-page 50 \
  "https://www.reddit.com/r/earthporn/"

# Download hot posts without metadata JSON files (faster)
reddit-dl --config config.json --sort hot \
  "https://www.reddit.com/user/SomeUser/"

# Download only submissions (comments disabled by default) from multiple users
reddit-dl --config config.json \
  --user User1,User2,User3

Batch Processing

Process multiple URLs from a file:

# Create URL list
cat > urls.txt << EOF
https://www.reddit.com/user/User1/
https://www.reddit.com/user/User2/
https://www.reddit.com/r/subreddit1/
EOF

# Process all URLs
xargs -I {} reddit-dl --config config.json {} < urls.txt

Process multiple sources using flags:

# Download from multiple users and subreddits in one command (recommended):
reddit-dl --config config.json \
  --user User1,User2,User3 \
  --subreddit pics,funny,aww

# Or using repeated flags:
reddit-dl --config config.json \
  --user User1 --user User2 --user User3 \
  --subreddit pics --subreddit funny --subreddit aww

# Mix URLs and flags:
reddit-dl --config config.json \
  "https://www.reddit.com/user/SpecialUser/" \
  --subreddit earthporn,wallpapers \
  --postid abc123,def456

Output Structure

Downloaded files are organized as follows:

downloads/
├── .md5_index.sqlite        # MD5 deduplication database
├── logs.txt                 # Comprehensive download logs
├── u_USERNAME/              # User downloads
│   ├── POST_ID.jpg         # Media files
│   ├── POST_ID.json        # Metadata
│   └── POST_ID_1.jpg       # Additional media from galleries
└── r_SUBREDDIT/            # Subreddit downloads
    ├── POST_ID.mp4
    ├── POST_ID.json
    └── ...

How MD5 Deduplication Works

reddit-dl uses content-based deduplication to ensure you never store duplicate media files:

Process Flow

  1. Download - File is downloaded to disk
  2. Calculate MD5 - Content hash is computed for the file
  3. Check Database - MD5 is looked up in .md5_index.sqlite
  4. Decision:
    • If MD5 exists → File is deleted immediately (duplicate detected)
    • If MD5 is new → File is kept, MD5 added to database

Key Features

Content-Based - Detects duplicates even if filenames differ
Persistent - Database survives across all runs
Automatic - No configuration needed, always active
Efficient - Only unique content stored on disk
Cross-Post Detection - Same image posted to multiple subreddits = stored once

Example Behavior

First Run:

reddit-dl --config config.json --all --user SomeUser
# Result: 102 items → 30 unique files kept, 72 duplicates deleted
# Files on disk: 30 (all unique)
# Database: 30 MD5 hashes

Second Run (Same Command):

reddit-dl --config config.json --all --user SomeUser
# Result: 102 items downloaded, all detected as duplicates, all deleted
# Files on disk: 30 (unchanged)
# Note: Files are downloaded then immediately deleted if duplicate

Database Location

The MD5 index is stored at downloads/.md5_index.sqlite by default. This file:

  • Tracks all MD5 hashes of downloaded content
  • Persists across runs and system restarts
  • Can be safely deleted to reset deduplication tracking
  • Is automatically checkpointed based on --save-interval setting

Force and Debug Options

--force Flag:

  • Bypasses the failed URL check (retries previously failed downloads)
  • MD5 deduplication still runs normally
  • Useful for recovering from incomplete downloads

--debug Flag:

  • Enables verbose logging output
  • Also bypasses MD5 deduplication (keeps all files even if duplicates)
  • Useful for testing, debugging, and verifying content differences
  • Files are downloaded and kept on disk without duplicate deletion
  • MD5 hashes are still recorded in database for future runs

Important: In normal operation (without --debug), MD5 deduplication always runs to ensure:

  • You never store duplicate content
  • Storage remains efficient
  • Only unique files are kept

Troubleshooting

Common Issues

Command Not Found

````## Troubleshooting

### Common Issues

#### Command Not Found
```bash
reddit-dl: command not found

Solutions:

  • Ensure virtual environment is activated: source venv/bin/activate
  • Verify installation: pip list | grep reddit-dl
  • Check PATH configuration for pipx installations
  • Use module execution: python -m reddit_dl.extractor

Authentication Errors

HTTP 403: Forbidden

Solutions:

  • Verify Reddit app credentials in config.json
  • Ensure app type is set to "script" in Reddit preferences
  • Check username and password are correct
  • Confirm client ID and secret are accurate

Download Issues

Files appear to re-download unnecessarily

Solutions:

  • Check downloads/logs.txt for detailed information
  • Verify MD5 database integrity
  • Use --debug for verbose output

Performance Problems

Downloads are slow or timing out

Solutions:

  • Reduce --max-posts for testing
  • Omit --save-json to skip metadata writing (faster)
  • Use --per-page with smaller values (e.g., 25) for better rate limiting
  • Check network connectivity
  • Monitor Reddit API rate limits

Debug Mode

Enable comprehensive logging:

reddit-dl --config config.json --debug "https://www.reddit.com/user/SomeUser/"

This provides detailed information about:

  • Authentication status
  • URL processing
  • File deduplication decisions
  • Download progress
  • Error conditions

Log Analysis

Check downloads/logs.txt for audit trails:

# View recent activity
tail -f downloads/logs.txt

# Search for errors
grep -i error downloads/logs.txt

# Check specific user downloads
grep "u_SomeUser" downloads/logs.txt

Getting Help

If you encounter issues:

  1. Enable debug mode and check logs
  2. Verify configuration against config.example.json
  3. Test with a small --max-posts value
  4. Check Reddit app settings and permissions
  5. Review GitHub issues for similar problems
  6. Create a new issue with debug output

Contributing

We welcome contributions! Please follow these guidelines:

Development Setup

  1. Fork the repository
  2. Clone your fork locally
  3. Create a virtual environment
  4. Install in development mode
git clone https://github.com/YOUR_USERNAME/reddit-dl.git
cd reddit-dl
python -m venv venv
source venv/bin/activate
pip install -e .

Code Standards

  • Follow PEP 8 style guidelines
  • Add docstrings for public functions
  • Include type hints where appropriate
  • Write descriptive commit messages

Testing

  • Test changes with various Reddit URL types
  • Verify OAuth authentication works
  • Check deduplication functionality
  • Test edge cases and error conditions

Pull Request Process

  1. Create a feature branch from main
  2. Make focused, atomic commits
  3. Include tests for new functionality
  4. Update documentation as needed
  5. Submit pull request with clear description

Issue Reporting

When reporting bugs, include:

  • Python version and operating system
  • Full command used and configuration
  • Complete error output with --debug
  • Steps to reproduce the issue

License

This project is derived from gallery-dl and maintains compatibility with its licensing. The original gallery-dl project is licensed under the GNU General Public License v2.0.

License Details

  • This project: GPL-2.0 License (following gallery-dl)
  • Dependencies: Various licenses (see requirements)
  • Reddit API: Subject to Reddit's Terms of Service

Attribution

Special thanks to the gallery-dl project and its contributors for providing the foundation for this focused Reddit extractor.

For complete license text, see the LICENSE file in the repository.

About

Super Fast Downloads and archives content from reddit

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages