Reddit-dl

focused fork of gallery-dl — that uses OAuth2 for secure API access, and MD5-based content deduplication to avoid re-downloading. It automatically detects and removes duplicate media files based on content (MD5 hash), ensuring you only store unique files. Built for speed (concurrent workers + polite rate-limiting) and efficiency, while keeping logs and output organized.

Features

OAuth2 Authentication - Secure Reddit API access using script app credentials
MD5-Based Deduplication - Automatically detects and deletes duplicate files by content hash
Simple & Reliable - Persistent SQLite database tracks seen content across all runs
Gallery Support - Automatic expansion and host-specific URL normalization
Flexible Output - Organized downloads with customizable directory structure
Comprehensive Logging - Detailed audit trails for all download activities
High Performance - Parallel, rate-limited downloads for maximum speed

Installation

Requirements

Python 3.8 or higher
requests library (automatically installed)

Development Installation (Recommended)

For local development with editable installation:

git clone https://github.com/qasimbilalstack/reddit-dl.git
cd reddit-dl
python -m pip install -e .

Direct Installation from GitHub

Install the latest version directly:

python -m pip install "git+https://github.com/qasimbilalstack/reddit-dl.git"

Virtual Environment Installation

Recommended approach to avoid dependency conflicts:

# Create and activate virtual environment
python -m venv reddit-dl-env
source reddit-dl-env/bin/activate  # On Windows: reddit-dl-env\Scripts\activate

# Install reddit-dl
python -m pip install -e .

Using pipx (Isolated Installation)

Install as an isolated command-line tool:

python -m pip install --user pipx
python -m pipx ensurepath
pipx install git+https://github.com/qasimbilalstack/reddit-dl.git

Running Without Installation

Execute directly as a Python module:

python -m reddit_dl.extractor --config config.json <urls>

Updating

Update Development Installation

If you installed using the development method (git clone + pip install -e .):

cd reddit-dl
git pull origin main
python -m pip install -e . --upgrade

Update Direct GitHub Installation

If you installed directly from GitHub:

python -m pip install --upgrade "git+https://github.com/qasimbilalstack/reddit-dl.git"

Update pipx Installation

pipx upgrade reddit-dl

Or reinstall to ensure latest version:

pipx uninstall reddit-dl
pipx install git+https://github.com/qasimbilalstack/reddit-dl.git

Quickstart

1. Create Reddit Application

Visit Reddit App Preferences
Click "Create App" or "Create Another App"
Select "script" as the application type
Note your client ID (under the app name) and client secret

2. Configure Authentication

Copy the example configuration and add your credentials:

cp config.example.json config.json

Edit config.json with your Reddit app credentials:

3. Start Downloading

Download media from a Reddit user:

reddit-dl --config config.json "https://www.reddit.com/user/SomeUser/"

Files will be saved to downloads/ with organized subfolders and detailed logs in downloads/logs.txt.

Configuration

Parameter	Description	Default
`client_id`	Reddit app client ID	Required
`client_secret`	Reddit app client secret	Required
`username`	Reddit username	Required
`password`	Reddit password	Required
`user_agent`	Custom user agent string	`reddit-dl/0.1`
`output_dir`	Download directory	`downloads`
`token_cache`	Path to OAuth token cache file	`~/.reddit_dl_tokens.json`
`max_posts`	Default maximum posts per source	Unlimited
`default_max_posts`	Default max posts when no --max-posts or --all	1000
`md5_save_interval`	MD5 database checkpoint frequency (saves after N downloads)	10
`parallel_downloads`	Number of parallel downloads	4
`requests_per_second`	Rate limit for download requests (per second)	4.0

Recommended conservative presets (choose one based on your environment):

Gentle (very low load): parallel_downloads: 1, requests_per_second: 1.0
Conservative (recommended): parallel_downloads: 2, requests_per_second: 1.0
Balanced (default): parallel_downloads: 4, requests_per_second: 4.0

Configuration example all available options:

{
  "extractor": {
    "reddit": {
      "oauth": {
        "client_id": "YOUR_CLIENT_ID",
        "client_secret": "YOUR_CLIENT_SECRET",
        "username": "YOUR_REDDIT_USERNAME",
        "password": "YOUR_REDDIT_PASSWORD"
      },
      "user_agent": "reddit-dl/0.1 by YOUR_USERNAME",
  "output_dir": "downloads",
  "md5_save_interval": 10,
      "token_cache": "~/.reddit_dl_tokens.json",
      "default_max_posts": 1000,
      "parallel_downloads": 2,
      "requests_per_second": 1.0
    }
  }
}

CLI Reference

Usage

reddit-dl [OPTIONS] URLS...

Positional Arguments

urls - One or more Reddit URLs to process

Supported URL formats:

User pages: https://www.reddit.com/user/USERNAME/
Subreddits: https://www.reddit.com/r/SUBREDDIT/
Individual posts: https://www.reddit.com/r/SUBREDDIT/comments/POST_ID/
Shortened URLs: https://redd.it/POST_ID

Options

General Options

Option	Description
`-h, --help`	Show help message and exit
`-c, --config CONFIG`	Path to configuration JSON file
`--debug`	Enable debug logging output (also bypasses MD5 deduplication to see all downloads)

Source Selection

Option	Description
`-u, --user USER`	Reddit username(s) to fetch (comma-separated or repeat flag)
`-r, --subreddit SUBREDDIT`	Subreddit name(s) to fetch (comma-separated or repeat flag)
`-p, --postid POSTID`	Post ID(s) to fetch (comma-separated or repeat flag)

Download Control

Option	Description
`--output OUTPUT_DIR`, `-o OUTPUT_DIR`	Output directory for downloads (overrides config file setting)
`--max-posts MAX_POSTS`	Maximum number of posts to fetch
`--all`	Fetch all available posts (follow pagination)
`--per-page N`	Number of posts to request per page when paginating (default: 100, max: 100)
`--sort {hot,new,top,rising,best}`	Listing sort order to request from Reddit (default: new)
`--force`	Retry previously failed downloads (Note: MD5 deduplication always runs)
`--retry-failed`	Retry previously failed downloads
`--clear-failed`	Clear the failed URLs tracking database
`--prefer-mp4`	Prefer MP4 video format when available (adds ?format=mp4 to compatible URLs)

Performance Options

Option	Description
`--save-interval N`	Save MD5 database every N downloads (default: 10)

Content Control

Option	Description
`--save-json`	Save per-post metadata JSON files (disabled by default for faster downloads)
`--save-meta-only`	Only save per-post metadata JSON files; do not download media files
`--comments`	Fetch comments in addition to submissions (disabled by default). Without this flag only submissions are fetched (uses /submitted/ URLs)

User Profile Options

Option	Description
`--save-bio`	Fetch user profile bio(s) and save compact JSON into `<outdir>/user_bio` (for --user)
`--only-verified`	When specified with --user or --subreddit, only process users/posts whose profile has `verified: true`

Examples

Basic Usage

Download recent posts from a user:

reddit-dl --config config.json "https://www.reddit.com/user/SomeUser/"
# Or using the --user flag:
reddit-dl --config config.json --user SomeUser

Download to a specific directory:

# Use short form (-o)
reddit-dl --config config.json --user SomeUser -o /path/to/downloads

# Use long form (--output)
reddit-dl --config config.json --user SomeUser --output ./my_reddit_content

Download from a subreddit:

reddit-dl --config config.json "https://www.reddit.com/r/earthporn/"
# Or using the --subreddit flag:
reddit-dl --config config.json --subreddit earthporn

# Download top posts from a subreddit:
reddit-dl --config config.json --sort top --subreddit earthporn

Download a specific post:

reddit-dl --config config.json "https://www.reddit.com/r/pics/comments/abc123/..."
# Or using the --postid flag:
reddit-dl --config config.json --postid abc123

Advanced Usage

Download all available posts from multiple sources:

reddit-dl --config config.json --all \
  "https://www.reddit.com/user/User1/" \
  "https://www.reddit.com/r/subreddit1/" \
  "https://www.reddit.com/r/subreddit2/"
# Or using flags (can mix and match):
reddit-dl --config config.json --all \
  --user User1,User2 \
  --subreddit subreddit1,subreddit2

Download from multiple users and subreddits:

# Using comma-separated lists (recommended):
reddit-dl --config config.json \
  --user User1,User2,User3 \
  --subreddit pics,funny,aww \
  --postid abc123,def456

# Or using repeated flags:
reddit-dl --config config.json \
  --user User1 --user User2 \
  --subreddit pics --subreddit funny \
  --postid abc123 --postid def456

Limit downloads and enable debug logging:

reddit-dl --config config.json --max-posts 50 --debug \
  "https://www.reddit.com/user/SomeUser/"
# Or with flags:
reddit-dl --config config.json --max-posts 50 --debug --user SomeUser

Force re-download with custom save interval:

reddit-dl --config config.json --force --save-interval 1 \
  "https://www.reddit.com/user/SomeUser/"

Retry failed downloads from previous sessions:

reddit-dl --config config.json --retry-failed

Download with custom sort order and pagination:

# Download top posts with custom page size
reddit-dl --config config.json --sort top --per-page 50 \
  "https://www.reddit.com/r/earthporn/"

# Download hot posts without metadata JSON files (faster)
reddit-dl --config config.json --sort hot \
  "https://www.reddit.com/user/SomeUser/"

# Download only submissions (comments disabled by default) from multiple users
reddit-dl --config config.json \
  --user User1,User2,User3

Batch Processing

Process multiple URLs from a file:

# Create URL list
cat > urls.txt << EOF
https://www.reddit.com/user/User1/
https://www.reddit.com/user/User2/
https://www.reddit.com/r/subreddit1/
EOF

# Process all URLs
xargs -I {} reddit-dl --config config.json {} < urls.txt

Process multiple sources using flags:

# Download from multiple users and subreddits in one command (recommended):
reddit-dl --config config.json \
  --user User1,User2,User3 \
  --subreddit pics,funny,aww

# Or using repeated flags:
reddit-dl --config config.json \
  --user User1 --user User2 --user User3 \
  --subreddit pics --subreddit funny --subreddit aww

# Mix URLs and flags:
reddit-dl --config config.json \
  "https://www.reddit.com/user/SpecialUser/" \
  --subreddit earthporn,wallpapers \
  --postid abc123,def456

Output Structure

Downloaded files are organized as follows:

downloads/
├── .md5_index.sqlite        # MD5 deduplication database
├── logs.txt                 # Comprehensive download logs
├── u_USERNAME/              # User downloads
│   ├── POST_ID.jpg         # Media files
│   ├── POST_ID.json        # Metadata
│   └── POST_ID_1.jpg       # Additional media from galleries
└── r_SUBREDDIT/            # Subreddit downloads
    ├── POST_ID.mp4
    ├── POST_ID.json
    └── ...

How MD5 Deduplication Works

reddit-dl uses content-based deduplication to ensure you never store duplicate media files:

Process Flow

Download - File is downloaded to disk
Calculate MD5 - Content hash is computed for the file
Check Database - MD5 is looked up in .md5_index.sqlite
Decision:
- If MD5 exists → File is deleted immediately (duplicate detected)
- If MD5 is new → File is kept, MD5 added to database

Key Features

✅ Content-Based - Detects duplicates even if filenames differ
✅ Persistent - Database survives across all runs
✅ Automatic - No configuration needed, always active
✅ Efficient - Only unique content stored on disk
✅ Cross-Post Detection - Same image posted to multiple subreddits = stored once

Example Behavior

First Run:

reddit-dl --config config.json --all --user SomeUser
# Result: 102 items → 30 unique files kept, 72 duplicates deleted
# Files on disk: 30 (all unique)
# Database: 30 MD5 hashes

Second Run (Same Command):

reddit-dl --config config.json --all --user SomeUser
# Result: 102 items downloaded, all detected as duplicates, all deleted
# Files on disk: 30 (unchanged)
# Note: Files are downloaded then immediately deleted if duplicate

Database Location

The MD5 index is stored at downloads/.md5_index.sqlite by default. This file:

Tracks all MD5 hashes of downloaded content
Persists across runs and system restarts
Can be safely deleted to reset deduplication tracking
Is automatically checkpointed based on --save-interval setting

Force and Debug Options

--force Flag:

Bypasses the failed URL check (retries previously failed downloads)
MD5 deduplication still runs normally
Useful for recovering from incomplete downloads

--debug Flag:

Enables verbose logging output
Also bypasses MD5 deduplication (keeps all files even if duplicates)
Useful for testing, debugging, and verifying content differences
Files are downloaded and kept on disk without duplicate deletion
MD5 hashes are still recorded in database for future runs

Important: In normal operation (without --debug), MD5 deduplication always runs to ensure:

You never store duplicate content
Storage remains efficient
Only unique files are kept

Troubleshooting

Common Issues

Command Not Found

````## Troubleshooting

### Common Issues

#### Command Not Found
```bash
reddit-dl: command not found

Solutions:

Ensure virtual environment is activated: source venv/bin/activate
Verify installation: pip list | grep reddit-dl
Check PATH configuration for pipx installations
Use module execution: python -m reddit_dl.extractor

Authentication Errors

HTTP 403: Forbidden

Solutions:

Verify Reddit app credentials in config.json
Ensure app type is set to "script" in Reddit preferences
Check username and password are correct
Confirm client ID and secret are accurate

Download Issues

Files appear to re-download unnecessarily

Solutions:

Check downloads/logs.txt for detailed information
Verify MD5 database integrity
Use --debug for verbose output

Performance Problems

Downloads are slow or timing out

Solutions:

Reduce --max-posts for testing
Omit --save-json to skip metadata writing (faster)
Use --per-page with smaller values (e.g., 25) for better rate limiting
Check network connectivity
Monitor Reddit API rate limits

Debug Mode

Enable comprehensive logging:

reddit-dl --config config.json --debug "https://www.reddit.com/user/SomeUser/"

This provides detailed information about:

Authentication status
URL processing
File deduplication decisions
Download progress
Error conditions

Log Analysis

Check downloads/logs.txt for audit trails:

# View recent activity
tail -f downloads/logs.txt

# Search for errors
grep -i error downloads/logs.txt

# Check specific user downloads
grep "u_SomeUser" downloads/logs.txt

Getting Help

If you encounter issues:

Enable debug mode and check logs
Verify configuration against config.example.json
Test with a small --max-posts value
Check Reddit app settings and permissions
Review GitHub issues for similar problems
Create a new issue with debug output

Contributing

We welcome contributions! Please follow these guidelines:

Development Setup

Fork the repository
Clone your fork locally
Create a virtual environment
Install in development mode

git clone https://github.com/YOUR_USERNAME/reddit-dl.git
cd reddit-dl
python -m venv venv
source venv/bin/activate
pip install -e .

Code Standards

Follow PEP 8 style guidelines
Add docstrings for public functions
Include type hints where appropriate
Write descriptive commit messages

Testing

Test changes with various Reddit URL types
Verify OAuth authentication works
Check deduplication functionality
Test edge cases and error conditions

Pull Request Process

Create a feature branch from main
Make focused, atomic commits
Include tests for new functionality
Update documentation as needed
Submit pull request with clear description

Issue Reporting

When reporting bugs, include:

Python version and operating system
Full command used and configuration
Complete error output with --debug
Steps to reproduce the issue

License

This project is derived from gallery-dl and maintains compatibility with its licensing. The original gallery-dl project is licensed under the GNU General Public License v2.0.

License Details

This project: GPL-2.0 License (following gallery-dl)
Dependencies: Various licenses (see requirements)
Reddit API: Subject to Reddit's Terms of Service

Attribution

Special thanks to the gallery-dl project and its contributors for providing the foundation for this focused Reddit extractor.

For complete license text, see the LICENSE file in the repository.

Name		Name	Last commit message	Last commit date
Latest commit History 42 Commits
.vscode		.vscode
reddit_dl		reddit_dl
scripts		scripts
.gitignore		.gitignore
README.md		README.md
config.example.json		config.example.json
reddit-dl		reddit-dl
setup.py		setup.py

Folders and files

Latest commit

History

Repository files navigation

Reddit-dl

Table of Contents

Features

Installation

Requirements

Development Installation (Recommended)

Direct Installation from GitHub

Virtual Environment Installation

Using pipx (Isolated Installation)

Running Without Installation

Updating

Update Development Installation

Update Direct GitHub Installation

Update pipx Installation

Quickstart

1. Create Reddit Application

2. Configure Authentication

3. Start Downloading

Configuration

CLI Reference

Usage

Positional Arguments

Options

General Options

Source Selection

Download Control

Performance Options

Content Control

User Profile Options

Examples

Basic Usage

Advanced Usage

Batch Processing

Output Structure

How MD5 Deduplication Works

Process Flow

Key Features

Example Behavior

Database Location

Force and Debug Options

Troubleshooting

Common Issues

Command Not Found

Authentication Errors

Download Issues

Performance Problems

Debug Mode

Log Analysis

Getting Help

Contributing

Development Setup

Code Standards

Testing

Pull Request Process

Issue Reporting

License

License Details

Attribution

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages