A Python script that converts Reddit JSONL files (Reddit API format) into a beautiful HTML visualization with embedded media. The script downloads images and videos, merges comments from multiple sources, and exports cleaned data for training.
- Media-Aware: Automatically downloads images and videos from posts and comments
- Giphy Support: Extracts and downloads giphy links (e.g.,
) - Preview Images: Embeds preview and thumbnail images even if not in post body text
- Comment Merging: Combines comments from multiple JSONL files (useful when users block each other)
- Reddit-Style UI: Beautiful HTML output that looks like Reddit
- Training Data Export: Exports cleaned JSONL format with local media paths for training
- Parallel Downloads: Fast media downloads using multiple workers
No special dependencies beyond standard Python libraries. The script uses:
json- JSON parsinghtml- HTML escapingrequests- Media downloadsargparse- Command-line argumentspathlib- Path handlingconcurrent.futures- Parallel downloads
Process specific JSONL files:
python3 jsonl_to_html.py -i test.jsonl test3.jsonlProcess all JSONL files in a directory:
python3 jsonl_to_html.py -i data/Process files and directories together:
python3 jsonl_to_html.py -i file1.jsonl file2.jsonl data/ -o output/-i, --input Input JSONL file(s) or directory/directories (required)
-o, --output Output directory (default: current directory)
--html-name Output HTML filename (default: media_aware_visualization.html)
--jsonl-name Output JSONL filename (default: conversation_data_cleaned.jsonl)
--media-dir Directory for downloaded media (default: downloaded_media)
--workers Number of parallel workers for downloads (default: 50)
# Process files and save to custom output directory
python3 jsonl_to_html.py -i test.jsonl test3.jsonl -o results/
# Process directory with custom filenames
python3 jsonl_to_html.py -i data/ -o output/ --html-name reddit_threads.html --jsonl-name training_data.jsonl
# Use more workers for faster downloads
python3 jsonl_to_html.py -i large_dataset/ -o output/ --workers 100The script expects JSONL files in Reddit API format. Each line should be a JSON array with:
[0]: Listing containing posts (kind="Listing", children with kind="t3")[1]: Listing containing comments (kind="Listing", children with kind="t1")
Example structure:
[
{
"kind": "Listing",
"data": {
"children": [
{
"kind": "t3",
"data": {
"name": "t3_xxxxx",
"title": "Post Title",
"selftext": "Post body text",
"author": "username",
"score": 100,
"subreddit": "subredditname",
"created_utc": 1234567890,
...
}
}
]
}
},
{
"kind": "Listing",
"data": {
"children": [
{
"kind": "t1",
"data": {
"name": "t1_xxxxx",
"body": "Comment text",
"author": "username",
"score": 50,
"created_utc": 1234567890,
"replies": { ... }
}
}
]
}
}
]A Reddit-style HTML page showing:
- Post titles, authors, scores, timestamps
- Post bodies with embedded images/videos
- Nested comment threads with proper indentation
- All media files embedded inline
Training-ready format with:
- One post per line
- Cleaned text (HTML tags removed, media paths preserved as local file references)
- Nested comment structure preserved
- All metadata (scores, timestamps, authors)
- Local media paths for training (e.g.,
downloaded_media/filename.jpg)
Example output line:
{
"id": "t3_xxxxx",
"title": "Post Title",
"author": "username",
"body": "Post body text downloaded_media/image1.jpg",
"score": 100,
"created_at": 1234567890,
"subreddit": "subredditname",
"comment_count": 5,
"comments": [
{
"id": "t1_xxxxx",
"author": "commenter",
"body": "Comment text downloaded_media/image2.jpg",
"score": 50,
"created_at": 1234567890,
"replies": [...]
}
]
}All downloaded images and videos are saved here with MD5-based filenames to avoid duplicates.
- Reads all specified JSONL files
- Extracts posts and comments from Reddit API format
- Collects all media URLs from posts and comments
- Downloads all unique media files in parallel
- Saves to
downloaded_media/directory - Creates mapping from original URLs to local paths
- Merges comments from multiple files (by matching comment IDs)
- Replaces media URLs with local file paths
- Converts giphy links to local file paths
- Embeds all images/videos as HTML tags (including preview/thumbnail images)
- Processes nested comment threads recursively
- Generates Reddit-style HTML visualization
- Exports cleaned JSONL for training data
When multiple JSONL files contain the same post, the script:
- Matches comments by their Reddit ID
- Merges replies recursively
- Combines all unique comments from all sources
- Useful when users block each other and different files show different parts of the conversation
-
Extraction: Finds media URLs in:
- Post
urlandurl_overridden_by_destfields - Post
previewimages (automatically embedded even if not in body) - Post
thumbnailimages (automatically embedded even if not in body) - Gallery images (
gallery_data/media_metadata) - URLs in post/comment body text
- Giphy links:
orgiphy|IDformat
- Post
-
Download:
- Parallel downloads using ThreadPoolExecutor
- Retry logic for failed downloads
- Deduplication by URL
- Progress tracking
- Giphy GIFs are downloaded and saved locally
-
Embedding:
- All downloaded images are embedded as
<img>tags in HTML - Videos embedded as
<video>tags with controls - Preview/thumbnail images included even if not in post body text
- Giphy links converted to local image references
- Responsive styling for all media
- All downloaded images are embedded as
- Check that input paths are correct
- Ensure JSONL files have
.jsonlextension - Verify file permissions
- Check internet connection
- Some URLs may be expired or require authentication
- Failed downloads are logged but don't stop processing
- Ensure
downloaded_media/directory is in the same location as HTML file - Check that media files were actually downloaded
- Verify relative paths in HTML source
- All downloaded files should be embedded as
<img>tags - check HTML source to confirm
- Giphy links are automatically extracted and downloaded
- Format:
or justgiphy|IDin text - Downloaded giphy files are saved with MD5-based filenames
- If a giphy fails to download, it will fall back to external URL
This project is licensed under the MIT License - see the LICENSE file for details.