This document summarizes the primary classes and functions exposed by the Scrapi Reddit package. Import them from scrapi_reddit unless noted otherwise.
Creates a preconfigured requests.Session with Reddit-friendly headers. Set verify=False to bypass TLS validation when testing on restricted networks.
Configuration dataclass passed to process_listing / process_post.
Key fields:
output_root (Path): Root directory for all artifacts.listing_limit (int | None): Maximum items per listing;Nonemeans unlimited.comment_limit (int): Max comments per post (capped at 500).delay (float): Seconds between post fetches; enforced minimum is 1.0.time_filter (str): Default timeframe for "top" listings (hour/day/week/month/year/all).output_formats (set[str]): Any combination of{"json", "csv"}.fetch_comments (bool): Whether to follow each listing post with a comment pull.resume (bool): Reuse cached JSON/media to skip existing work.download_media (bool): Save discovered media files under the target'smedia/directory.media_filters (set[str] | None): Allowed media tokens (categories or extensions).
Dataclass describing a listing endpoint (subreddit, search query, front page, etc.). Use helper factories (e.g. _build_targets inside the CLI or build_search_target) to make instances.
Fields:
label: Human readable name for logs.output_segments: Tuple mapping to directories insideoutput_root.url: Fully-qualified Reddit JSON URL to fetch.params: Default query parameters for the listing.context: Short identifier (e.g.python,search).allow_limit: IfFalse, use server-providedlimitunchanged.
Describes a single post (permalink) JSON endpoint. Useful when you only need a single thread but still want cached JSON/CSV artifacts.
Utility for constructing search listings. Accepts keyword text plus optional type filters, sort order, time filter, and subreddit restriction. Returns a ListingTarget that plugs into process_listing.
Fetches a JSON document with retry/backoff on transient failures (rate limits, gateway errors). Raises the original exception after all retries.
Pulls the children array out of a listing response, yielding normalized link dictionaries with rank, IDs, permalinks, and JSON URLs ready for downstream processing.
Primary worker for listings. Handles pagination, caching, comment fetching, CSV writing, and optional media downloads. Respects listing_limit, delay, and resume options.
Processes a single post target. Useful when combining listing data with ad-hoc thread fetches.
Recreates posts.csv and comments.csv from previously cached post_jsons plus links.json. Handy after adjusting CSV schema or when recovering from interrupted runs.
Convert nested JSON structures into flat CSV-ready dictionaries. Used internally but available if you want to roll your own writers.
Parses user-provided media filters (category names or extensions) into canonical tokens used by the downloader.
- Network and rate limit failures bubble up after the configured retry attempts. Wrap top-level calls in a try/except block if you need custom recovery.
- File-system operations (
write_csv, media downloads) create parent directories on demand and raise regularOSErrorif permissions fail. rebuild_csv_from_cacheraisesFileNotFoundErrorwhen cached JSON artifacts are missing. Checktarget.output_dir(output_root)before calling.build_search_targetvalidates sort/type/time options and raisesValueErrorfor unsupported combinations.
The package exports the most common helpers at the top level:
from scrapi_reddit import (
build_session,
build_search_target,
ListingTarget,
PostTarget,
ScrapeOptions,
process_listing,
process_post,
rebuild_csv_from_cache,
)Refer to the source (scrapi_reddit/core.py) for advanced internals that are not part of the stable API.