This is a Python script that collects posts from the Bluesky firehose and saves them to a JSONL file. This tool is designed to be easy to set up and use, making it accessible for anyone interested in archiving Bluesky posts.
- Collects posts from the Bluesky firehose.
- Saves posts to a JSONL file with details such as text, creation time, author, URI, image presence, and reply information.
- Uses a cache for efficient author handle resolution.
- Python >=3.11,<3.14
- Poetry for dependency management
-
Clone the repository:
git clone https://github.com/deepfates/bsky-scraper.git cd bsky-scraper
-
Install dependencies:
Use Poetry to install the required packages:
poetry install
-
Run the script:
You can start collecting posts by running the script with Poetry:
poetry run python scrape.py
By default, the script collects posts for 30 seconds. You can adjust the duration by modifying the
duration_seconds
parameter in thestart_collection
method. -
Output:
The collected posts are saved to
bluesky_posts.jsonl
in the project directory. Each line in the file is a JSON object representing a post.
- Output File: You can change the output file by passing a different filename to the
FirehoseScraper
constructor. - Collection Duration: Modify the
duration_seconds
parameter in thestart_collection
method to change how long the script collects posts.
This project is licensed under the MIT License. See the LICENSE file for details.
Contributions are welcome! Please open an issue or submit a pull request for any improvements or bug fixes.
For questions or feedback, please contact deepfates on Bluesky.