Skip to content

🤖 Vibecoded: Automated research aggregation toolkit for software leadership, innovation, and academic research. Single-command multi-source content discovery with Obsidian & NotebookLM integration.

License

Notifications You must be signed in to change notification settings

DoubtfulPrism/research-digest-toolkit

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

12 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Research Digest Toolkit

Tests Python 3.9+ Code style: black License: MIT

Automated research aggregation for software leadership, innovation, and academic research

🤖 Vibecoded Project: This project was developed through AI-assisted pair programming with Claude (Anthropic). The human provided requirements, domain expertise, and direction; Claude provided implementation, best practices, and code structure.


🎯 What This Is

A complete toolkit for automating research content discovery, aggregation, and organization. Designed for researchers, team leads, and knowledge workers who need to stay current across multiple sources.

Single-command research digest:

./research_digest.py

Automatically discovers, scrapes, and organizes content from:

  • HackerNews discussions
  • RSS feeds (blogs, journals)
  • Reddit communities
  • Twitter/X threads (manual curation)
  • YouTube transcripts

Output: Organized by date, formatted for Obsidian, ready for NotebookLM analysis.


✨ Key Features

  • 10 specialized tools - Each optimized for specific content sources
  • Automated orchestration - One command to rule them all
  • Native tool integration - Uses pandoc & pdftotext for quality
  • Obsidian-ready - YAML frontmatter + auto-tagging
  • NotebookLM-ready - Automatic file splitting at 400k char limit
  • Deduplication - Smart duplicate removal by URL/title
  • Configurable - YAML config for topics, sources, thresholds
  • Schedulable - Cron/systemd examples included

🛠️ The Toolkit

The toolkit is composed of a central orchestrator, a set of scraper plugins, and several standalone utility tools.

Orchestration

Tool Purpose
research_digest.py Runs the entire pipeline by loading and executing enabled scraper plugins.

Scraper Plugins (in scrapers/ directory)

Plugin Source Purpose
ArxivScraper arXiv.org Fetches scientific pre-prints based on keywords.
HNScraper HackerNews Fetches discussions based on keywords and score.
RSSScraper RSS/Atom Monitors blog and news feeds.
RedditScraper Reddit Fetches posts from specified subreddits.

Utility Tools

Tool Purpose
web_scraper.py Manually scrape articles from web pages.
youtube_transcript.py Manually download YouTube video transcripts.
thread_reader.py Manually download Twitter/X threads.
obsidian_prep.py Format any text content for Obsidian with auto-tagging.
file_splitter.py Split large text files into smaller chunks for NotebookLM.
file_converter.py Convert between document formats (e.g., PDF to text).
convert_documents.sh A wrapper script for higher-quality native document conversion.

🚀 Quick Start

1. Install Dependencies

# Python packages
pip install --user -r requirements.txt

# Native tools (optional but recommended for quality)
sudo dnf install pandoc poppler-utils  # Fedora/RHEL
# or
sudo apt install pandoc poppler-utils  # Debian/Ubuntu

2. Configure Your Scrapers

Edit research_config.yaml to enable and configure your desired scrapers:

# In research_config.yaml
scrapers:
  rss:
    enabled: true
    feeds:
      - url: "https://charity.wtf/feed/"
        name: "Charity Majors"
        tags: ["leadership"]
  hackernews:
    enabled: true
    search_topics: ["engineering culture"]
    min_points: 50

3. Run

# Make scripts executable
chmod +x *.py *.sh

# Run the digest
./research_digest.py

# Check results
cat research_digest/$(date +%Y-%m-%d)/REPORT.md

4. Automate (Optional)

# Weekly digest every Monday at 9 AM
crontab -e

# Add this line:
0 9 * * 1 cd /path/to/Scripts && ./research_digest.py

📖 Documentation


🎓 Use Cases

Software Team Lead

  • Track engineering culture discussions (HN, Twitter, Reddit)
  • Monitor platform engineering trends
  • Aggregate developer productivity insights
  • Auto-organized for weekly review

University Researcher

  • Monitor academic RSS feeds
  • Track higher education discussions
  • Aggregate EdTech trends
  • Format for literature review workflow

Knowledge Worker

  • Curate topic-specific content
  • Build personal knowledge base
  • Feed NotebookLM for synthesis
  • Integrate with Obsidian vault

💡 Example Workflow

Weekly Research Digest

Monday Morning (Automated):

# Cron runs automatically
./research_digest.py

# Outputs to:
research_digest/2024-12-17/
├── raw/           # Original content
├── obsidian/      # Formatted & tagged
└── REPORT.md      # Summary

Review:

# Read summary
cat research_digest/$(date +%Y-%m-%d)/REPORT.md

# Import to Obsidian
cp -r research_digest/$(date +%Y-%m-%d)/obsidian/* \
      ~/Documents/Obsidian/Research/

# Upload to NotebookLM for analysis
# All files in obsidian/ are pre-formatted and split

🏗️ Architecture

Design Philosophy

Modular and Extensible: The toolkit is designed around a plugin architecture. The core orchestrator dynamically loads and runs scraper plugins, making it easy to add new sources without modifying the main application.

Hybrid Approach: Uses the best tool for each job.

  • Python for APIs, scraping, and orchestration logic.
  • Native tools (pandoc, pdftotext) for high-quality document conversion.

Pipeline

1. Discovery & Execution (research_digest.py)
   ├─ Loads `research_config.yaml`
   ├─ Discovers scraper plugins in `scrapers/`
   └─ Runs enabled plugins in sequence

2. Scraping (Plugins)
   ├─ Plugin (e.g., HNScraper) fetches content.
   ├─ Checks `research_digest_state.db` to see if item is new.
   ├─ If new, saves raw content to `research_digest/DATE/raw/`
   └─ Adds new item's ID to the database.

3. Processing (research_digest.py)
   ├─ Document Conversion (optional, uses native tools)
   ├─ Obsidian Formatting (`obsidian_prep.py`)
   └─ File Splitting (`file_splitter.py`)

4. Output
   ├─ Date-organized folders
   ├─ Clean, tagged markdown in `obsidian/`
   └─ Summary `REPORT.md`

🔧 Configuration

research_config.yaml is the main configuration file. The new structure is organized around a central scrapers block.

Key sections:

  • scrapers: Enable, disable, and configure each plugin (e.g., hackernews, rss).
  • topics: Your research keywords for auto-tagging.
  • processing: Enable features like auto-tagging and file splitting.

See AUTOMATION_GUIDE.md for detailed configuration options.


🧪 Testing & Quality

Automated Testing

The project has comprehensive automated tests with 86 tests and 89%+ coverage on core modules.

Run tests:

# All tests
pytest tests/

# With coverage report
pytest tests/ --cov=. --cov-report=term-missing

# Quick tests only
pytest tests/ -m "not slow"

Test coverage:

  • database.py - 89% (deduplication logic)
  • utils.py - 83% (filename generation, HTML cleaning)
  • scrapers/base.py - 100% (plugin architecture)
  • 86 total tests across 4 test modules

See tests/README.md for detailed testing documentation.

Continuous Integration

GitHub Actions automatically runs tests on:

  • Every push to main/develop branches
  • All pull requests
  • Weekly scheduled runs (regression testing)

CI workflows:

  • Full Test Suite - Tests on Python 3.9, 3.10, 3.11, 3.12
  • Quick Check - Fast validation for feature branches
  • Security Scan - Dependency vulnerabilities and code security
  • Code Quality - Linting with ruff, black, isort

See .github/CI_SETUP.md for CI/CD documentation.

Code Quality

  • Linting: Ruff for code quality
  • Formatting: Black for consistent style
  • Security: Bandit for security scanning
  • Dependencies: Dependabot for automatic updates

🤝 Contributing

This is a personal research toolkit, but contributions welcome!

If you:

  • Add a new source (Mastodon, arXiv, etc.)
  • Improve auto-tagging for specific domains
  • Add new output formats
  • Fix bugs

Please submit PRs with clear descriptions.


📜 License

MIT License - See LICENSE file


🙏 Acknowledgments

Development

  • Vibecoded with: Claude (Anthropic) - AI pair programming assistant
  • Human: Doug - Domain expertise, requirements, use case definition
  • Approach: Collaborative AI-assisted development

Inspiration

  • Obsidian - Personal knowledge management
  • NotebookLM - AI-powered research synthesis
  • Reddit academic workflows - Community inspiration

Tools

  • Python - Primary language
  • pandoc - Universal document converter
  • poppler-utils - PDF text extraction
  • Various Python libraries (see requirements.txt)

🔬 About Vibecoding

Vibecoding (AI-assisted development) was used extensively in this project:

What the human provided:

  • Problem definition and use case
  • Domain expertise (software leadership, academia)
  • Requirements and feature requests
  • Workflow design
  • Testing and feedback

What Claude provided:

  • Code implementation
  • Best practices and patterns
  • Documentation
  • Error handling and edge cases
  • Tool selection and integration

Result: A production-ready toolkit built faster than solo development, with better code quality through AI-assisted review.


📊 Stats

  • 10 CLI tools - Each with --help documentation
  • 3,500+ lines of production code
  • 86 automated tests - 89%+ coverage on core modules
  • CI/CD pipelines - GitHub Actions for quality assurance
  • Native tool integration - Best quality output
  • YAML configuration - Easy customization
  • Cron-ready - Set and forget automation
  • Obsidian + NotebookLM - Seamless workflow integration

🚦 Status

Production Ready

All tools are:

  • ✅ Fully functional
  • ✅ Documented
  • ✅ Comprehensively tested (86 automated tests)
  • ✅ CI/CD enabled (GitHub Actions)
  • ✅ Error handled
  • ✅ Tested on Linux
  • ✅ Ready for automation

💬 Support

  • Documentation: See guides in this repo
  • Issues: Use GitHub issues for bugs/features
  • Discussion: For general questions about setup

🗺️ Roadmap

Potential future additions:

  • Mastodon/Fediverse integration
  • arXiv paper monitoring
  • Google Scholar alerts → RSS
  • Slack/Discord notifications
  • Email digest formatting
  • Web UI for configuration
  • Docker containerization

Built with ❤️ and 🤖 for researchers who want to focus on thinking, not searching.


Last updated: December 2024 Development approach: Vibecoded (AI-assisted)

About

🤖 Vibecoded: Automated research aggregation toolkit for software leadership, innovation, and academic research. Single-command multi-source content discovery with Obsidian & NotebookLM integration.

Topics

Resources

License

Contributing

Stars

Watchers

Forks

Packages

No packages published

Contributors 2

  •  
  •