Skip to content
Change the repository type filter

All

    Repositories list

    • sitehound

      Public
      This is the facade for installation and access to the individual components
      Shell
      61530Updated Jan 22, 2026Jan 22, 2026
    • Scrapy extension which writes crawled items to Kafka
      Python
      83050Updated Jan 22, 2026Jan 22, 2026
    • privoxy

      Public
      Privoxy HTTP Proxy based on jess/privoxy
      Dockerfile
      2630Updated Jan 22, 2026Jan 22, 2026
    • Show summary of a large number of URLs in a Jupyter Notebook
      Python
      71730Updated Jan 22, 2026Jan 22, 2026
    • Scrapy middleware that reads proxy config from settings
      Python
      4430Updated Jan 22, 2026Jan 22, 2026
    • tor-proxy

      Public
      a tor socks proxy docker image
      Dockerfile
      41130Updated Jan 22, 2026Jan 22, 2026
    • Sitehound's backend
      HTML
      4630Updated Jan 22, 2026Jan 22, 2026
    • sitehound-frontend

      Public
      Site Hound (previously THH) is a Domain Discovery Tool
      HTML
      92351Updated Jan 22, 2026Jan 22, 2026
    • program-index

      Public
      A list of memex-related tools and their repository URLs
      57600Updated Jan 22, 2026Jan 22, 2026
    • scrapy-crawl-once

      Public
      Scrapy middleware which allows to crawl only new content
      Python
      217972Updated Jan 22, 2026Jan 22, 2026
    • extract difference between two html pages
      HTML
      43250Updated Jan 22, 2026Jan 22, 2026
    • fuzzyset

      Public
      A simple fuzzy matching set for python strings
      Python
      48100Updated Jan 22, 2026Jan 22, 2026
    • use multiple proxies with Scrapy
      Python
      160771485Updated Jan 22, 2026Jan 22, 2026
    • linkdepth

      Public
      [UNMAINTAINED] scrapy spider to check link depth over time
      Python
      1430Updated Jan 22, 2026Jan 22, 2026
    • memex-cdr

      Public
      This repository hosts code and schema information related to the Memex Crawl Data Repository (CDR)
      Python
      8100Updated Jan 22, 2026Jan 22, 2026
    • Log TensorBoard events without touching TensorFlow
      Python
      49628121Updated Jan 22, 2026Jan 22, 2026
    • THH ↔ deep-deep integration
      Python
      2320Updated Jan 22, 2026Jan 22, 2026
    • Headless Horseman Page Classifier service
      Python
      4630Updated Jan 22, 2026Jan 22, 2026
    • html-text

      Public
      Extract text from HTML
      HTML
      22134150Updated Jan 22, 2026Jan 22, 2026
    • eli5

      Public
      A library for debugging/inspecting machine learning classifiers and explaining their predictions
      Jupyter Notebook
      3282.8k14814Updated Jan 22, 2026Jan 22, 2026
    • soft404

      Public
      A classifier for detecting soft 404 pages
      Jupyter Notebook
      135861Updated Jan 22, 2026Jan 22, 2026
    • Publish Scrapy stats to statsd daemon
      Python
      5100Updated Jan 22, 2026Jan 22, 2026
    • linkrot

      Public
      [UNMAINTAINED] A script (Scrapy spider) to check a list of URLs.
      Jupyter Notebook
      3430Updated Jan 22, 2026Jan 22, 2026
    • Annotate parts of web pages in the browser
      Python
      31140Updated Jan 22, 2026Jan 22, 2026
    • Read JSON lines (jl) files, including gzipped and broken
      Python
      83550Updated Jan 22, 2026Jan 22, 2026
    • Item definition and utils for storing items in CDR format for scrapy
      Python
      5730Updated Jan 22, 2026Jan 22, 2026
    • Broad crawler for domain discovery
      Python
      81950Updated Jan 22, 2026Jan 22, 2026
    • Python
      27100Updated Jan 22, 2026Jan 22, 2026
    • A simple tool to add a new user with OpenSSH keys.
      Python
      0330Updated Jan 22, 2026Jan 22, 2026
    • Broad crawl of onion sites in search for captchas
      Python
      2320Updated Jan 22, 2026Jan 22, 2026