Skip to content

Sift is is a text extraction tool for the command line. It's a composable tool for building data pipelines for your LLM workflows

License

Notifications You must be signed in to change notification settings

chriscorrea/sift

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

15 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

🧑‍🍳 Sift

Go Version Go Report Card CI Latest Release

A prep tool for your text-based recipes

sift is a text extraction tool for the command line. Use its search to pinpoint relevant information, or simply extract clean, structured content from URLs, files, or stdin. It's a composable tool for building data pipelines for your LLM workflows

✨ Highlights

  • Smart Content Extraction: Automatically removes HTML, ads, and boilerplate to isolate the main content using Mozilla's Readability algorithm. You can also target specific elements with CSS selectors.

  • Field-Aware Search: Pinpoint relevant information with a keyword search that understands document structure.

  • Flexible I/O: Process content from URLs, local files, or standard input with automatic source detection. Output formats include Markdown, plain text, or JSON.

  • Precise Output Sizing: Control output size with token-level precision for LLM workflows, using the cl100k_base tokenizer, or by word and character counts.

  • Composable by Design: Built as a native command-line tool, so you can easily pipe content in and chain with other tools to create powerful text processing workflows.

Installation

From Release Binaries

You can download a pre-compiled binary for your operating system from latest releases.

Go Install

If you have a Go environment set up, you can install sift directly:

go install github.com/chriscorrea/sift/cmd/sift@latest

Quick Start

Sift the main content from a webpage:

sift https://www.recipetineats.com/carrot-cake/

Target specific content with CSS selectors:

sift https://www.recipetineats.com/carrot-cake/ --selector ".wprm-recipe"

Find the most relevant content using keyword search (and limit to 200 tokens):

sift https://www.marcuse.org/herbert/pubs/64onedim/odmintro.html --search "technology" -t 200

Chain with other command line tools, such as slop for LLMs:

sift https://www.recipetineats.com/carrot-cake/ | \
slop --yaml "build a shopping list, organized by aisle"

Usage

Flags

Extraction & Search

Flag Short Description
--search Search for keywords and extract relevant context.
--context-tokens Token budget for smart context around search results (default is 200).
--selector -s CSS selector for content extraction.
--include-all -i Include all content without readability filtering.

Output Sizing

Flag Short Description
--token-limit -t Maximum number of tokens for output (effective default is 2500).
--word-limit -w Maximum number of words for output.
--character-limit -c Maximum number of characters for output.
--beginning Select content from the document's beginning (default).
--middle Select content from the document's middle, expanding outward.
--end Select content from the document's end, working backward.

Formatting & Behavior

Flag Short Description
--md Output in Markdown format (default).
--text Output in plain text format.
--json Output in JSON format.

Other

Flag Short Description
--quiet -q Suppress informational messages and progress spinners.
--help -h Show help information.

Contributing

Contributions and issues are welcome – please see the issues page.

License

This project is licensed under the BSD-3 License.

Roadmap

  • Content fetching from multiple sources
  • CSS selector support
  • Multiple output formats (Markdown, text, JSON)
  • Text search with BM25 field-aware text ranking
  • Content deduplication across sources
  • Recursive chunking
  • Streaming content processing for arbitrarily large files
  • Additional tokenizer support beyon cl100k_base
  • Improve smart content extraction through zero-shot classification or other NLP approaches
  • Semantic search via local ONNX embedding model (under evaluation)

About

Sift is is a text extraction tool for the command line. It's a composable tool for building data pipelines for your LLM workflows

Resources

License

Stars

Watchers

Forks

Contributors 2

  •  
  •