🧑‍🍳 Sift

A prep tool for your text-based recipes

sift is a text extraction tool for the command line. Use its search to pinpoint relevant information, or simply extract clean, structured content from URLs, files, or stdin. It's a composable tool for building data pipelines for your LLM workflows

✨ Highlights

Smart Content Extraction: Automatically removes HTML, ads, and boilerplate to isolate the main content using Mozilla's Readability algorithm. You can also target specific elements with CSS selectors.
Field-Aware Search: Pinpoint relevant information with a keyword search that understands document structure.
Flexible I/O: Process content from URLs, local files, or standard input with automatic source detection. Output formats include Markdown, plain text, or JSON.
Precise Output Sizing: Control output size with token-level precision for LLM workflows, using the cl100k_base tokenizer, or by word and character counts.
Composable by Design: Built as a native command-line tool, so you can easily pipe content in and chain with other tools to create powerful text processing workflows.

Installation

From Release Binaries

You can download a pre-compiled binary for your operating system from latest releases.

Go Install

If you have a Go environment set up, you can install sift directly:

go install github.com/chriscorrea/sift/cmd/sift@latest

Quick Start

Sift the main content from a webpage:

sift https://www.recipetineats.com/carrot-cake/

Target specific content with CSS selectors:

sift https://www.recipetineats.com/carrot-cake/ --selector ".wprm-recipe"

Find the most relevant content using keyword search (and limit to 200 tokens):

sift https://www.marcuse.org/herbert/pubs/64onedim/odmintro.html --search "technology" -t 200

Chain with other command line tools, such as slop for LLMs:

sift https://www.recipetineats.com/carrot-cake/ | \
slop --yaml "build a shopping list, organized by aisle"

Usage

Flags

Extraction & Search

Flag	Short	Description
`--search`		Search for keywords and extract relevant context.
`--context-tokens`		Token budget for smart context around search results (default is 200).
`--selector`	`-s`	CSS selector for content extraction.
`--include-all`	`-i`	Include all content without readability filtering.

Output Sizing

Flag	Short	Description
`--token-limit`	`-t`	Maximum number of tokens for output (effective default is 2500).
`--word-limit`	`-w`	Maximum number of words for output.
`--character-limit`	`-c`	Maximum number of characters for output.
`--beginning`		Select content from the document's beginning (default).
`--middle`		Select content from the document's middle, expanding outward.
`--end`		Select content from the document's end, working backward.

Formatting & Behavior

Flag	Short	Description
`--md`		Output in Markdown format (default).
`--text`		Output in plain text format.
`--json`		Output in JSON format.

Other

Flag	Short	Description
`--quiet`	`-q`	Suppress informational messages and progress spinners.
`--help`	`-h`	Show help information.

Name		Name	Last commit message	Last commit date
Latest commit History 15 Commits
.github/workflows		.github/workflows
cmd/sift		cmd/sift
internal		internal
.gitignore		.gitignore
.goreleaser.yaml		.goreleaser.yaml
LICENSE		LICENSE
Makefile		Makefile
README.md		README.md
go.mod		go.mod
go.sum		go.sum

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

🧑‍🍳 Sift

✨ Highlights

Installation

From Release Binaries

Go Install

Quick Start

Usage

Flags

Extraction & Search

Output Sizing

Formatting & Behavior

Other

Contributing

License

Roadmap

About

Uh oh!

Releases 2

Uh oh!

Contributors 2

Uh oh!

Languages

License

chriscorrea/sift

Folders and files

Latest commit

History

Repository files navigation

🧑‍🍳 Sift

✨ Highlights

Installation

From Release Binaries

Go Install

Quick Start

Usage

Flags

Extraction & Search

Output Sizing

Formatting & Behavior

Other

Contributing

License

Roadmap

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases 2

Uh oh!

Contributors 2

Uh oh!

Languages