SubjectSpotter

SubjectSpotter is a Ruby module for processing text files, identifying relevant subjects from a predefined list, and annotating them in the output format (currently HTML). It uses an LLM service (OpenAI) to generate annotated outputs.

Installation

Run the setup script to install dependencies and configure environment variables:

bin/setup
bin/subject_spotter help process

Parameters

Parameter	Description
`text_path`	Path to the text file. Can be a local or remote URL.
`subject_listing_path`	Path to the subject listing JSON file. Can be a local or remote URL.
`user_context`	Optional string to provide additional context to the text being processed.
`prompt_template`	Either `:template1` or `:template2`. Refer to `/subject_spotter/prompt/templates` for more details.
`llm_service`	Currently supports only `openai`.
`output_format`	Only `html` is supported at the moment.

Environment Variables for OpenAI

To use OpenAI as the LLM service, set the following environment variables:

OPENAI_ACCESS_TOKEN=your_access_token
OPENAI_MODEL=your_model_name
OPENAI_LOG_ERRORS=true/false

Example Usage

subject_spotter = SubjectSpotter::Engine.new(
  text_path: "https://raw.githubusercontent.com/SubjectSpotter/sample-documents/refs/heads/main/application-from-stephen-mclendon-jesse-wallace-and-w-a-barr-june-2-1864/plaintext/verbatim_transcript.txt",
  subject_listing_path: "https://raw.githubusercontent.com/SubjectSpotter/sample-documents/main/all_subjects.json"
)

# Generate annotated HTML output
subject_spotter.process

# Inspect the generated prompt
subject_spotter.prompt

# Inspect the raw text
subject_spotter.text

# Modify user context dynamically
subject_spotter.user_context = "New context here"

# Changing these parameters invalidates the previous prompt and requires re-processing
subject_spotter.text_path = 'new_text_path'
subject_spotter.subject_listing_path = 'new_subject_listing_path'
subject_spotter.prompt_template = :template2
subject_spotter.llm_service = :openai

# Re-generate output after modifications
subject_spotter.process

# View cached outputs
subject_spotter.cached_outputs

Features

Supports text processing and entity annotation.
Uses OpenAI for entity recognition and formatting.
Provides prompt customization with templates.
Caches previous outputs for reference.

TODO

Support additional LLM services.
Improve output formatting and text cleaning.

Notes

Changing any of the key attributes (text_path, subject_listing_path, etc.) invalidates the cached prompt and requires re-processing.
Cached outputs are stored in an instance variable and are lost when the object is destroyed.

For more details, refer to the /subject_spotter directory in the repository.

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
.github/workflows		.github/workflows
bin		bin
lib		lib
sig		sig
spec		spec
.gitignore		.gitignore
.rspec		.rspec
.rubocop.yml		.rubocop.yml
CHANGELOG.md		CHANGELOG.md
CODE_OF_CONDUCT.md		CODE_OF_CONDUCT.md
Gemfile		Gemfile
Gemfile.lock		Gemfile.lock
LICENSE.txt		LICENSE.txt
README.md		README.md
Rakefile		Rakefile
subject_spotter.gemspec		subject_spotter.gemspec

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

SubjectSpotter

Installation

Parameters

Environment Variables for OpenAI

Example Usage

Features

TODO

Notes

About

Uh oh!

Releases

Packages

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

SubjectSpotter

Installation

Parameters

Environment Variables for OpenAI

Example Usage

Features

TODO

Notes

About

Resources

License

Code of conduct

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Packages