08 Oct 22:26

0dcae74

v0.0.44 Latest

Latest

v0.0.44 adds new pipeline blocks including spacey NER block, semantic hnmfk block, collect hnmfk leaf block, peacock stats block, beaver codependency matrix block, ocelot filter block, wolf block, and sbatch block, along with supporting helpers such as KernelServer, affiliation and author partitioners, hnmfk paths, and peacock renderer, delivers a full Termite application stack for knowledge graph and vector workflows with Neo4j injection through DataInjector and constants, pluggable embedding stores for Milvus and OpenSearch, vector injection, and corresponding pipeline blocks termite neo4j block and termite vector block, updates the Lynx frontend with a document view and a new project data loader, switches post install to the telf post install console script in place of the python post install script with updated examples and HPC flags, refines preprocessing and factorization with fixes including an Orca update to better handle missing affiliations, introduces versioning and archiving scripts under a new post install package, and expands .gitignore to exclude additional caches, example outputs, and backup files.

Assets 2

26 Aug 20:54

barronlanl

v0.0.43

6e7c3e4

v0.0.43

v0.0.43 adds Ocelot sub-module to Cheetah, a module for filtering documents using n-grams; updates AutoBunny to support simpler block-style usage within pipelines; introduces a block for summarizing term matches when applying Ocelot filtering; and includes accompanying example notebooks.

Assets 2

18 Aug 21:15

barronlanl

v0.0.42

1729628

v0.0.42

Version 0.0.42 introduces a new block-based pipeline framework that allows users to design and execute custom data flows across TELF modules, enabling greater flexibility and experimentation with pipeline configurations. This release also includes Lynx, a lightweight Streamlit-based frontend for visualizing post-processing results, offering a way to explore outcomes from scientific literature analysis and link prediction.

Assets 2

08 May 23:20

MaksimEkin

v0.0.41

52a836a

v0.0.41

New Pre-Processing Module: Squirrel 🚀

Version 0.0.41 introduces a new pre-processing module called Squirrel, designed for automated document pruning. Squirrel streamlines the process of accepting or rejecting documents by applying predefined rules and thresholds, eliminating the need for manual review. Squirrel supports the use of multiple pruning strategies. In this release, we include both embedding-based pruning and LLM-based pruning:

Embedding-Based Pruning: This method filters documents based on their distance from a reference centroid in embedding space. Only documents within a specified threshold are retained, ensuring higher data quality.
LLM-Based Pruning: Squirrel leverages large language models to further refine the pruning process. It conducts multiple voting trials using LLM evaluations to determine whether a document should be accepted or rejected.

Assets 2

30 Apr 19:34

MaksimEkin

v0.0.40

ce17cfa

v0.0.40

🔧 Vulture Enhancements

Expanded Standard Cleaning Functions:
Added the following to Vulture’s standard text cleaning pipeline:
- remove_numbers: Removes stand-alone numbers.
- remove_alphanumeric: Removes mixed alphanumeric terms (e.g., abc123).
- remove_roman_numerals: Removes Roman numeral listings.

🐆 Cheetah Additions

term_generator:
- Extracts top keywords from cleaned text using TF-IDF.
- Pairs keywords with nearby support terms based on a co-occurrence matrix.
- Saves output as a structured markdown file of search terms.
CheetahTermFormatter:
- Parses markdown search term files into structured blocks with optional filters (e.g., positives, negatives).
- Supports plain string output or category-based filtering.
- Can generate substitution maps to convert multi-word phrases into underscored versions and back.
convert_txt_to_cheetah_markdown:
- Converts plain .txt files or structured term dictionaries into Cheetah-compatible markdown format.
- Facilitates easier programmatic creation and editing of search term files.

🧹 Refactoring and Fixes

Code Refactoring:
- Consolidated several duplicated functions across modules into shared helper utilities at a higher level.
Bug Fixes:
- Vulture:
  - Fixed path-saving logic in operator pipelines.
  - Fixed bugs in the NER and Vocabulary Consolidator operators.
- Beaver:
  - Resolved a file-saving issue that also affected Wolf’s visualization routines.
- Fixes README under examples to have the correct module links.
.gitignore Updates:
- Added more output files and example notebook directories to .gitignore.

📁 New Example: NM Law Data Pipeline

Added the NM Law Data/ folder, containing the data processing pipeline used in the paper:
“Legal Document Analysis with HNMFk” (arXiv:2502.20364)

00_data_collection/:
Scrapes and formats legal documents (statutes, constitution, court cases) from Justia.
01_hnmfk_operation/:
Constructs document-word matrices and runs Hierarchical Nonnegative Matrix Factorization (HNMFk).
02_benchmarking/:
Evaluates LLM-generated content using factual accuracy, entailment, and summarization metrics.
03_visualizations/:
Visualizes legal trends, knowledge graphs, and model evaluation results.

Assets 2

02 Apr 21:24

MaksimEkin

v0.0.39

bf34fd5

v0.0.39

🚀 New Features

New Modules Added

Fox: Report generation tool for text data from NMFk using OpenAI
ArcticFox : Report generation tool for text data from HNMFk using local LLMs
SPLIT : Joint NMFk factorization of multiple data via SPLIT
SPLITTransfer : Supervised transfer learning method via SPLIT and NMFk

Beaver Enhancements

Added support for automatically creating the directory specified by save_path when saving objects.

🐛 Bug Fixes

Beaver: Highlighting & Vocabulary Logic

Fixed an issue where tokens used in highlighting but not present in the provided vocabulary would fail to trigger re-vectorization.
The logic now ensures that the vocabulary is properly expanded and documents are re-vectorized accordingly.

Beaver: Trailing Newline in Output Files

Fixed a bug where an extra newline was added at the end of output text files such as Vocabulary.txt.

HNMFk: Model Loading Path

Fixed a bug with incorrect handling of the model path and name when loading an existing model.

Vulture: Module Imports

Resolved inconsistent module imports.

Conda Installation

Fixed .yml files by adding missing dependencies for proper conda installation.

Assets 2

26 Mar 15:01

MaksimEkin

v0.0.38

9531315

v0.0.38

New Modules

Adds Penguin, Bunny, Peacock, and SeaLion modules:

Penguin: Text storage tool.
Bunny: Dataset generation tool for documents and their citations/references.
Peacock: Data visualization and generation of actionable statistics.
SeaLion: Generic report generation tool.

Bugs

Fixes query index issue in Cheetah

Assets 2

18 Mar 16:10

MaksimEkin

v0.0.37

02c0c7b

v0.0.37

Adds three new modules for pre-processing text (Orca and iPenguin), and post-processing text (Wolf).

Wolf: Graph centrality and ranking tool.
iPenguin: Online information retrieval tool for Scopus, SemanticScholar, and OSTI.
Orca: Duplicate author detector for text mining and information retrieval.

Assets 2

04 Mar 18:32

MaksimEkin

v0.0.36

4561828

v0.0.36

HNMFk graph post-processing & root node naming

Added the ability to post-process HNMFk graphs based on the number of documents in leaf nodes.
- New functions:
  - model.traverse_tiny_leaf_topics(threshold: int): Identifies outlier clusters where the number of documents is below the given threshold.
  - model.get_tiny_leaf_topics(): Retrieves tiny leaf nodes (processed separately).
  - model.process_tiny_leaf_topics(threshold: int): Processes the graph to separate tiny nodes based on the given threshold.
  - Resetting the graph by setting threshold=None restores the tiny nodes.
Added option to specify a root node name in HNMFk using root_node_name="Root".
- Default is now "Root" instead of "*" to resolve Windows compatibility issues.

Bug(s)

Fixed a bug in Beaver where mismatched indexes caused incorrect highlighting.

Assets 2

27 Jan 23:03

MaksimEkin

v0.0.35

a84b855

v0.0.35

Fixes a bug with Cheetah on setting the default to empty string.
Adds Logistic Matrix Factorization (LMF).
Adds developer script to change versioning automatically.
Updates documentation.

Assets 2

Releases: lanl/T-ELF

v0.0.44

Uh oh!

v0.0.43

Uh oh!

v0.0.42

Uh oh!

v0.0.41

New Pre-Processing Module: Squirrel 🚀

Uh oh!

v0.0.40

🔧 Vulture Enhancements

🐆 Cheetah Additions

🧹 Refactoring and Fixes

📁 New Example: NM Law Data Pipeline

Uh oh!

v0.0.39

🚀 New Features

New Modules Added

Beaver Enhancements

🐛 Bug Fixes

Beaver: Highlighting & Vocabulary Logic

Beaver: Trailing Newline in Output Files

HNMFk: Model Loading Path

Vulture: Module Imports

Conda Installation

Uh oh!

v0.0.38

New Modules

Adds Penguin, Bunny, Peacock, and SeaLion modules:

Bugs

Uh oh!

v0.0.37

Uh oh!

v0.0.36

Uh oh!

v0.0.35

Uh oh!