Releases: lanl/T-ELF
v0.0.44
v0.0.44 adds new pipeline blocks including spacey NER block, semantic hnmfk block, collect hnmfk leaf block, peacock stats block, beaver codependency matrix block, ocelot filter block, wolf block, and sbatch block, along with supporting helpers such as KernelServer, affiliation and author partitioners, hnmfk paths, and peacock renderer, delivers a full Termite application stack for knowledge graph and vector workflows with Neo4j injection through DataInjector and constants, pluggable embedding stores for Milvus and OpenSearch, vector injection, and corresponding pipeline blocks termite neo4j block and termite vector block, updates the Lynx frontend with a document view and a new project data loader, switches post install to the telf post install console script in place of the python post install script with updated examples and HPC flags, refines preprocessing and factorization with fixes including an Orca update to better handle missing affiliations, introduces versioning and archiving scripts under a new post install package, and expands .gitignore to exclude additional caches, example outputs, and backup files.
v0.0.43
v0.0.43 adds Ocelot sub-module to Cheetah, a module for filtering documents using n-grams; updates AutoBunny to support simpler block-style usage within pipelines; introduces a block for summarizing term matches when applying Ocelot filtering; and includes accompanying example notebooks.
v0.0.42
Version 0.0.42 introduces a new block-based pipeline framework that allows users to design and execute custom data flows across TELF modules, enabling greater flexibility and experimentation with pipeline configurations. This release also includes Lynx, a lightweight Streamlit-based frontend for visualizing post-processing results, offering a way to explore outcomes from scientific literature analysis and link prediction.
v0.0.41
New Pre-Processing Module: Squirrel 🚀
Version 0.0.41 introduces a new pre-processing module called Squirrel, designed for automated document pruning. Squirrel streamlines the process of accepting or rejecting documents by applying predefined rules and thresholds, eliminating the need for manual review. Squirrel supports the use of multiple pruning strategies. In this release, we include both embedding-based pruning and LLM-based pruning:
- Embedding-Based Pruning: This method filters documents based on their distance from a reference centroid in embedding space. Only documents within a specified threshold are retained, ensuring higher data quality.
- LLM-Based Pruning: Squirrel leverages large language models to further refine the pruning process. It conducts multiple voting trials using LLM evaluations to determine whether a document should be accepted or rejected.
v0.0.40
🔧 Vulture Enhancements
- Expanded Standard Cleaning Functions:
Added the following to Vulture’s standard text cleaning pipeline:remove_numbers: Removes stand-alone numbers.remove_alphanumeric: Removes mixed alphanumeric terms (e.g.,abc123).remove_roman_numerals: Removes Roman numeral listings.
🐆 Cheetah Additions
-
term_generator:- Extracts top keywords from cleaned text using TF-IDF.
- Pairs keywords with nearby support terms based on a co-occurrence matrix.
- Saves output as a structured markdown file of search terms.
-
CheetahTermFormatter:- Parses markdown search term files into structured blocks with optional filters (e.g.,
positives,negatives). - Supports plain string output or category-based filtering.
- Can generate substitution maps to convert multi-word phrases into underscored versions and back.
- Parses markdown search term files into structured blocks with optional filters (e.g.,
-
convert_txt_to_cheetah_markdown:- Converts plain
.txtfiles or structured term dictionaries into Cheetah-compatible markdown format. - Facilitates easier programmatic creation and editing of search term files.
- Converts plain
🧹 Refactoring and Fixes
-
Code Refactoring:
- Consolidated several duplicated functions across modules into shared helper utilities at a higher level.
-
Bug Fixes:
- Vulture:
- Fixed path-saving logic in operator pipelines.
- Fixed bugs in the NER and Vocabulary Consolidator operators.
- Beaver:
- Resolved a file-saving issue that also affected Wolf’s visualization routines.
- Fixes README under examples to have the correct module links.
- Vulture:
-
.gitignoreUpdates:- Added more output files and example notebook directories to
.gitignore.
- Added more output files and example notebook directories to
📁 New Example: NM Law Data Pipeline
Added the NM Law Data/ folder, containing the data processing pipeline used in the paper:
“Legal Document Analysis with HNMFk” (arXiv:2502.20364)
-
00_data_collection/:
Scrapes and formats legal documents (statutes, constitution, court cases) from Justia. -
01_hnmfk_operation/:
Constructs document-word matrices and runs Hierarchical Nonnegative Matrix Factorization (HNMFk). -
02_benchmarking/:
Evaluates LLM-generated content using factual accuracy, entailment, and summarization metrics. -
03_visualizations/:
Visualizes legal trends, knowledge graphs, and model evaluation results.
v0.0.39
🚀 New Features
New Modules Added
Fox: Report generation tool for text data from NMFk using OpenAIArcticFox: Report generation tool for text data from HNMFk using local LLMsSPLIT: Joint NMFk factorization of multiple data via SPLITSPLITTransfer: Supervised transfer learning method via SPLIT and NMFk
Beaver Enhancements
- Added support for automatically creating the directory specified by
save_pathwhen saving objects.
🐛 Bug Fixes
Beaver: Highlighting & Vocabulary Logic
- Fixed an issue where tokens used in highlighting but not present in the provided vocabulary would fail to trigger re-vectorization.
- The logic now ensures that the vocabulary is properly expanded and documents are re-vectorized accordingly.
Beaver: Trailing Newline in Output Files
- Fixed a bug where an extra newline was added at the end of output text files such as
Vocabulary.txt.
HNMFk: Model Loading Path
- Fixed a bug with incorrect handling of the model path and name when loading an existing model.
Vulture: Module Imports
- Resolved inconsistent module imports.
Conda Installation
- Fixed
.ymlfiles by adding missing dependencies for proper conda installation.
v0.0.38
New Modules
Adds Penguin, Bunny, Peacock, and SeaLion modules:
- Penguin: Text storage tool.
- Bunny: Dataset generation tool for documents and their citations/references.
- Peacock: Data visualization and generation of actionable statistics.
- SeaLion: Generic report generation tool.
Bugs
- Fixes query index issue in Cheetah
v0.0.37
Adds three new modules for pre-processing text (Orca and iPenguin), and post-processing text (Wolf).
- Wolf: Graph centrality and ranking tool.
- iPenguin: Online information retrieval tool for Scopus, SemanticScholar, and OSTI.
- Orca: Duplicate author detector for text mining and information retrieval.
v0.0.36
HNMFk graph post-processing & root node naming
-
Added the ability to post-process HNMFk graphs based on the number of documents in leaf nodes.
- New functions:
model.traverse_tiny_leaf_topics(threshold: int): Identifies outlier clusters where the number of documents is below the given threshold.model.get_tiny_leaf_topics(): Retrieves tiny leaf nodes (processed separately).model.process_tiny_leaf_topics(threshold: int): Processes the graph to separate tiny nodes based on the given threshold.- Resetting the graph by setting
threshold=Nonerestores the tiny nodes.
- New functions:
-
Added option to specify a root node name in HNMFk using
root_node_name="Root".- Default is now
"Root"instead of"*"to resolve Windows compatibility issues.
- Default is now
Bug(s)
- Fixed a bug in Beaver where mismatched indexes caused incorrect highlighting.
v0.0.35
- Fixes a bug with Cheetah on setting the default to empty string.
- Adds Logistic Matrix Factorization (LMF).
- Adds developer script to change versioning automatically.
- Updates documentation.