use:
- uv sync
- uv run -m src.main
- Downloads the HF dataset
RJT1990/GeneralThoughtArchive(first 100 rows by default). - Loads the NER model
Ihor/gliner-biomed-large-v1.0. - Scans each question (first 150 characters), detects biomedical entities across predefined labels.
- Classifies a question as biomedical if:
- it contains the token "bio" (case-insensitive), or
- the model predicts at least one entity above the threshold (default 0.80).
- Writes matched records to
Final_Bio_terms.txt. - Reconstructs the matching rows from the original HF dataset and saves them to
hf_med_filtered_data/viadatasets.save_to_disk.
-
Entry point:
src/main.py- Seeds and picks device (
cudaif available). - Iterates over
dataset['question'], truncates to 150 chars, runsmodel.predict_entities(...). - Keeps only
{"label", "score"}per entity for logging . - Aggregates matched questions into
bio_termand writes one JSON-like dict per line toFinal_Bio_terms.txt. - Calls
filtered_med_data(...)to map truncated questions back to full HF rows and saves tohf_med_filtered_data/.
- Seeds and picks device (
-
NER model:
src/ner_model.py- Loads
Ihor/gliner-biomed-large-v1.0to CPU/GPU and exposesmodelandner_labels.
- Loads
-
Dataset:
src/hf_datasets.py- Loads split
trainfromRJT1990/GeneralThoughtArchive
- Loads split
-
Final- HF :
src/filtered_data.py- Matches records by truncated question text to reconstruct the original rows.