A modular, end-to-end Natural Language Processing (NLP) pipeline for extracting and evaluating treatment response endpoints from Non-Small Cell Lung Cancer(NSCLS) clinical notes. This repository provides data processing, model training, inference, evaluation, and a lightweight web API for demonstration.
-
Data Ingestion & Preprocessing
Load raw clinical notes, apply cleaning, tokenization, and formatting for downstream NLP tasks. -
Model Training & Inference
Train custom NER and classification models; serialize trained weights undermodels/. -
Evaluation Suite
Standard metrics (precision, recall, F1) and custom RECIST-style reporting undereval/. -
Reusable Utilities
Helper functions for data loading, metric computation, and result visualization inutils/. -
Scripts & Automation
Command-line entry points to run each stage of the pipeline inscripts/. -
Demo API
A minimal Flask (or FastAPI) app (app.py) that exposes the model as a RESTful service.
.
├── eval/ # Evaluation scripts & reports
│ ├── evaluate.py
│ └── reports/
├── models/ # Trained model checkpoints
│ ├── ner_model.pt
│ └── classifier_model.pt
├── scripts/ # Standalone scripts for each pipeline stage
│ ├── preprocess.py
│ ├── train_ner.py
│ ├── train_classifier.py
│ └── infer.py
├── utils/ # Utility modules
│ ├── data_loader.py
│ ├── metrics.py
│ └── text_processing.py
├── .gitignore
├── README.md
├── requirements.txt # Python dependencies
├── app.py # Demo REST API for inference
└── run_pipeline.py # Orchestrator: runs full pipeline end-to-end
-
Clone the repo
git clone https://github.com/PittNAIL/nlp-lc-recist.git cd nlp-lc-recist -
Create & activate a virtual environment
python3 -m venv venv source venv/bin/activate -
Install dependencies
pip install --upgrade pip pip install -r requirements.txt
Use the orchestrator to go from raw data → predictions → evaluation in one command:
python run_pipeline.py \
--input_dir data/raw/ \
--output_dir outputs/ \
--config config/pipeline.yaml-
Preprocess
python scripts/preprocess.py \ --input data/raw/notes.jsonl \ --output data/processed/notes_tok.jsonl
-
Train NER
python scripts/train_ner.py \ --train data/processed/train.jsonl \ --dev data/processed/dev.jsonl \ --output models/ner_model.pt
-
Train Classifier
python scripts/train_classifier.py \ --features data/processed/features.npz \ --labels data/processed/labels.npy \ --output models/classifier_model.pt
-
Inference
python scripts/infer.py \ --model models/ner_model.pt \ --input data/processed/test.jsonl \ --output predictions/ner_preds.jsonl
-
Evaluate
python eval/evaluate.py \ --predictions predictions/ \ --gold data/processed/test_gold.jsonl \ --report eval/reports/metrics.json
Start the REST API for on-the-fly inference:
python app.py --host 0.0.0.0 --port 5000- Endpoint
`POST /predict`
PayloadResponse{ "text": "...clinical note text..." }{ "entities": [ { "start": 10, "end": 25, "label": "TUMOR_SIZE", ... } ], "recist_call": "Stable Disease" }
All hyperparameters and filepaths can be set via the top-level `config/pipeline.yaml`. Example:
preprocessing:
lowercase: true
remove_pii: true
ner:
learning_rate: 3e-5
batch_size: 16
epochs: 10
classifier:
hidden_dim: 256
dropout: 0.1
paths:
raw_data: data/raw/
processed_data: data/processed/
model_dir: models/
output_dir: outputs/- Fork the repo
- Create a feature branch (`git checkout -b feature/my-new-feature`)
- Commit your changes (`git commit -am 'Add feature'`)
- Push to the branch (`git push origin feature/my-new-feature`)
- Open a Pull Request
Please follow the existing code style and add tests where appropriate.
This project is licensed under the MIT License. See the LICENSE file for details.
Sonish Sivarajkumar
– PhD Candidate, PittNAIL Lab, University of Pittsburgh
– [email protected]
Yanshan Wang, PhD
– PittNAIL Lab, University of Pittsburgh
– [email protected]
Feel free to open issues or reach out with questions!