NSLC Treatement-Response Extraction Pipeline

A modular, end-to-end Natural Language Processing (NLP) pipeline for extracting and evaluating treatment response endpoints from Non-Small Cell Lung Cancer(NSCLS) clinical notes. This repository provides data processing, model training, inference, evaluation, and a lightweight web API for demonstration.

Features

Data Ingestion & Preprocessing
Load raw clinical notes, apply cleaning, tokenization, and formatting for downstream NLP tasks.
Model Training & Inference
Train custom NER and classification models; serialize trained weights under models/.
Evaluation Suite
Standard metrics (precision, recall, F1) and custom RECIST-style reporting under eval/.
Reusable Utilities
Helper functions for data loading, metric computation, and result visualization in utils/.
Scripts & Automation
Command-line entry points to run each stage of the pipeline in scripts/.
Demo API
A minimal Flask (or FastAPI) app (app.py) that exposes the model as a RESTful service.

Repository Structure

.
├── eval/                  # Evaluation scripts & reports
│   ├── evaluate.py
│   └── reports/
├── models/                # Trained model checkpoints
│   ├── ner_model.pt
│   └── classifier_model.pt
├── scripts/               # Standalone scripts for each pipeline stage
│   ├── preprocess.py
│   ├── train_ner.py
│   ├── train_classifier.py
│   └── infer.py
├── utils/                 # Utility modules
│   ├── data_loader.py
│   ├── metrics.py
│   └── text_processing.py
├── .gitignore
├── README.md
├── requirements.txt       # Python dependencies
├── app.py                 # Demo REST API for inference
└── run_pipeline.py        # Orchestrator: runs full pipeline end-to-end

Installation

Clone the repo

git clone https://github.com/PittNAIL/nlp-lc-recist.git
cd nlp-lc-recist

Create & activate a virtual environment

python3 -m venv venv
source venv/bin/activate

Install dependencies

pip install --upgrade pip
pip install -r requirements.txt

Usage

1. Run Full Pipeline

Use the orchestrator to go from raw data → predictions → evaluation in one command:

python run_pipeline.py \
  --input_dir data/raw/ \
  --output_dir outputs/ \
  --config config/pipeline.yaml

2. Individual Stages

Preprocess

python scripts/preprocess.py \
  --input data/raw/notes.jsonl \
  --output data/processed/notes_tok.jsonl

Train NER

python scripts/train_ner.py \
  --train data/processed/train.jsonl \
  --dev data/processed/dev.jsonl \
  --output models/ner_model.pt

Train Classifier

python scripts/train_classifier.py \
  --features data/processed/features.npz \
  --labels data/processed/labels.npy \
  --output models/classifier_model.pt

Inference

python scripts/infer.py \
  --model models/ner_model.pt \
  --input data/processed/test.jsonl \
  --output predictions/ner_preds.jsonl

Evaluate

python eval/evaluate.py \
  --predictions predictions/ \
  --gold data/processed/test_gold.jsonl \
  --report eval/reports/metrics.json

3. Launch Demo API

Start the REST API for on-the-fly inference:

python app.py --host 0.0.0.0 --port 5000

Endpoint
`POST /predict`
Payload

{ "text": "...clinical note text..." }

Response

{
  "entities": [ { "start": 10, "end": 25, "label": "TUMOR_SIZE", ... } ],
  "recist_call": "Stable Disease"
}

Configuration

All hyperparameters and filepaths can be set via the top-level `config/pipeline.yaml`. Example:

preprocessing:
  lowercase: true
  remove_pii: true

ner:
  learning_rate: 3e-5
  batch_size: 16
  epochs: 10

classifier:
  hidden_dim: 256
  dropout: 0.1

paths:
  raw_data: data/raw/
  processed_data: data/processed/
  model_dir: models/
  output_dir: outputs/

Contributing

Fork the repo
Create a feature branch (`git checkout -b feature/my-new-feature`)
Commit your changes (`git commit -am 'Add feature'`)
Push to the branch (`git push origin feature/my-new-feature`)
Open a Pull Request

Please follow the existing code style and add tests where appropriate.

License

This project is licensed under the MIT License. See the LICENSE file for details.

Contact

Sonish Sivarajkumar
– PhD Candidate, PittNAIL Lab, University of Pittsburgh
– [email protected]

Yanshan Wang, PhD – PittNAIL Lab, University of Pittsburgh
– [email protected]

Feel free to open issues or reach out with questions!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

NSLC Treatement-Response Extraction Pipeline

Features

Repository Structure

Installation

Usage

1. Run Full Pipeline

2. Individual Stages

3. Launch Demo API

Configuration

Contributing

License

Contact

About

Uh oh!

Releases

Packages

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 7 Commits
eval		eval
models		models
scripts		scripts
utils		utils
.DS_Store		.DS_Store
README.md		README.md
app.py		app.py
run_pipeline.py		run_pipeline.py

PittNAIL/nlp-nsclc

Folders and files

Latest commit

History

Repository files navigation

NSLC Treatement-Response Extraction Pipeline

Features

Repository Structure

Installation

Usage

1. Run Full Pipeline

2. Individual Stages

3. Launch Demo API

Configuration

Contributing

License

Contact

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages