@@ -11,13 +11,13 @@ to make this as a general program for text multi-label classification task.
1111
1212## Usage
1313### Python Env
14- ``` sh
14+ ``` shell
1515micromamba env create -f environment.yaml -p ./_pyenv --yes
1616micromamba activate ./_pyenv
1717pip install -r requirements.txt
1818```
1919### Run Tests
20- ``` sh
20+ ``` shell
2121python -m pytest ./test --cov=./src/plm_icd_multi_label_classifier --durations=0 -v
2222```
2323
@@ -55,7 +55,11 @@ And the `dict.json` is for bi-directionary mapping between label names and IDs,
5555```
5656{
5757 "label2id": {
58-
58+ "label_0": 0,
59+ "label_1": 1,
60+ "label_2": 2,
61+ ...
62+ "label_n": n
5963 },
6064 "id2label": {
6165 0: "label_0",
@@ -66,61 +70,20 @@ And the `dict.json` is for bi-directionary mapping between label names and IDs,
6670 }
6771}
6872```
69- As the label ID will be also used as index in one-hot vector, so must start from 0.
70-
71-
72- ### (MIMIC3 Dataset Preparation)
73- The ETL contain following steps:
74- * Origin JSON line dataset preparation
75- * Transform JSON line file to ** limited** JOSN line file, which means all ` list ` or ` dict `
76- will be transformed to ` string ` .
77- * Data dictionary generation.
78-
79- Note, the final data folder should contains 4 files: train.jsonl, dev.jsonl, test.jsonl, dict.json.
80-
81- #### Prepare (Specific) Original JSON Line Dataset
82- The data should be in JSON line format, here provide an MIMIC-III data ETL program:
83- ``` sh
84- python ./bin/etl/etl_mimic3_processing.py ${YOUR_MIMIC3_DATA_DIRECTORY} ${YOUR_TARGET_OUTPUT_DIRECTORY}
85- ```
86- When you need use this program do text multi-label classification on your custimized
87- data set, you can just transfer it into a JSON line file, and using ** training config**
88- file to specify which field is text and which is label.
89-
90- ** NOTE** , since here you are dealing a multi-label classification task, the format of
91- label field should be as a CSV string, for example:
92- ```
93- {"text": "this is a fake text.", "label": "label1,label2,label3,label4"}
94- ```
95-
96- But you can also use your specific dataset.
97-
98- #### Transform To Limited JSON Line Dataset
99- Although using JSON line file, here do not allow ` list ` and ` dict ` contained in JOSN.
100- I believe "flat" JSON can make things clear, so here provide a tool which can help
101- to convert ` list ` and ` dict ` contained in JSON to ` string ` :
102- ``` shell
103- python ./bin/etl/etl_jsonl2limited_jsonl.py ${ORIGINAL_JSON_LINE_DATASET} ${TRANSFORMED_JSON_LINE_DATASET}
104- ```
105-
106- ** NOTE, alghouth you can put dataset in anly directory you like, but you HAVE TO naming you datasets
107- as train.jsonl, dev.jsonl and test.jsonl.**
108-
109- #### Data Dictionary Generation
110- Generate (some) data dictionaries by scanning train, dev and test data. Run:
111- ``` shell
112- python ./bin/etl/etl_generate_data_dict.py ${TRAIN_CONFIG_JSON_FILE_PATH}
113- ```
73+ ** As the label ID will be also used as index in one-hot vector, so must start from 0.**
74+
75+ As the original paper use MIMIC-III as dataset, here also provide a
76+ [ pre-built ETL] ( https://github.com/innerNULL/PLM-ICD-multi-label-classifier/blob/main/doc/mimic-iii-dataset-etl.md )
77+ to generate training data from MIMIC-III data.
11478
11579
11680### Training and Evaluation
117- ``` sh
81+ ``` shell
11882CUDA_VISIBLE_DEVICES=0,1,2,3 python ./train.py ${TRAIN_CONFIG_JSON_FILE_PATH}
11983```
12084
121- #### Training Config File
122- The format should be JSON, most of parameters are easy to understand is your are a
123- MLE or researcher:
85+ The format of config file is JSON, most of parameters are easy to understand
86+ if your are a MLE/data scientist/researcher:
12487* ` chunk_size ` : Each chunks token ID number.
12588* ` chunk_num ` : The number of chunk each text/document should have, padding first for short sentences.
12689* ` hf_lm ` : HuggingFace language model name/path, each ` hf_lm ` may have different ` lm_hidden_dim ` ,
@@ -145,55 +108,27 @@ MLE or researcher:
145108* ` ckpt_dir ` : Checkpoint directory name.
146109* ` log_period ` : How many ** batchs** passed before each time's evaluation log printing.
147110* ` dump_period ` : How many ** steps** passed before each time's checkpoint dumping.
111+ * ` label_splitter ` : The seperator with which we split concated label string to list of label names.
112+ * ` eval.label_confidence_threshold ` : Each label's confidence threshold, if higher then will be set as positive during the evaluation.
148113
149- ## Examples
150- ### Using MIMIC-III Data Training ICD10 Classification Model
151- #### Preparation - Get Raw MIMIC-III Data
152- Suppose you put original MIMIC-III data under ` ./_data/raw/mimic3/ ` like:
153- ```
154- ./_data/raw/mimic3/
155- ├── DIAGNOSES_ICD.csv
156- ├── NOTEEVENTS.csv
157- └── PROCEDURES_ICD.csv
158-
159- 0 directories, 3 files
160- ```
161- #### ETL - Training Dataset Building
162- This is about join necessary tables' data together and build training dataset. Suppose we are
163- going to put training data under ` ./_data/etl/mimic3/ ` , as this programed rules, the directory
164- should contain 3 files, train.jsonl, dev.jsonl and test.jsonl, like:
165- ```
166- ./_data/etl/mimic3/
167- ├── dev.jsonl
168- ├── dict.json
169- ├── dim_processed_base_data.jsonl
170- ├── test.jsonl
171- └── train.jsonl
172-
173- 0 directories, 5 files
174- ```
175- You can run:
114+ ### Inference
176115``` shell
177- python ./bin/etl/etl_mimic3_processing. py ./_data/raw/mimic3/ ./_data/etl/mimic3/
116+ python inf. py inf.json
178117```
118+ Most parameters explanations are already in ` inf.json ` .
179119
180- #### Config - Prepare Your Training Config File
181- The ` data_dir ` in this config will be needed by next ETL step, can just refer to ` train_mimic3_icd.json ` .
182-
183- #### ETL - Convert Training Dataset JSONL to Limited JSONL File
184- Note this step is unnecessary, since the outputs of ` ./bin/etl/etl_mimic3_processing.py ` have
185- already been limited JSON line files, so even though you run following program, you will get
186- exactly same files:
120+ ### Evaluation
187121``` shell
188- python ./bin/etl/etl_jsonl2limited_jsonl. py ./_data/raw/mimic3/ ${INPUT_JSONL_FILE} ./_data/raw/mimic3/ ${OUTPUT_JSONL_FILE}
122+ python eval. py eval.json
189123```
124+ Most parameters explanations are already in ` eval.json ` .
190125
191- #### Training - Training ICD10 Classification Model with MIMIC-II Dataset
192- ``` shell
193- CUDA_VISIBLE_DEVICES=0,1,2,3 python ./train.py ./train_mimic3_icd.json
194- ```
195126
196127
128+ ## Examples
129+ ### Training Examples
130+ * [ ICD10 prediction based on MIMIC-III data] ( https://github.com/innerNULL/PLM-ICD-multi-label-classifier/blob/main/doc/mimic-iii-train-example.md )
131+
197132## Other Implementation Details
198133* After ` chunk_size ` and ` chunk_num ` defined, each text's token ID length are fixed to ` chunk_size * chunk_num ` .
199134if not long enough then automatically padding first.
0 commit comments