Skip to content

Commit e849006

Browse files
authored
Merge pull request #1813 from lwiklendt/note-pr
MIMIC-IV-NOTE buildmimic for duckdb
2 parents 78d12aa + 713ff66 commit e849006

File tree

3 files changed

+241
-2
lines changed

3 files changed

+241
-2
lines changed
Lines changed: 100 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,100 @@
1+
# DuckDB
2+
3+
The script in this folder creates the schema for MIMIC-IV-NOTE and
4+
loads the data into the appropriate tables for
5+
[DuckDB](https://duckdb.org/).
6+
DuckDB, like SQLite, is serverless and
7+
stores all information in a single file.
8+
Unlike SQLite, an OLTP database,
9+
DuckDB is an OLAP database, and therefore optimized for analytical queries.
10+
This will result in faster queries for researchers using MIMIC-IV-NOTE
11+
with DuckDB compared to SQLite.
12+
To learn more, please read their ["why duckdb"](https://duckdb.org/docs/why_duckdb)
13+
page.
14+
15+
The instructions to load MIMIC-IV-NOTE into a DuckDB
16+
only require:
17+
1. DuckDB to be installed and
18+
2. Your computer to have a POSIX-compliant terminal shell,
19+
which is already found by default on any Mac OSX, Linux, or BSD installation.
20+
21+
To use these instructions on Windows,
22+
you need a Unix command line environment,
23+
which you can obtain by either installing
24+
[Windows Subsystem for Linux](https://docs.microsoft.com/en-us/windows/wsl/install-win10)
25+
or [Cygwin](https://www.cygwin.com/).
26+
27+
## Set-up
28+
29+
### Quick overview
30+
31+
1. [Install](https://duckdb.org/docs/installation/) the CLI version of DuckDB
32+
2. [Download](https://https://physionet.org/content/mimic-iv-note/2.2/) the MIMIC-IV-NOTE files
33+
3. Create DuckDB database and load data
34+
35+
### Install DuckDB
36+
37+
Follow instructions on their website to
38+
[install](https://duckdb.org/docs/installation/)
39+
the CLI version of DuckDB.
40+
41+
You will need to place the `duckdb` binary in a folder on your environment path,
42+
e.g. `/usr/local/bin`.
43+
44+
### Download MIMIC-IV-NOTE files
45+
46+
Download the CSV files for [MIMIC-IV-NOTE](https://physionet.org/content/mimic-iv-note/2.2/)
47+
by any method you wish.
48+
These instructions were tested with MIMIC-IV-NOTE v2.2.
49+
50+
The CSV files should be a folder structure as follows:
51+
52+
```
53+
mimic_data_dir
54+
note
55+
discharge.csv.gz
56+
...
57+
radiology_detail.csv.gz
58+
```
59+
60+
The CSV files can be uncompressed (end in `.csv`) or compressed (end in `.csv.gz`).
61+
62+
The easiest way to download them is to open a terminal then run:
63+
64+
```
65+
wget -r -N -c -np --user YOURUSERNAME --ask-password https://physionet.org/files/mimic-iv-note/2.2/
66+
```
67+
68+
Replace `YOURUSERNAME` with your physionet username.
69+
70+
This will make you `mimic_data_dir` be `physionet.org/files/mimic-iv-note/2.2`.
71+
72+
# Create DuckDB database and load data
73+
74+
The last step requires creating a DuckDB database and
75+
loading the data into it.
76+
77+
You can do all of this will one shell script, `import_duckdb.sh`,
78+
located in this repository.
79+
80+
See the help for it below:
81+
82+
```sh
83+
$ ./import_duckdb.sh -h
84+
./import_duckdb.sh:
85+
USAGE: ./import_duckdb.sh mimic_data_dir [output_db]
86+
WHERE:
87+
mimic_data_dir directory that contains csv.gz or csv files
88+
output_db: optional filename for duckdb file (default: mimic4_note.db)
89+
$
90+
```
91+
92+
The script will print out progress as it goes.
93+
Be patient, this can take minutes to hours to load
94+
depending on your computer's configuration.
95+
96+
* It took about a minute in an Ubuntu 24.04 container with duckdb v1.0.0 on a Windows 11 host 8-core i9 with 32GB RAM.
97+
98+
# Help
99+
100+
Please see the [issues page](https://github.com/MIT-LCP/mimic-code/issues) to discuss other issues you may be having.
Lines changed: 126 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,126 @@
1+
#!/bin/sh
2+
3+
# Copyright (c) 2023 MIT Laboratory for Computational Physiology
4+
# Copyright (c) 2021 Thomas Ward <[email protected]>
5+
#
6+
# Permission is hereby granted, free of charge, to any person obtaining a copy
7+
# of this software and associated documentation files (the "Software"), to deal
8+
# in the Software without restriction, including without limitation the rights
9+
# to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
10+
# copies of the Software, and to permit persons to whom the Software is
11+
# furnished to do so, subject to the following conditions:
12+
#
13+
# The above copyright notice and this permission notice shall be included in all
14+
# copies or substantial portions of the Software.
15+
#
16+
# THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
17+
# IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
18+
# FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
19+
# AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
20+
# LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
21+
# OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
22+
# SOFTWARE.
23+
24+
yell () { echo "$0: $*" >&2; }
25+
die () { yell "$*"; exit 111; }
26+
try () { "$@" || die "Exiting. Failed to run: \"$*\""; }
27+
28+
usage () {
29+
die "
30+
USAGE: ./import_duckdb.sh mimic_data_dir [output_db]
31+
WHERE:
32+
mimic_data_dir directory that contains csv.tar.gz or csv files
33+
output_db: optional filename for duckdb file (default: mimic4_note.db)\
34+
"
35+
}
36+
37+
# Print help if requested
38+
echo "$0 $* " | grep -Eq " -h | --help " && usage
39+
40+
# rename CLI positional args to more friendly variable names
41+
MIMIC_DIR=$1
42+
# allow optional specification of duckdb name, otherwise default to mimic4_note.db
43+
OUTFILE=mimic4_note.db
44+
if [ -n "$2" ]; then
45+
OUTFILE=$2
46+
fi
47+
48+
49+
# basic error checking before running
50+
if [ -z "$MIMIC_DIR" ]; then
51+
yell "Please specify a mimic data directory"
52+
die "Usage: ./import_duckdb.sh mimic_data_dir [output_db]"
53+
elif [ ! -d "$MIMIC_DIR" ]; then
54+
yell "Specified directory \"$MIMIC_DIR\" does not exist."
55+
die "Usage: ./import_duckdb.sh mimic_data_dir [output_db]"
56+
elif [ -n "$3" ]; then
57+
yell "import.sh takes a maximum of two arguments."
58+
die "Usage: ./import_duckdb.sh mimic_data_dir [output_db]"
59+
elif [ -s "$OUTFILE" ]; then
60+
yell "File \"$OUTFILE\" already exists."
61+
read -p "Continue? (y/d/n) 'y' continues, 'd' deletes original file, 'n' stops: " yn
62+
case $yn in
63+
[Yy]* ) ;; # OK
64+
[Nn]* ) exit;;
65+
[Dd]* ) rm "$OUTFILE";;
66+
* ) die "Unrecognized input.";;
67+
esac
68+
fi
69+
70+
# trim trailing slash from MIMIC_DIR, if present
71+
MIMIC_DIR=${MIMIC_DIR%/}
72+
73+
# we will copy the postgresql create.sql file, and apply regex
74+
# to fix the following issues:
75+
# 1. Remove optional precision value from TIMESTAMP(NN) -> TIMESTAMP
76+
# duckdb does not support this.
77+
export REGEX_TIMESTAMP='s/TIMESTAMP\([0-9]+\)/TIMESTAMP/g'
78+
79+
# use sed + above regex to create tables within db
80+
sed -r -e "${REGEX_TIMESTAMP}" ../postgres/create.sql | \
81+
duckdb "$OUTFILE"
82+
83+
# goal: get path from find, e.g., ./1.0/icu/d_items
84+
# and return database table name for it, e.g., mimiciv_icu.d_items
85+
make_table_name () {
86+
# strip leading directories (e.g., ./icu/hello.csv.gz -> hello.csv.gz)
87+
BASENAME=${1##*/}
88+
# strip suffix (e.g., hello.csv.gz -> hello; hello.csv -> hello)
89+
TABLE_NAME=${BASENAME%%.*}
90+
# strip basename (e.g., ./icu/hello.csv.gz -> ./icu)
91+
PATHNAME=${1%/*}
92+
# strip leading directories from PATHNAME (e.g. ./icu -> icu)
93+
DIRNAME=${PATHNAME##*/}
94+
TABLE_NAME="mimiciv_$DIRNAME.$TABLE_NAME"
95+
}
96+
97+
98+
# load data into database
99+
find "$MIMIC_DIR" -type f -name '*.csv???' | sort | while IFS= read -r FILE; do
100+
make_table_name "$FILE"
101+
102+
# skip directories which we do not expect in mimic-iv-note
103+
# avoids syntax errors if mimic-iv in the same dir
104+
case $DIRNAME in
105+
(note) ;; # OK
106+
(*) continue;
107+
esac
108+
echo "Loading $FILE .."
109+
OUTPUT=$(duckdb "$OUTFILE" 2>&1 <<-EOSQL
110+
COPY $TABLE_NAME FROM '$FILE' (HEADER, DELIM ',', QUOTE '"', ESCAPE '"');
111+
EOSQL
112+
)
113+
# If the table is missing in the DB, we emit a warning and continue.
114+
# Otherwise, the script repeats the error and exits.
115+
STATUS=$?
116+
if [ $STATUS -ne 0 ]; then
117+
echo "$OUTPUT" | grep -qiE 'table .* does not exist' && {
118+
echo "skipped (table $TABLE_NAME not found)";
119+
continue;
120+
}
121+
yell "Failed loading $FILE into $TABLE_NAME"
122+
yell "$OUTPUT"
123+
die "Exiting due to load error."
124+
fi
125+
echo "done!"
126+
done && echo "Successfully finished loading data into $OUTFILE."

mimic-iv/buildmimic/duckdb/import_duckdb.sh

Lines changed: 15 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -111,8 +111,21 @@ find "$MIMIC_DIR" -type f -name '*.csv???' | sort | while IFS= read -r FILE; do
111111
(*) continue;
112112
esac
113113
echo "Loading $FILE .. \c"
114-
try duckdb "$OUTFILE" <<-EOSQL
115-
COPY $TABLE_NAME FROM '$FILE' (HEADER);
114+
OUTPUT=$(duckdb "$OUTFILE" 2>&1 <<-EOSQL
115+
COPY $TABLE_NAME FROM '$FILE' (HEADER, DELIM ',', QUOTE '"', ESCAPE '"');
116116
EOSQL
117+
)
118+
# If the table is missing in the DB, we emit a warning and continue.
119+
# Otherwise, the script repeats the error and exits.
120+
STATUS=$?
121+
if [ $STATUS -ne 0 ]; then
122+
echo "$OUTPUT" | grep -qiE 'table .* does not exist' && {
123+
echo "skipped (table $TABLE_NAME not found)";
124+
continue;
125+
}
126+
yell "Failed loading $FILE into $TABLE_NAME"
127+
yell "$OUTPUT"
128+
die "Exiting due to load error."
129+
fi
117130
echo "done!"
118131
done && echo "Successfully finished loading data into $OUTFILE."

0 commit comments

Comments
 (0)