Skip to content

Commit 78d12aa

Browse files
authored
Merge pull request #1812 from lwiklendt/ed-pr
MIMIC-IV-ED buildmimic for duckdb
2 parents c34baed + 527d6a7 commit 78d12aa

File tree

2 files changed

+206
-0
lines changed

2 files changed

+206
-0
lines changed
Lines changed: 96 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,96 @@
1+
# DuckDB
2+
3+
The script in this folder creates the schema for MIMIC-IV-ED and
4+
loads the data into the appropriate tables for
5+
[DuckDB](https://duckdb.org/).
6+
DuckDB, like SQLite, is serverless and
7+
stores all information in a single file.
8+
Unlike SQLite, an OLTP database,
9+
DuckDB is an OLAP database, and therefore optimized for analytical queries.
10+
This will result in faster queries for researchers using MIMIC-IV-ED
11+
with DuckDB compared to SQLite.
12+
To learn more, please read their ["why duckdb"](https://duckdb.org/docs/why_duckdb)
13+
page.
14+
15+
The instructions to load MIMIC-IV-ED into a DuckDB
16+
only require:
17+
1. DuckDB to be installed and
18+
2. Your computer to have a POSIX-compliant terminal shell,
19+
which is already found by default on any Mac OSX, Linux, or BSD installation.
20+
21+
To use these instructions on Windows,
22+
you need a Unix command line environment,
23+
which you can obtain by either installing
24+
[Windows Subsystem for Linux](https://docs.microsoft.com/en-us/windows/wsl/install-win10)
25+
or [Cygwin](https://www.cygwin.com/).
26+
27+
## Set-up
28+
29+
### Quick overview
30+
31+
1. [Install](https://duckdb.org/docs/installation/) the CLI version of DuckDB
32+
2. [Download](https://physionet.org/content/mimic-iv-ed/2.2/) the MIMIC-IV-ED files
33+
3. Create DuckDB database and load data
34+
35+
### Install DuckDB
36+
37+
Follow instructions on their website to
38+
[install](https://duckdb.org/docs/installation/)
39+
the CLI version of DuckDB.
40+
41+
You will need to place the `duckdb` binary in a folder on your environment path,
42+
e.g. `/usr/local/bin`.
43+
44+
### Download MIMIC-IV-ED files
45+
46+
Download the CSV files for [MIMIC-IV-ED](https://physionet.org/content/mimic-iv-ed/2.2/)
47+
by any method you wish.
48+
These instructions were tested with MIMIC-IV-ED v2.2.
49+
50+
The CSV files should be a folder structure as follows:
51+
52+
```
53+
mimic_data_dir
54+
ed
55+
diagnosis.csv.gz
56+
...
57+
vitalsign.csv.gz
58+
```
59+
60+
The CSV files can be uncompressed (end in `.csv`) or compressed (end in `.csv.gz`).
61+
62+
The easiest way to download them is to open a terminal then run:
63+
64+
```
65+
wget -r -N -c -np --user YOURUSERNAME --ask-password https://physionet.org/files/mimic-iv-ed/2.2/
66+
```
67+
68+
Replace `YOURUSERNAME` with your physionet username.
69+
70+
This will make you `mimic_data_dir` be `physionet.org/files/mimic-iv-ed/2.2`.
71+
72+
# Create DuckDB database and load data
73+
74+
The last step requires creating a DuckDB database and
75+
loading the data into it.
76+
77+
You can do all of this will one shell script, `import_duckdb.sh`,
78+
located in this repository.
79+
80+
See the help for it below:
81+
82+
```sh
83+
$ ./import_duckdb.sh -h
84+
./import_duckdb.sh:
85+
USAGE: ./import_duckdb.sh mimic_data_dir [output_db]
86+
WHERE:
87+
mimic_data_dir directory that contains csv.gz or csv files
88+
output_db: optional filename for duckdb file (default: mimic4_ed.db)
89+
$
90+
```
91+
92+
The script will print out progress as it goes. It should only take a few seconds to load.
93+
94+
# Help
95+
96+
Please see the [issues page](https://github.com/MIT-LCP/mimic-code/issues) to discuss other issues you may be having.
Lines changed: 110 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,110 @@
1+
#!/bin/sh
2+
3+
# Copyright (c) 2023 MIT Laboratory for Computational Physiology
4+
# Copyright (c) 2021 Thomas Ward <[email protected]>
5+
#
6+
# Permission is hereby granted, free of charge, to any person obtaining a copy
7+
# of this software and associated documentation files (the "Software"), to deal
8+
# in the Software without restriction, including without limitation the rights
9+
# to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
10+
# copies of the Software, and to permit persons to whom the Software is
11+
# furnished to do so, subject to the following conditions:
12+
#
13+
# The above copyright notice and this permission notice shall be included in all
14+
# copies or substantial portions of the Software.
15+
#
16+
# THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
17+
# IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
18+
# FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
19+
# AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
20+
# LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
21+
# OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
22+
# SOFTWARE.
23+
24+
yell () { echo "$0: $*" >&2; }
25+
die () { yell "$*"; exit 111; }
26+
try () { "$@" || die "Exiting. Failed to run: \"$*\""; }
27+
28+
usage () {
29+
die "
30+
USAGE: ./import_duckdb.sh mimic_data_dir [output_db]
31+
WHERE:
32+
mimic_data_dir directory that contains csv.tar.gz or csv files
33+
output_db: optional filename for duckdb file (default: mimic4_ed.db)\
34+
"
35+
}
36+
37+
# Print help if requested
38+
echo "$0 $* " | grep -Eq " -h | --help " && usage
39+
40+
# rename CLI positional args to more friendly variable names
41+
MIMIC_DIR=$1
42+
# allow optional specification of duckdb name, otherwise default to mimic4_ed.db
43+
OUTFILE=mimic4_ed.db
44+
if [ -n "$2" ]; then
45+
OUTFILE=$2
46+
fi
47+
48+
49+
# basic error checking before running
50+
if [ -z "$MIMIC_DIR" ]; then
51+
yell "Please specify a mimic data directory"
52+
die "Usage: ./import_duckdb.sh mimic_data_dir [output_db]"
53+
elif [ ! -d "$MIMIC_DIR" ]; then
54+
yell "Specified directory \"$MIMIC_DIR\" does not exist."
55+
die "Usage: ./import_duckdb.sh mimic_data_dir [output_db]"
56+
elif [ -n "$3" ]; then
57+
yell "import.sh takes a maximum of two arguments."
58+
die "Usage: ./import_duckdb.sh mimic_data_dir [output_db]"
59+
elif [ -s "$OUTFILE" ]; then
60+
yell "File \"$OUTFILE\" already exists."
61+
read -p "Continue? (y/d/n) 'y' continues, 'd' deletes original file, 'n' stops: " yn
62+
case $yn in
63+
[Yy]* ) ;; # OK
64+
[Nn]* ) exit;;
65+
[Dd]* ) rm "$OUTFILE";;
66+
* ) die "Unrecognized input.";;
67+
esac
68+
fi
69+
70+
# we will copy the postgresql create.sql file, and apply regex
71+
# to fix the following issues:
72+
# 1. Remove optional precision value from TIMESTAMP(NN) -> TIMESTAMP
73+
# duckdb does not support this.
74+
export REGEX_TIMESTAMP='s/TIMESTAMP\([0-9]+\)/TIMESTAMP/g'
75+
76+
# use sed + above regex to create tables within db
77+
sed -r -e "${REGEX_TIMESTAMP}" ../postgres/create.sql |
78+
duckdb "$OUTFILE"
79+
80+
# goal: get path from find, e.g., ./1.0/icu/d_items
81+
# and return database table name for it, e.g., mimiciv_icu.d_items
82+
make_table_name () {
83+
# strip leading directories (e.g., ./icu/hello.csv.gz -> hello.csv.gz)
84+
BASENAME=${1##*/}
85+
# strip suffix (e.g., hello.csv.gz -> hello; hello.csv -> hello)
86+
TABLE_NAME=${BASENAME%%.*}
87+
# strip basename (e.g., ./icu/hello.csv.gz -> ./icu)
88+
PATHNAME=${1%/*}
89+
# strip leading directories from PATHNAME (e.g. ./icu -> icu)
90+
DIRNAME=${PATHNAME##*/}
91+
TABLE_NAME="mimiciv_$DIRNAME.$TABLE_NAME"
92+
}
93+
94+
95+
# load data into database
96+
find "$MIMIC_DIR" -type f -name '*.csv???' | sort | while IFS= read -r FILE; do
97+
make_table_name "$FILE"
98+
99+
# skip directories which we do not expect in mimic-iv-ed
100+
# avoids syntax errors if mimic-iv in the same dir
101+
case $DIRNAME in
102+
(ed) ;; # OK
103+
(*) continue;
104+
esac
105+
echo "Loading $FILE .. "
106+
try duckdb "$OUTFILE" <<-EOSQL
107+
COPY $TABLE_NAME FROM '$FILE' (HEADER, DELIM ',', QUOTE '"', ESCAPE '"');
108+
EOSQL
109+
echo "done!"
110+
done && echo "Successfully finished loading data into $OUTFILE."

0 commit comments

Comments
 (0)