Skip to content

Commit e9f3da6

Browse files
committed
Update README and docs
Signed-off-by: Ben Sherman <[email protected]>
1 parent df3976d commit e9f3da6

File tree

3 files changed

+121
-130
lines changed

3 files changed

+121
-130
lines changed

Diff for: README.md

+85-86
Original file line numberDiff line numberDiff line change
@@ -1,142 +1,147 @@
11
# SQL DB plugin for Nextflow
22

3-
This plugin provides an extension to implement built-in support for SQL DB access and manipulation in Nextflow scripts.
3+
This plugin provides support for interacting with SQL databases in Nextflow scripts.
44

5-
It provides the ability to create a Nextflow channel from SQL queries and to populate database tables.
6-
The current version provides out-of-the-box support for the following databases:
5+
The following databases are currently supported:
76

7+
* [AWS Athena](https://aws.amazon.com/athena/) (Setup guide [here](docs/aws-athena.md))
8+
* [DuckDB](https://duckdb.org/)
9+
* [Google BigQuery](https://cloud.google.com/bigquery) (Setup guide [here](docs/google-bigquery.md))
810
* [H2](https://www.h2database.com)
911
* [MySQL](https://www.mysql.com/)
1012
* [MariaDB](https://mariadb.org/)
1113
* [PostgreSQL](https://www.postgresql.org/)
1214
* [SQLite](https://www.sqlite.org/index.html)
13-
* [DuckDB](https://duckdb.org/)
14-
* [AWS Athena](https://aws.amazon.com/athena/) (Setup guide [here](/docs/aws-athena.md))
15-
* [Google BigQuery](https://cloud.google.com/bigquery) (Setup guide [here](/docs/google-bigquery.md))
1615

1716
NOTE: THIS IS A PREVIEW TECHNOLOGY, FEATURES AND CONFIGURATION SETTINGS CAN CHANGE IN FUTURE RELEASES.
1817

19-
This repository only holds plugin artefacts. Source code is available at this [link](https://github.com/nextflow-io/nextflow/tree/master/plugins/nf-sqldb).
18+
## Getting started
2019

21-
## Get started
22-
23-
Make sure to have Nextflow `22.08.1-edge` or later. Add the following snippet to your `nextflow.config` file.
20+
This plugin requires Nextflow `22.08.1-edge` or later. You can enable the plugin by adding the following snippet to your `nextflow.config` file:
2421

25-
```
22+
```groovy
2623
plugins {
27-
id 'nf-sqldb@0.5.0'
24+
id 'nf-sqldb'
2825
}
2926
```
3027

31-
The above declaration allows the use of the SQL plugin functionalities in your Nextflow pipelines.
32-
See the section below to configure the connection properties with a database instance.
28+
Support for BigQuery is provided in a separate plugin:
3329

34-
For BigQuery datasource you need to use the nf-bigquery plugin
35-
36-
```
30+
```groovy
3731
plugins {
38-
id 'nf-bigquery@0.0.1'
32+
id 'nf-bigquery'
3933
}
4034
```
4135

42-
4336
## Configuration
4437

45-
The target database connection coordinates are specified in the `nextflow.config` file using the
46-
`sql.db` scope. The following are available
47-
48-
| Config option | Description |
49-
|--- |--- |
50-
| `sql.db.'<DB-NAME>'.url` | The database connection URL based on Java [JDBC standard](https://docs.oracle.com/javase/tutorial/jdbc/basics/connecting.html#db_connection_url).
51-
| `sql.db.'<DB-NAME>'.driver` | The database driver class name (optional).
52-
| `sql.db.'<DB-NAME>'.user` | The database connection user name.
53-
| `sql.db.'<DB-NAME>'.password` | The database connection password.
38+
You can configure any number of databases under the `sql.db` configuration scope. For example:
5439

55-
For example:
56-
57-
```
40+
```groovy
5841
sql {
5942
db {
6043
foo {
61-
url = 'jdbc:mysql://localhost:3306/demo'
62-
user = 'my-user'
63-
password = 'my-password'
64-
}
44+
url = 'jdbc:mysql://localhost:3306/demo'
45+
user = 'my-user'
46+
password = 'my-password'
47+
}
6548
}
6649
}
67-
6850
```
6951

70-
The above snippet defines SQL DB named *foo* that connects to a MySQL server running locally at port 3306 and
71-
using `demo` schema, with `my-name` and `my-password` as credentials.
52+
The above example defines a database named `foo` that connects to a MySQL server running locally at port 3306 and
53+
using the `demo` schema, with `my-name` and `my-password` as credentials.
54+
55+
The following options are available:
56+
57+
`sql.db.'<DB-NAME>'.url`
58+
: The database connection URL based on the [JDBC standard](https://docs.oracle.com/javase/tutorial/jdbc/basics/connecting.html#db_connection_url).
7259

73-
## Available operations
60+
`sql.db.'<DB-NAME>'.driver`
61+
: The database driver class name (optional).
7462

75-
This plugin adds to the Nextflow DSL the following extensions that allows performing of queries and populating database tables.
63+
`sql.db.'<DB-NAME>'.user`
64+
: The database connection user name.
65+
66+
`sql.db.'<DB-NAME>'.password`
67+
: The database connection password.
68+
69+
## Dataflow Operators
70+
71+
This plugin provides the following dataflow operators for querying from and inserting into database tables.
7672

7773
### fromQuery
7874

79-
The `fromQuery` factory method allows for performing a query against a SQL database and creating a Nextflow channel emitting
80-
a tuple for each row in the corresponding result set. For example:
75+
The `fromQuery` factory method queries a SQL database and creates a channel that emits a tuple for each row in the corresponding result set. For example:
8176

82-
```
77+
```nextflow
8378
include { fromQuery } from 'plugin/nf-sqldb'
8479
85-
ch = channel.fromQuery('select alpha, delta, omega from SAMPLE', db: 'foo')
80+
channel.fromQuery('select alpha, delta, omega from SAMPLE', db: 'foo').view()
8681
```
8782

8883
The following options are available:
8984

90-
| Operator option | Description |
91-
|--- |--- |
92-
| `db` | The database handle. It must must a `sql.db` name defined in the `nextflow.config` file.
93-
| `batchSize` | Performs the query in batches of the specified size. This is useful to avoid loading the complete resultset in memory for query returning a large number of entries. NOTE: this feature requires that the underlying SQL database to support `LIMIT` and `OFFSET` capability.
94-
| `emitColumns` | When `true` the column names in the select statement are emitted as first tuple in the resulting channel.
85+
`db`
86+
: The database handle. It must be defined under `sql.db` in the Nextflow configuration.
87+
88+
`batchSize`
89+
: Query the data in batches of the given size. This option is recommended for queries that may return large a large result set, so that the entire result set is not loaded into memory at once.
90+
: *NOTE:* this feature requires that the underlying SQL database supports `LIMIT` and `OFFSET`.
91+
92+
`emitColumns`
93+
: When `true`, the column names in the `SELECT` statement are emitted as the first tuple in the resulting channel.
9594

9695
### sqlInsert
9796

98-
The `sqlInsert` operator provided by this plugin allows populating a database table with the data emitted
99-
by a Nextflow channels and therefore produced as result by a pipeline process or an upstream operator. For example:
97+
The `sqlInsert` operator collects the items in a source channel and inserts them into a SQL database. For example:
10098

101-
```
99+
```nextflow
102100
include { sqlInsert } from 'plugin/nf-sqldb'
103101
104102
channel
105103
.of('Hello','world!')
106104
.map( it -> tuple(it, it.length) )
107105
.sqlInsert( into: 'SAMPLE', columns: 'NAME, LEN', db: 'foo' )
108-
109106
```
110107

111-
The above example creates and performs the following two SQL statements into the database with name `foo` as defined
112-
in the `nextflow.config` file.
108+
The above example executes the following SQL statements into the database `foo` (as defined in the Nextflow configuration).
113109

114-
```
110+
```sql
115111
INSERT INTO SAMPLE (NAME, LEN) VALUES ('HELLO', 5);
116112
INSERT INTO SAMPLE (NAME, LEN) VALUES ('WORLD!', 6);
117113
```
118114

119-
NOTE: the target table (e.g. `SAMPLE` in the above example) must be created ahead.
115+
*NOTE:* the target table (e.g. `SAMPLE` in the above example) must be created beforehand.
120116

121117
The following options are available:
122118

123-
| Operator option | Description |
124-
|-------------------|--- |
125-
| `db` | The database handle. It must must a `sql.db` name defined in the `nextflow.config` file.
126-
| `into` | The database table name into with the data needs to be stored.
127-
| `columns` | The database table column names to be filled with the channel data. The column names order and cardinality must match the tuple values emitted by the channel. The columns can be specified as a `List` object or a comma-separated value string.
128-
| `statement` | The SQL `insert` statement to be performed to insert values in the database using `?` as placeholder for the actual values, for example: `insert into SAMPLE(X,Y) values (?,?)`. When provided the `into` and `columsn` parameters are ignored.
129-
| `batchSize` | The number of insert statements that are grouped together before performing the SQL operations (default: `10`).
130-
| `setup` | A SQL statement that's executed before the first insert operation. This is useful to create the target DB table. NOTE: the underlying DB should support the *create table if not exist* idiom (i.e. the plugin will execute this statement every time the script is run).
119+
`db`
120+
: The database handle. It must be defined under `sql.db` in the Nextflow configuration.
131121

132-
## Query CSV files
122+
`into`
123+
: The target table for inserting the data.
133124

134-
The SQL plugin includes the [H2](https://www.h2database.com/html/main.html) database engine that allows the query of CSV files
135-
as DB tables using SQL statements.
125+
`columns`
126+
: The database table column names to be filled with the channel data. The column names order and cardinality must match the tuple values emitted by the channel. The columns can be specified as a list or as a string of comma-separated values.
136127

137-
For example, create CSV file using the snippet below:
128+
`statement`
129+
: The SQL `INSERT` statement to execute, using `?` as a placeholder for the actual values, for example: `insert into SAMPLE(X,Y) values (?,?)`. The `into` and `columns` options are ignored when this option is provided.
138130

139-
```
131+
`batchSize`
132+
: Insert the data in batches of the given size (default: `10`).
133+
134+
`setup`
135+
: A SQL statement that is executed before inserting the data, e.g. to create the target table.
136+
: *NOTE:* the underlying database should support the *create table if not exist* idiom, as the plugin will execute this statement every time the script is run.
137+
138+
## Querying CSV files
139+
140+
This plugin supports the [H2](https://www.h2database.com/html/main.html) database engine, which can query CSV files like database tables using SQL statements.
141+
142+
For example, create a CSV file using the snippet below:
143+
144+
```bash
140145
cat <<EOF > test.csv
141146
foo,bar
142147
1,hello
@@ -146,26 +151,20 @@ foo,bar
146151
EOF
147152
```
148153

149-
To query this file in a Nextflow script use the following snippet:
154+
Then query it in a Nextflow script:
150155

151156
```nextflow
152-
channel
153-
.sql
154-
.fromQuery("SELECT * FROM CSVREAD('test.csv') where foo>=2;")
155-
.view()
156-
```
157+
include { fromQuery } from 'plugin/nf-sqldb'
157158
158-
The `CSVREAD` function provided by the H2 database engine allows the access of a CSV file in your computer file system,
159-
you can replace `test.csv` with a CSV file path of your choice. The `foo>=2` condition shows how to define a filtering
160-
clause using the conventional SQL WHERE constrains.
159+
channel
160+
.fromQuery("SELECT * FROM CSVREAD('test.csv') where foo>=2;")
161+
.view()
162+
```
161163

162-
## Important
164+
The `CSVREAD` function provided by the H2 database engine allows you to query any CSV file in your filesystem. As shown in the example, you can use standard SQL clauses like `SELECT` and `WHERE` to define your query.
163165

164-
This plugin is not expected to be used to store and access a pipeline status in a synchronous manner during the pipeline
165-
execution.
166+
## Caveats
166167

167-
This means that if your script has a `sqlInsert` operation followed by a successive `fromQuery` operation, the query
168-
may *not* contain previously inserted data due to the asynchronous nature of Nextflow operators.
168+
Like all dataflow operators in Nextflow, the operators provided by this plugin are executed asynchronously.
169169

170-
The SQL support provided by this plugin is meant to be used to fetch DB data from a previous run or to populate DB tables
171-
for storing or archival purpose.
170+
In particular, data inserted using the `sqlInsert` operator is *not* guaranteed to be available to any subsequent queries using the `fromQuery` operator, as it is not possible to make a channel factory operation dependent on some upstream operation.

Diff for: docs/aws-athena.md

+15-21
Original file line numberDiff line numberDiff line change
@@ -1,4 +1,4 @@
1-
# AWS Athena integration setup
1+
# AWS Athena integration
22

33
## Pre-requisites
44

@@ -8,24 +8,22 @@
88

99
## Usage
1010

11-
In the example below, it is assumed that the [NCBI SRA Metadata](https://www.ncbi.nlm.nih.gov/sra/docs/sra-athena/) has been used as the data source. You can refer the [tutorial from NCBI](https://www.youtube.com/watch?v=_F4FhcDWSJg&ab_channel=TheNationalLibraryofMedicine) for setting up the AWS resources correctly
11+
The following example uses the [NCBI SRA Metadata](https://www.ncbi.nlm.nih.gov/sra/docs/sra-athena/) as the data source. Refer to the [tutorial from NCBI](https://www.youtube.com/watch?v=_F4FhcDWSJg&ab_channel=TheNationalLibraryofMedicine) for setting up the AWS resources correctly.
1212

1313
### Configuration
1414

15-
```nextflow config
16-
//NOTE: Replace the values in the config file as per your setup
15+
Adjust the following configuration to match your setup:
1716

17+
```nextflow config
1818
params {
19-
aws_glue_db = "sra-glue-db"
20-
aws_glue_db_table = "metadata"
19+
aws_glue_db = 'sra-glue-db'
20+
aws_glue_db_table = 'metadata'
2121
}
2222
23-
2423
plugins {
25-
id 'nf-sqldb@0.6.0'
24+
id 'nf-sqldb'
2625
}
2726
28-
2927
sql {
3028
db {
3129
athena {
@@ -35,33 +33,29 @@ sql {
3533
}
3634
}
3735
}
38-
3936
```
4037

4138
### Pipeline
4239

43-
Once the configuration has been setup correctly, you can use it in the Nextflow code as shown below
40+
Execute the following Nextflow pipeline:
4441

4542
```nextflow
4643
include { fromQuery } from 'plugin/nf-sqldb'
4744
4845
def sqlQuery = """
49-
SELECT *
50-
FROM \"${params.aws_glue_db}\".${params.aws_glue_db_table}
51-
WHERE organism = 'Mycobacterium tuberculosis'
52-
LIMIT 10;
53-
"""
54-
55-
Channel.fromQuery(sqlQuery, db: 'athena')
56-
.view()
46+
SELECT *
47+
FROM \"${params.aws_glue_db}\".${params.aws_glue_db_table}
48+
WHERE organism = 'Mycobacterium tuberculosis'
49+
LIMIT 10;
50+
"""
5751
52+
Channel.fromQuery(sqlQuery, db: 'athena').view()
5853
```
5954

6055
### Output
6156

62-
When you execute the above code, you'll see the query results on the console
57+
The pipeline script will print the query results to the console:
6358

6459
```console
6560
[SRR6797500, WGS, SAN RAFFAELE, public, SRX3756197, 131677, Illumina HiSeq 2500, PAIRED, RANDOM, GENOMIC, ILLUMINA, SRS3011891, SAMN08629009, Mycobacterium tuberculosis, SRP128089, 2018-03-02, PRJNA428596, 165, null, 201, 383, null, 131677_WGS, Pathogen.cl, null, uncalculated, uncalculated, null, null, null, bam, sra, s3, s3.us-east-1, {k=assemblyname, v=GCF_000195955.2}, {k=bases, v=383901808}, {k=bytes, v=173931377}, {k=biosample_sam, v=MTB131677}, {k=collected_by_sam, v=missing}, {k=collection_date_sam, v=2010/2014}, {k=host_disease_sam, v=Tuberculosis}, {k=host_sam, v=Homo sapiens}, {k=isolate_sam, v=Clinical isolate18}, {k=isolation_source_sam_ss_dpl262, v=Not applicable}, {k=lat_lon_sam, v=Not collected}, {k=primary_search, v=131677}, {k=primary_search, v=131677_210916_BGD_210916_100.gatk.bam}, {k=primary_search, v=131677_WGS}, {k=primary_search, v=428596}, {k=primary_search, v=8629009}, {k=primary_search, v=PRJNA428596}, {k=primary_search, v=SAMN08629009}, {k=primary_search, v=SRP128089}, {k=primary_search, v=SRR6797500}, {k=primary_search, v=SRS3011891}, {k=primary_search, v=SRX3756197}, {k=primary_search, v=bp0}, {"assemblyname": "GCF_000195955.2", "bases": 383901808, "bytes": 173931377, "biosample_sam": "MTB131677", "collected_by_sam": ["missing"], "collection_date_sam": ["2010/2014"], "host_disease_sam": ["Tuberculosis"], "host_sam": ["Homo sapiens"], "isolate_sam": ["Clinical isolate18"], "isolation_source_sam_ss_dpl262": ["Not applicable"], "lat_lon_sam": ["Not collected"], "primary_search": "131677"}]
66-
6761
```

0 commit comments

Comments
 (0)