Skip to content

Commit df3976d

Browse files
authored
Merge pull request #13 from nextflow-io/abhinav/google-bigquery
Add Google BigQuery support
2 parents 3a0ac32 + 9f6ad93 commit df3976d

File tree

18 files changed

+539
-28
lines changed

18 files changed

+539
-28
lines changed

Diff for: .github/workflows/build.yml

+1
Original file line numberDiff line numberDiff line change
@@ -42,6 +42,7 @@ jobs:
4242
run: ./gradlew check
4343
env:
4444
GRADLE_OPTS: '-Dorg.gradle.daemon=false'
45+
NXF_SMOKE: 1
4546

4647
- name: Publish
4748
if: failure()

Diff for: README.md

+10
Original file line numberDiff line numberDiff line change
@@ -12,6 +12,7 @@ The current version provides out-of-the-box support for the following databases:
1212
* [SQLite](https://www.sqlite.org/index.html)
1313
* [DuckDB](https://duckdb.org/)
1414
* [AWS Athena](https://aws.amazon.com/athena/) (Setup guide [here](/docs/aws-athena.md))
15+
* [Google BigQuery](https://cloud.google.com/bigquery) (Setup guide [here](/docs/google-bigquery.md))
1516

1617
NOTE: THIS IS A PREVIEW TECHNOLOGY, FEATURES AND CONFIGURATION SETTINGS CAN CHANGE IN FUTURE RELEASES.
1718

@@ -30,6 +31,15 @@ plugins {
3031
The above declaration allows the use of the SQL plugin functionalities in your Nextflow pipelines.
3132
See the section below to configure the connection properties with a database instance.
3233

34+
For BigQuery datasource you need to use the nf-bigquery plugin
35+
36+
```
37+
plugins {
38+
39+
}
40+
```
41+
42+
3343
## Configuration
3444

3545
The target database connection coordinates are specified in the `nextflow.config` file using the

Diff for: docs/aws-athena.md

+3-3
Original file line numberDiff line numberDiff line change
@@ -22,7 +22,7 @@ params {
2222
2323
2424
plugins {
25-
id 'nf-sqldb@0.5.0'
25+
id 'nf-sqldb@0.6.0'
2626
}
2727
2828
@@ -40,7 +40,7 @@ sql {
4040

4141
### Pipeline
4242

43-
Once the configuration has been setup correctly, you can use it in the Nextlow code as shown below
43+
Once the configuration has been setup correctly, you can use it in the Nextflow code as shown below
4444

4545
```nextflow
4646
include { fromQuery } from 'plugin/nf-sqldb'
@@ -59,7 +59,7 @@ Channel.fromQuery(sqlQuery, db: 'athena')
5959

6060
### Output
6161

62-
When you execute the above code, you'll see the AWS Athena query results on the console
62+
When you execute the above code, you'll see the query results on the console
6363

6464
```console
6565
[SRR6797500, WGS, SAN RAFFAELE, public, SRX3756197, 131677, Illumina HiSeq 2500, PAIRED, RANDOM, GENOMIC, ILLUMINA, SRS3011891, SAMN08629009, Mycobacterium tuberculosis, SRP128089, 2018-03-02, PRJNA428596, 165, null, 201, 383, null, 131677_WGS, Pathogen.cl, null, uncalculated, uncalculated, null, null, null, bam, sra, s3, s3.us-east-1, {k=assemblyname, v=GCF_000195955.2}, {k=bases, v=383901808}, {k=bytes, v=173931377}, {k=biosample_sam, v=MTB131677}, {k=collected_by_sam, v=missing}, {k=collection_date_sam, v=2010/2014}, {k=host_disease_sam, v=Tuberculosis}, {k=host_sam, v=Homo sapiens}, {k=isolate_sam, v=Clinical isolate18}, {k=isolation_source_sam_ss_dpl262, v=Not applicable}, {k=lat_lon_sam, v=Not collected}, {k=primary_search, v=131677}, {k=primary_search, v=131677_210916_BGD_210916_100.gatk.bam}, {k=primary_search, v=131677_WGS}, {k=primary_search, v=428596}, {k=primary_search, v=8629009}, {k=primary_search, v=PRJNA428596}, {k=primary_search, v=SAMN08629009}, {k=primary_search, v=SRP128089}, {k=primary_search, v=SRR6797500}, {k=primary_search, v=SRS3011891}, {k=primary_search, v=SRX3756197}, {k=primary_search, v=bp0}, {"assemblyname": "GCF_000195955.2", "bases": 383901808, "bytes": 173931377, "biosample_sam": "MTB131677", "collected_by_sam": ["missing"], "collection_date_sam": ["2010/2014"], "host_disease_sam": ["Tuberculosis"], "host_sam": ["Homo sapiens"], "isolate_sam": ["Clinical isolate18"], "isolation_source_sam_ss_dpl262": ["Not applicable"], "lat_lon_sam": ["Not collected"], "primary_search": "131677"}]

Diff for: docs/google-bigquery.md

+65
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,65 @@
1+
# Google BigQuery integration setup
2+
3+
## Pre-requisites
4+
5+
1. A Google Cloud project with BigQuery APIs enabled
6+
2. A service account with sufficient permissions
7+
8+
## Usage
9+
10+
In the example below, it is assumed that the [NCBI SRA Metadata](https://www.ncbi.nlm.nih.gov/sra/docs/sra-bigquery/) has been used as the data source. You can refer the official [NCBI docs](https://www.ncbi.nlm.nih.gov/sra/docs/sra-bigquery/) for setting up the `nih-sra-datastore` within your BigQuery console.
11+
12+
*NOTE*: For Google BiqQuery you do not need to specify the `user` and `password` fields as these are provided by your service account credentials JSON file.
13+
14+
### Configuration
15+
16+
```nextflow config
17+
//NOTE: Replace the values in the config file as per your setup
18+
19+
params {
20+
google_bigquery_db = "nih-sra-datastore.sra.metadata"
21+
google_project_id = "<YOUR_GOOGLE_PROJECT_ID>"
22+
google_service_account_email = "<YOUR_GOOGLE_SERVICE_ACCOUNT_EMAIL>"
23+
google_service_account_key = "<YOUR_GOOGLE_SERVICE_ACCOUNT_KEY_LOCATION>"
24+
}
25+
26+
plugins {
27+
28+
}
29+
30+
sql {
31+
db {
32+
googlebigquery {
33+
url = "jdbc:bigquery://https://www.googleapis.com/bigquery/v2:443;ProjectId=${params.google_project_id};OAuthType=0;OAuthServiceAcctEmail=${params.google_service_account_email};OAuthPvtKeyPath=${params.google_service_account_key};"
34+
}
35+
}
36+
}
37+
```
38+
39+
### Pipeline
40+
41+
Once the configuration has been setup correctly, you can use it in the Nextflow code as shown below
42+
43+
```nextflow
44+
include { fromQuery } from 'plugin/nf-bigquery'
45+
46+
def googleSqlQuery = """
47+
SELECT *
48+
FROM `nih-sra-datastore.sra.metadata`
49+
WHERE organism = 'Mycobacterium tuberculosis'
50+
AND bioproject = 'PRJNA670836'
51+
LIMIT 2;
52+
"""
53+
54+
Channel.fromQuery(googleSqlQuery, db: 'googlebigquery')
55+
.view()
56+
57+
```
58+
59+
### Output
60+
61+
When you execute the above code, you'll see the query results on the console
62+
63+
```console
64+
[SRR6797500, WGS, SAN RAFFAELE, public, SRX3756197, 131677, Illumina HiSeq 2500, PAIRED, RANDOM, GENOMIC, ILLUMINA, SRS3011891, SAMN08629009, Mycobacterium tuberculosis, SRP128089, 2018-03-02, PRJNA428596, 165, null, 201, 383, null, 131677_WGS, Pathogen.cl, null, uncalculated, uncalculated, null, null, null, bam, sra, s3, s3.us-east-1, {k=assemblyname, v=GCF_000195955.2}, {k=bases, v=383901808}, {k=bytes, v=173931377}, {k=biosample_sam, v=MTB131677}, {k=collected_by_sam, v=missing}, {k=collection_date_sam, v=2010/2014}, {k=host_disease_sam, v=Tuberculosis}, {k=host_sam, v=Homo sapiens}, {k=isolate_sam, v=Clinical isolate18}, {k=isolation_source_sam_ss_dpl262, v=Not applicable}, {k=lat_lon_sam, v=Not collected}, {k=primary_search, v=131677}, {k=primary_search, v=131677_210916_BGD_210916_100.gatk.bam}, {k=primary_search, v=131677_WGS}, {k=primary_search, v=428596}, {k=primary_search, v=8629009}, {k=primary_search, v=PRJNA428596}, {k=primary_search, v=SAMN08629009}, {k=primary_search, v=SRP128089}, {k=primary_search, v=SRR6797500}, {k=primary_search, v=SRS3011891}, {k=primary_search, v=SRX3756197}, {k=primary_search, v=bp0}, {"assemblyname": "GCF_000195955.2", "bases": 383901808, "bytes": 173931377, "biosample_sam": "MTB131677", "collected_by_sam": ["missing"], "collection_date_sam": ["2010/2014"], "host_disease_sam": ["Tuberculosis"], "host_sam": ["Homo sapiens"], "isolate_sam": ["Clinical isolate18"], "isolation_source_sam_ss_dpl262": ["Not applicable"], "lat_lon_sam": ["Not collected"], "primary_search": "131677"}]
65+
```

Diff for: plugins/nf-bigquery/build.gradle

+140
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,140 @@
1+
/*
2+
* Copyright 2020-2022, Seqera Labs
3+
*
4+
* Licensed under the Apache License, Version 2.0 (the "License");
5+
* you may not use this file except in compliance with the License.
6+
* You may obtain a copy of the License at
7+
*
8+
* http://www.apache.org/licenses/LICENSE-2.0
9+
*
10+
* Unless required by applicable law or agreed to in writing, software
11+
* distributed under the License is distributed on an "AS IS" BASIS,
12+
* WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
13+
* See the License for the specific language governing permissions and
14+
* limitations under the License.
15+
*/
16+
17+
plugins {
18+
id 'java-library'
19+
id 'groovy'
20+
id 'idea'
21+
id 'de.undercouch.download' version '4.1.2'
22+
}
23+
24+
group = 'io.nextflow'
25+
// DO NOT SET THE VERSION HERE
26+
// THE VERSION FOR PLUGINS IS DEFINED IN THE `/resources/META-INF/MANIFEST.NF` file
27+
java {
28+
toolchain {
29+
languageVersion = JavaLanguageVersion.of(11)
30+
}
31+
}
32+
33+
idea {
34+
module.inheritOutputDirs = true
35+
}
36+
37+
repositories {
38+
mavenCentral()
39+
maven { url = 'https://s3-eu-west-1.amazonaws.com/maven.seqera.io/releases' }
40+
maven { url = 'https://s3-eu-west-1.amazonaws.com/maven.seqera.io/snapshots' }
41+
}
42+
43+
configurations {
44+
// see https://docs.gradle.org/4.1/userguide/dependency_management.html#sub:exclude_transitive_dependencies
45+
runtimeClasspath.exclude group: 'org.slf4j', module: 'slf4j-api'
46+
}
47+
48+
sourceSets {
49+
main.java.srcDirs = []
50+
main.groovy.srcDirs = ['src/main']
51+
main.resources.srcDirs = ['src/resources']
52+
test.groovy.srcDirs = ['src/test']
53+
test.java.srcDirs = []
54+
test.resources.srcDirs = []
55+
}
56+
57+
ext{
58+
nextflowVersion = '22.08.1-edge'
59+
}
60+
61+
dependencies {
62+
compileOnly "io.nextflow:nextflow:$nextflowVersion"
63+
compileOnly 'org.slf4j:slf4j-api:1.7.10'
64+
compileOnly 'org.pf4j:pf4j:3.4.1'
65+
66+
api("org.codehaus.groovy:groovy-sql:3.0.10") { transitive = false }
67+
68+
api project(":plugins:nf-sqldb")
69+
70+
// JDBC driver setup for Google BigQuery - the 3rd party JAR are being downloaded and setup as gradle tasks below.
71+
// Reference https://cloud.google.com/bigquery/docs/reference/odbc-jdbc-drivers
72+
api files('src/dist/lib/GoogleBigQueryJDBC42.jar')
73+
//NOTE: Had to remove the slf4j jar due to a conflict
74+
implementation fileTree(dir: 'src/dist/lib/libs', include: '*.jar')
75+
76+
77+
testImplementation "io.nextflow:nextflow:$nextflowVersion"
78+
testImplementation "org.codehaus.groovy:groovy:3.0.10"
79+
testImplementation "org.codehaus.groovy:groovy-nio:3.0.10"
80+
testImplementation("org.codehaus.groovy:groovy-test:3.0.10") { exclude group: 'org.codehaus.groovy' }
81+
testImplementation("cglib:cglib-nodep:3.3.0")
82+
testImplementation("org.objenesis:objenesis:3.2")
83+
testImplementation("org.spockframework:spock-core:2.1-groovy-3.0") {
84+
exclude group: 'org.codehaus.groovy';
85+
exclude group: 'net.bytebuddy'
86+
}
87+
testImplementation('org.spockframework:spock-junit4:2.1-groovy-3.0') {
88+
exclude group: 'org.codehaus.groovy';
89+
exclude group: 'net.bytebuddy'
90+
}
91+
testImplementation('com.google.jimfs:jimfs:1.1')
92+
93+
testImplementation(testFixtures("io.nextflow:nextflow:$nextflowVersion"))
94+
testImplementation(testFixtures("io.nextflow:nf-commons:$nextflowVersion"))
95+
}
96+
97+
test {
98+
useJUnitPlatform()
99+
}
100+
101+
/**
102+
* Google BigQuery
103+
* The following tasks download and confirm the MD5 checksum of the ZIP archive
104+
* for Simba BigQuery JDBC driver and extract its contents to the build directory
105+
* Reference: https://cloud.google.com/bigquery/docs/reference/odbc-jdbc-drivers
106+
*/
107+
task downloadBigqueryDep(type: Download) {
108+
src 'https://storage.googleapis.com/simba-bq-release/jdbc/SimbaJDBCDriverforGoogleBigQuery42_1.3.0.1001.zip'
109+
dest new File(buildDir, 'downloads/SimbaJDBCDriverforGoogleBigQuery42_1.3.0.1001.zip')
110+
overwrite false
111+
}
112+
113+
task verifyBigqueryDep(type: Verify, dependsOn: downloadBigqueryDep) {
114+
src new File(buildDir, 'downloads/SimbaJDBCDriverforGoogleBigQuery42_1.3.0.1001.zip')
115+
algorithm 'MD5'
116+
checksum '2e54169cfba2050f0a0f01bcf12c8aa7'
117+
}
118+
119+
task unzipBigqueryDep(dependsOn: verifyBigqueryDep, type: Copy) {
120+
from zipTree(new File(buildDir, 'downloads/SimbaJDBCDriverforGoogleBigQuery42_1.3.0.1001.zip'))
121+
into "${buildDir}/downloads/unzip/googlebigquery"
122+
}
123+
unzipBigqueryDep.doLast{
124+
file("${buildDir}/downloads/unzip/googlebigquery/libs/slf4j-api-1.7.36.jar").delete()
125+
}
126+
127+
// Files under src/dist are included into the distribution zip
128+
// https://docs.gradle.org/current/userguide/application_plugin.html
129+
task copyBigqueryDep(dependsOn: unzipBigqueryDep, type: Copy) {
130+
from file(new File(buildDir, '/downloads/unzip/googlebigquery/GoogleBigQueryJDBC42.jar'))
131+
into "src/dist/lib"
132+
}
133+
134+
task copyBigqueryLibs(dependsOn: copyBigqueryDep, type: Copy) {
135+
from file(new File(buildDir, '/downloads/unzip/googlebigquery/libs'))
136+
into "src/dist/lib/libs"
137+
}
138+
139+
project.copyPluginLibs.dependsOn('copyBigqueryLibs')
140+
project.compileGroovy.dependsOn('copyBigqueryLibs')
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,17 @@
1+
package nextflow.sql
2+
3+
import nextflow.sql.config.DriverRegistry
4+
5+
6+
/**
7+
* @author : jorge <[email protected]>
8+
*
9+
*/
10+
class BigQueryDriverRegistry extends DriverRegistry {
11+
12+
BigQueryDriverRegistry(){
13+
super()
14+
addDriver('bigquery','com.simba.googlebigquery.jdbc.Driver')
15+
}
16+
17+
}
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,34 @@
1+
/*
2+
* Copyright 2020-2022, Seqera Labs
3+
*
4+
* Licensed under the Apache License, Version 2.0 (the "License");
5+
* you may not use this file except in compliance with the License.
6+
* You may obtain a copy of the License at
7+
*
8+
* http://www.apache.org/licenses/LICENSE-2.0
9+
*
10+
* Unless required by applicable law or agreed to in writing, software
11+
* distributed under the License is distributed on an "AS IS" BASIS,
12+
* WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
13+
* See the License for the specific language governing permissions and
14+
* limitations under the License.
15+
*
16+
*/
17+
18+
package nextflow.sql
19+
20+
import nextflow.sql.config.DriverRegistry
21+
import org.pf4j.PluginWrapper
22+
23+
/**
24+
* Implements BigQuerySQL plugin for Nextflow
25+
*
26+
* @author : jorge <[email protected]>
27+
*/
28+
class BigQuerySqlPlugin extends SqlPlugin {
29+
30+
BigQuerySqlPlugin(PluginWrapper wrapper) {
31+
super(wrapper)
32+
DriverRegistry.DEFAULT = new BigQueryDriverRegistry()
33+
}
34+
}
+6
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,6 @@
1+
Manifest-Version: 1.0
2+
Plugin-Class: nextflow.sql.BigQuerySqlPlugin
3+
Plugin-Id: nf-bigquery
4+
Plugin-Provider: Seqera Labs
5+
Plugin-Version: 0.0.1
6+
Plugin-Requires: >=22.08.1-edge
+17
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,17 @@
1+
#
2+
# Copyright 2020-2022, Seqera Labs
3+
#
4+
# Licensed under the Apache License, Version 2.0 (the "License");
5+
# you may not use this file except in compliance with the License.
6+
# You may obtain a copy of the License at
7+
#
8+
# http://www.apache.org/licenses/LICENSE-2.0
9+
#
10+
# Unless required by applicable law or agreed to in writing, software
11+
# distributed under the License is distributed on an "AS IS" BASIS,
12+
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
13+
# See the License for the specific language governing permissions and
14+
# limitations under the License.
15+
#
16+
17+
nextflow.sql.ChannelSqlExtension
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,49 @@
1+
/*
2+
* Copyright 2020-2022, Seqera Labs
3+
*
4+
* Licensed under the Apache License, Version 2.0 (the "License");
5+
* you may not use this file except in compliance with the License.
6+
* You may obtain a copy of the License at
7+
*
8+
* http://www.apache.org/licenses/LICENSE-2.0
9+
*
10+
* Unless required by applicable law or agreed to in writing, software
11+
* distributed under the License is distributed on an "AS IS" BASIS,
12+
* WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
13+
* See the License for the specific language governing permissions and
14+
* limitations under the License.
15+
*
16+
*/
17+
18+
package nextflow.sql.bigquery
19+
20+
import nextflow.sql.BigQueryDriverRegistry
21+
import nextflow.sql.config.DriverRegistry
22+
import nextflow.sql.config.SqlDataSource
23+
import spock.lang.Specification
24+
/**
25+
*
26+
* @author Paolo Di Tommaso <[email protected]>
27+
*/
28+
class BigQuerySqlDataSourceTest extends Specification {
29+
30+
def 'should map url to driver' () {
31+
given:
32+
DriverRegistry.DEFAULT = new BigQueryDriverRegistry()
33+
def helper = new SqlDataSource([:])
34+
35+
expect:
36+
helper.urlToDriver(JBDC_URL) == DRIVER
37+
where:
38+
JBDC_URL | DRIVER
39+
'jdbc:postgresql:database' | 'org.postgresql.Driver'
40+
'jdbc:sqlite:database' | 'org.sqlite.JDBC'
41+
'jdbc:h2:mem:' | 'org.h2.Driver'
42+
'jdbc:mysql:some-host' | 'com.mysql.cj.jdbc.Driver'
43+
'jdbc:mariadb:other-host' | 'org.mariadb.jdbc.Driver'
44+
'jdbc:duckdb:' | 'org.duckdb.DuckDBDriver'
45+
'jdbc:awsathena:' | 'com.simba.athena.jdbc.Driver'
46+
'jdbc:bigquery:' | 'com.simba.googlebigquery.jdbc.Driver'
47+
}
48+
49+
}

0 commit comments

Comments
 (0)