1
- This is an evolving guide for developers interested in developing and testing this project. This guide assumes that you
2
- have cloned this repository to your local workstation.
1
+ This guide covers how to develop and test this project. It assumes that you have cloned this repository to your local
2
+ workstation.
3
3
4
- # Do this first!
4
+ Due to the use of the Sonar plugin for Gradle, you must use Java 11 or higher for developing and testing the project.
5
+ The ` build.gradle ` file for this project ensures that the connector is built to run on Java 8 or higher.
5
6
6
- In order to develop and/or test the connector, or to try out the PySpark instructions below, you first
7
- need to deploy the test application in this project to MarkLogic. You can do so either on your own installation of
8
- MarkLogic, or you can use ` docker-compose ` to install MarkLogic, optionally as a 3-node cluster with a load balancer
9
- in front of it.
7
+ # Setup
8
+
9
+ To begin, you need to deploy the test application in this project to MarkLogic. You can do so either on your own
10
+ installation of MarkLogic, or you can use ` docker-compose ` to install MarkLogic, optionally as a 3-node cluster with
11
+ a load balancer in front of it.
10
12
11
13
## Installing MarkLogic with docker-compose
12
14
@@ -22,9 +24,9 @@ The above will result in a new MarkLogic instance with a single node.
22
24
Alternatively, if you would like to test against a 3-node MarkLogic cluster with a load balancer in front of it,
23
25
run ` docker-compose -f docker-compose-3nodes.yaml up -d --build ` .
24
26
25
- ### Accessing MarkLogic logs in Grafana
27
+ ## Accessing MarkLogic logs in Grafana
26
28
27
- This project's ` docker-compose.yaml ` file includes
29
+ This project's ` docker-compose-3nodes .yaml ` file includes
28
30
[ Grafana, Loki, and promtail services] ( https://grafana.com/docs/loki/latest/clients/promtail/ ) for the primary reason of
29
31
collecting MarkLogic log files and allowing them to be viewed and searched via Grafana.
30
32
@@ -75,6 +77,46 @@ You can then run the tests from within the Docker environment via the following
75
77
./gradlew dockerTest
76
78
77
79
80
+ ## Generating code quality reports with SonarQube
81
+
82
+ In order to use SonarQube, you must have used Docker to run this project's ` docker-compose.yml ` file and you must
83
+ have the services in that file running.
84
+
85
+ To configure the SonarQube service, perform the following steps:
86
+
87
+ 1 . Go to http://localhost:9000 .
88
+ 2 . Login as admin/admin. SonarQube will ask you to change this password; you can choose whatever you want ("password" works).
89
+ 3 . Click on "Create project manually".
90
+ 4 . Enter "marklogic-spark" for the Project Name; use that as the Project Key too.
91
+ 5 . Enter "develop" as the main branch name.
92
+ 6 . Click on "Next".
93
+ 7 . Click on "Use the global setting" and then "Create project".
94
+ 8 . On the "Analysis Method" page, click on "Locally".
95
+ 9 . In the "Provide a token" panel, click on "Generate". Copy the token.
96
+ 10 . Add ` systemProp.sonar.token=your token pasted here ` to ` gradle-local.properties ` in the root of your project, creating
97
+ that file if it does not exist yet.
98
+
99
+ To run SonarQube, run the following Gradle tasks, which will run all the tests with code coverage and then generate
100
+ a quality report with SonarQube:
101
+
102
+ ./gradlew test sonar
103
+
104
+ If you do not add ` systemProp.sonar.token ` to your ` gradle-local.properties ` file, you can specify the token via the
105
+ following:
106
+
107
+ ./gradlew test sonar -Dsonar.token=paste your token here
108
+
109
+ When that completes, you will see a line like this near the end of the logging:
110
+
111
+ ANALYSIS SUCCESSFUL, you can find the results at: http://localhost:9000/dashboard?id=marklogic-spark
112
+
113
+ Click on that link. If it's the first time you've run the report, you'll see all issues. If you've run the report
114
+ before, then SonarQube will show "New Code" by default. That's handy, as you can use that to quickly see any issues
115
+ you've introduced on the feature branch you're working on. You can then click on "Overall Code" to see all issues.
116
+
117
+ Note that if you only need results on code smells and vulnerabilities, you can repeatedly run ` ./gradlew sonar `
118
+ without having to re-run the tests.
119
+
78
120
# Testing with PySpark
79
121
80
122
The documentation for this project
@@ -89,19 +131,16 @@ This will produce a single jar file for the connector in the `./build/libs` dire
89
131
90
132
You can then launch PySpark with the connector available via:
91
133
92
- pyspark --jars build/libs/marklogic-spark-connector-2.1.0 .jar
134
+ pyspark --jars build/libs/marklogic-spark-connector-2.2-SNAPSHOT .jar
93
135
94
136
The below command is an example of loading data from the test application deployed via the instructions at the top of
95
137
this page.
96
138
97
139
```
98
- df = spark.read.format("com.marklogic.spark")\
99
- .option("spark.marklogic.client.host", "localhost")\
100
- .option("spark.marklogic.client.port", "8016")\
101
- .option("spark.marklogic.client.username", "admin")\
102
- .option("spark.marklogic.client.password", "admin")\
103
- .option("spark.marklogic.client.authType", "digest")\
140
+ df = spark.read.format("marklogic")\
141
+ .option("spark.marklogic.client.uri", "spark-test-user:spark@localhost:8016")\
104
142
.option("spark.marklogic.read.opticQuery", "op.fromView('Medical', 'Authors')")\
143
+ .option("spark.marklogic.read.numPartitions", 8)\
105
144
.load()
106
145
```
107
146
@@ -114,6 +153,74 @@ You now have a Spark dataframe - try some commands out on it:
114
153
Check out the [ PySpark docs] ( https://spark.apache.org/docs/latest/api/python/getting_started/quickstart_df.html ) for
115
154
more commands you can try out.
116
155
156
+ You can query for documents as well - the following shows a simple example along with a technique for converting the
157
+ binary content of each document into a string of JSON.
158
+
159
+ ```
160
+ import json
161
+ from pyspark.sql import functions as F
162
+
163
+ df = spark.read.format("marklogic")\
164
+ .option("spark.marklogic.client.uri", "spark-test-user:spark@localhost:8016")\
165
+ .option("spark.marklogic.read.documents.collections", "author")\
166
+ .load()
167
+ df.show()
168
+
169
+ df2 = df.select(F.col("content").cast("string"))
170
+ df2.head()
171
+ json.loads(df2.head()['content'])
172
+ ```
173
+
174
+
175
+ # Testing against a local Spark cluster
176
+
177
+ When you run PySpark, it will create its own Spark cluster. If you'd like to try against a separate Spark cluster
178
+ that still runs on your local machine, perform the following steps:
179
+
180
+ 1 . Use [ sdkman to install Spark] ( https://sdkman.io/sdks#spark ) . Run ` sdk install spark 3.4.1 ` since we are currently
181
+ building against Spark 3.4.1.
182
+ 2 . ` cd ~/.sdkman/candidates/spark/current/sbin ` , which is where sdkman will install Spark.
183
+ 3 . Run ` ./start-master.sh ` to start a master Spark node.
184
+ 4 . ` cd ../logs ` and open the master log file that was created to find the address for the master node. It will be in a
185
+ log message similar to ` Starting Spark master at spark://NYWHYC3G0W:7077 ` - copy that address at the end of the message.
186
+ 5 . ` cd ../sbin ` .
187
+ 6 . Run ` ./start-worker.sh spark://NYWHYC3G0W:7077 ` , changing that address as necessary.
188
+
189
+ You can of course simplify the above steps by adding ` SPARK_HOME ` to your env and adding ` $SPARK_HOME/sbin ` to your
190
+ path, which thus avoids having to change directories. The log files in ` ./logs ` are useful to tail as well.
191
+
192
+ The Spark master GUI is at < http://localhost:8080 > . You can use this to view details about jobs running in the cluster.
193
+
194
+ Now that you have a Spark cluster running, you just need to tell PySpark to connect to it:
195
+
196
+ pyspark --master spark://NYWHYC3G0W:7077 --jars build/libs/marklogic-spark-connector-2.2-SNAPSHOT.jar
197
+
198
+ You can then run the same commands as shown in the PySpark section above. The Spark master GUI will allow you to
199
+ examine details of each of the commands that you run.
200
+
201
+ The above approach is ultimately a sanity check to ensure that the connector works properly with a separate cluster
202
+ process.
203
+
204
+ ## Testing spark-submit
205
+
206
+ Once you have the above Spark cluster running, you can test out
207
+ [ spark-submit] ( https://spark.apache.org/docs/latest/submitting-applications.html ) which enables submitting a program
208
+ and an optional set of jars to a Spark cluster for execution.
209
+
210
+ You will need the connector jar available, so run ` ./gradlew clean shadowJar ` if you have not already.
211
+
212
+ You can then run a test Python program in this repository via the following (again, change the master address as
213
+ needed); note that you run this outside of PySpark, and ` spark-submit ` is available after having installed PySpark:
214
+
215
+ spark-submit --master spark://NYWHYC3G0W:7077 --jars build/libs/marklogic-spark-connector-2.2-SNAPSHOT.jar src/test/python/test_program.py
216
+
217
+ You can also test a Java program. To do so, first move the ` com.marklogic.spark.TestProgram ` class from ` src/test/java `
218
+ to ` src/main/java ` . Then run ` ./gradlew clean shadowJar ` to rebuild the connector jar. Then run the following:
219
+
220
+ spark-submit --master spark://NYWHYC3G0W:7077 --class com.marklogic.spark.TestProgram build/libs/marklogic-spark-connector-2.2-SNAPSHOT.jar
221
+
222
+ Be sure to move ` TestProgram ` back to ` src/test/java ` when you are done.
223
+
117
224
# Testing the documentation locally
118
225
119
226
See the section with the same name in the
0 commit comments