Skip to content

Commit e804715

Browse files
Merge pull request #652 from datastax/SPARKC-134
Sparkc 134
2 parents e425144 + 701d3fb commit e804715

File tree

4 files changed

+202
-1
lines changed

4 files changed

+202
-1
lines changed

Diff for: README.md

+2
Original file line numberDiff line numberDiff line change
@@ -70,6 +70,8 @@ See [Building And Artifacts](doc/12_building_and_artifacts.md)
7070
- [The spark-cassandra-connector-embedded Artifact](doc/10_embedded.md)
7171
- [Performance monitoring](doc/11_metrics.md)
7272
- [Building And Artifacts](doc/12_building_and_artifacts.md)
73+
- [The Spark Shell](doc/13_spark_shell.md)
74+
- [Frequently Asked Questions](doc/FAQ.md)
7375

7476
## Community
7577
### Reporting Bugs

Diff for: doc/12_building_and_artifacts.md

+3-1
Original file line numberDiff line numberDiff line change
@@ -66,7 +66,7 @@ The documentation will be generated to:
6666
- `spark-cassandra-connector/target/scala-{binary.version}/api/`
6767
- `spark-cassandra-connector-java/target/scala-{binary.version}/api/`
6868

69-
##### Build Tasks
69+
#### Using the Assembly Jar With Spark Submit
7070
The easiest way to do this is to make the assembled connector jar using
7171

7272
sbt assembly
@@ -85,3 +85,5 @@ Then add this jar to your Spark executor classpath by adding the following line
8585

8686
This driver is also compatible with Spark distribution provided in
8787
[DataStax Enterprise](http://datastax.com/docs/latest-dse/).
88+
89+
[Next - The Spark Shell](13_spark_shell.md)

Diff for: doc/13_spark_shell.md

+78
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,78 @@
1+
# Documentation
2+
3+
## The Spark Shell and Spark Cassandra Connector
4+
5+
These instructions were last confirmed with C* 2.0.11, Spark 1.2.1 and Connector 1.2.0-rc3
6+
7+
### Setting up Cassandra
8+
9+
The easiest way to get quickly started with Cassandra is to follow the instructions provided by
10+
[Datastax](http://docs.datastax.com/en/cassandra/2.1/cassandra/install/install_cassandraTOC.html)
11+
12+
### Setting up spark
13+
14+
#### Download Spark
15+
16+
Download a pre-built Spark from https://spark.apache.org/downloads.html
17+
Untar the tar.gz downloaded with
18+
19+
tar -xvf spark-*-.tgz
20+
21+
#### Start Spark in Stand Alone Mode (Optional)
22+
23+
[Offical Spark Instructions](https://spark.apache.org/docs/latest/spark-standalone.html)
24+
25+
If you would like to run against a separate executor JVM then you need a running Spark Master and Worker.
26+
By default the spark-shell will run in local mode (driver/master/executor share a jvm.)
27+
28+
Go to the newly created directory and start up Spark in stand-alone mode bound to localhost
29+
30+
cd spark*
31+
./sbin/start-all.sh
32+
33+
At this point you should be able to access the Spark UI at localhost:8080. In the display you
34+
should see a single worker. At the top of this website you should see a URL set for the spark master. Save
35+
the master address (the entire spark://something:7077) if you would like to connect the shell to
36+
this stand alone spark master (use as sparkMasterAddress below).
37+
38+
### Clone and assemble the Spark Cassandra Connector
39+
40+
git clone [email protected]:datastax/spark-cassandra-connector.git
41+
cd spark-cassandra-connector
42+
git checkout b1.2 ## Replace this with the version of the connector you would like to use
43+
./sbt/sbt assembly
44+
ls spark-cassandra-connector/target/scala-2.10/*
45+
## Should have a spark-cassandra-connector-assembly-*.jar here, copy full path to use as yourAssemblyJar below)
46+
47+
### Start the Spark Shell
48+
If you don't include the master address below the spark shell will run in Local mode.
49+
50+
cd ~/spark-*
51+
52+
#Include the --master if you want to run against a stand alone spark and not local mode
53+
./bin/spark-shell [--master sparkMasterAddress] --jars yourAssemblyJar --conf spark.cassandra.connection.host=yourCassandraClusterIp
54+
55+
By default spark will log everything to the console and this may be a bit of an overload. To change this copy and modify the log4j.properties template file
56+
57+
cp conf/log4j.properties.template conf/log4j.properties
58+
59+
Changing the root logger at the top from INFO to WARN will significantly reduce the verbosity.
60+
61+
### Import connector classes
62+
```scala
63+
import com.datastax.spark.connector._ //Imports basic rdd functions
64+
import com.datastax.spark.connector.cql._ //(Optional) Imports java driver helper functions
65+
```
66+
67+
### Test it out
68+
``` scala
69+
val c = CassandraConnector(sc.getConf)
70+
c.withSessionDo ( session => session.execute("CREATE KEYSPACE test WITH replication={'class':'SimpleStrategy', 'replication_factor':1}"))
71+
c.withSessionDo ( session => session.execute("CREATE TABLE test.fun (k int PRIMARY KEY, v int)"))
72+
sc.parallelize(1 to 100).map( x => (x,x)).saveToCassandra("test","fun")
73+
sc.cassandraTable("test","fun").take(3)
74+
// Your results may differ
75+
//res1: Array[com.datastax.spark.connector.CassandraRow] = Array(CassandraRow{k: 60, v: 60}, CassandraRow{k: 67, v: 67}, CassandraRow{k: 10, v: 10})
76+
```
77+
78+

Diff for: doc/FAQ.md

+119
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,119 @@
1+
# Documentation
2+
3+
## Frequently Asked Questions
4+
5+
### Why is my job running on a single executor? Why am I not seeing any parallelism?
6+
7+
The first thing to check when you see that a Spark job is not being parallelized is to
8+
determine how many tasks have been generated. To check this look at the UI for your spark
9+
job and see how many tasks are being run. In the current Shell a small progress bar is shown
10+
when running stages, the numbers represent (Completed Tasks + Running Tasks) / Total Tasks
11+
12+
[Stage 2:=============================================> (121 + 1) / 200]0
13+
14+
If you see that only a single task has been created this means that the Cassandra Token range has
15+
not been split into a enough tasks to be well parallelized on your cluster. The number of
16+
Spark partitions(tasks) created is directly controlled by the setting `spark.cassandra.input.split.size`.
17+
This number reflects the approximate number of live Cassandra Partitions in a given Spark partition.
18+
To increase the number of Spark Partitions decrease this number from the default (100k) to one that
19+
will sufficiently break up your C* token range. This can also be adjusted on a per cassandraTable basis
20+
with the function `withReadConf` and specifying a new `ReadConf` object.
21+
22+
If there is more than one task but only a single machine is working, make sure that the job itself
23+
has been allocated multiple executor slots to work with. This is set at the time of SparkContext
24+
creation with `spark.cores.max` in the `SparkConf` and cannot be changed during the job.
25+
26+
One last thing to check is whether there is a `where` clause with a partition-key predicate. Currently
27+
the Spark Cassandra Connector creates Spark Tasks which contain entire C* partitions. This method
28+
ensures a single C* partition request will always create a single Spark task. `where` clauses with
29+
an `in` will also generate a single Spark Partition.
30+
31+
### Why can't the spark job find Spark Cassandra Connector Classes? (ClassNotFound Exceptions for SCC Classes)
32+
33+
The most common cause for this is that the executor classpath does not contain the Spark Cassandra Connector
34+
jars. The simplest way to add these to the class path is to use SparkSubmit with the --jars option pointing
35+
to your Spark Cassandra Connector assembly jar. If this is impossible, the second best option
36+
is to manually distribute the jar to all of your executors and add the jar's location to `spark.executor.extraClassPath`
37+
in the SparkConf or spark-defaults.conf.
38+
39+
### Where should I set configuration options for the connector?
40+
41+
The suggested location is to use the `spark-defaults.conf` file in your spark/conf directory but
42+
this file is ONLY used by spark-submit. Any applications not running through spark submit will ignore
43+
this file. You can also specify Spark-Submit conf options with `--conf option=value` on the command
44+
line.
45+
46+
For applications not running through spark submit, set the options in the `SparkConf` object used to
47+
create your `SparkContext`. Usually this will take the form of a series of statements that look like
48+
49+
```scala
50+
val conf = SparkConf()
51+
.set("Option","Value")
52+
...
53+
54+
val sc = SparkContext(conf)
55+
```
56+
57+
### Why are my write tasks timing out/ failing?
58+
59+
The most common cause of this is that Spark is able to issue write requests much more quickly than
60+
Cassandra can handle them. This can lead to GC issues and build up of hints. If this is the case
61+
with your application, try lowering the number of concurrent writes and the current batch size using
62+
the following options.
63+
64+
spark.cassandra.output.batch.size.rows
65+
spark.cassandra.output.concurrent.writes
66+
67+
or in versions of the Spark Cassandra Connector greater than or equal to 1.2.0 set
68+
69+
spark.cassandra.output.throughput_mb_per_sec
70+
71+
which will allow you to control the amount of data written to C* per Spark core per second.
72+
73+
### Why are my executors throwing `OutOfMemoryException`s while Reading from Cassandra?
74+
75+
This usually means that the size of the partitions you are attempting to create are larger than
76+
the executor's heap can handle. Remember that all of the executors run in the same JVM so the size
77+
of the data is multiplied by the number of executor slots.
78+
79+
To fix this either increase the heap size of the executors `spark.executor.memory`
80+
or shrink the size of the partitions by decreasing `spark.cassandra.input.split.size`
81+
82+
### Why can't my spark job find My Application Classes / Anonymous Functions?
83+
84+
This occurs when your application code hasn't been placed on the classpath of the Spark Executor. When using
85+
Spark Submit make sure that the jar contains all of the classes and dependencies for running your code.
86+
To build a fat jar look into using sbt assembly, or look for instructions for your build tool of choice.
87+
88+
If you are not using the recommended approach with Spark Submit, make sure that your dependencies
89+
have been set in the `SparkConf` using `setJars` or by distributing the jars yourself and modifying
90+
the executor classpath.
91+
92+
### Why don't my case classes work?
93+
Usually this is because they have been defined within another object/class. Try moving the definition
94+
outside of the scope of other classes.
95+
96+
### Why can't my spark job connect to Cassandra?
97+
98+
Check that your Cassandra instance is on and responds to cqlsh. Make sure that the rpc address also
99+
accepts incoming connections on the interface you are setting as `rpc_address` in the cassandra.yaml file.
100+
Make sure that you are setting the `spark.cassandra.connection.host` property to the interface which
101+
the rpc_address is set to.
102+
103+
When troubleshooting Cassandra connections it is sometimes useful to set the rpc_address in the
104+
C* yaml file to `0.0.0.0` so any incoming connection will work.
105+
106+
107+
### Can I contribute to the Spark Cassandra Connector?
108+
109+
YES! Feel free to start a Jira and detail the changes you would like to make or the feature you
110+
would like to add. We would be happy to discuss it with you and see your work. Feel free to create
111+
a Jira before you have started any work if you would like feedback on an idea. When you have a branch
112+
that you are satisfied with and passes all the tests (`/dev/run_tests.sh`) make a GitHub PR against
113+
your target Connector Version and set your Jira to Reviewing.
114+
115+
### What should I do if I find a bug?
116+
117+
Feel free to post a repo on the Mailing List or if you are feeling ambitious file a Jira with
118+
steps for reproduction and we'll get to it as soon as possible. Please remember to include a full
119+
stack trace (if any) and the versions of Spark, The Connector, and Cassandra that you are using.

0 commit comments

Comments
 (0)