Merge pull request #656 from datastax/SPARKC-142

RussellSpitzer · RussellSpitzer · commit 7f7eaa4d9b2b · 2015-05-04T10:13:35.000-07:00
SPARKC-142: New Defaults Based on Perf Tests
diff --git a/doc/5_saving.md b/doc/5_saving.md
@@ -126,19 +126,22 @@ collection.saveAsCassandraTableEx(table2, SomeColumns("word", "count"))
 
 
 ## Tuning
-The following properties set in `SparkConf` can be used to fine-tune the saving process:
+The following properties set in `SparkConf` can be used to fine-tune the saving process, 
+These values have been set to achieve stability and not performance. Changing these values may
+increase your performance based on your workload:
 
   - `spark.cassandra.output.batch.size.rows`: number of rows per single batch; default is 'auto' which means the connector 
      will adjust the number of rows based on the amount of data in each row  
-  - `spark.cassandra.output.batch.size.bytes`: maximum total size of the batch in bytes; defaults to 16 kB.
+  - `spark.cassandra.output.batch.size.bytes`: maximum total size of the batch in bytes; defaults to 1 kB.
   - `spark.cassandra.output.batch.grouping.key`: determines how insert statements are grouped into batches; available values are:
      - `none`: a batch may contain any statements
      - `replica_set`: a batch may contain only statements to be written to the same replica set
      - `partition` (default): a batch may contain only statements for rows sharing the same partition key value
   - `spark.cassandra.output.batch.buffer.size`: how many batches per single Spark task can be stored in memory before sending to Cassandra; default 1000
   - `spark.cassandra.output.concurrent.writes`: maximum number of batches executed in parallel by a single Spark task; defaults to 5
   - `spark.cassandra.output.consistency.level`: consistency level for writing; defaults to LOCAL_ONE.
-  - `spark.cassandra.output.throughput_mb_per_sec`: maximum write throughput allowed per single core in MB/s;
-     throughput limiting needs `spark.cassandra.output.metrics` enabled
+  - `spark.cassandra.output.throughput_mb_per_sec`: maximum write throughput allowed per single core in MB/s
+                                                    limit this on long (+8 hour) runs to 70% of your max 
+                                                    throughput as seen on a smaller job for stability
 
 [Next - Customizing the object mapping](6_advanced_mapper.md)
diff --git a/spark-cassandra-connector/src/main/scala/com/datastax/spark/connector/writer/WriteConf.scala b/spark-cassandra-connector/src/main/scala/com/datastax/spark/connector/writer/WriteConf.scala
@@ -74,8 +74,8 @@ object WriteConf {
   )
 
   val DefaultConsistencyLevel = ConsistencyLevel.LOCAL_ONE
-  val DefaultBatchSizeInBytes = 16 * 1024
-  val DefaultParallelismLevel = 8
+  val DefaultBatchSizeInBytes = 1024
+  val DefaultParallelismLevel = 5
   val DefaultBatchGroupingBufferSize = 1000
   val DefaultBatchGroupingKey = BatchGroupingKey.Partition
   val DefaultThroughputMiBPS = Int.MaxValue