-
Notifications
You must be signed in to change notification settings - Fork 198
Open
Description
Tried to reproduce a benchmark test made here:
https://dzone.com/articles/joining-a-billion-rows-20x-faster-than-apache-spar
Basically, it is 26x times slower (25.95 - 26.31 sec) than clean apache spark (0.97 - 0.98 sec) on my laptop:
MacOS: Catalina ver. 10
Processor Name: Quad-Core Intel Core i7
Processor Speed: 2,9 GHz
Total Number of Cores: 4
Memory: 16 GB
Oracle jdk1.8.0_201.jdk
scala-sdk-2.11.8
snappydata-cluster_2.11:1.2.0 or 1.1.0
RuntimeMemoryManager org.apache.spark.memory.SnappyUnifiedMemoryManager@4c398c80 configuration:
Total Usable Heap = 2.9 GB (3082926162)
Storage Pool = 1470.1 MB (1541463081)
Execution Pool = 1470.1 MB (1541463081)
Max Storage Pool Size = 2.3 GB (2466340929)
I'm sure I have not tuned my environment well enough, but I though it is still important to post this issue, since it is not related to spark, but snappy spark distribution:
val rangeData = spark.range(1000L * 1000 * 1000).toDF()
rangeData.cache()
rangeData.count()
leads to the error:
java.lang.IllegalArgumentException: Size exceeds Integer.MAX_VALUE
at sun.nio.ch.FileChannelImpl.map(FileChannelImpl.java:863)
at org.apache.spark.storage.DiskStore$$anonfun$getBytes$2.apply(DiskStore.scala:102)
at org.apache.spark.storage.DiskStore$$anonfun$getBytes$2.apply(DiskStore.scala:90)
at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1366)
at org.apache.spark.storage.DiskStore.getBytes(DiskStore.scala:104)
at org.apache.spark.storage.BlockManager.getLocalValues(BlockManager.scala:468)
at org.apache.spark.storage.BlockManager.getOrElseUpdate(BlockManager.scala:704)
at org.apache.spark.rdd.RDD.getOrCompute(RDD.scala:335)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:286)
at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:41)
...
Metadata
Metadata
Assignees
Labels
No labels