This repository was archived by the owner on Jun 14, 2024. It is now read-only.
-
Couldn't load subscription status.
- Fork 115
This repository was archived by the owner on Jun 14, 2024. It is now read-only.
MinMax analysis util throws exception on large dataset #528
Copy link
Copy link
Open
Labels
untriagedThis is the default tag for a newly created issueThis is the default tag for a newly created issue
Description
Describe the issue
I tried to use this MinMaxAnalysisUtil to analyze distribution of column. It worked well on small data set, however threw exception on my TPC-H dataset which has around 10GB data in 1k partitions.
To Reproduce
Run analysis tool on TPC-H 10GB data set:
scala> println(MinMaxAnalysisUtil.analyze(df, Seq("l_discount", "l_quantity"), format = "text"))
java.lang.ClassCastException: java.math.BigDecimal cannot be cast to org.apache.spark.sql.types.Decimal
at org.apache.spark.sql.types.Decimal$DecimalIsFractional$.compare(Decimal.scala:688)
at scala.math.Ordering.equiv(Ordering.scala:103)
at scala.math.Ordering.equiv$(Ordering.scala:103)
at org.apache.spark.sql.types.Decimal$DecimalIsFractional$.equiv(Decimal.scala:688)
at com.microsoft.hyperspace.util.MinMaxAnalyzer.$anonfun$analyzeMinMaxHistogram$2(MinMaxAnalysisUtil.scala:661)
at com.microsoft.hyperspace.util.MinMaxAnalyzer.$anonfun$analyzeMinMaxHistogram$2$adapted(MinMaxAnalysisUtil.scala:661)
at scala.math.Ordering$$anon$6.compare(Ordering.scala:203)
at java.util.TimSort.countRunAndMakeAscending(TimSort.java:355)
at java.util.TimSort.sort(TimSort.java:234)
at java.util.Arrays.sort(Arrays.java:1438)
at scala.collection.SeqLike.sorted(SeqLike.scala:659)
at scala.collection.SeqLike.sorted$(SeqLike.scala:647)
at scala.collection.AbstractSeq.sorted(Seq.scala:45)
at scala.collection.SeqLike.sortWith(SeqLike.scala:612)
at scala.collection.SeqLike.sortWith$(SeqLike.scala:612)
at scala.collection.AbstractSeq.sortWith(Seq.scala:45)
at com.microsoft.hyperspace.util.MinMaxAnalyzer.analyzeMinMaxHistogram(MinMaxAnalysisUtil.scala:661)
at com.microsoft.hyperspace.util.MinMaxAnalyzer.analyzeMinMaxHistogram$(MinMaxAnalysisUtil.scala:635)
at com.microsoft.hyperspace.util.DataframeMinMaxAnalyzer.analyzeMinMaxHistogram(MinMaxAnalysisUtil.scala:735)
at com.microsoft.hyperspace.util.MinMaxAnalyzer.$anonfun$analyze$1(MinMaxAnalysisUtil.scala:630)
at com.microsoft.hyperspace.util.MinMaxAnalyzer.$anonfun$analyze$1$adapted(MinMaxAnalysisUtil.scala:629)
at scala.collection.immutable.List.foreach(List.scala:392)
at com.microsoft.hyperspace.util.MinMaxAnalyzer.analyze(MinMaxAnalysisUtil.scala:629)
at com.microsoft.hyperspace.util.MinMaxAnalyzer.analyze$(MinMaxAnalysisUtil.scala:624)
at com.microsoft.hyperspace.util.DataframeMinMaxAnalyzer.analyze(MinMaxAnalysisUtil.scala:735)
at com.microsoft.hyperspace.util.MinMaxAnalysis.analyzeDataframe(MinMaxAnalysisUtil.scala:763)
at com.microsoft.hyperspace.util.MinMaxAnalysis.analyzeDataframe$(MinMaxAnalysisUtil.scala:760)
at com.microsoft.hyperspace.util.MinMaxAnalysisUtil$.analyzeDataframe(MinMaxAnalysisUtil.scala:768)
at com.microsoft.hyperspace.util.MinMaxAnalysisUtil$.analyze(MinMaxAnalysisUtil.scala:774)
... 49 elided
Expected behavior
Print diagram as usual.
Environment
Please complete the following information if applicable:
- OS: iOS
- Apache Spark Version: 3.1.2
- Platform: local with master branch code
Metadata
Metadata
Assignees
Labels
untriagedThis is the default tag for a newly created issueThis is the default tag for a newly created issue