Skip to content
This repository was archived by the owner on Jun 14, 2024. It is now read-only.
This repository was archived by the owner on Jun 14, 2024. It is now read-only.

MinMax analysis util throws exception on large dataset #528

@dai-chen

Description

@dai-chen

Describe the issue

I tried to use this MinMaxAnalysisUtil to analyze distribution of column. It worked well on small data set, however threw exception on my TPC-H dataset which has around 10GB data in 1k partitions.

To Reproduce

Run analysis tool on TPC-H 10GB data set:

scala> println(MinMaxAnalysisUtil.analyze(df, Seq("l_discount", "l_quantity"), format = "text"))
java.lang.ClassCastException: java.math.BigDecimal cannot be cast to org.apache.spark.sql.types.Decimal
  at org.apache.spark.sql.types.Decimal$DecimalIsFractional$.compare(Decimal.scala:688)
  at scala.math.Ordering.equiv(Ordering.scala:103)
  at scala.math.Ordering.equiv$(Ordering.scala:103)
  at org.apache.spark.sql.types.Decimal$DecimalIsFractional$.equiv(Decimal.scala:688)
  at com.microsoft.hyperspace.util.MinMaxAnalyzer.$anonfun$analyzeMinMaxHistogram$2(MinMaxAnalysisUtil.scala:661)
  at com.microsoft.hyperspace.util.MinMaxAnalyzer.$anonfun$analyzeMinMaxHistogram$2$adapted(MinMaxAnalysisUtil.scala:661)
  at scala.math.Ordering$$anon$6.compare(Ordering.scala:203)
  at java.util.TimSort.countRunAndMakeAscending(TimSort.java:355)
  at java.util.TimSort.sort(TimSort.java:234)
  at java.util.Arrays.sort(Arrays.java:1438)
  at scala.collection.SeqLike.sorted(SeqLike.scala:659)
  at scala.collection.SeqLike.sorted$(SeqLike.scala:647)
  at scala.collection.AbstractSeq.sorted(Seq.scala:45)
  at scala.collection.SeqLike.sortWith(SeqLike.scala:612)
  at scala.collection.SeqLike.sortWith$(SeqLike.scala:612)
  at scala.collection.AbstractSeq.sortWith(Seq.scala:45)
  at com.microsoft.hyperspace.util.MinMaxAnalyzer.analyzeMinMaxHistogram(MinMaxAnalysisUtil.scala:661)
  at com.microsoft.hyperspace.util.MinMaxAnalyzer.analyzeMinMaxHistogram$(MinMaxAnalysisUtil.scala:635)
  at com.microsoft.hyperspace.util.DataframeMinMaxAnalyzer.analyzeMinMaxHistogram(MinMaxAnalysisUtil.scala:735)
  at com.microsoft.hyperspace.util.MinMaxAnalyzer.$anonfun$analyze$1(MinMaxAnalysisUtil.scala:630)
  at com.microsoft.hyperspace.util.MinMaxAnalyzer.$anonfun$analyze$1$adapted(MinMaxAnalysisUtil.scala:629)
  at scala.collection.immutable.List.foreach(List.scala:392)
  at com.microsoft.hyperspace.util.MinMaxAnalyzer.analyze(MinMaxAnalysisUtil.scala:629)
  at com.microsoft.hyperspace.util.MinMaxAnalyzer.analyze$(MinMaxAnalysisUtil.scala:624)
  at com.microsoft.hyperspace.util.DataframeMinMaxAnalyzer.analyze(MinMaxAnalysisUtil.scala:735)
  at com.microsoft.hyperspace.util.MinMaxAnalysis.analyzeDataframe(MinMaxAnalysisUtil.scala:763)
  at com.microsoft.hyperspace.util.MinMaxAnalysis.analyzeDataframe$(MinMaxAnalysisUtil.scala:760)
  at com.microsoft.hyperspace.util.MinMaxAnalysisUtil$.analyzeDataframe(MinMaxAnalysisUtil.scala:768)
  at com.microsoft.hyperspace.util.MinMaxAnalysisUtil$.analyze(MinMaxAnalysisUtil.scala:774)
  ... 49 elided

Expected behavior

Print diagram as usual.

Environment

Please complete the following information if applicable:

  • OS: iOS
  • Apache Spark Version: 3.1.2
  • Platform: local with master branch code

Metadata

Metadata

Assignees

No one assigned

    Labels

    untriagedThis is the default tag for a newly created issue

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions