EMR zero performance with spark udf #769

vineeetska · 2020-11-09T10:29:26Z

vineeetska
Nov 9, 2020

Hi,

I am working on AWS EMR infra with target OS ubuntu.16.04-x64 & Microsoft.Spark 1.0.0 with dotnet core 3.1
We are using DataFrame with Udf to match X regex with Y problem statements where X,Y are more than 5K.

It's execution time is more than 2 hrs when run with AWS EMR.

Execution steps,

Load text file into DataFrame.
Apply filters.
Create temp view.
Execute udf func for each regex to find match.

While the same job is completing within 5 mins if executed with Java code implemented with DataFrame api.

Could anyone suggest what approach to follow to achieve similar performance as Java.

Thanks,

imback82 · 2020-11-09T19:19:18Z

imback82
Nov 9, 2020

@vineeetska, for the C# UDF to be executed, the values should be moved out of JVM and be executed on CLR, incurring some overhead. If you are just doing regex, I suggest to check the built-in regex SQL function, which will be executed on JVM:

spark/src/csharp/Microsoft.Spark/Sql/Functions.cs

Line 2430 in b8e2259

public static Column RegexpExtract(Column column, string exp, int groupIdx)

If that is not an option, there are few things you can try to optimize:

Create only one instance of compiled Regex object if you are not already doing it. Example:

spark/benchmark/csharp/Tpch/TpchFunctionalQueries.cs

Line 403 in b8e2259

private static readonly Regex s_q13SpecialRegex = new Regex("^.*special.*requests.*", RegexOptions.Compiled);
Use VectorUdf to utilize Arrow instead of pickling for serialization/deserialization: Example to use VectorUdf:

spark/src/csharp/Microsoft.Spark.E2ETest/IpcTests/Sql/DataFrameTests.cs

Line 162 in b8e2259

public void TestVectorUdf()

Please let me know how it goes.

0 replies

vineeetska · 2020-11-18T15:44:21Z

vineeetska
Nov 18, 2020
Author

Thanks @imback82 for the response & sorry for replying late.

I have used Regex cache to avoid recompiling Regex.IsMatch with Regex.CacheSize=6000

I am using below spark-submit cmd in aws emr, could you point me to the doc to make sure C# udf are getting executing in CLR only.

spark-submit --deploy-mode client --master yarn --driver-memory 10G --class org.apache.spark.deploy.dotnet.DotnetRunner <s3location>/microsoft-spark-2-4_2.11-1.0.0.jar <zip> args

VectorUdf I haven't tried yet, will try and let you know.

Thanks,

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

EMR zero performance with spark udf #769

Uh oh!

{{title}}

Uh oh!

Replies: 2 comments

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

EMR zero performance with spark udf #769

Uh oh!

vineeetska Nov 9, 2020

Replies: 2 comments

Uh oh!

imback82 Nov 9, 2020

Uh oh!

vineeetska Nov 18, 2020 Author

vineeetska
Nov 9, 2020

imback82
Nov 9, 2020

vineeetska
Nov 18, 2020
Author