EMR zero performance with spark udf #769
Replies: 2 comments
-
@vineeetska, for the C# UDF to be executed, the values should be moved out of JVM and be executed on CLR, incurring some overhead. If you are just doing regex, I suggest to check the built-in regex SQL function, which will be executed on JVM: spark/src/csharp/Microsoft.Spark/Sql/Functions.cs Line 2430 in b8e2259 If that is not an option, there are few things you can try to optimize:
Please let me know how it goes. |
Beta Was this translation helpful? Give feedback.
-
Thanks @imback82 for the response & sorry for replying late. I have used Regex cache to avoid recompiling Regex.IsMatch with Regex.CacheSize=6000 I am using below spark-submit cmd in aws emr, could you point me to the doc to make sure C# udf are getting executing in CLR only.
VectorUdf I haven't tried yet, will try and let you know. Thanks, |
Beta Was this translation helpful? Give feedback.
-
Hi,
I am working on AWS EMR infra with target OS ubuntu.16.04-x64 & Microsoft.Spark 1.0.0 with dotnet core 3.1
We are using DataFrame with Udf to match X regex with Y problem statements where X,Y are more than 5K.
It's execution time is more than 2 hrs when run with AWS EMR.
Execution steps,
While the same job is completing within 5 mins if executed with Java code implemented with DataFrame api.
Could anyone suggest what approach to follow to achieve similar performance as Java.
Thanks,
Beta Was this translation helpful? Give feedback.
All reactions