Open
Description
Describe the bug
I am reading a csv file and a adding a column using Udf. This csv file has around 17M rows. I am using 8 executors (8 cores * 56GB)
using System;
Func<Column, Column> UDFfunction = Udf<string, string>( url => {
if(!String.IsNullOrEmpty(url)) return "Dummy";
return "";
});
var path = "<CSV file path>"
var df = spark.Read().Option("header", false).Option("inferShchema", true).Option("delimiter", "\t").Csv(path);
//Display(df);
df = df.WithColumn("newCol", UDFfunction(df["_c0"]));
Display(df);
When I do display, just after reading CSV, at line no (2), I don't get any error but if I do display at line no 4, it throws me below error.
22/03/14 16:35:04 INFO FileScanRDD [stdout writer for /usr/local/bin/sparkdotnet/Microsoft.Spark.Worker/2.0.0/Microsoft.Spark.Worker]: Reading File path: adl://<txtpath>, range: 0-134217728, partition values: [empty row]
22/03/14 16:35:04 INFO LineRecordReader [stdout writer for /usr/local/bin/sparkdotnet/Microsoft.Spark.Worker/2.0.0/Microsoft.Spark.Worker]: Found UTF-8 BOM and skipped it
[2022-03-14T16:35:04.9692566Z] [vm-8f973732] [Warn] [AssemblyLoader] Assembly 'ℛ*fcdc26d4-b2cf-49c1-a497-de2fa26e8c4f#1-49, Version=0.0.0.0, Culture=neutral, PublicKeyToken=null' file not found 'fcdc26d4-b2cf-49c1-a497-de2fa26e8c4f-1-49[.dll,.ni.dll]' in '.,/mnt/var/hadoop/tmp/nm-local-dir/usercache/trusted-service-user/appcache/application_1647274291204_0002/container_1647274291204_0002_01_000002,/usr/local/bin/sparkdotnet/Microsoft.Spark.Worker/2.0.0/'
[2022-03-14T16:35:04.9698518Z] [vm-8f973732] [Error] [TaskRunner] [17] ProcessStream() failed with exception: System.IO.FileNotFoundException: Assembly 'ℛ*fcdc26d4-b2cf-49c1-a497-de2fa26e8c4f#1-49, Version=0.0.0.0, Culture=neutral, PublicKeyToken=null' file not found ''
at Microsoft.Spark.Utils.UdfSerDe.<>c.b__10_0(TypeData td) in /_/src/csharp/Microsoft.Spark/Utils/UdfSerDe.cs:line 270
at System.Collections.Concurrent.ConcurrentDictionary`2.GetOrAdd(TKey key, Func`2 valueFactory)
at Microsoft.Spark.Utils.UdfSerDe.DeserializeType(TypeData typeData) in /_/src/csharp/Microsoft.Spark/Utils/UdfSerDe.cs:line 261
at Microsoft.Spark.Utils.UdfSerDe.Deserialize(UdfData udfData) in /_/src/csharp/Microsoft.Spark/Utils/UdfSerDe.cs:line 160
at Microsoft.Spark.Utils.CommandSerDe.DeserializeUdfs[T](UdfWrapperData data, Int32& nodeIndex, Int32& udfIndex) in /_/src/csharp/Microsoft.Spark/Utils/CommandSerDe.cs:line 333
at Microsoft.Spark.Utils.CommandSerDe.Deserialize[T](Stream stream, SerializedMode& serializerMode, SerializedMode& deserializerMode, String& runMode) in /_/src/csharp/Microsoft.Spark/Utils/CommandSerDe.cs:line 306
at Microsoft.Spark.Worker.Processor.CommandProcessor.ReadSqlCommands(PythonEvalType evalType, Stream stream) in D:\a\1\s\src\csharp\Microsoft.Spark.Worker\Processor\CommandProcessor.cs:line 187
at Microsoft.Spark.Worker.Processor.CommandProcessor.ReadSqlCommands(PythonEvalType evalType, Stream stream, Version version) in D:\a\1\s\src\csharp\Microsoft.Spark.Worker\Processor\CommandProcessor.cs:line 101
at Microsoft.Spark.Worker.Processor.CommandProcessor.Process(Stream stream) in D:\a\1\s\src\csharp\Microsoft.Spark.Worker\Processor\CommandProcessor.cs:line 43
at Microsoft.Spark.Worker.Processor.PayloadProcessor.Process(Stream stream) in D:\a\1\s\src\csharp\Microsoft.Spark.Worker\Processor\PayloadProcessor.cs:line 83
at Microsoft.Spark.Worker.TaskRunner.ProcessStream(Stream inputStream, Stream outputStream, Version version, Boolean& readComplete) in D:\a\1\s\src\csharp\Microsoft.Spark.Worker\TaskRunner.cs:line 144
[2022-03-14T16:35:04.9704949Z] [vm-8f973732] [Error] [TaskRunner] [17] Exiting with exception: System.IO.FileNotFoundException: Assembly 'ℛ*fcdc26d4-b2cf-49c1-a497-de2fa26e8c4f#1-49, Version=0.0.0.0, Culture=neutral, PublicKeyToken=null' file not found ''
at Microsoft.Spark.Utils.UdfSerDe.<>c.b__10_0(TypeData td) in /_/src/csharp/Microsoft.Spark/Utils/UdfSerDe.cs:line 270
at System.Collections.Concurrent.ConcurrentDictionary`2.GetOrAdd(TKey key, Func`2 valueFactory)
at Microsoft.Spark.Utils.UdfSerDe.DeserializeType(TypeData typeData) in /_/src/csharp/Microsoft.Spark/Utils/UdfSerDe.cs:line 261
at Microsoft.Spark.Utils.UdfSerDe.Deserialize(UdfData udfData) in /_/src/csharp/Microsoft.Spark/Utils/UdfSerDe.cs:line 160
at Microsoft.Spark.Utils.CommandSerDe.DeserializeUdfs[T](UdfWrapperData data, Int32& nodeIndex, Int32& udfIndex) in /_/src/csharp/Microsoft.Spark/Utils/CommandSerDe.cs:line 333
at Microsoft.Spark.Utils.CommandSerDe.Deserialize[T](Stream stream, SerializedMode& serializerMode, SerializedMode& deserializerMode, String& runMode) in /_/src/csharp/Microsoft.Spark/Utils/CommandSerDe.cs:line 306
at Microsoft.Spark.Worker.Processor.CommandProcessor.ReadSqlCommands(PythonEvalType evalType, Stream stream) in D:\a\1\s\src\csharp\Microsoft.Spark.Worker\Processor\CommandProcessor.cs:line 187
at Microsoft.Spark.Worker.Processor.CommandProcessor.ReadSqlCommands(PythonEvalType evalType, Stream stream, Version version) in D:\a\1\s\src\csharp\Microsoft.Spark.Worker\Processor\CommandProcessor.cs:line 101
at Microsoft.Spark.Worker.Processor.CommandProcessor.Process(Stream stream) in D:\a\1\s\src\csharp\Microsoft.Spark.Worker\Processor\CommandProcessor.cs:line 43
at Microsoft.Spark.Worker.Processor.PayloadProcessor.Process(Stream stream) in D:\a\1\s\src\csharp\Microsoft.Spark.Worker\Processor\PayloadProcessor.cs:line 83
at Microsoft.Spark.Worker.TaskRunner.ProcessStream(Stream inputStream, Stream outputStream, Version version, Boolean& readComplete) in D:\a\1\s\src\csharp\Microsoft.Spark.Worker\TaskRunner.cs:line 144
at Microsoft.Spark.Worker.TaskRunner.Run() in D:\a\1\s\src\csharp\Microsoft.Spark.Worker\TaskRunner.cs:line 66
22/03/14 16:35:04 WARN BlockManager [Executor task launch worker for task 0.3 in stage 19.0 (TID 43)]: Putting block rdd_147_0 failed due to exception org.apache.spark.api.python.PythonException: System.IO.FileNotFoundException: Assembly 'ℛ*fcdc26d4-b2cf-49c1-a497-de2fa26e8c4f#1-49, Version=0.0.0.0, Culture=neutral, PublicKeyToken=null' file not found ''
at Microsoft.Spark.Utils.UdfSerDe.<>c.b__10_0(TypeData td) in /_/src/csharp/Microsoft.Spark/Utils/UdfSerDe.cs:line 270
at System.Collections.Concurrent.ConcurrentDictionary`2.GetOrAdd(TKey key, Func`2 valueFactory)
at Microsoft.Spark.Utils.UdfSerDe.DeserializeType(TypeData typeData) in /_/src/csharp/Microsoft.Spark/Utils/UdfSerDe.cs:line 261
at Microsoft.Spark.Utils.UdfSerDe.Deserialize(UdfData udfData) in /_/src/csharp/Microsoft.Spark/Utils/UdfSerDe.cs:line 160
at Microsoft.Spark.Utils.CommandSerDe.DeserializeUdfs[T](UdfWrapperData data, Int32& nodeIndex, Int32& udfIndex) in /_/src/csharp/Microsoft.Spark/Utils/CommandSerDe.cs:line 333
at Microsoft.Spark.Utils.CommandSerDe.Deserialize[T](Stream stream, SerializedMode& serializerMode, SerializedMode& deserializerMode, String& runMode) in /_/src/csharp/Microsoft.Spark/Utils/CommandSerDe.cs:line 306
at Microsoft.Spark.Worker.Processor.CommandProcessor.ReadSqlCommands(PythonEvalType evalType, Stream stream) in D:\a\1\s\src\csharp\Microsoft.Spark.Worker\Processor\CommandProcessor.cs:line 187
at Microsoft.Spark.Worker.Processor.CommandProcessor.ReadSqlCommands(PythonEvalType evalType, Stream stream, Version version) in D:\a\1\s\src\csharp\Microsoft.Spark.Worker\Processor\CommandProcessor.cs:line 101
at Microsoft.Spark.Worker.Processor.CommandProcessor.Process(Stream stream) in D:\a\1\s\src\csharp\Microsoft.Spark.Worker\Processor\CommandProcessor.cs:line 43
at Microsoft.Spark.Worker.Processor.PayloadProcessor.Process(Stream stream) in D:\a\1\s\src\csharp\Microsoft.Spark.Worker\Processor\PayloadProcessor.cs:line 83
at Microsoft.Spark.Worker.TaskRunner.ProcessStream(Stream inputStream, Stream outputStream, Version version, Boolean& readComplete) in D:\a\1\s\src\csharp\Microsoft.Spark.Worker\TaskRunner.cs:line 144.
[2022-03-14T16:35:04.9707821Z] [vm-8f973732] [Info] [TaskRunner] [17] Finished running 0 task(s).
22/03/14 16:35:04 WARN BlockManager [Executor task launch worker for task 0.3 in stage 19.0 (TID 43)]: Block rdd_147_0 could not be removed as it was not found on disk or in memory
22/03/14 16:35:04 ERROR Executor [Executor task launch worker for task 0.3 in stage 19.0 (TID 43)]: Exception in task 0.3 in stage 19.0 (TID 43)
org.apache.spark.api.python.PythonException: System.IO.FileNotFoundException: Assembly 'ℛ*fcdc26d4-b2cf-49c1-a497-de2fa26e8c4f#1-49, Version=0.0.0.0, Culture=neutral, PublicKeyToken=null' file not found ''
at Microsoft.Spark.Utils.UdfSerDe.<>c.b__10_0(TypeData td) in /_/src/csharp/Microsoft.Spark/Utils/UdfSerDe.cs:line 270
at System.Collections.Concurrent.ConcurrentDictionary`2.GetOrAdd(TKey key, Func`2 valueFactory)
at Microsoft.Spark.Utils.UdfSerDe.DeserializeType(TypeData typeData) in /_/src/csharp/Microsoft.Spark/Utils/UdfSerDe.cs:line 261
at Microsoft.Spark.Utils.UdfSerDe.Deserialize(UdfData udfData) in /_/src/csharp/Microsoft.Spark/Utils/UdfSerDe.cs:line 160
at Microsoft.Spark.Utils.CommandSerDe.DeserializeUdfs[T](UdfWrapperData data, Int32& nodeIndex, Int32& udfIndex) in /_/src/csharp/Microsoft.Spark/Utils/CommandSerDe.cs:line 333
at Microsoft.Spark.Utils.CommandSerDe.Deserialize[T](Stream stream, SerializedMode& serializerMode, SerializedMode& deserializerMode, String& runMode) in /_/src/csharp/Microsoft.Spark/Utils/CommandSerDe.cs:line 306
at Microsoft.Spark.Worker.Processor.CommandProcessor.ReadSqlCommands(PythonEvalType evalType, Stream stream) in D:\a\1\s\src\csharp\Microsoft.Spark.Worker\Processor\CommandProcessor.cs:line 187
at Microsoft.Spark.Worker.Processor.CommandProcessor.ReadSqlCommands(PythonEvalType evalType, Stream stream, Version version) in D:\a\1\s\src\csharp\Microsoft.Spark.Worker\Processor\CommandProcessor.cs:line 101
at Microsoft.Spark.Worker.Processor.CommandProcessor.Process(Stream stream) in D:\a\1\s\src\csharp\Microsoft.Spark.Worker\Processor\CommandProcessor.cs:line 43
at Microsoft.Spark.Worker.Processor.PayloadProcessor.Process(Stream stream) in D:\a\1\s\src\csharp\Microsoft.Spark.Worker\Processor\PayloadProcessor.cs:line 83
at Microsoft.Spark.Worker.TaskRunner.ProcessStream(Stream inputStream, Stream outputStream, Version version, Boolean& readComplete) in D:\a\1\s\src\csharp\Microsoft.Spark.Worker\TaskRunner.cs:line 144
at org.apache.spark.api.python.BasePythonRunner$ReaderIterator.handlePythonException(PythonRunner.scala:517)
at org.apache.spark.sql.execution.python.PythonUDFRunner$$anon$2.read(PythonUDFRunner.scala:84)
at org.apache.spark.sql.execution.python.PythonUDFRunner$$anon$2.read(PythonUDFRunner.scala:67)
at org.apache.spark.api.python.BasePythonRunner$ReaderIterator.hasNext(PythonRunner.scala:470)
at org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:37)
at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:489)
at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:458)
at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:458)
at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage2.processNext(Unknown Source)
at org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
at org.apache.spark.sql.execution.WholeStageCodegenExec$$anon$1.hasNext(WholeStageCodegenExec.scala:755)
at org.apache.spark.sql.execution.columnar.DefaultCachedBatchSerializer$$anon$1.hasNext(InMemoryRelation.scala:118)
at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:458)
at org.apache.spark.storage.memory.MemoryStore.putIterator(MemoryStore.scala:237)
at org.apache.spark.storage.memory.MemoryStore.putIteratorAsValues(MemoryStore.scala:315)
at org.apache.spark.storage.BlockManager.$anonfun$doPutIterator$1(BlockManager.scala:1424)
at org.apache.spark.storage.BlockManager.org$apache$spark$storage$BlockManager$$doPut(BlockManager.scala:1351)
at org.apache.spark.storage.BlockManager.doPutIterator(BlockManager.scala:1415)
at org.apache.spark.storage.BlockManager.getOrElseUpdate(BlockManager.scala:1238)
at org.apache.spark.rdd.RDD.getOrCompute(RDD.scala:384)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:335)
at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:373)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:337)
at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:373)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:337)
at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:373)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:337)
at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:373)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:337)
at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:373)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:337)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90)
at org.apache.spark.scheduler.Task.run(Task.scala:131)
at org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:497)
at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1439)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:500)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
End of LogType:stderr.This log file belongs to a running container (container_1647274291204_0002_01_000002) and so may not be complete.
***********************************************************************
I have been getting this error intermittently. Can you please help me with what is the root cause for this problem? Is it a memory issue?
This Assembly/File is auto generated and name is changed every time job runs.