Skip to content

[BUG]: Spark Worker throwing System.IO.FileNotFoundException exception with random dll #1043

Open
@DeepikaPrabha

Description

@DeepikaPrabha

Describe the bug
I am reading a csv file and a adding a column using Udf. This csv file has around 17M rows. I am using 8 executors (8 cores * 56GB)

using System;
Func<Column, Column> UDFfunction = Udf<string, string>( url => { 
    if(!String.IsNullOrEmpty(url)) return "Dummy";
    return "";
});


var path = "<CSV file path>"

var df = spark.Read().Option("header", false).Option("inferShchema", true).Option("delimiter", "\t").Csv(path);
//Display(df);
df = df.WithColumn("newCol", UDFfunction(df["_c0"]));
Display(df);

When I do display, just after reading CSV, at line no (2), I don't get any error but if I do display at line no 4, it throws me below error.

22/03/14 16:35:04 INFO FileScanRDD [stdout writer for /usr/local/bin/sparkdotnet/Microsoft.Spark.Worker/2.0.0/Microsoft.Spark.Worker]: Reading File path: adl://<txtpath>, range: 0-134217728, partition values: [empty row]
22/03/14 16:35:04 INFO LineRecordReader [stdout writer for /usr/local/bin/sparkdotnet/Microsoft.Spark.Worker/2.0.0/Microsoft.Spark.Worker]: Found UTF-8 BOM and skipped it
[2022-03-14T16:35:04.9692566Z] [vm-8f973732] [Warn] [AssemblyLoader] Assembly 'ℛ*fcdc26d4-b2cf-49c1-a497-de2fa26e8c4f#1-49, Version=0.0.0.0, Culture=neutral, PublicKeyToken=null' file not found 'fcdc26d4-b2cf-49c1-a497-de2fa26e8c4f-1-49[.dll,.ni.dll]' in '.,/mnt/var/hadoop/tmp/nm-local-dir/usercache/trusted-service-user/appcache/application_1647274291204_0002/container_1647274291204_0002_01_000002,/usr/local/bin/sparkdotnet/Microsoft.Spark.Worker/2.0.0/'
[2022-03-14T16:35:04.9698518Z] [vm-8f973732] [Error] [TaskRunner] [17] ProcessStream() failed with exception: System.IO.FileNotFoundException: Assembly 'ℛ*fcdc26d4-b2cf-49c1-a497-de2fa26e8c4f#1-49, Version=0.0.0.0, Culture=neutral, PublicKeyToken=null' file not found ''
   at Microsoft.Spark.Utils.UdfSerDe.<>c.b__10_0(TypeData td) in /_/src/csharp/Microsoft.Spark/Utils/UdfSerDe.cs:line 270
   at System.Collections.Concurrent.ConcurrentDictionary`2.GetOrAdd(TKey key, Func`2 valueFactory)
   at Microsoft.Spark.Utils.UdfSerDe.DeserializeType(TypeData typeData) in /_/src/csharp/Microsoft.Spark/Utils/UdfSerDe.cs:line 261
   at Microsoft.Spark.Utils.UdfSerDe.Deserialize(UdfData udfData) in /_/src/csharp/Microsoft.Spark/Utils/UdfSerDe.cs:line 160
   at Microsoft.Spark.Utils.CommandSerDe.DeserializeUdfs[T](UdfWrapperData data, Int32& nodeIndex, Int32& udfIndex) in /_/src/csharp/Microsoft.Spark/Utils/CommandSerDe.cs:line 333
   at Microsoft.Spark.Utils.CommandSerDe.Deserialize[T](Stream stream, SerializedMode& serializerMode, SerializedMode& deserializerMode, String& runMode) in /_/src/csharp/Microsoft.Spark/Utils/CommandSerDe.cs:line 306
   at Microsoft.Spark.Worker.Processor.CommandProcessor.ReadSqlCommands(PythonEvalType evalType, Stream stream) in D:\a\1\s\src\csharp\Microsoft.Spark.Worker\Processor\CommandProcessor.cs:line 187
   at Microsoft.Spark.Worker.Processor.CommandProcessor.ReadSqlCommands(PythonEvalType evalType, Stream stream, Version version) in D:\a\1\s\src\csharp\Microsoft.Spark.Worker\Processor\CommandProcessor.cs:line 101
   at Microsoft.Spark.Worker.Processor.CommandProcessor.Process(Stream stream) in D:\a\1\s\src\csharp\Microsoft.Spark.Worker\Processor\CommandProcessor.cs:line 43
   at Microsoft.Spark.Worker.Processor.PayloadProcessor.Process(Stream stream) in D:\a\1\s\src\csharp\Microsoft.Spark.Worker\Processor\PayloadProcessor.cs:line 83
   at Microsoft.Spark.Worker.TaskRunner.ProcessStream(Stream inputStream, Stream outputStream, Version version, Boolean& readComplete) in D:\a\1\s\src\csharp\Microsoft.Spark.Worker\TaskRunner.cs:line 144
[2022-03-14T16:35:04.9704949Z] [vm-8f973732] [Error] [TaskRunner] [17] Exiting with exception: System.IO.FileNotFoundException: Assembly 'ℛ*fcdc26d4-b2cf-49c1-a497-de2fa26e8c4f#1-49, Version=0.0.0.0, Culture=neutral, PublicKeyToken=null' file not found ''
   at Microsoft.Spark.Utils.UdfSerDe.<>c.b__10_0(TypeData td) in /_/src/csharp/Microsoft.Spark/Utils/UdfSerDe.cs:line 270
   at System.Collections.Concurrent.ConcurrentDictionary`2.GetOrAdd(TKey key, Func`2 valueFactory)
   at Microsoft.Spark.Utils.UdfSerDe.DeserializeType(TypeData typeData) in /_/src/csharp/Microsoft.Spark/Utils/UdfSerDe.cs:line 261
   at Microsoft.Spark.Utils.UdfSerDe.Deserialize(UdfData udfData) in /_/src/csharp/Microsoft.Spark/Utils/UdfSerDe.cs:line 160
   at Microsoft.Spark.Utils.CommandSerDe.DeserializeUdfs[T](UdfWrapperData data, Int32& nodeIndex, Int32& udfIndex) in /_/src/csharp/Microsoft.Spark/Utils/CommandSerDe.cs:line 333
   at Microsoft.Spark.Utils.CommandSerDe.Deserialize[T](Stream stream, SerializedMode& serializerMode, SerializedMode& deserializerMode, String& runMode) in /_/src/csharp/Microsoft.Spark/Utils/CommandSerDe.cs:line 306
   at Microsoft.Spark.Worker.Processor.CommandProcessor.ReadSqlCommands(PythonEvalType evalType, Stream stream) in D:\a\1\s\src\csharp\Microsoft.Spark.Worker\Processor\CommandProcessor.cs:line 187
   at Microsoft.Spark.Worker.Processor.CommandProcessor.ReadSqlCommands(PythonEvalType evalType, Stream stream, Version version) in D:\a\1\s\src\csharp\Microsoft.Spark.Worker\Processor\CommandProcessor.cs:line 101
   at Microsoft.Spark.Worker.Processor.CommandProcessor.Process(Stream stream) in D:\a\1\s\src\csharp\Microsoft.Spark.Worker\Processor\CommandProcessor.cs:line 43
   at Microsoft.Spark.Worker.Processor.PayloadProcessor.Process(Stream stream) in D:\a\1\s\src\csharp\Microsoft.Spark.Worker\Processor\PayloadProcessor.cs:line 83
   at Microsoft.Spark.Worker.TaskRunner.ProcessStream(Stream inputStream, Stream outputStream, Version version, Boolean& readComplete) in D:\a\1\s\src\csharp\Microsoft.Spark.Worker\TaskRunner.cs:line 144
   at Microsoft.Spark.Worker.TaskRunner.Run() in D:\a\1\s\src\csharp\Microsoft.Spark.Worker\TaskRunner.cs:line 66
22/03/14 16:35:04 WARN BlockManager [Executor task launch worker for task 0.3 in stage 19.0 (TID 43)]: Putting block rdd_147_0 failed due to exception org.apache.spark.api.python.PythonException: System.IO.FileNotFoundException: Assembly 'ℛ*fcdc26d4-b2cf-49c1-a497-de2fa26e8c4f#1-49, Version=0.0.0.0, Culture=neutral, PublicKeyToken=null' file not found ''
   at Microsoft.Spark.Utils.UdfSerDe.<>c.b__10_0(TypeData td) in /_/src/csharp/Microsoft.Spark/Utils/UdfSerDe.cs:line 270
   at System.Collections.Concurrent.ConcurrentDictionary`2.GetOrAdd(TKey key, Func`2 valueFactory)
   at Microsoft.Spark.Utils.UdfSerDe.DeserializeType(TypeData typeData) in /_/src/csharp/Microsoft.Spark/Utils/UdfSerDe.cs:line 261
   at Microsoft.Spark.Utils.UdfSerDe.Deserialize(UdfData udfData) in /_/src/csharp/Microsoft.Spark/Utils/UdfSerDe.cs:line 160
   at Microsoft.Spark.Utils.CommandSerDe.DeserializeUdfs[T](UdfWrapperData data, Int32& nodeIndex, Int32& udfIndex) in /_/src/csharp/Microsoft.Spark/Utils/CommandSerDe.cs:line 333
   at Microsoft.Spark.Utils.CommandSerDe.Deserialize[T](Stream stream, SerializedMode& serializerMode, SerializedMode& deserializerMode, String& runMode) in /_/src/csharp/Microsoft.Spark/Utils/CommandSerDe.cs:line 306
   at Microsoft.Spark.Worker.Processor.CommandProcessor.ReadSqlCommands(PythonEvalType evalType, Stream stream) in D:\a\1\s\src\csharp\Microsoft.Spark.Worker\Processor\CommandProcessor.cs:line 187
   at Microsoft.Spark.Worker.Processor.CommandProcessor.ReadSqlCommands(PythonEvalType evalType, Stream stream, Version version) in D:\a\1\s\src\csharp\Microsoft.Spark.Worker\Processor\CommandProcessor.cs:line 101
   at Microsoft.Spark.Worker.Processor.CommandProcessor.Process(Stream stream) in D:\a\1\s\src\csharp\Microsoft.Spark.Worker\Processor\CommandProcessor.cs:line 43
   at Microsoft.Spark.Worker.Processor.PayloadProcessor.Process(Stream stream) in D:\a\1\s\src\csharp\Microsoft.Spark.Worker\Processor\PayloadProcessor.cs:line 83
   at Microsoft.Spark.Worker.TaskRunner.ProcessStream(Stream inputStream, Stream outputStream, Version version, Boolean& readComplete) in D:\a\1\s\src\csharp\Microsoft.Spark.Worker\TaskRunner.cs:line 144.
[2022-03-14T16:35:04.9707821Z] [vm-8f973732] [Info] [TaskRunner] [17] Finished running 0 task(s).
22/03/14 16:35:04 WARN BlockManager [Executor task launch worker for task 0.3 in stage 19.0 (TID 43)]: Block rdd_147_0 could not be removed as it was not found on disk or in memory
22/03/14 16:35:04 ERROR Executor [Executor task launch worker for task 0.3 in stage 19.0 (TID 43)]: Exception in task 0.3 in stage 19.0 (TID 43)
org.apache.spark.api.python.PythonException: System.IO.FileNotFoundException: Assembly 'ℛ*fcdc26d4-b2cf-49c1-a497-de2fa26e8c4f#1-49, Version=0.0.0.0, Culture=neutral, PublicKeyToken=null' file not found ''
   at Microsoft.Spark.Utils.UdfSerDe.<>c.b__10_0(TypeData td) in /_/src/csharp/Microsoft.Spark/Utils/UdfSerDe.cs:line 270
   at System.Collections.Concurrent.ConcurrentDictionary`2.GetOrAdd(TKey key, Func`2 valueFactory)
   at Microsoft.Spark.Utils.UdfSerDe.DeserializeType(TypeData typeData) in /_/src/csharp/Microsoft.Spark/Utils/UdfSerDe.cs:line 261
   at Microsoft.Spark.Utils.UdfSerDe.Deserialize(UdfData udfData) in /_/src/csharp/Microsoft.Spark/Utils/UdfSerDe.cs:line 160
   at Microsoft.Spark.Utils.CommandSerDe.DeserializeUdfs[T](UdfWrapperData data, Int32& nodeIndex, Int32& udfIndex) in /_/src/csharp/Microsoft.Spark/Utils/CommandSerDe.cs:line 333
   at Microsoft.Spark.Utils.CommandSerDe.Deserialize[T](Stream stream, SerializedMode& serializerMode, SerializedMode& deserializerMode, String& runMode) in /_/src/csharp/Microsoft.Spark/Utils/CommandSerDe.cs:line 306
   at Microsoft.Spark.Worker.Processor.CommandProcessor.ReadSqlCommands(PythonEvalType evalType, Stream stream) in D:\a\1\s\src\csharp\Microsoft.Spark.Worker\Processor\CommandProcessor.cs:line 187
   at Microsoft.Spark.Worker.Processor.CommandProcessor.ReadSqlCommands(PythonEvalType evalType, Stream stream, Version version) in D:\a\1\s\src\csharp\Microsoft.Spark.Worker\Processor\CommandProcessor.cs:line 101
   at Microsoft.Spark.Worker.Processor.CommandProcessor.Process(Stream stream) in D:\a\1\s\src\csharp\Microsoft.Spark.Worker\Processor\CommandProcessor.cs:line 43
   at Microsoft.Spark.Worker.Processor.PayloadProcessor.Process(Stream stream) in D:\a\1\s\src\csharp\Microsoft.Spark.Worker\Processor\PayloadProcessor.cs:line 83
   at Microsoft.Spark.Worker.TaskRunner.ProcessStream(Stream inputStream, Stream outputStream, Version version, Boolean& readComplete) in D:\a\1\s\src\csharp\Microsoft.Spark.Worker\TaskRunner.cs:line 144
	at org.apache.spark.api.python.BasePythonRunner$ReaderIterator.handlePythonException(PythonRunner.scala:517)
	at org.apache.spark.sql.execution.python.PythonUDFRunner$$anon$2.read(PythonUDFRunner.scala:84)
	at org.apache.spark.sql.execution.python.PythonUDFRunner$$anon$2.read(PythonUDFRunner.scala:67)
	at org.apache.spark.api.python.BasePythonRunner$ReaderIterator.hasNext(PythonRunner.scala:470)
	at org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:37)
	at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:489)
	at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:458)
	at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:458)
	at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage2.processNext(Unknown Source)
	at org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
	at org.apache.spark.sql.execution.WholeStageCodegenExec$$anon$1.hasNext(WholeStageCodegenExec.scala:755)
	at org.apache.spark.sql.execution.columnar.DefaultCachedBatchSerializer$$anon$1.hasNext(InMemoryRelation.scala:118)
	at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:458)
	at org.apache.spark.storage.memory.MemoryStore.putIterator(MemoryStore.scala:237)
	at org.apache.spark.storage.memory.MemoryStore.putIteratorAsValues(MemoryStore.scala:315)
	at org.apache.spark.storage.BlockManager.$anonfun$doPutIterator$1(BlockManager.scala:1424)
	at org.apache.spark.storage.BlockManager.org$apache$spark$storage$BlockManager$$doPut(BlockManager.scala:1351)
	at org.apache.spark.storage.BlockManager.doPutIterator(BlockManager.scala:1415)
	at org.apache.spark.storage.BlockManager.getOrElseUpdate(BlockManager.scala:1238)
	at org.apache.spark.rdd.RDD.getOrCompute(RDD.scala:384)
	at org.apache.spark.rdd.RDD.iterator(RDD.scala:335)
	at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
	at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:373)
	at org.apache.spark.rdd.RDD.iterator(RDD.scala:337)
	at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
	at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:373)
	at org.apache.spark.rdd.RDD.iterator(RDD.scala:337)
	at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
	at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:373)
	at org.apache.spark.rdd.RDD.iterator(RDD.scala:337)
	at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
	at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:373)
	at org.apache.spark.rdd.RDD.iterator(RDD.scala:337)
	at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
	at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:373)
	at org.apache.spark.rdd.RDD.iterator(RDD.scala:337)
	at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90)
	at org.apache.spark.scheduler.Task.run(Task.scala:131)
	at org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:497)
	at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1439)
	at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:500)
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
	at java.lang.Thread.run(Thread.java:748)
End of LogType:stderr.This log file belongs to a running container (container_1647274291204_0002_01_000002) and so may not be complete.
***********************************************************************

I have been getting this error intermittently. Can you please help me with what is the root cause for this problem? Is it a memory issue?

This Assembly/File is auto generated and name is changed every time job runs.

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions