Skip to content

Error from many DataFrame methods after UDF called in DataFrame.WithColumn  #1137

Open
@dogulas-accip

Description

@dogulas-accip

I'm a long time C# programmer but just getting my feet wet with .Net for Apache Spark. Following many "getting started" instructions and videos, I installed:

7-Zip
Java 8
I downloaded Apache Spark from https://spark.apache.org/downloads.html
.NET for Apache Spark v2.1.1
WinUtils.exe I'm running this on Window 10

Problem:
When I call DataFrame.Show() after doing a DataFrame.WithColumn() using a UDF, I always get an error: [2023-02-07T15:45:31.3903664Z] [DESKTOP-H37P8Q0] [Error] [TaskRunner] [0] ProcessStream() failed with exception: System.ArgumentNullException: Value cannot be null. Parameter name: type

TestCases.csv looks like this:
TestCases.csv

OrderList.csv looks like this:
OrderList.csv

Here is the Program class of the TestSparkApp console project:
Program.cs.txt
and supporting classes:
Player.cs.txt
Collector.cs.txt

Here is the output of the above app:
TestSpartAppOutput.txt

Note that the same bug will appear executing many different methods on the DataFrame object but only after a call to the WithColumn method using a UDF. In this case, the code looks like this:

          // user defined function
           Func<Column, Column, Column> GetSubst = Udf<string, string, int>(
               (strOrder, strPlayers) =>
               {
                   return GetSubstance(strOrder, strPlayers);
               });

           // call the user defined function and add a new column to the dataframe
           ordersFrame = ordersFrame.WithColumn("substance", GetSubst(ordersFrame["names"], ordersFrame["players"]).Cast("Integer"));

           // *** This is where the error will be thrown, but if I comment it out, the same error will be thrown later
           // print out the data
           ordersFrame.Show(20, 20, false);

however, I've tried it with other UDFs followed by other DataFrame method calls and I always get the same error. In the Main() function, you will see a later foreach loop. If I comment out the ordersFrame.Show() call, and comment in the contents of the loop, I will get the same error when I access row.Values[0].ToString().

I wonder if I have missed something in my installation?

Desktop (please complete the following information):

  • OS: Windows 10
  • Browser n/a
  • Version see above

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions