Skip to content

RDD is removing null columns on fuzzy linking #258

@saumyasuhagiya

Description

@saumyasuhagiya

Describe the bug
RDD is removing null columns on fuzzy linking

To Reproduce

  1. Take sample RDD with null values in some column
  2. Do fuzzy join by link method.

-- Code --

    `<dependency>
        <groupId>org.zouzias</groupId>
        <artifactId>spark-lucenerdd_2.11</artifactId>
        <version>0.3.7</version>
    </dependency>

    <dependency>
        <groupId>org.apache.lucene</groupId>
        <artifactId>lucene-analyzers-common</artifactId>
        <version>8.5.2</version>
    </dependency>

    <dependency>
        <groupId>org.apache.lucene</groupId>
        <artifactId>lucene-core</artifactId>
        <version>8.5.2</version>
    </dependency>

    <dependency>
        <groupId>org.apache.lucene</groupId>
        <artifactId>lucene-codecs</artifactId>
        <version>8.5.2</version>
    </dependency>

    <dependency>
        <groupId>org.apache.lucene</groupId>
        <artifactId>lucene-queryparser</artifactId>
        <version>8.5.2</version>
    </dependency>

/------------------------------------/

ClassTag simpleRowTag = scala.reflect.ClassTag$.MODULE$.apply(Row.class);

      LuceneRDD<Row> rightDsLuceneRDD = LuceneRDD.apply(rightDs
        .withColumn(rightColumn, lower(col(rightColumn))),
        "org.apache.lucene.analysis.standard.ClassicAnalyzer",
        "org.apache.lucene.analysis.standard.ClassicAnalyzer",
        "org.apache.lucene.search.similarities.BM25Similarity");

     String leftColumn = "a";
     String rightColumn = "b";

    RDD<Tuple2<Row, Row[]>> fuzzyJoinResults =
        rightDsLuceneRDD.link(leftDs.rdd(), new SearchQuery<Row, String>() {
            @Override
            public String apply(Row input) {
                Row row = (Row) input;
                String leftRDDValue = row.getAs(leftColumn).toString();
                String rightRDDColumn = rightColumn;
                String query = rightRDDColumn + ":" + QueryParser
                    .escape(leftRDDValue.toLowerCase()) + "~" + fuzziness;
                return query;
            }
        }, noOfResults, null, simpleRowTag);

`

Expected behavior
It should not remove any null columns and should give back all fields which were there in RDD

Versions (please complete the following information):

  • spark-lucenerdd version: [0.3.7]
  • Spark Version: [2.4.5]
  • Java version: [Java 8]

Additional context
I am doing this coding in Java.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions