-
Notifications
You must be signed in to change notification settings - Fork 37
RDD is removing null columns on fuzzy linking #258
Copy link
Copy link
Open
Description
Describe the bug
RDD is removing null columns on fuzzy linking
To Reproduce
- Take sample RDD with null values in some column
- Do fuzzy join by
linkmethod.
-- Code --
`<dependency>
<groupId>org.zouzias</groupId>
<artifactId>spark-lucenerdd_2.11</artifactId>
<version>0.3.7</version>
</dependency>
<dependency>
<groupId>org.apache.lucene</groupId>
<artifactId>lucene-analyzers-common</artifactId>
<version>8.5.2</version>
</dependency>
<dependency>
<groupId>org.apache.lucene</groupId>
<artifactId>lucene-core</artifactId>
<version>8.5.2</version>
</dependency>
<dependency>
<groupId>org.apache.lucene</groupId>
<artifactId>lucene-codecs</artifactId>
<version>8.5.2</version>
</dependency>
<dependency>
<groupId>org.apache.lucene</groupId>
<artifactId>lucene-queryparser</artifactId>
<version>8.5.2</version>
</dependency>
/------------------------------------/
ClassTag simpleRowTag = scala.reflect.ClassTag$.MODULE$.apply(Row.class);
LuceneRDD<Row> rightDsLuceneRDD = LuceneRDD.apply(rightDs
.withColumn(rightColumn, lower(col(rightColumn))),
"org.apache.lucene.analysis.standard.ClassicAnalyzer",
"org.apache.lucene.analysis.standard.ClassicAnalyzer",
"org.apache.lucene.search.similarities.BM25Similarity");
String leftColumn = "a";
String rightColumn = "b";
RDD<Tuple2<Row, Row[]>> fuzzyJoinResults =
rightDsLuceneRDD.link(leftDs.rdd(), new SearchQuery<Row, String>() {
@Override
public String apply(Row input) {
Row row = (Row) input;
String leftRDDValue = row.getAs(leftColumn).toString();
String rightRDDColumn = rightColumn;
String query = rightRDDColumn + ":" + QueryParser
.escape(leftRDDValue.toLowerCase()) + "~" + fuzziness;
return query;
}
}, noOfResults, null, simpleRowTag);
`
Expected behavior
It should not remove any null columns and should give back all fields which were there in RDD
Versions (please complete the following information):
- spark-lucenerdd version: [0.3.7]
- Spark Version: [2.4.5]
- Java version: [Java 8]
Additional context
I am doing this coding in Java.
Reactions are currently unavailable
Metadata
Metadata
Assignees
Labels
No labels