-
Notifications
You must be signed in to change notification settings - Fork 4
Open
Description
Currently, it is impossible to track progress when reading a batch.
Apart from Spark job stages, which are unrelated to the number of files in the underlying folder.

Let's say we proceed with ingesting the entire folder:
data = (
spark
.read
.format("tika")
.option("recursiveFileLookup", 'true')
.load('file:' + os.path.abspath('/dbfs/mnt/where_my_files_are'))
.withColumn("contentHash", md5(col("content")))
.drop('content')
)
data.show()@nfx @arcaputo3 @aamend @JCZuurmond, based on your expertise, would it be technically feasible to implement such a feature?
Appreciate your help!
Metadata
Metadata
Assignees
Labels
No labels