Skip to content

h2o import function is not working when we have a _SUCCESS file present. #16768

@iamakalia

Description

@iamakalia

H2O version, Operating System and Environment
Running on Hadoop CDH 7.1.9

Actual behavior
h2o import function is not working when we have a _SUCCESS file present.

As a workaround, They were informed :
The hadoop _SUCCESS file is a parquet file and unfortunately H2O3 doesn't have a mechanism to know if it's not a relevant file from the folder. There is however a way to not write this file while saving parquet files into folders. for example, if you are using a spark job to write these files you can use spark.conf.set("parquet.enable.summary-metadata", "false") to not write this file."

Expected behavior
h2o import function is supposed to work.
Their strong preference is that we have the ability to ignore _SUCCESS when reading from HDFS, either by default/via some configuration that can be set up upfront. They don't want a situation where they don't leave the _SUCCESS file in place, since that's actually useful for figuring out if the upstream task went well.

Upload logs
If you can, please upload the H2O logs. More information on how to do that is available here, or you can use the h2o.downloadAllLogs() in R or the h2o.download_all_logs() function in Python.

Additional context
It was working earlier before 23rd Nov 2024. So their might be a regression.

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions