master the converter should only read all data once#31
Conversation
| cols = data_frame.schema.fields | ||
| for col in cols: | ||
| if isinstance(col.dataType,NullType): | ||
| data_frame = data_frame.drop(col.name) |
There was a problem hiding this comment.
It looks like this is dropping the whole column? DropNullFields is intended to convert missing values to null values.
There was a problem hiding this comment.
Yes, It is dropping the whole column. Just as DropNullFields as far as I know. The first row in the doc link you posted:
"Drops all null fields in a DynamicFrame whose type is NullType. These are fields with missing or null values in every record in the DynamicFrame dataset."
There was a problem hiding this comment.
Don't know if you remember it but the rationale for dropping them is that the writer doesn't handle NoneType columns
There was a problem hiding this comment.
Ya know, I think I entirely misunderstood the purpose of the DropNullFields function. 🤦♂️
If that's the case, this makes total sense and I don't even know if we want to drop null columns...will have to think about that some more. :)
There was a problem hiding this comment.
Yepp, do it. But can add, that when you're outputting parquet, you need to remove NoneType columns since there is no such datatype and the writer will fail.
There is no impact while you're reading though. So if you have all the columns in the table schema but lack some of the coulmn in the parquet files they will be read as null.
Issue #, if available: N/A
Description of changes:
The current implementation reads all data twice as far as I can see.
The dynamicframe dropNull causes recomputeSchema to be triggered in
the toDF call.
Guess there is a thousand ways of achieving it. Just added something that solves the matter, in a quick way
cause I don't know if this project is still alive.
By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.