Skip to content
This repository was archived by the owner on Feb 1, 2024. It is now read-only.

master the converter should only read all data once#31

Open
NiklasMolin wants to merge 1 commit into
awslabs:masterfrom
NiklasMolin:master
Open

master the converter should only read all data once#31
NiklasMolin wants to merge 1 commit into
awslabs:masterfrom
NiklasMolin:master

Conversation

@NiklasMolin
Copy link
Copy Markdown

Issue #, if available: N/A

Description of changes:
The current implementation reads all data twice as far as I can see.
The dynamicframe dropNull causes recomputeSchema to be triggered in
the toDF call.
Guess there is a thousand ways of achieving it. Just added something that solves the matter, in a quick way
cause I don't know if this project is still alive.

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.

cols = data_frame.schema.fields
for col in cols:
if isinstance(col.dataType,NullType):
data_frame = data_frame.drop(col.name)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It looks like this is dropping the whole column? DropNullFields is intended to convert missing values to null values.

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, It is dropping the whole column. Just as DropNullFields as far as I know. The first row in the doc link you posted:
"Drops all null fields in a DynamicFrame whose type is NullType. These are fields with missing or null values in every record in the DynamicFrame dataset."

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Don't know if you remember it but the rationale for dropping them is that the writer doesn't handle NoneType columns

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ya know, I think I entirely misunderstood the purpose of the DropNullFields function. 🤦‍♂️

If that's the case, this makes total sense and I don't even know if we want to drop null columns...will have to think about that some more. :)

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yepp, do it. But can add, that when you're outputting parquet, you need to remove NoneType columns since there is no such datatype and the writer will fail.
There is no impact while you're reading though. So if you have all the columns in the table schema but lack some of the coulmn in the parquet files they will be read as null.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants